Wednesday, March 18, 2009

Final Version of Multi-Version Documents Paper Published by Elsevier

The final version of my MVD paper has now appeared online. This hyperlink is permanent and can be used in citations. The paper reference is Schmidt, D. and Colomb, R, 2009. A data structure for representing multi-version texts online, International Journal of Human-Computer Studies, 67.6, 497-514.

Thesis Submission

Also I have now submitted my thesis. The final title was 'Multiple Versions and Overlap in Digital Text'. Here's the abstract:

This thesis is unusual in that it tries to solve a problem that exists between two widely separated disciplines: the humanities (and to some extent also linguistics) on the one hand and information science on the other.

Chapter 1 explains why it is essential to strike a balance between study of the solution and problem domains.

Chapter 2 surveys the various models of cultural heritage text, starting in the remote past, through the coming of the digital era to the present. It establishes why current models are outdated and need to be revised, and also what significance such a revision would have.

Chapter 3 examines the history of markup in an attempt to trace how inadequacies of representation arose. It then examines two major problems in cultural heritage and linguistics digital texts: overlapping hierarchies and textual variation. It assesses previously proposed solutions to both problems and explains why they are all inadequate. It argues that overlapping hierarchies is a subset of the textual variation problem, and also why markup cannot be the solution to either problem.

Chapter 4 develops a new data model for representing cultural heritage and linguistics texts, called a 'variant graph', which separates the natural overlapping structures from the content. It develops a simplified list-form of the graph that scales well as the number of versions increases. It also describes the main operations that need to be performed on the graph and explores their algorithmic complexities.

Chapter 5 draws on research in bioinformatics and text processing to develop a greedy algorithm that aligns n versions with non-overlapping block transpositions in O(MN) time in the worst case, where M is the size of the graph and N is the length of the new version being added or updated. It shows how this algorithm can be applied to texts in corpus linguistics and the humanities, and tests an implementation of the algorithm on a variety of real-world texts.

Tuesday, March 10, 2009

MVD is Not a Replacement for Markup

Some people still think of MVD as a replacement for markup. It isn't. It complements markup systems or any technology that can represent content. As I said in the main page What's a Multi-Version Document? an MVD represents the overlapping structure of a set of versions or markup perspectives. It doesn't need to represent any of the detail of the content, which is the responsibility of the markup.

I realise that it's easy, and natural, to seek to dismiss radical ideas simply because they are radical. The difference in this case is that MVD is a technology that definitely works. It's not all that radical anyway. Consider the direction in which multiple-sequence alignment is going in biology. They have also realised that the best way to represent multi-version genomes or protein sequences is via a directed graph (e.g. Raphael et al., 2004. A novel method for multiple alignment of sequences with repeated and shuffled elements, Genome Research, 14, 2336-2346). I prefer to think of that idea as parallel to mine, and his 'A-Bruijn' graph is rather different from my MVD, but it represents the same kind of data in much the same way. Acceptance that this basic idea can also be applied to texts in humanities and linguistics is just a matter of time.

The Inadequacy of Markup

If markup is adequate for linguistics texts, why is it that every year someone thinks up a new way to manipulate markup systems to try to represent overlap? If it were adequate there would be no need for new systems, but we continue to see 1-3 new papers on the subject every year. It's seen as a game. Look at the Balisage website: 'There's nothing so practical as a good theory'. Perceived as an unsolvable problem, overlap is the perfect topic for a paper or a thesis.

In the humanities, overlap in markup systems is more than an annoyance; it wrecks the whole process of digitisation. In simple texts you can just about get by, but it's a question of degree. Try to use markup to record the following structures:

  1. Deletion of a paragraph break
  2. Deletion of underlining
  3. Changes to document structure
  4. Transposition
  5. Overlapping variants
These can all be done somehow in markup, I admit, but very poorly. And they are features that occur all the time in original texts. The fundamental problem is that you can't adequately fit a non-hierarchical structure into a hierarchical template. To choose markup alone as a medium to preserve our textual cultural heritage is to resign yourself to mangling that information.

Why do we have to use markup to record complex structures it was never designed to represent? Hand that complexity over to the computer and let it work it out. That's what MVD lets you do. If you are getting a headache shuffling around angle brackets and xml:ids, then think again. Is this any proper way for humans of the 21st century to interact with the texts of their forebears?