Wednesday, April 28, 2010

Revised Variants Command

The revised variants command for nmerge now works like this: You specify a range with a particular version and it computes all the variants that leave and rejoin that path. Mathematically it is very simple. Unfortunately, variants must be aligned on word-boundaries. It doesn't make sense to compute them on character boundaries (as they are of necessity in the MVD). If you did that you would end up with variants like 'Q1:a' and have no idea what the context of this 'a' in version 'Q1' is. The problem is that in extending the variant to its natural word boundaries, you can of course encounter more variation. This means that you can end up duplicating variants. To get around this several fixes were required:

  1. Equal variants in different versions can be merged. So 'Q1:Map' and 'Q2:Map' becomes 'Q1,Q2:Map'. Cool.
  2. A variant can also be part of another variant. The versions are the same, so you just drop the smaller variant.
  3. Because of imperfections in the nmerge program a 'variant' can have the same text as the base version. In this case each computed variant is compared with the text of the equivalent base version and dropped if it is the same.

Getting all that working has taken a month. Here's the output of nmerge -c variants -m kinglear.mvd -o 2000 -k 100 -v 1 (variants in King Lear, base version 1, at offset 2000, length of range = 100):

[Q1:for,]
[Q1,Q2:mother]
[F3,F4:fair,]
[F2,Q1,Q2:faire,]
[Q1,Q2:&]
[F2,F3,F4:whorson]
[Q1,Q2:whoreson]

Here's the original 6 versions that contain these variants. Note that the initial 'r:' gets extended back to the first word-boundary and is in fact 'for:':

F1: r: yet was his Mother fayre, there was good sport at his making, and the horson must be acknowledged
F2: r: yet was his Mother faire, there was good sport at his making, and the whorson must be acknowledged
F3: r: yet was his Mother fair, there was good sport at his making, and the whorson must be acknowledged
F4: r: yet was his Mother fair, there was good sport at his making, and the whorson must be acknowledged
Q1: r, yet was his mother faire, there was good sport at his making, & the whoreson must be acknowledged
Q2: r yet was his mother faire, there was good sport at his making, & the whoreson must be acknowledged

Now you might say that a collation program could do as much. Yet I don't think so. In a collation program you have to collate the entire text of all the versions against the chosen base text to get that output, then sift through it to find the right location. Nmerge computes variants over ranges in the base text - actually it reads them from the MVD. And the base version can be changed at will. This makes it possible to display variants dynamically in a GUI.

Now all I have to do is call this via Ajax from the Joomla GUI. I'll need to filter it so that residual tags and entities get turned into something useful. Time, though, is beginning to run out.

Monday, April 19, 2010

Inadequacy of Embedded Markup

My paper on 'The Inadequacy of Embedded Markup for Cultural Heritage Texts' has just been published online by Literary and Linguistic Computing. It should be interesting to see what people make of it. It's not good to criticise, but sometimes if you don't the opposition will just keep saying that what we already have is good enough. And I'm tired of that.