Tuesday, July 29, 2008

Improvements to Alignment

In preparation to publicly releasing the NMerge code I have been revising the alignment algorithm, as there were significant weaknesses in the naïve approach I had previously adopted. In cases like the Sibylline Gospel, there is a great deal of variation in a small space, and simply requiring the user to choose a base text for alignment doesn’t work very well. There seem to be a few reasons for this:

  1. The user doesn't actually know to which existing version a new version should be aligned. It should be the task of the software to find this out.
  2. There can't be a distinction, as I originally supposed, between updating an existing version and adding a new one. The newly revised version might resemble another version more than the one it came from. For example, we might change ‘the quick brown dog’ back to ‘the quick brown fox’. In the case of manuscript traditions like the Sibylline Gospel this happens all the time, because the versions aren’t a succession of edits, but a set of parallel alternatives. To avoid this problem I now automatically calculate the most similar version already in the MVD and then align with that.
  3. When you add a new arc to the graph you need to look for identical paths not merely identical arcs that are already in that position. Now when the program finds an existing path with the same text, spanning the same two end-points, the new arc can be discarded and its version simply added to the path instead: In this simplified real-world example, the new D-version was aligned with version C, overall its most similar version. However, in this location the D-variant ‘milia hominum’ already exists in version A. When the program tried to add the D-Arc it realised that there was already an A-path with the same text and instead merely added the D-version to that path.

This is only the first stage of a series of improvements. Still to come:

  • Calling the alignment algorithm recursively to handle contamination, i.e. aligning to more than one base text. After aligning to the most similar version, any significant portions of the new version that didn't align can be realigned to the most similar version between the two endpoints of the unaligned section.
  • Calculating what is a variant of what, one use of which might be to generate a kind of apparatus criticus
  • Calculating transpositions. These can be done after the other alignments are complete: any leftover insertion/deletion pairs meeting certain criteria can be tested for equality, and the transposition carried out.

XML Awareness

While NMerge remains ignorant of XML, as it should, the public interface class now breaks up ‘words’ based on angle-brackets as well as white space. In addition I am contemplating moving this code and the class that XML-izes the differences between versions into a separate package. The latter functionality is needed by the wiki since a difference might occur in the middle of a tag, and the marker for this needs to be moved to the start of the next piece of real text, so it can be displayed.