Thursday, December 4, 2008

The MVD File Format

Several people have asked me what is inside an MVD, so I thought I would put it on the record.

The idea behind the Multi-Version Document or MVD format is to use the list form of the variant graph as the basis for an encoding of a single work, in all its versions or markup perspectives, as a single digital entity. The advantages of this form of digital document should be obvious. It enables a work to be viewed and searched, its versions compared and edited as one file. For example, all versions of Homer's Iliad or the seven markup perspectives of the American National Corpus (Ide, 2006) could be encapsulated in a single compact and editable representation. Also, the relationships between various parts of each version, the what-is-a-variant-of-what information, is also recorded. Storing a multi-version work as a set of separate files has the great disadvantage of requiring this kind of data to be recalculated each time it is needed. In an MVD this has already been calculated once and is thus built-in.

If the content of each version is itself XML, then XML is a poor format for an MVD. An MVD may, however, be written in binary or XML format. In the latter case, the XML content of each version, inside the XML encoding of the MVD structure, is escaped. That is, all instances of '<', '>' and '&' have to be replaced by their equivalent entities '&lt;', '&gt;' and '&amp;'. The purpose of the XML form of an MVD is merely to allow the researcher to look inside it to see what is there. Editing it by hand is virtually impossible, because the delicate list format produced by Algorithm 1 can so easily be broken.

A tinker-proof binary format is therefore preferred. If desired for archival purposes, an MVD can be written out as a set of separate XML files, but the format uses open-source software to encode its content, so it is also archivable. The structure of an MVD is shown below:

The outer wrapper is a Base64 encoding, expressing binary data as plain text.

The inner wrapper is the ZIP encoding performed by the open source Zlib library (Gaily and Adler, 1995). This serves the double purpose of scrambling the data to deter tinkering, and compressing it so that one MVD typically occupies little more space than a single original version. Even the alteration of a single byte of the outer Base64 wrapper will very likely break the inner ZIP encoding and the document will fail to load, as it should. Inside the ZIP container are the four parts that comprise the real content:

Magicthe presence of this hexadecimal string guarantees that this is an MVD
Groupsthese are labels for a hierarchy of arbitrary depth used to group versions or other groups
Versionsthese provide a simple description sufficient to identify the ID, short name and long name of each version, whether or not it is a partial version, and its group
Pairsthe pairs list that defines the variant graph itself

No further detail is needed, and would in fact damage the general applicability of the format. Groups can be used to express any desired classification system for versions. The short name of a version would typically be a siglum or other short name for convenient reference, and the longer name would typically be a full version name. All other details of a version's text are the responsability of the content format.

References

J.-L. Gailly and M. Adler (1995) Zlib

N. Ide and K. Suderman (2006) Integrating Linguistic Resources: The American National Corpus Model. In Proceedings of the Fifth Language Resources and Evaluation Conference.