Saturday, April 16, 2011

From TEI to HRIT and back again

Since we are designing a software suite to more or less replace embedded markup there has to be some way to import legacy texts. At first I thought the problem was insurmountable. Even if the original encoders had stuck to recommended guidelines such as the TEI (Text Encoding Initiative) they would have been forced to customise their encoding in two ways:

  1. By adding custom tags and attributes, and
  2. By making a selection of tags from the large number of available ones

In the second case it is clear that any general solution that embraced an arbitrary subset of TEI would have to support all of it. Since there are currently 519 tags in the scheme, and (probably) thousands of attributes, that is a daunting prospect for any programmer. And we are talking about meaningful conversion into an entirely different software system, not a simple one-for-one mapping. And in respect to point 1 any customised tags would either have to be left out, or their function would need to be specified by the user.

Solving the problem

When forced to perform the task, however, I soon realised that any customised tags must have already been specified by a user who understood XML. So that same user could supply a customised table of conversion in XML to say what should be done with them. If they didn't follow the Guidelines then they have to do a little extra work, but they're not shut out.

And in the second case only a small subset of TEI is regularly used by digital humanists. For the purposes of defining versions, for example, only a small number of tags come into play, and even customised ones would have to follow one of only a couple of basic patterns, which could be programmed in as general functions. The customisations could be handled by a 'recipe', or set of instructions on how to convert the files. A default recipe would be provided for standard files, which the user could extend or change at will.

Why do this at all?

Because HRIT format is much more powerful than TEI:

  1. It allows arbitrary overlap of properties.
  2. It does not mandate any standard tag names
  3. It supports versions natively including transpositions
  4. It allows mixing and matching of markup sets in the one text

That's more than enough reasons to move from TEI to HRIT. Another way of looking at it is to say that rather than replacing TEI it seeks to enhance it, and use it as an interchange format between HRIT and non-HRIT users. It depends on what kind of 'spin' you prefer.

Two-way conversion

Any conversion applied to legacy files (or, if you prefer, current files) would have to be reversible. Those who had imported their files into HRIT and changed their minds later on would feel 'locked in' if they couldn't back out, and those who hadn't made the switch would likewise be frightened off by that very prospect. So the overall process looks like this. Red/green arrows indicate as yet unavailable/available paths:

'TEI' refers to any TEI-encoded file. The two-way process works like this:

  • Splitter splits the TEI file into N versions. By default it splits <app><rdg>...</rdg></app> structures as well as nested <del> and <add> and <choice> structures into versions. Unsplitter, not yet written, will take the versions (possibly modified) and try to put them back into one file, although this may be difficult. The recipe file is used by splitter to direct the splitting. It can be customised by the user to control which elements are split and how.
  • Stripper removes the remaining markup from each separate version in TEI format. A different recipe file specifies simplifications of elements intended to be rendered as formats in the final HTML. One simplification might be the reduction of <hi rend="italic"> to the property 'italics'. The output of stripper is the HRIT standoff XML format. (But stripper is written in such a way that another format can be added if required). It expresses every TEI element as a potentially overlapping property with possible 'annotations' or attributes. These attributes are ignored by the formatter but are not lost. Elements like the TEI-header, which contain metadata about the text, are entirely hidden but also not lost. This is to enable later reversal of the stripping process. Each version produces a pair of markup and plain text files that are separately merged into a single CorTex and a single CorCode file. It is these files that are edited and read by the HRIT system.
  • Formatter takes the properties of the CorCode and combines them with the information from the CSS file into HTML. The CSS is used not only to change the appearance of the text on a web page but also to transform the markup. For example the CSS rule span.italics can be used to change the appearance of italics, but also to convert properties called 'italics' into spans of class 'italics'. In this way we can avoid use of XSLT. But what about the 'annotations' that were originally attributes in the TEI-XML? They are simply ignored (although not lost). If you want to convert an element plus some attribute(s) into a HTML element using formatter, you must first specify a rule to simplify them to a plain property using splitter's recipe file.