Saturday, June 11, 2011

From arbitrary overlap to HTML

If we try to represent original documents not authored in the digital medium, we soon discover that the pen or the printed type used to create them were not constrained, as modern embedded markup languages are, to represent only tree-structures. It would thus be very liberating to encode such documents for digital presentation on the Web, using arbitrary overlapping external properties instead of an embedded hierarchy of tags. This would provide a number of distinct advantages:

  1. Properties could represent the source texts more accurately.
  2. Different sets of properties could be combined in the same document.
  3. With appropriate software, it will be easier to edit separate text and markup files than complex embedded markup.
  4. Texts and markup, as separate building blocks, could be exchanged and reused for other applications.

These are winning arguments for at least digital humanists, and maybe for other people who use embedded markup.

How it works

To see how this can be done let's specify some fictitious properties that apply to random ranges of a short text:

0,12,'banana'
3,7,'pear'
12,9,'refrigerator'
13,4,'orange'
18,12,'pineapple'
22,34,'guava'
35,12,'grape'
48,9,'penguin'
52,17,'dog'

What this means is that the text, which is at least 69 bytes long, is 'marked-up' by a series of arbitrary properties. The offsets in the text where these properties start are specified by the first number in each line, their lengths by the second number, and their names by the quoted strings. Of course, in a real-world application the names would more likely be 'p' or 'span' or 'table' etc.

Reduction to intervals

But how can we turn this apparent chaos into syntactically correct HTML? The approach taken here is to break up the properties into a series of 'intervals' where all the properties are the same throughout. For example between offsets 52 and 55 the properties 'dog', 'guava' and 'penguin' are all active.

Intervals defined by overlapping properties

Although dividing the properties into intervals removes the overlap it also creates too many short sequences for efficient HTML. So the next step is to work out where we might be able join them up. To do that we need to know which tags may appear inside which other tags. In other words, we need a kind of basic schema.

On-the-fly deduced schemas

Fortunately we already have the HTML schema. Since we will be using CSS (cascading style sheets) to format the text, we can use CSS rules to tell us which HTML elements will represent our properties, and then work backwards to figure out how they will nest.

The size of the problem can be reduced by reflecting that not all of our overlapping properties will be rendered in HTML. Some have other uses, for example, to provide programming information, are not needed in the current view, or are intended for future use. So it is safe to ignore any properties that aren't mentioned in the CSS file. However, because HTML syntax is fairly loose, this only gives us part of the answer.

The missing information can be deduced from the properties themselves. If we are recording a play, for example, properties like 'line' almost always nest inside 'speech'. In those few cases where they don't, we can split the 'line' property so that it always nests within the dominant tag 'speech'.

Merging this statistical information about property nesting into that derived from the HTML schema allows us to reconstruct almost all of the structure needed to render the document correctly. However, since this relies on statistics, it is not absolutely guaranteed to work in all cases, but even when it doesn't we will still have valid HTML. The worst that can happen is that the formatting won't look right.

Using the deduced hierarchy information

One simple way to express a hierarchy is to record which elements may appear inside which other elements. If you have 10 elements that need rendering this means you must compute a 10x10 matrix. 10 is probably a realistic number in practice, but even with all 107 HTML5 tags a matrix of just 11,449 ints or 45K would suffice.

Properties (left) that may appear within other properties (top)

In my test program I just specified the nesting matrix manually. Following the natural analogy a 'guava' may appear inside a 'penguin', 'dog' or 'refrigerator', but an 'orange' cannot be inside a 'pineapple'. In the finished program, of course, such a matrix would be computed as described above.

From this hierarchy information we can easily work out when to start and stop tags. For each interval visited in order we separate the ranges into three sets:

  1. closing: properties that are present in the previous interval but missing in the current one
  2. opening: properties that are present in the current interval but were absent in the previous one
  3. continuing: properties that are present in both the preceding and current intervals

After this initial classification we use the nesting matrix to correct any anomalies. Any property in the 'continuing' set that is contained by one from the 'closing' or 'opening' sets must be moved to the 'closing' set and also added to the 'opening' set. This is because, in order to preserve well-formedness, the closure or opening of the parent will force the closing and re-opening of a continuing child.

Remaining problems

Even after these measures two problems remain:

  1. Since we allowed arbitrary overlap there is nothing to prevent two incompatible properties such as 'guava' and 'grape' being defined for the same interval. Neither may contain the other, so if they occur together one must be dropped.
  2. Tags must be written out in the correct order: highest level containers first and the lowest level tags last. This can be achieved by sorting the ranges within each interval by their descending position in the hierarchy. We also use a stack to ensure that closing tags match and come out in the correct order.

The result

Once these adjustments have been made, the intervals can be printed out one at a time, the closing tags followed by the opening ones. Here is the output of the test program, with dots representing the text for clarity:

<banana>............</banana><refrigerator>.<orange>....</orange>.<pineapple>...</pineapple></refrigerator><pineapple>.........</pineapple><guava>..................</guava><penguin><guava>....</guava></penguin><dog><penguin><guava>....</guava>.</penguin>............</dog>

In this crazy random example the conflicting properties 'pear' and 'grape' had to be dropped. There was no way to render them given the containment rules. But the result is still well-formed XML and it would be HTML if we had used a CSS file to transform the properties.

Where to go from here

This test solution needs to be incorporated into the formatter tool and the whole thing converted into a php extension, so it can be used as a direct replacement for XSLT.