Thursday, December 10, 2009

LLC paper accepted in record time

My LLC paper, the 9,000 word one about the inadequacy of markup for cultural heritage texts, has been accepted, exactly 30 days after it was submitted. I was expecting a wait of two years or so; this approval makes it the fastest paper I have ever had accepted. Who says the humanities are sleepy? Obviously they thought this was important enough to approve straight away. And I don't think the reviewers were bored. Unlike many papers in the field it doesn't concentrate on a small specialised area - e.g. the digitisation of one author's works, but on the digitisation of all of them. A pizza and beer tonight.

Sunday, December 6, 2009

Progress Table for MVD Joomla Components

As I did for nmerge I have drawn up a table for the various components of the Joomla! solution. I'll update this as each component is completed. Red means not done, yellow means partly written, orange means completely written and working but not tested, green means tested. At the current rate of progress this application will take until June to be fully finished and tested, and perhaps even that is optimistic.

The structure is that there will be one component: mvd, which will have a number of views, and one plugin:

view mvd_listView to display a list of available MVDs. Allow user to create new MVDs and delete old ones. Open MVDs for viewing in various ways.
view editversionsView to allow user to edit version information for a given MVD.
view twinView two versions of an MVD side by side.
view singleView a single version of an MVD.
view mseditEdit the source text for a version next to the relevant facsimile page. (NEW)
view singleditEdit the transcription without the accompanying facsimile.
view treeView the genealogy of the versions of an MVD as a phylogenetic tree or stemma. (NEW)
plugin_searchIndexed search plugin for all pages that require it. (NEW)
search siteAdvanced search for files, descriptions and contents. (NEW)
importVarious import options: plain text to MVD, TEI XML to MVD etc. (NEW)
exportExport an MVD to source format (e.g. XML) (NEW)

Friday, December 4, 2009

Launch of Harpur Website

Although there’s not much there yet I launched the Harpur Archive test website last weekend. It is a Joomla installation and in it I intend to build all the technology from the Alpha wiki prototype, and a new version of nmerge. In short, I will try to build reusable and easy to use Joomla components and experiment with them there. So if you are interested, watch this space.

Markup Inadequacy Paper

While I’m on the subject of news I submitted on the 13th of November a long paper (9,000 words) to Literary and Linguistic Computing entitled: ‘The Inadequacy of Embedded Markup for Cultural Heritage Texts’. It’s provocative, and it’s meant to be. I am basically calling the establishment’s bluff that they dare try to stop this. I think we’ve gone on quite long enough with an inadequate means of recording our historical texts in digital form. So this is my attempt to make it stop. Here’s the abstract:

Embedded generalized markup, as applied by digital humanists to the recording and studying of our textual cultural heritage, suffers from a number of serious technical drawbacks. As a result of its evolution from early printer control languages, generalized markup can only express a document’s ‘logical’ structure via a repertoire of permissible printed format structures. In addition to the well-researched overlap problem, the embedding of markup codes into texts that never had them when written leads to a number of further difficulties: the inclusion of potentially obsolescent technical and subjective information into texts that are supposed to be archivable for the long term, the manual encoding of information that could be better computed automatically, and the obscuring of the text by highly complex technical data. Many of these problems can be alleviated by asserting a separation between the versions of which many cultural heritage texts are composed, and their content. In this way the complex interconnections between versions can be handled automatically, leaving only simple markup for individual versions to be handled by the user.

Friday, November 27, 2009

Interedition Handout

I've had some positive feedback from the recent meeting of the Interedition initiative in Brussels. One of my colleagues distributed a handout that was favourably received, and to which I have already received one offer of collaboration. Since it expresses the essence of MVD in a non-technical form and has a stunning graphic of the comparison of two versions of Charles Harpur's 1845 versus 1888 editions of the Creek of the Four Graves, which have only around 40% similarity, I thought I'd share it with you:

Multi-Version Documents and the Harpur Archive

The Multi-Version Document or MVD system is designed to automate as far as possible the work of editing our textual cultural heritage. Existing markup-based approaches pose serious problems for the modern digital scholarly editor, including:

  1. Failure to adequately and accurately represent ordinary textual phenomena
  2. Obscuring the text and confusing the editor with excessive density of technical markup
  3. Requiring manual tasks that could be performed much better and automatically by computer
  4. Embedding subjective and potentially obsolescent technical information into texts that are supposed to be archived for the long term

These problems can mostly be overcome by separating the versions from their content. In this way editing a text becomes relatively simple, because all the complexities of versions (insertions, deletions, variants and transpositions) are handled automatically. Instead the editor works on a simplified text marked up only with the textual structure of each version.

An MVD represents 'the work' as an interrelated set of versions that can be searched, compared, edited and archived as a single, compact digital entity. An MVD also has a zero footprint. You can always get out the texts in exactly the same form as you put them in.

What we have now:

The following tools are available for download from the Googlecode site:

  1. The nmerge commandline tool. This can be used to create, edit and manipulate MVDs.
  2. The Alpha wiki prototype. This can be used to visualise and edit MVDs. For copyright reasons it only has one example text: all major versions of Act 1 Scene 1 from Shakespeare’s King Lear.

Future Developments

We are currently developing a plugin for Joomla! that will incorporate all the current technology, with further enhancements, to enable a humanities type web archive to be easily built and deployed on ordinary web hosts, requiring only a low level of technical expertise. This will be used as the basis of the new Digital Variants website and also the Harpur Text Archive. Progress reports will be posted on the MVD blog.

References

Schmidt, D. (2009a). Merging Multi-Version Texts: a Generic Solution to the Overlap Problem. In: Usdin, B.T. (ed) Proceedings of Balisage: The Markup Conference 2009. doi:10.4242/BalisageVol3.Schmidt01.

Schmidt, D. and Colomb, R. (2009). A data structure for representing multi-version texts online. International Journal of Human-Computer Studies, 67.6: 497-514.

Schmidt, D., Brocca, N. and Fiormonte, D. (2008). A Multi-Version Wiki. In: L.L. Opas-Hänninen, M. Jokelainen, I. Juuso, T. Seppänen (eds), Proceedings of Digital Humanities 2008, Oulu, Finland, June, 2008, pp. 187-188.

Multi-Version Documents. http://multiversiondocs.blogspot.com.

Merge and edit N versions in one document. http://code.google.com/p/multiversiondocs/.

Tuesday, November 24, 2009

Minor updates to nmerge, Alpha

I have added a README to Alpha to help install it and get it working. It didn't have one, which was an oversight. Also I noticed that the nmerge installer didn't work properly. This is due to my inexperience with automake. In fact it installed correctly, it just complained about the java source code directory which wasn't listed in the makefile properly. I'll try to be more careful in future.

Sunday, November 1, 2009

C++ Version of nmerge

One problem with the current design of nmerge is that it is written in Java. The commandline tool is a thin C wrapper around that, and if you want to process larger files you can't pass in arguments to increase available memory. So it just fails to work on large files. Also if you want to run it on servers that don't have, or won't allow, Java (true of many commercial hosting sites) you're also out of luck. Since the Digital Variants people and probably a large number of humanities projects will have these problems also, I have decided to convert it into pure C++. This should be relatively easy, and the benefits are:

  1. memory usage will be limited only by what is available on the machine, not to that allocated to the Java Virtual machine (JVM)
  2. nmerge-c++ will be callable from PHP or another scripting language without requiring installation of a JVM.
  3. nmerge can optionally write to a database instead of directly to disk. This is usually the only way you can save changes on a commercial hosting site.
  4. The C++ version will use far less memory than the Java version and should be a bit faster.

Overall, these changes will facilitate the building of a practical web application or plugin, which can be added to existing sites. Initially, my intention is to produce a Joomla! plugin that other people can use.

Some changes that will be possible in this revision include:

  1. Grouped transpositions. By assessing individual transposition candidates as a group it will be possible to detect larger transpositions that contain small corrections.
  2. Proper multi-tasking of the merging process in C++ will hopefully speed up the algorithm considerably.

That's the plan. I thought I'd let you know where I'm taking this, and it is to turn it into a generally usable tool.

There is at least one drawback, of course. C++ is cumbersome to write code in, compared to the relative heaven of Java. It's like painting a room with a brush instead of a roller.

Friday, October 23, 2009

Whoops!

A favourite quotation of Edgar Dijkstra is that 'testing shows the presence, not the absence of bugs'. This is very true. In Australian homes too you can squash a few cockroaches and think you've got them all, but how do you know there isn't a whole colony hiding in the skirting boards? I'm guilty of putting in a '!' when I shouldn't have. My only explanation was that I was jetlagged in Montreal and fed up with preparing my presentation. For some reason I put in that 'not', which prevented nmerge from finding any left-side transpositions at all. All I can say is: 'Whoops!'

I'll fix it in the next hour or so and upload the new version as 1.0.2, and update Alpha too. The transposition algorithm is not perfect - I never said so, if you read the Balisage paper, particularly at the end - but it is workable. One thing you should keep in mind is that this is a unique program in its field. Several people have written merging programs for humanistic texts, and a couple have even included transpositions (MEDITE, JNDiff). But only between two texts at a time. I merge N texts into one digital representation.

One thing I'd like to do soon is make it find transpositions in groups (a flaw that Peter Robinson rightly pointed out). And it could be even faster, if I can work out how to parallelise the algorithm. That's why I 'built' this fancy i7 computer.

The good thing about computing variants automatically rather than manually is that it is not final. Any improvements in the algorithm are immediately visible. Whereas making systematic changes to a manually coded set of texts with complex variants is not trivial.

Tuesday, September 1, 2009

New Versions of nmerge and Alpha posted

The difference here is that nmerge now includes the full source code, released under the GPL v3, and also contains a single example text that I can give away under the same license. It is the first scene of Shakespeare's King Lear. I have tried to make it as true to the source texts as I can but it's a lot of work getting markup to look like a manuscript. I never realised before how much the tags interfere with that. It's very annoying. Anyway, let me know if there are any mistakes. Or any ideas on how Alpha can be improved. I'm sure there are lots.

Of course it is full of markup hacks, mainly lines split over speeches, but I couldn't fix that without introducing another layer for each MS. I'd prefer using some other technology other than markup for the content but there isn't one yet. Oh well!

Here's the link.

Thursday, August 13, 2009

Balisage Presentation 13 August 2009

My talk at Balisage in Montreal went very much as planned. The slides took as long as I had paced them to last: 28 minutes. Then I did two software demonstrations, one of the nmerge commandline tool and another of the Alpha multi-version wiki. The former is more or less finished (though I keep tweaking it) and the Alpha wiki is about half done, but usable. There was time afterwards for a few questions. The best of these came from Fabio Vitali, who also works with Angelo Di Iorio on a diff calculating algorithm for edited XML texts. He convinced me after the talk that their method of computing diffs has some advantages over my simplistic greedy approach for XML texts. But my method I think is still a good fallback in the general case. I think the best thing is to try to incorporate the basic idea of their JNDiff algorithm, which is making the merging algorithm optionally XML-aware, rather than try to use their code, which is not really open-source yet.

I think the paper went down well because of the demos. No one else whose talk I saw presented any finished software. It was mostly work in progress - the usual conference fare. But reactions to it were not very critical. They had little to say I think because it was not about an application of XSLT or XQuery - their favourite tools. But the talk at least has exposed the MVD idea to a wider audience. No more excuses any more for not mentioning it when discussing solutions to overlapping hierarchies.

I received favourable comments from the upper reaches of the Balisage hierarchy which seemed genuine. And I am encouraged by that.

I have updated nmerge with the version I demonstrated at Montreal. Also there is a copy of the wiki in its current state, minus any MVDs. I can't use any of the usual examples because of copyright restrictions. So I'll have to create some of my own pretty soon.

Wednesday, August 12, 2009

The Biggest Advantage of Using MVDs

I suddenly realised now I am here in Montreal preparing to defend my ideas against 100 experts from around the world that I have failed to notice all this time the biggest advantage of MVDs. And it is this: The alternative to computing the interrelations between multi-version texts can only be encoding them manually. In speaking of the supposed advantages of standard XML tools what is often forgotten is the enormous human cost of training people to use markup, and getting them to encode it and check it against the originals. I know from experience that this is very expensive. We literally spent thousands of man-hours encoding variants in Wittgenstein. If we could have had a tool for doing that automatically, much of that time and money would have been saved.

Another advantage of computing interrelations automatically is that it is so easy to get back what you put in, unmolested. Hand-encoded XML hard-wires the interconnections between versions, and getting back the original text can be a hard problem if you decide later to change to another technology. With nmerge I just press the "archive" button and it is done.

If computers are good for anything they are good for saving human effort.

Sunday, July 26, 2009

Alpha Prototype Ready

I am renaming the multi-version wiki Alpha, simply because it's easier to say than Phaidros. It's a bit of a joke, really, because 'Alpha' was just the description of the product I developed for DH2008. It was the 'alpha' release of that.

The old Alpha didn't do transpositions, and to remedy this deficiency I have been labouring hard for the past year. NMerge was revised to support transpositions, but I hadn't integrated it into the multi-version wiki. But when I finally saw the result of the new nmerge in the web browser, it was suddenly clear that there were still some bugs in the transposition algorithm. Finding out exactly what was going wrong, though, took me about a week of solid debugging. But it is done now and I am finally satisfied. And now I have something to take to Montréal to show the audience. And I can say: 'Hey folks, you said this conference was all about theory, but here's something that actually works.' I think that is a pretty good argument.

In this screendump of part of the TwinView of Galiano's 'El mapa de las aguas' you can see the transposition of 'otras de un hachazo' from after 'de un bocado rabioso' (in version B, left) to before (in version C, right). To consistently detect cases like this manually would be near to impossible.

Red text is deleted in the left-hand version with respect to the version on the right. Blue text is inserted, and transpositions are shown in grey. Black text is merged and, like transpositions, clicking on it aligns the text on each side. This use of these simple features of HTML results in a surprisingly effective UI.

Character-Level vs Word-Level Alignment

The use of character-level alignment by default is new to this version. For example, the expression 'el molino chico' became 'el molino' through the deletion of the character sequence 'o chic'. This goes to show that what humans would expect – the deletion of ' chico' – and what the computer detects, don't always correspond. I don't think that is a bad thing. The alternative would be to fail to see changes of spelling such as 'desaparecido' for 'desparecido' or the capitalisation of 'Ojos' for 'ojos'. A word-level granularity would puzzle the reader while he/she tried to work out the difference. It is clearer to see small changes like these highlighted, so I agree with the MEDITE people that character-level alignment is more powerful. After all, you can always reduce character-level granularity to word-level but if you only have word-level alignment you are stuck with it.

'Collation' programs based on XML use word-level granularity because a finer resolution would make the markup impossibly complex (you'd have to mark up each letter separately). That doesn't have to be a restriction once we abandon the print-oriented concept of 'apparatus.' For the digital medium, at least, a new digital presentation of variation is needed. Let it evolve.

Thursday, July 2, 2009

Interface 09 and Multi-Version Wiki

We will be presenting a poster at Interface09 at the University of Southampton. There will also be a demo of the multi-version wiki, which I hope will be an iteration further on from that presented at Oulu for Digital Humanities 2008. The new multi-version wiki is simply the old wiki with the new nmerge library added, but that includes support for transpositions, which is kind of important. It is a Jetty 6 based web application that runs inside your browser, and allows you to view and edit MVDs in a variety of intuitive ways.

Digital Variants Portal

Eventually the wiki will be broken up and integrated into the Digital Variants Website I am building. In this form the wiki will be a series of portlets inside a portal. Each portlet conforms to JSR 286 and is implemented in Jetspeed 2. A portal allows the user to configure his or her own interface on the web using the portlet components. It also promotes reuse of the portlets by other parties. We are going for broke with this design: I for one don't believe that deficient or obsolescent technology has any place in designs for the future. If we can build it, we will.

Friday, June 5, 2009

nmerge 1.0 posted

OK, I've posted the first BETA version of nmerge for UNIX/Linux/OSX only. I'll add a Windows installer as soon as I can get around to it. Of course I expect it to go wrong immediately, even though I have tested it thoroughly. But I can only really gather more information by trying it on other files. And it comes currently with no example files.

Some basic installation instructions for the non-GNU afficionados:

  1. Download the nmerge-1.0.tar.gz file using the link above
  2. Open a terminal window, navigate to the download file and unpack it using tar xzf dir.tar.gz or just double click on it if you have a Mac
  3. In the terminal window type cd nmerge-1.0
  4. ./configure
  5. make
  6. sudo make install

You should now have a command "nmerge". If it complains about the Java make sure you have a valid JRE installed. It must be at least version 1.5.0 (1.4.2 is no good). To find out type java -version in the terminal window. Download a more recent JRE from Sun. (You only need the JRE not the JDK unless you also want to develop Java software). If it still doesn't work you have an issue that you should post on Google code.

The first update will contain the source code and documentation. I left it out because of my inexperience with GNU automake.

Tuesday, June 2, 2009

Balisage Paper Accepted

My Balisage paper about how to create and edit MVD files has been accepted. I have already bought the flight tickets and registered, so I will be going to Montreal on August 11-14. That's the other side of the world for me and I think I must be mad. But this is the only way to properly air the MVD concept and get some reactions from the people most likely to field valid objections. If they clear it, then I think that will vindicate it as far as it can be at this stage. The draft paper is here, although it is rather technical. I will post my simplified slide show when I have it.

Out of the Tunnel

Well, it all works. Now I just have to build an installable package for it. To be honest I don't think many if any people will want to use nmerge. It's too user unfriendly because it has no real user interface. People want a GUI these days, and nmerge is designed to be the Swiss army knife for whatever GUI you might want to put on top of it. Nevertheless I will post it as soon as possible with a GNU type installer and maybe a Windows one if that is not too hard (perhaps using Nullsoft). The main point is that a milestone has been reached: the MVD file format is born. (Hooray!)

After that it will be time to add my own GUI, which is just an updating of the Phaidros wiki which has lain untouched for nearly a year now. It is time to update it with some killer features: e.g. Tree View, which will show the genealogy of a set of versions via a graphical tree which you can configure and regenerate according to taste. I have some other ideas too which can be blended in gradually.

Thursday, May 28, 2009

The Light at the End of the Tunnel

Well I finally got 'compare' to work properly. The delay was caused by having to redesign the 'chunking' mechanism that delivers the text back to the browser as a series of blocks with all the same characteristics. So all the deleted text can be made red, and the inserted blue and the merged black. And the user can click on the black text and be taken to the corresponding part of the compared text. Very important, but also very tricky to get absolutely right. And in this version I had to allow for transpositions, and they are even more complicated. But now at last it works. I will post the project on Google Code in the morning, because I am too tired now.

usagecreatehelpadddeldescarchunarchexportimportupdatereadlistcompfindvars

Several Days Later ...

Almost done testing the code. Just a few minor problems with find (again) and variants. The latter could be quite a useful feature in the GUI. For example, selecting a piece of text could conceivably show its variants dynamically in a sub-window at the bottom. I favour an in-line solution using popup text, but that will have to wait. This feature should demonstrate that we don't need to 'collate' separate physical versions any longer to get this information.

Thursday, May 14, 2009

HyperNietzsche vs MVD

I decided after all to make some general remarks about the recently proposed 'Encoding Model for Genetic Editions' being promoted by the HyperNietzsche people and the TEI. Since this is being put forward as a rival solution for a small subset of multi-version texts covered by my solution, I thought that readers of this blog might like to know the main reasons why I think that the MVD technology is much the better of the two.

One Work = One Text

Because it is difficult to record many versions in one file using markup, the proposal recommends a document-centric approach. In this method each physical document is encoded separately, even when they are just drafts of the one text. As a result there is a great deal of redundancy in their representation. They interconnect the variants between documents by means of links which are weighted with a probability, and they see in this their main advantage over MVD. But this is based purely on a misunderstanding of the MVD model. The weights can of course be encoded in the version information of the MVD as user-constructed paths. We can have an 80% probable version and a 20% probable version just as well as physical versions.

Actually I think it is wrong to encode one transcriber's opinion about the probability that a certain combination of variants is 'correct'. A transcription should just record the text and any interpretations should be kept separate. How else can it be shared? The display of alternative paths is a task for the software, mediated by the user's preferences.

The main disadvantage in having multiple copies of the same text is that every subsequent operation on the text has to reestablish or maintain the connections between bits that are supposed to be the same. You thus have much more work to do than in an MVD. I believe that text that is the same across versions should literally be the same text. This simplifies the whole approach to multi-version texts. I also don't believe that humanists want to maintain complex markup that essentially records interconnections between versions, when this same information can be recorded automatically as simple identity.

OHCO Thesis Redux

The section on 'grouping changes' implies that manuscript texts have a structure that can be broken down into a hierarchy of changes that can be conveniently grouped and nested arbitrarily. Similarly in section 4.1 a strict hierarchy is imposed consisting of document->writing surface->zone->line. Since Barnard's paper in 1988 where he pointed out the inherent failure of markup to adequately represent a simple case of nested speeches and lines in Shakespeare - sometimes a line was spread over two speeches - the problem of overlap has become the dominant issue in the digital encoding of historical texts. This representation, which seeks to reassert the OHCO thesis, which has been withdrawn by its own authors, will fail to adequately represent these genetic texts until it is recognised that they are fundamentally non-hierarchical. The last 20 years of research cannot simply be ignored. It is no longer possible to propose something for the future that does not address the overlap problem. And MVD neatly disposes of that.

Collation of XML Texts

I am also curious as to how they propose to 'collate' XML documents arranged in this structure, especially when the variants are distributed via two mechanisms: as markup in individual files and also as links between documentary versions. Collation programs work by comparing basically plain text files, containing only light markup for references in COCOA or empty XML elements (as in the case of Juxta). The virtual absence of collation programs able to process arbitrary XML renders this proposal at least very difficult to achieve. It would be better if a purely digital representation of the text were the objective, since in this case, an apparatus would not be needed.

Transpositions

The mechanism for transposition as described also sounds infeasible. It is unclear what is meant by the proposed standoff mechanism. However, if this allows chunks of transposed text to be moved around this will fail if the chunks contain non-well-formed markup or if the destination location does not permit that markup in the schema at that point. Also if transpositions between physical versions are allowed - and this actually comprises the majority of cases - how is such a mechanism to work, especially when transposed chunks may well overlap?

Simplicity = Limited Scope

Much is made in the supporting documentation of the HyperNietzsche Markup Language (HNML) and 'GML' (Genetic Markup Language) of the greater simplicity of the proposed encoding schemes. Clearly, the more general an encoding scheme the less succinct it is going to be. Since the proposal is to encorporate the encoding model for genetic editions into TEI then this advantage will surely be lost. In any case there seems very little in the proposal that cannot already be encoded as well (or as poorly, depending on your point of view) in the TEI Guidelines as they now stand.

Friday, May 8, 2009

A Slight Delay in a Good Cause

OK, I'm not finished yet, when I said I would, but software is like that. Sorry. I decided that in order to really test the program properly I should have a complete test suite that I can run after making any changes to make sure that everything in the release is OK. Well when I say 'make sure' a test can only tell you if a bug is present, not tell you that there are none. But that's a lot better than letting the user find them. If I release something that is incomplete or not fully tested then I know the sceptics will attack the flaws. They will say 'See, it doesn't work, I told you so!' I can't afford that, so I have to be careful. So far I have tests for fourteen out of 16 commands.

I also added an unarchive command to go with the archive command. With 'archive' users can save an MVD as a set of versions in a folder, plus a small XML file instructing nmerge how to reassemble them into an MVD. This contains all the version and group information etc. So if you don't believe the MVD format will last, it doesn't matter. You always have the archive and that is in whatever format the original files were in. A user could even construct such an archive manually. The 'unarchive' command takes this archive and builds an MVD from it in one step.

Here's a progress bar for the tests. Green means there is a test routine and it passes. Yellow means there is a test routine but it doesn't pass yet. Red means there is no test routine and I don't know for sure if it works, but it might. There was an intermittent problem with update, but this is now fixed.

usagecreatehelpadddeldescarchunarchexportimportupdatereadlistcompfindvars

I'm going for a beta version with this release. I think it's good enough.

OK now there's a project on Google code. I must say it was much easier than creating a Sourceforge project. They wanted me to write an epic about it and even then I had to wait 1-3 days for their royal approval. On Google code it was instant. Cool.

Thursday, April 30, 2009

Nmerge tool code-complete

The nmerge commandline tool is now code-complete. I guess it's a 'pre-alpha' version. Since this is a revision of a previous working version, though, testing should not take too long. I would estimate that, after the Labor day weekend (Monday 4th May) I should have an alpha-version. But with software you never know. This version supports the new merging algorithm from the submitted Balisage 2009 paper, which works pretty well.

Nmerge is also a JAVA library that can be used from within a JAVA application, like the Phaidros wiki, to provide support for Multi-Version-Documents. Once it has stabilised I will rewrite it as a C++ commandline tool. But for now we have to put up with a slightly more cumbersome syntax. Here is the "usage" statement produced by the program so you can get some idea of what it does. Once it is reasonably well tested I will put the source code on SourceForge under the GPL v3.

The command syntax is a bit complicated, but so is what it is trying to do. I envisage that this tool could be used in a shell or commandline script to automate, say, the construction of an MVD from a set of files. At least that's what I use it for. In any case the -h option prints out an example or two of how to use each command. The -c option specifies the command you want to perform on the MVD, and the other arguments are the parameters that the command uses, provided they make sense. If they don't you'll get an error message.

With the nmerge tool MVD becomes a real format. There's no GUI user interface because if I added one, you couldn't take it away and put in your own. If you need one, wait for Phaidros.

usage: java -jar nmerge.jar [-c command] [-a archive] [-b backup] 
     [-d description] [-e encoding] [-f string] [-g group] [-h command] 
     [-k length] [-l longname] [-m MVD] [-n mask] [-o offset] [-p]
     [-s shortname] [-t textfile] [-v version] [-w with] [-x XMLfile]
     [-?] 

-a archive - folder to use with archive and unarchive commands
-b backup - the version number of a backup (for partial versions)
-c command - operation to perform. One of:
     add - add the specified version to the MVD
     archive - save MVD in a folder as a set of separate versions
     compare - compare specified version 'with' another version
     create - create a new empty MVD
     description - print or change the MVD's description string
     delete - delete specified version from the MVD
     export - export the MVD as XML
     find - find specified text in all versions or in specified version
     import - convert XML file to MVD
     list - list versions and groups
     read - print specified version to standard out
     update - replace specified version with contents of textfile
     unarchive - convert an MVD archive into an MVD
     variants - find variants of specified version, offset and length
-d description - specified when setting/changing the MVD description
-e encoding - the encoding of the version's text e.g. UTF-8
-f string - to be found (used with command find)
-g group - name of group for new version
-h command - print example for command
-k length - find variants of this length in the base version's text
-l longname - the long name/description of the new version (quoted)
-m MVD - the MVD file to create/update
-n mask - mask out which kind of data in new mvd: none, xml or text
-o offset - in given version to look for variants
-p - specified version is partial
-s shortname - short name or siglum of specified version
-t textfile - the text file to add to/update in the MVD
-v version - number of version for command (starting from 1)
-w with - another version to compare with version
-x XML - the XML file to export or import
-? - print this message

Thursday, April 23, 2009

MVDs in binary or XML?

A pattern is emerging in the effect that the MVD concept is having on people. They take on board its power at representing variation but they don't like the idea of representing the data in binary form. Instead they think it is possible to represent variation in some form of XML. So far I've heard proposals to use TEI-XML, RDF or GraphML. It's tempting, of course, to carry on using XML when this is the tool we are all most familiar with. However, my point of developing the MVD format was precisely to get around the limitations of all forms of markup. You can't represent a variant graph in XML satisfactorily if the text you are recording the variation of is itself XML – and it usually is. The reason is that you can't represent cases where the markup itself varies: for example the deletion of a paragraph break:

<del></p><p></del>???

Of course there are hacks to get around this particular case but they have negative consequences. What you end up doing is modifying the markup to accommodate weaknesses in the representational power of markup itself. I think that is a fundamentally flawed strategy. It is just another form of putting presentational information into markup that is supposed to be generic. If you try to represent variation in a set of texts or in one text using markup you very quickly run up against the problem of overlap. And markup is very poor at representing that as we all know. The only way to completely get around the overlap problem is to represent variation using a non-markup based technology. That's the whole point of MVDs that doesn't seem to have been widely acknowledged yet.

Sunday, April 5, 2009

MergeTester released

For the thesis I wrote MergeTester, a simple utility that implements the merging algorithm from chapter 5. Although not a practical program, it does demonstrate how the program works and allows the user to test it on folders of versions in any format. It builds up a variant graph of the versions and prints them out one arc at a time. From the printout the user could manually reconstruct the graph or part of it.

The advantage of the program lies in the fact that the way it works is not obscured by any other code and it does not depend on 3rd party libraries. Any comments and reports of bugs found will be gratefully received!

At the moment I am incorporating it into nmerge, which will also be released shortly. Nmerge can convert a variant graph into an MVD, so the merging algorithm will then become practical.

Wednesday, March 18, 2009

Final Version of Multi-Version Documents Paper Published by Elsevier

The final version of my MVD paper has now appeared online. This hyperlink is permanent and can be used in citations. The paper reference is Schmidt, D. and Colomb, R, 2009. A data structure for representing multi-version texts online, International Journal of Human-Computer Studies, 67.6, 497-514.

Thesis Submission

Also I have now submitted my thesis. The final title was 'Multiple Versions and Overlap in Digital Text'. Here's the abstract:

This thesis is unusual in that it tries to solve a problem that exists between two widely separated disciplines: the humanities (and to some extent also linguistics) on the one hand and information science on the other.

Chapter 1 explains why it is essential to strike a balance between study of the solution and problem domains.

Chapter 2 surveys the various models of cultural heritage text, starting in the remote past, through the coming of the digital era to the present. It establishes why current models are outdated and need to be revised, and also what significance such a revision would have.

Chapter 3 examines the history of markup in an attempt to trace how inadequacies of representation arose. It then examines two major problems in cultural heritage and linguistics digital texts: overlapping hierarchies and textual variation. It assesses previously proposed solutions to both problems and explains why they are all inadequate. It argues that overlapping hierarchies is a subset of the textual variation problem, and also why markup cannot be the solution to either problem.

Chapter 4 develops a new data model for representing cultural heritage and linguistics texts, called a 'variant graph', which separates the natural overlapping structures from the content. It develops a simplified list-form of the graph that scales well as the number of versions increases. It also describes the main operations that need to be performed on the graph and explores their algorithmic complexities.

Chapter 5 draws on research in bioinformatics and text processing to develop a greedy algorithm that aligns n versions with non-overlapping block transpositions in O(MN) time in the worst case, where M is the size of the graph and N is the length of the new version being added or updated. It shows how this algorithm can be applied to texts in corpus linguistics and the humanities, and tests an implementation of the algorithm on a variety of real-world texts.

Tuesday, March 10, 2009

MVD is Not a Replacement for Markup

Some people still think of MVD as a replacement for markup. It isn't. It complements markup systems or any technology that can represent content. As I said in the main page What's a Multi-Version Document? an MVD represents the overlapping structure of a set of versions or markup perspectives. It doesn't need to represent any of the detail of the content, which is the responsibility of the markup.

I realise that it's easy, and natural, to seek to dismiss radical ideas simply because they are radical. The difference in this case is that MVD is a technology that definitely works. It's not all that radical anyway. Consider the direction in which multiple-sequence alignment is going in biology. They have also realised that the best way to represent multi-version genomes or protein sequences is via a directed graph (e.g. Raphael et al., 2004. A novel method for multiple alignment of sequences with repeated and shuffled elements, Genome Research, 14, 2336-2346). I prefer to think of that idea as parallel to mine, and his 'A-Bruijn' graph is rather different from my MVD, but it represents the same kind of data in much the same way. Acceptance that this basic idea can also be applied to texts in humanities and linguistics is just a matter of time.

The Inadequacy of Markup

If markup is adequate for linguistics texts, why is it that every year someone thinks up a new way to manipulate markup systems to try to represent overlap? If it were adequate there would be no need for new systems, but we continue to see 1-3 new papers on the subject every year. It's seen as a game. Look at the Balisage website: 'There's nothing so practical as a good theory'. Perceived as an unsolvable problem, overlap is the perfect topic for a paper or a thesis.

In the humanities, overlap in markup systems is more than an annoyance; it wrecks the whole process of digitisation. In simple texts you can just about get by, but it's a question of degree. Try to use markup to record the following structures:

  1. Deletion of a paragraph break
  2. Deletion of underlining
  3. Changes to document structure
  4. Transposition
  5. Overlapping variants
These can all be done somehow in markup, I admit, but very poorly. And they are features that occur all the time in original texts. The fundamental problem is that you can't adequately fit a non-hierarchical structure into a hierarchical template. To choose markup alone as a medium to preserve our textual cultural heritage is to resign yourself to mangling that information.

Why do we have to use markup to record complex structures it was never designed to represent? Hand that complexity over to the computer and let it work it out. That's what MVD lets you do. If you are getting a headache shuffling around angle brackets and xml:ids, then think again. Is this any proper way for humans of the 21st century to interact with the texts of their forebears?

Wednesday, February 18, 2009

MVD Paper available online

Elsevier have published the paper I wrote with Bob Colomb about Multi-Version Documents online. The Greek text has dropped out of Figure 16, but the rest is good. I hope this has an impact, and it is certainly something I will be referring to in future. It represents everything I knew about the MVD idea and its implications as of December 2008.

Thesis Complete

This morning I submitted a near-final draft of my thesis 'Multiple Versions and Overlap in Digital Text' to my two supervisors. The last chapter describes some new work on aligning multi-version texts automatically. Here's a table taken from the thesis which summarises its performance on a variety of multi-version texts.

The SZ column is the average version size in kilobytes, NV is the number of versions, TT is the total time taken to merge all versions, AT is the average time to merge one version after the first, both in seconds. The test machine had a 1.66GHz Core Duo processor, using one core. The Romulo doesn't merge properly at the moment because there is almost nothing in common between the versions, so the merge times don't mean much in this case.

The key is the AT column, which is how long it takes to 'save' an edited version back into the document. As you can see, it's pretty fast, considering that this is a hard problem. As far as quality goes, I can't see any bad alignments or false transpositions, except in the Malvezzi case. Once I can coerce the input into a sensible format this should also work.

Balisage

It looks as if I will be going to Balisage this year. I will be presenting a boiled down version of Chapter 5 of the thesis, which is all new work. I'll be very interested to hear their reactions, especially as I can now demonstrate the theory. (Their motto is 'There is nothing so practical as a good theory').