Almost Ready for Prime Time

24 May 2012

We now have two versions of a demos up and ready to run. Both allow a user to pull data from the witness files, containing manuscript transcriptions, select texts to compare, run the texts through a version of Collatex, then present the results as an alignment table (a “synopsis” in or “partitur” in some text-critical dialects), and as a text with apparatus.

The second of these is still buggy (and the cause of both a couple of late nights night and the lateness of this post (for which I apologize heartily to the nice people at MITH)), but it does a couple of additional things:

  • Prioritization. While the ability to generate all sorts of different apparatus is a desideratum, at present what we can do is choose the order in which results are presented, and, in the case of presenting a text with apparatus, the first text chosen becomes the base text for comparison.
  • Tokenizing. I am now able to tokenize in two steps. First with “rich” tokens that retain data about the individual words (e.g., abbreviations, which should be compared based on their expanded text rather than on the abbreviation as written), as well as other data in the text (page breaks, etc). From there we can create “regularized” tokens. For now I have regularized the tokens by removing all yods and waws. Additional candidates might include dealing with prepositions that are sometimes but not always attached in medieval Mishnah manuscripts (shel, e.g.), final aleph/heh, and final nun/mem. “Simple” tokens are passed to Collatex (or, we allow Collatex to process “rich” tokens) and the resulting collation output is merged with the rich tokens.
  • Presentation. Because the “rich” tokens retain information about the witness, it is possible to generate a “text-with-apparatus” in which the base text can be presented with formatting and contextual information that may be useful to the reader. (Disclaimer: Here is a big bug: The XSLT that joins the two lists of tokens inserts the non-words (page breaks etc.) in a position that is offset by one location. Any suggestions?)

Next up: modifying theĀ  demo to present multi-column synopses, and linking in Talmudic and Commentary citations.


4 Responses

  1. I have been thinking about digital representation of bilingual (or multilingual) texts for some time and have not found much information on them. As I was thinking about what kinds of texts might be good candidates, the Talmud occurred to me, and I found your site. I’d love to hear more about the theoretical thinking that went into your choices of how to represent your text!

    • Hayim Lapin

      Thanks for your comment.
      While I will probably integrate a translation to the text at some point, the project is currently monolingual in terms of the text represented (although interface and metadata are entirely in English; at a later stage we will need to internationalize this, at least to include Hebrew).
      In terms of theoretical thinking about representation of the text, my concern was less about language and more about manuscript. Talmud-oriented digital projects tend to aim at extracting text. While this is significant, it is not the end product. My goal was to view the transcription and markup as a guide to preserved page and to retain data about scribal practices such as abbreviation or adding surplus text, so that the end product is as much a contribution to medieval book culture as it is to the text-history of a 2nd-3rd C legal text.

      • Interesting! I was thinking less of a translation of the Talmud than of the Talmud itself as a bilingual text. Am I wrong that some is in Hebrew and some Aramaic? But yes, metadata and and interface internationalization are also a recurring thought for me, as well.

        • Hayim Lapin

          You are absolutely correct that the Talmud includes both Aramaic and Hebrew. For the present, I am working only with the Mishnah, which is an early layer in the Talmud that is almost entirely in Hebrew.
          For the Talmud there are some sets of material where distinctions are relatively easy: a lot of the Hebrew is in citations, and a lot of the Aramaic in “editorial” discussion (in scare quotes because the issue of editing is so vexed). More problematic is when that editorial layer itself has language shifts.
          What are your thoughts?

Leave a Reply

Your email address will not be published. Required fields are marked *