Thinking about the end product

25 Jan 2012

Since my last post, I have been working on a grant application. This has afforded the opportunity of some stock taking. I’ve also had some very helpful conversations with scholars in the field: Juan GarcĂ©s and Matt Munson in Hebrew Biblical Studies, Tim Finney in New Testament and Desmond Schmidt in textual computing and classics.

1. Collation. Based on very simple normalization and tokenization and a few samples, CollateX will remain error prone, unless the algorithm changes significantly. Examples: (1) In a Mishnah section with repeated words, slight differences in spelling resulted in pushing a whole clause off to the second match. (2) In another passage, CollateX failed to diagnose a missing clause in the text and aligned non matching tokens. My estimate is that currently the error rate is above 10% (for one passage it was about 15%). Better normalization will improve this result. This raises the question of whether the normalization (or, which may amount to the same thing, having CollateX ignore certain characters in comparison) can be carried out automatically, and what this would look like, or whether, as Desmond Schmidt assures me, the whole enterprise is wrongheaded.

2. Statistical measures, now done by hand, but ideally automated. I have now invested in a license for SPSS. This, and my old friend Excel have allowed me to run some preliminary analyses. First: run collations on every Mishnah section in my sample chapter using a few representative witnesses. Transfer the output to Excel; manually fix the alignment (remember, high error rate). Then start flagging variations. I have opted for a method that is akin to what Schmidt and Tim Finney have used: effectively to create a master document with all possible readings, and use a binary encoding (1, 0) for each witness for whether the reading appears in a given witness. (Since the text is already tokenized, I used individual tokens, aka words, not characters, for estimating distance.) Use SPSS to generate a distance matrix, multi-dimensional scaling (MDS), and clustering. I have also experimented with sites providing a graphic interface to Bioinformatic software (FastME and Phylip) to produce phylogenetic trees.

The results were interesting enough that I wanted to see the results with more careful identification of variance (I’m doing these by hand, after all) and more witnesses. I used the sections with the fullest representation among witnesses (Chapter 2, Mishnah 1-2), choosing a total of 10 witnesses. The results I got were consistent with the larger text sample and fewer witnesses, but neither represented the accepted wisdom on the relationship between manuscripts. I therefore divided the cases between no-variation, substantive (different word, different gender, change in grammatical form), and orthographic (initial waw, matres lectiones, spacing between preposition and word). As an example, the Greek word emporia generated no fewer than six variant spellings, but all represented a recognizable version of the word: orthographic, not substantive variation.

MDS for Orthographic Differences, 10 Witnesses

MDS for Substantive Differences

MDS for Substantive Differences, 10 Witnesses

Now, there were some interesting results: the manuscripts thought to be of the “Palestinian type” clustered closely on substantive differences, considerably less so (and differently) on orthographic differences.

The lesson: Orthographic and substantive variations do not coincide, probably due to scribal decision-making (and inconsistency). Substantive differences  seem to be better for groupings of text families. (This may be easier to identify automatically as well: normalizing orthography to improve collation erases orthographic difference (by definition), while retaining non-orthographic difference.) But lingusitic and orthographic differences are of research significance too. We may need a way for the user to flag readings to be compared.

Rooted Tree (Phylip) for Substantive Differences, 10 Witnesses

Unrooted Phylogenetic Tree, 10 Witnesses

As for visualization, we are not yet ready for phylogenetic stemmata, certainly not of the rooted type. The underlying assumptions about a steady evolutionary clock, and the absence of the assumption of contamination make the results interesting from a heuristic point of view, but unreliable in fact. We might think of an unrooted tree as a way of imagining the MDS space with links showing connections. The phylogenetic links in my examples are identical in the rooted and unrooted trees, although from the point of view of grouping families the unrooted tree makes more intuitive sense of the data (closer MSS appear closer) but the trees make the various close relations (the so-called “Palestinian tradition”) into the ancestors or early descendants of distinct traditions. This would require more work to establish, but in more generally, a phylogenetic scheme will require a model better suited to the data.

Tags: , , ,

5 Responses

  1. hlapin

    Commented by Hayim for Desmond Schmidt:

    If I understand this rightly, even the cluster analysis method considers the text as a whole, rather than parts of it independently. That’s the key problem in constructing a phylogenetic tree of a set of texts that include contamination, since contamination isn’t part of the biological model of evolution (at least not between different species). So I can’t help thinking if computing sameness might be a better criterion than computing difference. Computing number of readings in common between two texts could likewise be expressed via a matrix and could be used to reconstruct a “contaminated” tree, that is, one with dotted lines joining branches. But it would need a different tree-building algorithm altogether.
    The different results obtained by considering substantive and orthographic differences separately are interesting but it takes a lot of effort to subdivide the data this way, effort specific to a particular text. For a general method I’m not convinced that merging the two kinds of information together would produce anomalous results.

  2. hlapin

    Another comment from Desmond Schmidt. [and I’m tracking down the problem with “Error 1”.]

    Taking the above idea further I’d like to propose for comment the following method. This is already somewhere in Greg’s Calculus of Variants, but I can’t remember where.
    It seems that a basic stemma (NOT a phylogenetic tree) is an expression of literal similarities between versions. But textual traditions also contain: a) contaminations and b) lost manuscripts. Neither is modelled by biological software. These can be expressed mathematically by first computing the basic tree using an existing method, but based on *similarities* not differences. Then:
    a) check for each pair of versions not directly related, whether their similarities differ from the corresponding reading of their common ancestor. Then they are contaminated.
    b) If two versions both derived directly from the same ancestor share similarities that are different from the corresponding reading in the ancestor. Then we can postulate a lost MS as ancestor of both versions since the same mistake is unlikely to occur twice.
    All of this can be easily computed from an MVD. It would be interesting to test this idea against an artificial tradition to verify that it is accurate.
    Drawing the tree could then be done using a custom routine. Forget about biological software.

  3. Great site! — and the new format is much nicer. Yasher Koah!
    So much to comment on, but related to what you’re discussing, I would question the assertion that text contamination has no biological parallel.
    http://en.wikipedia.org/wiki/Horizontal_gene_transfer#Importance_in_evolution

    • hlapin

      True, but the standard software packages do not make that assumption. It needs special processing as Desmond notes in his first comment.

  4. OK. What type of contamination are we discussing? From Tosefta, Amoraim? Before or after we’re assuming the Mishna was committed to writing? Some cases are going to be clear later additions, like at the ends of tractates, but many are going to be very difficult to identify. I’m hoping that your research will help identify more of those sections.
    In any case, it seems that the majority of the Mishna could be analyzed with algorithms used for biological phenotyping. For example, sections which appear in all manuscript versions of the mishnah, but with significant orthographic variation. (and assuming that these variations do not appear verbatim in parallels, which might indicate contamination).
    These sections (the majority) could be analyzed algorithmically and the “junk DNA” (l’havdil) could be thrown out. They will have to be analyzed separately using “higher” critical methods.

Leave a Reply to hlapin Cancel reply

Your email address will not be published. Required fields are marked *