Modelling spelling variation from electronic diplomatic transcripts
The presentation demonstrates the adequacy of N-gram model perplexity, a standard metric in natural language processing, as an objective similarity metric for Middle English spelling data, despite the lexical differences between texts. N-gram models have rarely been constructed for the variable spelling systems characteristic of Middle English, most likely because a successful model presupposes a sizable body of training data. The tradition has instead been for the researcher to assess similarity based on visual, predominantly qualitative, comparison of spelling forms of selected words collected from samples of texts. Diplomatic transcripts of longer medieval English texts are increasingly becoming available in electronic form. Their arrival promises full models optimised through smoothing and interpolation as a basis for quantification and rigid testing. My examples of the adequacy of the perplexity metric are relevant to textual studies. (1) A scribe’s spelling is always biased in the direction of his exemplars. This bias opens up a window on the number of scribes behind the exemplars for a text executed in a single hand, when other factors such as authorship and poetic form are held constant. (2) Testing of models trained on a corpus totalling ten manuscripts demonstrates that initial position in the verse line regularly prompted scribes to suppress their tendency to introduce their own spelling forms in favour of replicating those encountered in their exemplars. The discussion attributes this behaviour to the operation of two mechanisms. One mechanism is psycholinguistic in origin, while the other is rooted in manuscripts’ production and so implies a codicological dimension to spelling variation.
Jacob Thaisen is Associate Professor of Literacy Studies in the Department of Cultural Studies and Languages at the University of Stavanger. He has a PhD in mediaeval English, and scribal copying practices have been central to the projects he is or has been involved in: in relation to textual transmission for the Canterbury Tales Project, in relation to sociolinguistic and pragmatic factors for the Middle English Scribal Texts programme, and in relation to cognitive processes for the Cognitive Processes in Copying a Text project [Copycat]. He is currently a research fellow at the Netherlands Institute of Advanced Studies, where he is pursuing a project on structured variation in script.