The Lampeter Corpus of Early Modern English Texts:
Choosing Material to be Digitised
Background to the Lampeter Corpus
Entering a bookseller's shop in the seventeenth century, the British reader had an overwhelming range of subjects to choose from. Topics ranged from the dignity of the priesthood to the curative effects of mistletoe, from the prevalence of pestilence to the advantages of trade via Gibraltar. Equally, the contemporary scholar wanting to study these books is faced with a wide range of books to analyse. Despite the destructive powers of time, thousands of books from printing's early centuries are still in existence. The Tract Collection, part of the Founders' Library at the University of Wales in Lampeter, contains 11,395 pamphlets (that is non-fictional works, smaller in size than books) published between 1520 and 1843. On a visit to the library in 1991, the linguistic scholar Josef Schmied saw the possibility of digitising the collection in order to exploit its rich potential for research in his field, and a team of scholars (Eva Hertl, Claudia Claridge and Rainer Siemund) was assembled to prepare the corpus. But a total of over 11,000 pamphlets was far too many. Therefore before any project could commence, criteria had to be defined so as to define a group of pamphlets that would eventually become the Lampeter Corpus. This article examines how these criteria were developed, illustrating both the practical and intellectual concerns informing their creation.

Figure 1 - The Lampeter Corpus Homepage
Choosing the Tract Collection
In effect, the Lampeter team made two selection processes during the formation of the Corpus. Firstly, there was the initial selection of the Tract Collection (as distinct to other groups of pamphlets held in other libraries), the wide collection of pamphlets in the University of Wales Library; secondly, there was the task of choosing specific pamphlets from within the Tract Collection. This second task was a difficult one - numerous intellectual concerns affected their choice. The initial selection of the Tract Collection, however, proved more straightforward because of the presence of various conditions that would facilitate the process of digitisation, conditions that would assist any digitisation project.
Prior to the interest of the Corpus team there had been little academic attention shown to the Tract Collection and very few of the pamphlets had been re-published in the three hundred or so years succeeding their initial printing. Perhaps even more importantly, working from the library collection allowed the Corpus team to transcribe an unmediated form of the documents, without secondary interpretation. This 'absolute authenticity', as it was called by the manual of information accompanying the Corpus, was crucial in ensuring that the digitised copies of the pamphlets were as close as possible to the original documents. The library at Lampeter was also a generous host, permitting direct access to the collection; at other libraries gaining access to such large numbers of fragile and semi-fragile texts would not be such an easy task. As ever with digitisation projects, pragmatic considerations such as these fashion what is and isn't done.
Choosing particular pamphlets
Yet on choosing the Tract Collection, the Lampeter team were still faced with a welter of available pamphlets. Criteria needed to be developed to streamline this number. Whereas practical circumstances had effected the choosing of the Tract Collection as a whole, the Lampeter team now had to think of the particular needs of their own discipline. Which pamphlets would be the most interesting for linguists? Which particular pamphlets would yield the most effective results? Would these pamphlets adequately reflect the larger Tract Collection? Thus at this more precise level, selection was more of an academic than a digital concern.
Choosing the chronological period between 1640 and 1740 was the first part of this because this is one of the most interesting periods to those studying the development of the English language. By the mid-eighteenth century codifications like Dr. Johnson's Dictionary, or Bishop Lowth's Introduction to Grammar marked the beginning of a much more homogenised English language. Thus the preceding hundred years was the last period of a more diverse English. Charting this evolution from a diverse to a more standardised language is obviously of great interest to linguists, not just in the project team, but in the discipline as a whole.
The type of document that was going to be digitised also had to be clarified. There are many types of sub-genre included within the pamphlets of the Tract Collection. This includes shorter writings by literary figures that have already been well-studied. Thus miscellaneous works by Milton, Dryden, Defoe and Swift, amongst others, were excluded by the corpus team. Early newspapers, which have a format distinct from that of the pamphlet, were also omitted. By excluding literary output and journalistic reportage, the project was able to give a much more coherent shape to the Lampeter Corpus. For more practical reasons, the length of the pamphlets also had to be taken into consideration, especially as these documents can greatly vary in their word count. It was decided to set a minimum of 3,000 and a maximum of 25,000 words per document.
In order to trace accurately the evolution of English before its eighteenth-century standardisations, the Corpus needed to be cultivated so it could be considered an acceptable reflection of the larger (English) universe of printed pamphlets at that time. The Tract Collection had been a good collection to choose in this respect: one historian has noted that "it reflects the range of English pamphlet publishing from 1640 to 1730." The Lampeter team was quick to acknowledge, of course, that there could be no definitive reflection of publishing in the Early Modern period. Imperfection is implicit in such a selection process, but intelligent criteria minimise the width of this imperfection. Subject and decade were felt to be two criteria that could perform this task.
Six subject fields were chosen which were felt to reflect the variety in British publishing in the era in question. These domains were Religion, Economy and Trade, Science, Law, Politics, and Miscellaneous. This division was not only based on the scholarly judgement of the team, but an existing categorisation of the Tract Collection and the work of previous scholars. This domain structure was to be intertwined with a decade structure. Each decade in the 1640 - 1740 period was to contain twelve texts, two from each of the domains, thereby creating a corpus of 120 texts.

Figure 2 - Some Pamphlets in the Science Domain
While parameters such as domain and decade are chosen as part of a process that is, in one way, restricting the amount of data available to the researcher, these parameters also facilitate research, allowing scholars to ask questions not only of the broader picture, but of sub-divisions within that. So, for instance, a researcher working on the Lampeter Corpus can compare a fiery tract in the Religion domain with a more sober reflection from the Science field, or pamphlets written under the Cromwellian Protectorate in the 1650s to those of the Restoration in the following decade. If chosen wisely, criteria can help develop a researcher's line of questioning.
Choosing particular pamphlets (continued)
A suitable total word-length for the corpus was a further criterion. The Lampeter team wanted a corpus that came to around a million words in total, a figure which previous experience had indicated was high enough to give reliable answers for most linguistic questions. While counting word-length was obviously difficult to do at the outset, it was found that the 120 texts they selected come to just over a million words, and therefore neither additions nor subtractions needed to be made.
The final criteria were introduced so as to encourage diversity in the Corpus. The two texts from the same subject in the same decade were on different themes. No author featured twice. There was a preference for acknowledged rather than anonymous authors, so those working on the texts would find it easier to position them in a particular historical framework. (Those using the Lampeter Corpus can also find biographical information on the authors provided by the team.)
Having considered and decided upon specific criteria, selecting individual pamphlets to make up the 120 texts became more of a straightforward process. For some of the topics there was little room for manoeuvre; there being, for example, only very few science tracts in certain decades, meaning the choice was almost made automatically for the team. In other fields, where there was a greater number of pamphlets, choice was partially dependent on content (with a slight preference for more interesting texts), and partially dependent on the title correctly describing the pamphlet's contents.
Re-using the Lampeter Corpus
The Lampeter Corpus was designed by linguists with the research needs of linguists in mind. This, however, does not rule out scholars in other disciplines making use of the digitised pamphlets. An additional stipulation for a pamphlet's entry into the Lampeter Corpus, in fact one that team member Claudia Claridge called "potentially the most important one", was that each pamphlet had to be a complete source. This meant that texts were excluded if they were not a first edition, taken from another source, or generally incomplete in some way (e.g. missing pages, paragraphs edited out). When added to the earlier criterion of a pamphlet being digitised first-hand, it ensured that each individual text in the Corpus would remain a trusted and complete primary source. Dedications, addresses and appendices were all retained. While the pamphlets themselves were selected with the needs of linguists in minds, the content of each digitised pamphlet is presented in such a way to continue to be a legitimate primary source for any scholar.
Achieving this goal is more than a matter of selecting complete pamphlets. The process of converting the documents to electronic form must be accompanied by a thorough mark-up procedure. This involves, during the process of digitisation, marking titles, instances of formatting, quotations, unusual grammatical incidents etc., so the researcher can understand, via her screen, how the text was presented in the original print copy.
Gaining help from the Oxford Text Archive, the Lampeter team proceeded to use the mark-up code named the Standard Generalised Markup Language (commonly known as SGML) to prepare the Corpus in such a fashion, thus preparing a reliable digital version on the printed original. Respecting the integrity of an original text maintains the breadth of possible use when presented in digital format. Thus scholars from disciplines outside linguistics can make good use of the content of the Lampeter Corpus. And while it is likely the Corpus will not provide all the primary data that is required for, say, an historian's research, this is not to say that he should not use the digital content of the Corpus alongside primary material still resting in other libraries.
The digitisation of the Lampeter Corpus was informed by practical and academic issues. The 'convenience' of the Tract Collection (its availability and its size for instance) facilitated its conversion to digital format. From then on, it was largely academic issues that determined which pamphlets were to be digitised. Criteria were developed which created a trustworthy sample of the period in question, as well as establishing a forum for questions germane to the needs of linguists. And because this was accompanied by intelligent mark-up, the Lampeter team created a resource that can bear fruit not only for linguists but for any interested scholar.
Many thanks to Claudia Claridge for her
help in the preparation of this case study.
Footnotes
1. "Life is rule and governed by opinion": The Lampeter Corpus of Early Modern English Tracts. Manual of Information, (unpublished manuscript), Compiled by Claudia Claridge, 3.
2. The Lampeter Corpus of Early Modern English Tracts. Manual of Information, 1.
3. Email correspondence with Claudia Claridge, May 2000.
4. The mark-up strategy is documented in Chapter 3 of the Manual of Information. Further information can also be found in Lou Burnard's essay Encoding the Lampeter Corpus.
Links and Bibliography
Claridge, Claudia. Forthcoming 2000. Multi-word Verbs in Early Modern English. A Corpus-based Study. Amsterdam: Rodopi.
Schmied, Josef, Claudia Claridge, and Rainer Siemund. Unpublished manuscript. The World is Ruled and Governed by Opinion': The Lampeter Corpus of Early Modern English Tracts. Manual of information.
Burnard, Lou. Encoding the Lampeter Corpus - http://users.ox.ac.uk/~lou/wip/
Siemund, Rainer, and Claudia Claridge. 1997. "The Lampeter Corpus of Early Modern English Tracts." ICAME Journal 21 , 61-70, http://icame.uib.no/ij21/.
The Oxford Text Archive - http://ota.ahds.ac.uk/. To reach the Lampeter Corpus, type in 'Lampeter' in the Quick Search box on the Archive's homepage. Visitors will also find information on the Standard Generalised Markup Language (SGML).