Link to AHDS Home

Search Collections
Creating Resources
Guides to Good Practice
Information Papers
Case Studies
Acronym Buster
Depositing Resources
News and Events
About the AHDS
Projects
Search Site
The Newton Project:
Implementing and Exploiting XML

Content written on 7th November 2005 by Alastair Dunning
Introduction
Newton Portrait

One of the advantages of digitisation is that it can allow for new perspectives on familiar themes. The cheaper publication costs of digitisation allow for neglected documents to be disseminated and in turn this can promote fresh interpretations that challenge received ideas.

This has been one of the inspirations of the Newton Project, based at Imperial College London. Isaac Newton is of course well known as a scientist and a mathematician. The project aims to broaden the understanding of Newton by creating digital editions of his theological manuscripts. Much of Newton's intellectual preoccupations were not with what the twenty-first century labels science, but with areas such as alchemy and theology. The Newton Project aims to focus attention on the manuscripts dealing with these themes, thus allowing contemporary scholars the opportunity to absorb these hitherto little-known documents and synthesise them with existing interpretations of Newton's life and work.

The project has been successful with the Arts and Humanities Research Council, receiving two tranches of funding, firstly in 1999 and then in 2004, which have combined to provide over £800,000 worth of funding. Readers can visit the project website at http://www.newtonproject.ic.ac.uk and explore both visual and transcribed versions of the manuscripts, in both diplomatic and normalised forms.

The project is also a worthy model for a case study for the manner in which it has evolved and developed into a successful project. This article highlights two overlapping themes which have contributed towards this success. Firstly, the importance of providing rich XML (eXtensible Mark-Up Language) encoding to provide a suitably professional digital edition of the manuscripts, which satisfy not only the usual standards of academic rigour but are flexible enough to allow a wide range of intellectual questions to be asked of them. Secondly, and relatedly, it indicates that exploiting such tagging in consistent, scholarly fashion requires time and experimentation - it is not, the experience of this project indicates, something that can be done immediately at the outset.

Background to the project

Although the project received its first load of funding in 1999, work actually begun a couple of years prior to this. Stephen Snobelen, currently a member of the editorial board and Rob Illiffe, the project's current director, began transcribing the manuscripts into Microsoft Word for a possible print edition.

As the Internet evolved, it quickly became apparent that other methods of dissemination were feasible, and Rob Illiffe began considering options for delivering the manuscripts on the web.

Once the project began to evolve, many issues relating to disseminating the manuscripts on the web came to the fore. It was obvious, for instance, that it would not be sufficient to place solely digitised images of the manuscripts on the web. Not only would this mean that users would not be able to search for specific terms or concepts but it simply would not be a user-friendly process. The manuscripts are difficult to read themselves; reading digital surrogates on screen would be even more tricky.

The project therefore needed to progress with developing textual transcripts of the manuscripts but to do so in a way that both reflected their complexity and catered for scholarly needs for searching and analysis.

Therefore the 1999 funding bid to the AHRB outlined the possibility of exploiting XML and team member John Young, who had had some basic experience in mark-up, was brought in to start tagging the digitised documents.

Rich mark-up
Image from Issac Newton's post-1710 Seven Statements on Religion. Source: Kings College Library, Cambridge, England: Keynes Ms. 6

Image from Issac Newton's post-1710 Seven Statements on Religion. Source: Kings College Library, Cambridge, England: Keynes Ms. 6, http://www.newtonproject.ic.ac.uk/web_keynes/k0006.html

Above is an example of one of Newton's manuscripts. Upon being confronted with such a manuscript the transcribers had immediate decisions to make about the level of mark-up to be incorporated into the digital version. Do I include deletions? Obvious spelling mistakes? How do I represent old characters, such as the A and E welded together?

Less obvious issues also needed consideration. Sometimes Newton would replace terms in his papers, implying a subtle shift in concept. For example, in one manuscript, Newton strikes a line through his use of the word Jews and replaces it with Israel - a small but significant change to his argument. In other places, the alterations were seemingly more mundane, such as changes in spelling between modern and an older English (for example, Newton switches between using honor and honour).

Any type of digitisation project has decisions to make about the level of detail of data capture. Not just for textual projects (where it is the quantity of mark-up that can vary), but for imaging projects (where the resolution and quality of the digitised image, amongst many other variables, can vary) and for database projects (where the extent of the data model built to house the original source can vary).

For the Newton Project, their overall policy has very much been one of rich encoding. They considered it vital to record every textual aspect of the manuscripts. Their rationale was that the project was seeking to create a definitive edition of a primary source. And considering scholarly editions of such works are only going be produced, say, every fifty years, it seemed obvious to dedicate time to ensuring the editions are of such a quality that users are not obliged to return to the physical source to continue their research.

Team member John Young points out that marking up even seemingly mundane aspects of the manuscripts is a vital ingredient in creating such a definitive resource. Systematic analysis of minor changes in the manuscripts could shine light on larger issues. For example, Young wonders if changes in spelling (such as from honor to honour) signalled Newton's anxiety about his roots - "Is the country boy trying to get rid of his schoolboy spelling habits and use the 'proper' Latin form?" Elsewhere, Young comments on more avenues for exploration that analysis of the marked-up texts could open up. He has noted that when Newton makes any mistakes in Biblical quotes or references, he "almost invariably didn't just cross the mistake out but scribbled over it comprehensively. Cross with himself, or ashamed of himself, for misquoting God? It's the sort of question you could gather the data for by searching for 'Bible references in word strikethrough' and 'Bible references cancelled' and comparing them."

It is incorporating rich mark-up into the digital texts that allows users of the digital manuscripts to explore these questions. This type of flexibility can only come from rich encoding.

The TEI (Text Encoding Initiative) Guidelines

The Newton Project chose the guidelines suggested by the Text Encoding Initiative (TEI) as their way of exploiting XML. The TEI provides a set of guidelines indicating how materials can be marked-up for scholarly use. It is particularly useful where a project is pursuing a rich encoding strategy, as it provides detailed advice on how to mark-up various idiosyncrasies and peculiarities located in a source text.

Using the TEI provides several other advantages. It is developed by the academic community and is therefore well equipped to deal with the needs of creating scholarly editions; because it is a commonly adopted standard it is used by other digitisation projects, making it easier for the fruit of such projects to be interoperable (this is particularly important if you want your texts to be searchable as part of a larger corpora of material, for example to allow linguists to study philological trends); and it provides a common framework for technical staff to build applications exploiting the marked-up text. There are contextual advantages as well - an international user community can provide help with tricky issues and give support in other areas of the project.

But the project team was at pains to point out that while 100% compliance with the TEI guidelines is an ideal aim, it is difficult to execute in practice. One particular issue the project faced was in representing the alchemical symbols that Newton deployed within his manuscripts, an issue made more complex by the fact that he gave them different functions. Sometimes they could refer to the physical substance that they commonly denoted during Newton's time. In other instances, Newton employed the symbols as note markers in the text, referring the reader to another point on the page. Such symbols do not form part of the Unicode character set, with which the Project was transcribing the text, and therefore had to develop other tags, outside the TEI guidelines, to represent them. The project initially found full TEI compliance problematic in other respects, particularly with regard to nesting restrictions on certain tags, and produced its own variant Newton DTD (Document Type Definition), an XML-compliant set of rules that circumvented these issues. However, the TEI guidelines have themselves been comprehensively revised over the last few years, largely in response to the issues raised by user groups such as the Newton Project, and the team is currently working on a new, expanded DTD which will be fully compatible with the latest TEI methodology.

Delaying transcription

The process for defining and executing the tagging has had a long gestation period. The early taggers had little in-depth experience of XML, but more importantly the full range of issues associated with marking-up the manuscripts was not appreciated. Thus it needed time to work through the manuscripts and see how the written word could be represented in digital form, as happened in the case of the alchemical symbols cited above.

The project warned that commencing the project with an "army of transcribers would have been a disaster". The project gained much from starting off with just one or two of the team playing around with the conceptual problems faced in representing the manuscripts in digital form. There was no official pilot project, but this initial two years work did much of this type of work, allowing them to develop the Newton DTD mentioned before.

Besides creating this DTD, it has been important for the project to develop a larger policy on guidelines for tagging and for the actual process of transcription (www.newtonproject.ic.ac.uk/images/pdfs/techspec.pdf). Numerous team members, working at different times and in different places, have been involved in the process of transcription. This involves not only adding the mark-up but actually reading the manuscripts and typing in the text. The transcription policy provides a full set of procedures on how team members should deal with each aspect of Newton's manuscripts. The policy is, according to John Young, "effectively a plain English version of the Newton DTD". Transcribers are informed what tags to use when there are any unexpected items in the text. It also tells taggers how to deal with mistakes in punctuation and spelling (in the original), capitalisation, abbreviations and more unusual features, such as changes in word direction (Newton sometimes wrote vertically along margins), use of different languages (Greek or Hebrew), alterations and deletions.

As with the DTD, it has evolved over many different versions, gradually developing according to discoveries made by team members when tagging up the texts. But it has been absolutly vital for the project to compose such a document, both as a comprehensive reference guide for project contributors and as a in-depth guide for later users of the resource.

Exploiting the mark-up

As mentioned above, one of the aims of the Newton Project is to allow users to exploit the rich encoding so that they can ask particular scholarly questions of the manuscripts. John Young comments that the project's guiding ethos "is that it shouldn't be up to us to decide on our users' behalf what they may or may nor want to look for", and he continues that the project aims "to provide them [users] with the means to investigate whatever they choose". Young points to two broad types of analysis which the Newton project could facilitate. One is 'text-historical', allowing users to trace the how a particular document was modified over time not just by Newton but also by later scribes who edited and glossed. The other is philological, allowing users to question how changes in punctuation and spelling evolved over time.

The desire to implement such an online system, where users can execute such queries via the project website, has made the project acknowledge an important point. The greater the range of functionality planned for an online resource, the greater the need for the technical expertise that can manage such a process.

As happens with many digitisation projects, particularly those with ambitious aims for making their material available on the Internet, the project did not originally factor a technical manager into their budget. Until this was rectified, it placed too much of a burden on non-technical staff to perform technical operations within the project. And while the staff within computing services at Imperial College were helpful in installing software and dealing with hardware and security issues, they are not, like most institutional Computing Services, mandated to help with the day-to-day technical issues that the Newton Project is dealing with.

However, once technical officer Mike Hawkins was brought on board, it eased many of the difficulties. Some of these tasks, while not too technically demanding for him, eased the workload of others - maintaining and updating the website, ensuring files were uploaded; testing that manuscripts were accessible through different platforms and browsers. Other more complex applications facilitated delivery of the digital resources; one recent tool allowed for the automated creation of jpegs suitable for web delivery, derived from the tiff originals. Mike Hawkins also pointed out that a technical officer allowed the team to play with new ideas and open up new avenues for exploiting the electronic manuscripts.

But it is some recent developments that give an idea of the more advanced work done by the technical manager.

Currently, the website offers users the opportunity to view two different views of each manuscript, one normalised, one diplomatic. As explained earlier, the project wishes to expand to allow users to see customisable views which reveal different facets of the "material culture or scribal history of each document" - additions, deletions, corrections etc. One pertinent example is in providing users with views that indicate where supplementary hands made changes or additions to the manuscript. Therefore it will be possible to see immediately what Newton wrote and what others wrote.

Mike Hawkins explains further "What we are ideally aiming for is options as complex as 'show me all the text in primary hand a with alterations in hand b' or 'show me all the text in hand a and all alterations except those by hands q, r or s'. Thus, these toggles could be used not only to highlight the importance of certain individuals but also to remove textual features not relevant to the given investigator. So readers could choose to view all the editorial changes by Newton but not those in other hands, thus recreating the text as Newton wrote it. Or they could highlight only the changes made by other hands, providing insight into scribal culture - what activities Newton preferred to do himself and what he was prepared to delegate."

Obviously, this is all made possible by rigorous attention to marking up such details during the transcription. Without it, it would not be possible to question the digitised manuscripts to such granularity. But developing such a system is not just a matter of adding the appropriate mark-up. It requires significant technical skill to allow for the preparation and delivery of such customisable texts on the project website.

But locating the necessary staff to carry out such work is perhaps even more difficult. The project was at pains to point out the importance of Mike Hawkins having both technical expertise and academic interest in the project. He could understand the scholarly aims of the project and the potential demands of users and make sure this was part of a successful technical implementation. This has been vital to the development on the online manuscripts and the effort to exploit the rich text mark-up. The Newton Project attributes much of this to the fact they avoided a 'marriage of ignorance' where computing specialist and academic historian fail to communicate their ideas to one another.

Conclusion

The Newton Project is now continuing work with its second tranche of AHRC funding, and hopes to conclude in 2009. They reckon that, as of 2005, they have digitised 40% of the approximately 2.5m words in the theological papers. But their aims go beyond simple digitisation. Explanatory material is being added to the website, as well as digital versions of out-of-print monographs relating to Newton, and some more modern work relating to early modern science and religion. The team also hope to add a second strand of mark-up, relating to the content rather than the structure of the manuscripts. It is this prospect which excites the team and allows them to expand their horizons even further. By marking up Newton's excerpts from the Bible or writings of the Church Fathers, the project will be able to provide automated links to these sources, enriching and expanding the possibilities for scholarly analysis. It is with techniques such as these, the project believes, that the project can fully exploit the digital nature of their creations.

Many thanks to Mike Hawkins, Rob Iliffe and John Young for their help in putting this case study together.