Plotting a Course in Database Creation:
The Gloucester Port Books Database
Plotting a Course - Introduction
Customs officials in Early Modern England were enthusiastic recorders of the ships, cargoes and merchants that travelled between domestic ports up and down the country's major waterways. By noting the hugely varied shipments that were being transported through the country (bottled ales, tanned leather and horseshoe moulds for example), they could ensure that the due taxes were being paid, and that goods were not being smuggled abroad. They could also ensure that merchants and ship-owners were not transgressing the dictates of various Navigation Acts.
The officials working at the busy port of Gloucester, the headport of the River Severn, were no exception to this, and during the early modern period, they created a large set of administrative records. These coastal Port Books, now housed at the Public Records Office, form a valuable historical resource, recording the trade on the major navigable river in pre-industrial England. They feature details that would interest, amongst others, the economic, the social and of course the maritime historian. They primarily record voyage sources or destinations, names of ships, masters and merchants, and the cargo of each ship that passed through or docked at Gloucester. The Gloucester Port Books also form an historical document that is ripe for digitisation. A source that contains over 37, 000 voyages and is recorded in a format that has a fair degree of uniformity, is something that can be meaningfully represented in a digital format.
The project to develop the Gloucester Port Books Database (1575 - 1765) was undertaken by a team of historians based at the University of Wolverhampton in the School of Humanities and Social Sciences. To be able to exploit a digital edition of the Port Books, they needed to chart a course from interpreting and transcribing the primary sources to constructing and completing the database - going from often tattered early modern manuscripts to the tabular display in Figure 1. This case study looks at how the Port Books team navigated this course.

Figure 1 - A table from the database showing some records for the port of Gloucester in 1700
Setting Sail - Defining Structure
Once one starts modelling an historical source in electronic form, it is difficult to make fundamental changes. As the History Data Service Guide to Good Practice Digitising History advises, the initial database design "ultimately influences whether the database can be used effectively by both the project itself and further researchers." The structure that pins the database together is the skeleton that holds all the data in place. It is therefore essential to plan the original structure intelligently - something that the Gloucester Port Books team achieved. This can be a complex process, and as the bibliography to Digitising History attests, much has been written detailing how best to go about this. The case study here gives a brief indication of the nature of the task.
Setting Sail - Defining Structure (continued)
Any database must feature a core entity, the unique element around which the rest of the data revolves. At the very beginning of the project, the Port Books team had a choice between their core entity being each voyage or each record in the Port Books - a single voyage could have two or even three separate entries in the Port Books. For example, one journey made in 1699 by the ship William from Worcester, carried, amongst other goods, cider, hops, salt, (unidentified) hair and wool - the record is depicted in Figure 2. This results in two entries, as port officials recorded wool cargoes separately. With examples such as this in mind, the Port Books team decided not upon each voyage as the entity, but each separate record. While a database organised via voyage may have made quantitative analysis a little easier, it would have created a much more confusing database - each port book entry had one related merchant, but the cargo on each particular voyage could have been the cargo of different merchants. Choosing the latter path with several merchants for each voyage would have required a more complex structure. Choosing the port book entry as the core entity was also easier for cataloguing reasons. Each separate record could be given a unique identifier related to the Public Record Office's accession and folio numbers; this would be more difficult for voyages that have records on two separate manuscripts. When the Gloucester project did come to changing their database in the 1990s, from a rudimentary flat-file database (where all the data is held in one single table) to a more sophisticated relational database (where the data is held in several interlinking tables), having designed the database in a coherent way made the experience a more fluid one. It was not a seamless process - changing database systems rarely is - but one that was possible because of the earlier design of the database.

Figure 2 - Details of the ship William, which was checked by customs officials in 1699. The cargo includes one bag of hair.
There were also sound historical reasons for adopting each record as the fundamental entity. In investigating the Port Books, the team could not always be sure if two separate entries actually referred to the same voyage. If they combined these two records into one, they could not be sure of its historical validity, i.e. whether the two were part of the same voyage. As the team wanted the Gloucester Port Books Database to avoid an unnecessary amount of contemporary interpretation, they chose not to squeeze two distinct two port book entries into one database record.
Besides defining the structure of the database, the Port Books team also had to decide what extent of the port book data they wished to see represented in the database. Would knowledge of shipowners' names improve the quality of historical analysis? Do historians need to know about the movement of miniature quotients of cargo? Any data that is going to be analysed needs a separate field within the database. Realising that the Gloucester Port Books Database could be a valuable resource for historians of different methodological hues, the team acknowledged the importance of creating a resource where the data still offered the full richness of interpretation. Some historians would be interested in executing quantitative analyses to get a broad idea of the ebb and flow of trade in the port; others were more interested in linguistic examination of the terms used for describing cargo loads. It was therefore necessary to hold on to as much of the original data as was necessary.
The data recorded in the first version of the database had focussed only on the names of ships and their places of origin and destination. Data relating to the ships' cargoes was recorded but lumped together in one large 'bucket' field. When the database was transported to the relational system, the team could include the data from the 'bucket' field in distinct tables. Additionally, other information (say on the town of residence of the merchant) only appearing in a few records could then get a rightful additional field - this allowed the team to present interesting idiosyncratic evidence. Recording what seemed to be extraneous data in the first occasion reaped benefits when the Port Books team could make use of a more powerful database.
Continuing the Course - Transcribing data
Having decided upon a structure for their database, the Port Books team needed to fill it with data. They were adamant that this would be a comprehensive database, having rejected the historical validity of creating only a sampled version of the Gloucester Port Books. But the team realised that cataloguing the 37,000 plus records would require a fair degree of effort. Amateur historians, working in local history groups dotted around towns and villages near the river Severn, were the first to help, recruited in order to help with transcription. The Port Books team noted that while this was not an ideal solution, there was no other method, given the time and resources available to the project, of getting to grips with such an enormous quantity of data. The database team provided the local history groups with the relevant historical documents and instructions for how this transcription was to proceed. These instructions detailed what needed to be transcribed and also the procedure for standardising the data in the Port Books. In addition, the team emphasised to the transcribers that if there were ambiguous entries they were unsure how best to standardise, then they had just to enter the data in its original form.
The collected data was then passed on to the University of Wolverhampton, where the team was fortunate in being able to make use of a Data Preparation Department to convert the transcribed data to digital form. Once present in electronic form, the Port Books team set about checking the data with their more professional eye. In some cases, they retranscribed some of the sources themselves (particularly those including the use of Latin or those involving a degree of illegibility, whether because of tricky calligraphy or physical condition). A systematic check was made on all commodity terms and units of measurement that appeared less than three times in the whole database or did not appear in pre-1800 sources cited by the Oxford English Dictionary. The team also made some random checks on the data. If they discovered unusual data in one field (strangely spelt names, for instance, or unusual quantities of cargo) the entire port book record would be re-examined. While the team performed a large number of corrections (which, they estimated, went up to around 30,000 in number), they acknowledged that for a database with around 2 million individual slices of data, most of which was entered by amateur transcribers, complete accuracy was something of an illusory target. Continually chasing after that target would become, after a certain period, a pointless task.
Maintaining a Steady Line - Standardisation and Coding
Inserting such a large degree of data raises a flood of questions as regards levels of standardisation. Though the Port Books are reasonably uniform sources, there are still many different spellings for the same names and cargoes. Many issues have to be negotiated and resolved. The need to standardise the source so to make it available for analysis competes with the desire to protect the integrity of the original source. Equally, reducing the complexities of the original source reduces the diverse questions and historical methods that could make appropriate use of it. Additionally, any project needs to take into consideration the practical side of the equation - the ability of the transcribers to follow the proscribed standards and the effect (which can be positive or negative) that pursuing a course of standardisation will have on the time needed to insert the data.
The Gloucester Port Books team decided that standardisation had to be limited to retain the source's richness. How the cargo was dealt with gives a good example of this. Local spellings of the same object were levelled out, so for example, cargo recorded as irone and yronne in the Port Books became iron in the database. However, different types of iron (British made iron, iron anvils, pig iron, cast iron) were recorded as distinct items. Therefore, users could still make searches on very particular types of cargo, as well as the larger family it came from. However, the matter was somewhat more problematic. Often entries in the Port Books disguise a particular cargo. Somebody searching for "rod iron" will miss this commodity when it is recorded as "rod and bar iron". The Gloucester team made it possible to search an aggregated list of all commodities that contain a particular word. The scholar inputting "rod" would discover cargoes such as "rolled and rod and plate iron" and "rods and baskets and cradles" have been recorded in the database, and could then make a selective search, based on only a certain family of commodities over the whole database. Figure 3 shows a user preparing a search of cargoes related to "rod iron". The Gloucester team also added a glossary to indicate other words that users might be interested in searching for. Typing in to the glossary "iron" reminds users that they could also search for "anvils", "hoops" or "steel".

Figure 3 - A user searches for all instances of "rod" in the Gloucester Port Books. All cargoes using the word "rod" are on the left. The user's chosen cargoes, relating to rod iron, are on the left.
Maintaining a Steady Line - Standardisation and Coding (continued)
The Port Books team had to deal with the standardisation of more than cargoes. Local spellings of proper nouns (names of places, ships and merchants' Christian names) were regularised. Christian names were reduced to their first three letters, so Christopher, for example, became Chr. These abbreviations were extended when the database was transported to the relational system, by the construction of a table giving both versions (Chr and Christopher). Regular transcribers often continued to use the abbreviated form for quickness, which was then automatically converted in the main table. The exceptions to this rule of regularisation were the surnames of merchants, boat masters and their crew. The names of these important figures were left in the original form. One other category requiring clarification was that concerning weights of cargoes. There were so many ways for measuring goods each with their own localised slant that to standardise would have been to reject the subtle nuances between systems. This left, for example, the six terms used in the Port Books that imply a hundred weight (c, cwt, hundred, hund, centum, 100 weight) in their original forms. The entire process of regularisation was somewhat helped by the local transcribers, who had a natural tendency to standardisation, although again in monitoring the data, the Port Books team had to be aware of any instances of over-standardisation.
The Port Books team made two interesting additions to the relational database. The unregularised surnames of merchants and boat masters were linked to a standardised table that suggested a more common name for the merchant in question. Smythe and Smithe being linked to the more common Smith is an obvious example that appears. Thus this related table, developed by the Port Books team, helps users identify if a merchant given two slightly different appellations in the Port Books is actually the same character. A word-recognition tool, also incorporated into the final resource, performed a similar operation, indicating to users a larger family of names that the merchant's surname could belong to.
Disembarking - Conclusion
Once the database had been completed, it was fully operative and usable, although difficult to do so without a good knowledge of the database program (Visual Fox Pro) that underpinned it. Wishing to seek publication, the Port Books team formed a partnership with a publishing house and a software company to create an interface. As often happens in such occasions, this was not a straightforward process. Information technology consultants rarely have a grasp for the most suitable way of packaging an academic production. The Port Books team had to produce a detailed specification to ensure that the interface was fully responsive to the potential needs of scholarly researchers. Equally, the computerised tutorial that accompanies the final product had to be devised by members of the Port Books team so to fully explain to those not well-acquainted with computers how best to manipulate the computerised data from the Gloucester Port Books.
The Gloucester Port Books Database is now a completed resource, permitting researchers to consider a wide range of historical questions regarding this particular branch of England's maritime past. No digital resource project follows a straightforward route, but the Port Books team has been able to reach their destination by keeping to some of the basic principles in database creation - a sensible structure at the outset, the inclusion of data relevant to researchers' needs, and a sensitive process of standardisation. By following these directions, the Gloucester Ports Book Database can be considered a resource that represents its original source in an accurate yet adaptable manner.
Footnotes
1. J.J.Cox, N.C.Cox, D.P. Hussey, G.J. Milne, A.P. Wakelin, M.D.G. Wanklyn.
2. Sean Townsend, Cressida Chappell & Oscar Struijvé (1999) Digitising History: a guide to creating electronic resources from historical documents. http://ahds.ac.uk/history/creating/guides/digitising-history/sect34.html
3. For a case study on the need to create precise academic specifications for resource interfaces, see the article on the Exeter Cathedral Keystones and Carvings
Many thanks to Nancy Cox and David Hussey
for their help in producing this case study.