The Digitisation Process: an introduction to some key themes
This paper is a companion piece to the presentation 'Digitisation: an introduction to key themes' given as part of the AHDS Digitisation Workshop series. It is not intended as a practical guide to digitisation, but rather has a broad scope, giving an overview of the characteristics of digital objects and the principles involved in their capture, management and use. The paper has three parts. First, it looks at some general issues, going on to discuss possible pathways available to the digitiser in moving from analogue to digital. Second, the paper outlines three distinct types of digital object - text, image and time-based media - and examines their defining attributes. Third, the paper discusses the matter of managing digital information and of organising digital objects in a particular 'data model'.
2. What is Digitising?
For the purposes of this paper, the term 'digitisation' is a shorthand phrase that describes the process of making an electronic version of a 'real world' object or event, enabling the object to be stored, displayed and manipulated on a computer, and disseminated over networks and/or the World Wide Web. The physical or analogue object is 'captured' by some device such as a scanner, digital camera or recorder, which converts the analogue features of the object to numerical values, enabling them to be 'read' electronically . The numerical system used by computers is called binary and is made up of a series of ones and zeros. These ones and zeros are commonly referred to as 'bits' of information. A fundamental point to note from any digitisation process is that the binary or digital channels are relatively narrow, and only a partial representation of an analogue object can ever be rendered in digital form. In other words, the digital object can ever only be a version of the real thing. The digitiser therefore has to make informed decisions about what level of detail is required in the digital version of an object, for that digital version to serve its intended purpose.
3. Fit for Purpose
The notion of 'fit for purpose' is central to all digitisation processes. To make decisions on any technical issue, the digitiser must have a clear understanding of how they expect the digital object to be utilised. This could include, for example: increasing access; facilitating new research; aiding conservation; adding value, by perhaps including the objects in a learning and teaching environment; or comparing and searching objects alongside electronic resources of a similar nature. The possibilities are many, and none are mutually exclusive. Questions such as: who is expected to use the resource and why, and what are users expected to do with the resource should be addressed at the outset. When answers to these questions are known, the digitiser can then make effective and proper technical decisions about how.
For example, the National Gallery in London was involved in a project called Methodology for Art Reproduction in Colour (MARC). This project developed a camera which enabled very high resolution scans of paintings of anything up to 20,000 x 20,000 pixels in size . These high resolution scans were used by the Gallery to conduct research into changes in appearance of paintings during conservation treatment, to track light-induced colour change of pigments and note the long-term colour changes in paintings. The project also used infrared imaging to study underlying layers and sketches evident beneath the top layers of paint.
On the other hand, another project - the digitisation of historical Old Bailey court proceedings - was concerned with having a digital record of thousands of court documents. This project decided there was no likelihood of requiring large format prints, or further detailed study of the images like the National Gallery. In this case, a conventional scanner scanning at maybe 1024 x 1024 pixels, approximately twenty times less than the National Galleries' images, was sufficient. These examples represent two ends of the digitisation spectrum. Both approaches to digitisation are valid as both serve their purpose. Other information papers in this series explore further the way that uses of a digital resource impact on how to create a resource .
The level of detail stored in a digital file has a corresponding increase in its file-size, which in turn impacts on storage and dissemination of the object. If a high resolution object is created, it is considered best practice to store at least two versions of the object: the master high resolution uncompressed file that is archived at point of capture, and a surrogate compressed version derived from the master for dissemination purposes. The uncompressed master or archival version should, wherever possible, be stored in an open source file format. More detail is provided on specific preservation standards and uncompressed and compressed file formats in the 'Digital Objects' section below.
The two types of compression available to the digitiser are called 'lossless' and 'lossy'. Lossless compression enables a reduction in file size without losing any information within the file, hence lossless. It does this using algorithms (or software routines) that reduce the number of bits used to represent data in a file, thereby reducing its size while retaining all the original information. Lossless compression typically achieves space savings of up to thirty percent.
Lossy compression, on the other hand, reduces file size with a corresponding loss of data. It works by eliminating information from the file that the program deems superfluous. The lost information is either unnoticeable to the user, or can be recovered during decompression by extrapolation of the existing data. Lossy is probably the appropriate type of compression for data that is originally analogue rather than digital, such as video and sound clips, and continuous-tone greyscale or colour images. Joint Photographic Experts Group (JPEG) and Motion Pictures Experts Group (MPEG) are two common types of lossy compression, and the file reduction can be quite dramatic, resulting in possibly ninety percent smaller files.
When starting to digitise it is thought best to capture as close a rendition of the analogue object as possible. Depending on the nature of the source this could mean capturing directly using, for example, a flatbed scanner to digitise a text document, a digital camera to capture an object, or a digital camcorder or audio recorder to capture moving images or sound. However, it may be that the source is one step removed from the original in, say, capturing a slide or photograph of an object, or digitising an interview stored on analogue audio tape. The further removed from the original, say a photocopy of a document, or a second analogue recording of a recording, the less likely the resulting file is to be a true representation of the original object or event, and the more likely that erroneous data could have crept into the file.
Making the decision about the method of capture is very much a project decision based on the type of source material being digitised, the equipment and staff skills available, and the budget allocated, for both equipment and staff time. Sometimes decisions about how to capture are self evident and easy to make. For example, if digitising a collection of 35mm slides, then a slide scanner is probably the best solution, if scanning a series of flat documents, no larger than A4, then an A4 flat bed scanner would be a good choice. Of course, decisions are not always as straightforward as this. For example, if capturing a museum collection that includes flat paintings, 3 dimensional objects and written documents, there is more than one way to go about digitising the collection. If digitising a manuscript, there is often the need to have text transcribed beside an image. If filming an event there is likely to be moving image, sound, and text transcription and perhaps also still images as part of the final resource.
6. Digital Objects
As noted above, there are various issues concerned with the capture of digital objects depending on the nature of the object itself. There are three broad types of digital object: text based, image based and time based. The paper will now look at each of these in turn, and draw on some prevalent issues involved in digitising them.
6.1 Text Based
Text processing and the scholarly use of digital texts is relatively long established, going back to the early days of computing in the late 1960s. The early text format used in computing was the American Standard Code for Information Interchange, more commonly known as ASCII text, sometimes referred to as plain text. ASCII is usually an adequate format for sharing documents between computers and applications when the text contains only simple, modern English prose. Simple US-ASCII uses 7 bits of information for character encoding, reserving the 8th bit for control and formatting information. Thus ASCII can encode only the upper and lower case characters of the Roman alphabet, along with the more common punctuation characters.
A second and later text format is Unicode . Unicode solves the problems of limitation evident in the use of ASCII plain text. The aim of Unicode is for all of the characters in all of the world's languages, including some languages of the past, to be mapped unambiguously onto a distinct numerical code. Unicode is certain to become the standard for character encoding in the future, and is already supported by the latest versions of the major operating systems. To date, Unicode offers mappings for all major languages and coverage of less commonly supported languages is on-going .
There are two main methods used to digitise existing texts - transcription and Optical Character Recognition (OCR).
6.1.1 Text transcription
Transcription is probably the simplest method of digitisation, as it requires only a person, keyboard and monitor. Transcription can be very accurate, particularly when working on documents with complex layouts and passages of text that are difficult to read. Perhaps, for example, hand-written diaries with notes in margins and text flowing at odd angles, or newsprint made up with blocks of unrelated text on the page. However, text transcription can also be time consuming, particularly if the work is outsourced. Also, spelling and other errors made by the transcriber can be difficult to find and fix as they are random. It is best practice for any text transcription to have two people working on a document, one person transcribing and another proof reading.
6.1.2 Optical Character Recognition (OCR)
The second method used to digitise text is scanning using OCR software. This is a more automated method of digitisation, and OCR works by scanning a document and using a computer programme to 'read' the resultant digital image. OCR software employs various methods to achieve its results, such as:
- Pattern Recognition: which uses pre recorded images in a database, which is good for documents with a uniform typeface.
- Feature Extraction: which recognises characters by their shape; again this is good for high quality prints.
- Structural Analysis: which examines the structure of each character (how many lines, vertical and horizontal it contains), and is better with poor quality texts.
- Neural Networks: which works by comparing each character with characters the software has been trained to recognise. The neural nets therefore change and 'grow' over time. Each character is given a confidence level, and, again, this is better with texts of poor quality.
OCR has the benefit of being much faster than transcription, and therefore more economical, especially for clear type-written documents with simple layout. Furthermore, errors made by the software tend to be systematic, and therefore easier to find and fix. OCR is less good when document layouts are complicated and text is difficult to read. Again, as with text transcription, it is best practice to error and spell check documents thoroughly, ideally using two people.
6.2.1 Raster (or Bit-mapped) Images:
Raster images are made up of what are known as pixels (picture elements), and each pixel stores information about the colour of an image. A black and white (bi-tonal) image has one 'bit' of information, either black or white. A greyscale image will be made up of 8 bits, going from white to black through shades of grey. For a colour (RGB) image there are commonly 8 bits per channel of red, green and blue respectively, making a 24 bit image. At the time of writing, the latest capture equipment has the possibility of going up to 48 bits, with 16 bits of information per red, green and blue channel. Raster images are the most common image type on the web, for example in the file formats JPEG and GIF, and Raster data models are often used in geographic information systems (GIS) to represent continuous surfaces - such as satellite images or historic maps. Geographical information systems are explained further in 'Data Models' section below.
The resolution of an image concerns the number of pixels held within the digital file, and is measured in pixels per inch (ppi). The more pixels stored per inch, the greater the density of the colour information, and therefore the greater the detail evident in the image. This is known as 'scan' resolution, and it is important to scan at an appropriate resolution. The appropriateness of the resolution chosen depends on the intended purpose of the digital image. However, note that resolution or ppi is only an indicator of image size, and therefore 'quality', when we know the dimensions of the original analogue object. For example, scanning an A4 document (9 × 12 inches) at 300 ppi will produce a digital image that is 2700 pixels x 3600 pixels (the dimensions of the original multiplied by the ppi). Scanning a postage stamp that is 1 inch x 1 inch in size, will produce a digital image that is 300 pixels x 300 pixels. Both these images are scanned at the same 300ppi resolution, but produce vastly different sized digital images. The correct way to refer to the size of a digital image should, therefore, always be its pixel dimensions.
In the context of a GIS each pixel represents a known area on the surface of the earth. In this context, each pixel summarises a square of known dimension: for a high resolution aerial photography a pixel may have a 'ground resolution' of 25cm - while for a low resolution satellite image it may summarise an area of several square kilometres.
6.2.3 Vector Images
Another type of image is a vector graphic. Rather than being made up of pixels, these images are co-ordinate based; so two points a and b define a line, and three or more points, define an area. A common file format used to create vector graphics is Encapsulated Postscript (.eps). Scalable Vector Graphics (.svg) is a newer format that utilises xml technologies, and may well, at time of writing, become industry standard. A significant benefit of vector images is that because they are co-ordinate based - as opposed to pixel based raster images - they can be zoomed to any size without pixilation. It is best practice to store the topology, in other words, the order, contiguity and relative position of the points. Vector graphics are used often in virtual reality and 3-D modelling, as well as in Macromedia Flash applications. A good simple example of a vector graphic at work is a font used in a word processor. These fonts, or rather the images used to represent the fonts, are vector based, and when you increase the size say from point 10 to point 28, the image increases in size without any degradation. Vector data is also widely used in architecture and cartography, such as in computer aided design (CAD) or in geographic information systems. In these contexts the ability to use a single co-ordinates system allows diverse types of information to be reconciled.
6.3 Time based
Digitising time-based media (sound and video) throws up some different concerns for the digitiser. The first major issue to be addressed is the large size of the digital files produced, relative to most other forms of digitisation. To give an idea of scale, one second of digitised sound produces the same size of file as around one quarter of the complete works of Shakespeare digitised as text. One and a half seconds of digitised video, is the same size as the entire digitised works of Shakespeare. The second issue, related to the size of the files, is how to enable sound and video to be disseminated and displayed over the web. Like large raster images, some form of file compression is required to do this, and types of appropriate compression are listed below. Finally, unlike most other types of digital object that are readily viewable on screen, to view most compressed sound and video on screen a plug-in or viewer is often required to use the file.
The process of moving from analogue to digital sound is called sampling. To reproduce an analogue sound to digital, one must sample the signal many times per second. The frequency of this sampling rate is measured in Hertz (Hz), and the range of each sample is measured in bits. For lossless digitisation a minimum sampling rate of 36kHz is normal, and the standard highest frequency for most computers is 44.1 kHz. In terms of bit rate, 16 bits per sample is considered good enough, giving an overall bit rate of 192 kb/s. Common types of uncompressed sound file formats are Microsofts' WAVeform PCM encoding (.wav), the default sound format on the MS Windows platform, and Audio Interchange File Format (.aiff), the default format for the Apple MacOS. These formats provide accurate, high quality and lossless sound files. However, the file sizes are large and therefore suitable for archival master versions of files, but not for dissemination over the web.
Compressed 'lossy' audio formats use specialist codecs (COmpressor/DECompressor) to compress audio data. Popular variants include MPEG2 Layer 3 (MP3), Ogg and Real Audio. MP3 is the most common compressed format. A sample rate of 44.1kHz and bit-rate of 192kbps or higher is advised to preserve quality. Using MP3 it is possible to achieve relatively big reductions in file size of up to about one twelfth of lossless .wav files.
6.3.2 Moving Image
A digital video file is a sequence of still images (frames) played in rapid succession, usually accompanied with audio data played in tandem. When played at a set rate (12 frames per second, or higher), the image sequence creates the illusion that the onscreen object is moving.
Some common video formats are MPEG-1, MPEG-2, Audio Video Interleaved (.AVI) and Quicktime (.QT). Higher compression rates and therefore smaller files may be achieved by using a third party codec such as DivX. For example, a 450mb MPEG-1 file, may be compressed by more than half to around a 180mb DivX file. MPEG-1 is a stable and documented format, which can be played back by most current computers and digital video players without the need for additional software. However, MPEG-2 and, more recently, MPEG-4, are better suited to high quality video, and do not place restrictions on resolution and dimensions, as happens with MPEG-1. MPEG-2 and MPEG-4 are likely to be the recommended archival formats for digital video in the next few years. There are many proprietary Codecs available for playing desktop video, such as, Quicktime, Real Media and Windows Media Video, which are useful for specific tasks, but none of these formats should be considered for a master archival version of a file.
7. Data Models
The digital objects referred to above, are, of course, single entities. The outcome of any digitisation process, will likely have many hundreds, sometimes thousands of these objects created. These may consist of objects of one type, say, one hundred digital images, or, sometimes combinations of text, image and time-based objects. It is necessary, therefore, to manage and organise these resources efficiently to enable their effective use and retrieval. In essence this is an information management issue, and is sometimes referred to as 'data modelling'. In other words, data modelling is the mechanism by which individual digital objects are stored, organised and managed together. For the purposes of this paper, there are essentially four conceptual methods of organising data: Lists, Hierarchies, Sets, and Geometric or Co-ordinate based systems. Each one is explained in further detail below.
7.1 Choosing a Data Model
Choosing a data model for your project's digital output, is based on three fundamental questions. How is the data organised naturally? What is the intended purpose of the resource, going back to the idea of 'fit for purpose'? And, following on from this, what are the intended users' expectations and experience when using resources of a similar nature?
A simple example of a data model would be if you had a list of contact details giving, say, contact name and email address. The best way to model this list data is in a tabular format, either in a spreadsheet, or if slightly more complex data manipulation and searching is required, in one table of a database. The wrong way to organise this data would be to store it in a text file, say an MS Word document. This would prohibit the effective use of the digital resource, as sorting and finding contacts would be made more difficult.
Another method of modelling data is in a hierarchy. This is the classic tree structure, common for storing files inside folders on a computer. This structure can be seen graphically by viewing files on a Windows machine, using the Windows Explorer option. There are various resources that suit such organisation. Most texts are organised naturally in a hierarchical fashion, e.g. a book, inside which are chapters, which consist of pages, which consist of sentences and so on. Or a poem, which has stanzas, which has lines, which have words etc. As a result, all text mark-up, whether Extensible Mark-up Language (XML), Hypertext Mark-up Language (HTML) or Standard Generalised Mark-up Language (SGML), follows this pattern. This is sometimes referred to as 'parent - child' model. It is common also, for archival resources to be organised in a hierarchy, as they are logically ordered in this way in the analogue world, for example, in boxes, that contain files, that contain documents. Therefore, most electronic archives will be stored in some form of hierarchical database.
Sets are an effective method of storage particularly for objects that have clear relationships with one another. A good example of the set data model is a relational database, where one object can have many related objects. This is known as a 'one to many' relationship. The advantage being here, that more objects can be added to the many side of the relationship without unnecessary repetition of information pertaining to the main object. For example, a relational database that contains images may have one main image with six related images showing different views. In this case, the main image information would be entered once only. Thus, relational databases avoid the unnecessary duplication of the same information in a database, sometimes known as avoiding redundancy.
Some data sets can be reconciled and explored most effectively by exploiting shared geography. So for example, modern and historic maps could be plotted together, with observations taken in the field or digitised from another source. A common model for storing such data is a Geographical Information System (GIS) . GIS combine many of the features of relational databases with image processing tools and thus present a powerful set of tools for unifying diverse data sets and for analysing relationships between them. In simple terms it uses geography as a primary key for data. So, for example, it is possible to ask questions such as 'show me all the information within 500 metres of where I am' or to produce visualisations of data that would not normally lend itself to being shown in this way - tasks that would be impossible or very time consuming by other means. This is made possible when the different sets share a common vocabulary of co-ordinates.
For example, fieldwork at the medieval settlement at Cottam in North Yorkshire drew upon existing maps and aerial photographs, as well as information from previous metal detecting to develop several seasons of excavation and surface observation. In turn the GIS was used to help interpret the results of this fieldwork. This was only possible because the different data sources could be reconciled to a common framework of co-ordinates, provided by the Ordnance Survey's National Grid.
8. Choosing Software
It is more important to get the data model right, than it is to choose a particular piece of software to represent that data model. In principle, for example, if you choose to organise your data in a relational database, it does not matter whether you choose MS Access, Filemaker Pro, Paradox, MySQL, Postgresql or any other relational database software. In essence these products all store data in the same way. There are caveats to this depending on use. For example, if the database is to be searchable over the web, or if a large amount of data is being stored and retrieval speeds are important, then the more robust MySQL would be preferred over MS Access. However, this strays from the basic premise that the underlying data is being designed and stored in the same model, whatever software is chosen. Similarly, a Lotus Spreadsheet will store information in the same way as an MS Excel Spreadsheet and Corel Word Perfect, stores text in the same way as MS Word.
Having said this, there are important considerations to bear in mind when choosing software. Such as:
- making sure the software performs the tasks required
- picking software that is well used and has good support
- choosing software that has good import and export functions
- choosing software that supports recognised international standards
This paper has given an introduction to the fundamental principles involved in the digitisation process. It has outlined some key concepts and themes, such as, defining digitisation, examining pathways, introducing the notion of 'fit for purpose', and assessing archival concerns and dissemination compression techniques. The paper has also looked at the three broad types of digital object - text image and time-based media - and commented on their defining characteristics as well as on some issues involved in their capture. Finally, the paper introduced the concept of the data model for storing, organising and managing digital files, and gave some examples of common methods of structuring data: lists, hierarchies, sets and geography/geometry. An over-riding theme prevalent throughout the paper has concerned the notion of 'fit for purpose'. This concept underpins the whole digitisation process, and without addressing it - the end goal, the use and the users of a digital resource - many, if not all, questions that emerge as part of any digitisation project, cannot be answered effectively.