Digitisation. A Project Planning Checklist
- Project design
- Project implementation
- Long-term maintenance and use
- Further reading
- Appendix. Estimating digital reformatting costs
The document offers practical guidance to those considering a digitisation project. It takes the form of a check-list of strategic issues which need to be addressed in a project's design phase. The issues follow the life course of a digital resource from its inception through to its development, maintenance and use. It does so because decisions taken about a digital resource at any one stage of its life will have ramifications for decisions which will be or have been taken about it at other stages.
Critical to the success of any project, the planning phase will determine whether, how, and at what cost digital resources are created and, critically, how those resources, once created, will be used. Issues that need to be addressed include:
A simple but essential cost-benefit analysis which may involve:
- a clear and precise statement of what any digitisation project is trying to achieve
- a clear understanding of the potential benefits it will offer and to whom they will be offered
- a clear understanding of the needs of intended user communities
- a clear understanding of the costs involved in not conducting a digitisation project
- a survey of complementary digital resources and digitisation projects that may complement (or make redundant) the digitisation project being considered
What to digitise?
Where digitisation projects entail the production of digital surrogates for items within existing collections, an element of selectivity is involved. That selectivity should be guided by clear and consistently applied criteria which may take account of:
- a project's aims (which items or collection of items, when digitised, will support these best)
- what items are most readily available for digitisation (availability may be restricted, for example, by copyright or by their physical media)
- items for which digital surrogates already exist elsewhere for acquisition (e.g. by purchase, subscription, etc)
An initial review of the technical requirements that will ensure a digital resource actually serves the purposes for which it is made. The review may take account of the following with regard to the creation, management, and delivery or use of a data resource:
- network, hardware and software requirements
- technical standards (e.g. file formats, encoding methods, compression techniques)
What costs and what funding?
Having defined the aims, content, and technical aspects of a digitisation project it should be possible to estimate costs and to assess how and from what source(s) these may be met. (See the Appendix which supplies a costing model for use by managers of projects digitising paper-based information.)
Although implementation is largely a technical and administrative matter, it is essential that techniques and administrative practices are suited to a project's aims and to the funding and technologies available to it. Accordingly, implementation strategies need to be assessed as part of project design.
How to make the data?
This phase will involve review and selection of data creation strategies (e.g. OCR, keyboard entry, digital photography, conducted in-house, contracted out, etc.) and related hardware and software. The review will also involve selection of those standards and best practices that will help digitisation projects maximise their achievements while minimising their costs. Standards and best practices deserve especial consideration because they are bewildering to most. Selection will depend in part on what kind of data resource is being created (standards appropriate for digital images are different than those appropriate for electronic texts or GIS), and in part on the uses to which a data resource is intended to be put (imaging standards appropriate for web-delivery of thumbnails are different than those used for archive-quality digital reproductions). There are also different kinds of standards which serve very different purposes as follows:
- Technical standards facilitate data interchange across networks and between platforms with minimal loss in content and functionality. Such standards include those pertaining to file formats, and compression and encoding techniques.
- Data documentation standards facilitate data resources' management and meaningful interchange between individuals and organisations. Such standards include MARC, Dublin Core, and the CIMI (Computer Information of Museum Information) standard.
- Controlled vocabularies and other standards help to ensure that data resources are comparable with other like resources. Such standards include Anglo-American Cataloguing Rules (AACR2), the UK Registrar General's occupational classification scheme, the Getty's Art and Architecture Thesaurus.
- Best practices are the constellation of technical, documentation, and data standards and of implementation strategies which promise to maximise a resource's intended usefulness while minimising the cost of its creation and subsequent management and use, and exist for data resources constructed for particular purposes (e.g. practices documented by Anne Kenney for preservation quality images of printed texts, Text Encoding Initiative's Guidelines for the use of SGML with electronic texts).
Where/how to store the data?
Data once created need to be managed on a day-to-day basis. How and where data are stored will be determined by how, and how frequently, they are intended for use. A number of storage/use scenarios exist and need to be considered in a project design phase. They include:
- data warehoused off-line on behalf of some third party and only "delivered" to that third party in the event of their experience of some unrecoverable corruption or data loss (typical of data warehouses)
- data stored off- or near line and only distributed to users upon request, either via pre-arranged network transmission procedure or on some hand-held object (e.g. on magnetic tape, diskette, CD-ROM, etc.)
- data stored on-line and distributed (via anonymous file transfer or the worldwide web) or browsed/analysed (e.g. via a Telnet connection or the worldwide web) in real time over some network and in real time
- mixed distribution scenarios involving some combination of those described above
How to find the data?
Data resources need somehow to be located in order that they may be used. What information is available will depend upon what documentation standards are adopted. How information is made available will depend upon users' resource discovery requirements and the tools selected to meet them. Amongst the tools that may be provided are:
- resource discovery agents such as Alta Vista or Yahoo which allow simple key-word searching across the contents of web-accessible documents;
- logically ordered web-accessible gateways which provide hypertext (Web) links to on-line data resources;
- on-line catalogues which allow users to progress structured queries against comparably structured resource descriptions;
- mixed scenarios which integrate two or more of those described above.
How to get the data?
How data are delivered to and used by end users will be contingent upon how and why they were created or acquired, how they are stored (e.g. on-, near- or off-line), and upon what software and hardware is needed to access them. User scenarios may include:
- resources are accessed on-line using client/server technologies and the collection managers manage the server (e.g. resources accessed by standard web browsers, Telnet sessions, etc.);
- resources are accessed on-line using client/server technologies and the collection managers do not manage the server (e.g. resources which are included in a collection but served to users by a third-party under the terms of some service agreement);
- resources and appropriate software are both resident on a workstation to which the user has direct access(e.g. a plug-and-play CD-ROM product, a digital text or database mounted locally on a user's desktop and accessed via analytical software also mounted on that desktop);
- mixed scenarios combining two or more of those described above.
Long-term maintenance and use
Having created a digital resource project managers will want to ensure it is used and maintained effectively. Data usage, support, and maintenance practices will be highly contingent upon why data were created in the first place and chosen to suit a digitisation project's aims. Accordingly, they need to be considered as part of the project design phase.
Overcoming obstacles to use
Technologies are arguably changing more rapidly than scholarly culture. Accordingly, some digital resources may remain under-utilised for a time after they are created. Obstacles to use that may need to be overcome include:
- lack of awareness about the existence of particular resources
- lack of awareness in general about how such resources may be exploited effectively for scholarly purposes
- lack of relevant IT skills and/or analytical methods
- lack of appropriate user support.
How to preserve the data?
Data resources are typically very expensive to develop. Investment, however, may be repaid if the data can be made available without content loss despite changes in hardware, software, and network technologies. Long-term preservation may be achieved by a number of means either in house or through deposit with some archive facility. However it is achieved, the prospects for and costs of long-term preservation will be determined to a large extent by decisions taken during a project's design phase. Strategies for preserving data include:
- preserving the data and the hardware and software platforms from which they are originally made accessible;
- refreshing data by copying them periodically onto new storage media;
- migrating data through changing technical regimes by rendering them into an appropriate standard interchange formats;
- emulating the look and feel of the original data on successive generations of hardware and software platforms.
How to administer the data?
Managing a digital resource over the long-term involves a degree of administration which needs to be planned from the outset. Consideration may need to be given to version control, order processing, and rights management and protection.
Distributing the data to recover their creation and maintenance costs
Owing to the costs involved in digitisation, whether and to what extent a data resource may be used to generate revenues are becoming key issues in project planning. How to design and implement cost recovery models is accordingly a concern in the long-term maintenance of any digitisation project.
Appendix. Estimating digital reformatting costs
based on Research Libraries Group, Worksheet Estimating Digital Reformatting Costs (1997, revised May 1998)
Ten step programme
Selection of materials
- Identify materials ( Determine legal restrictions ( Investigate the availability of digital and other versions
- Eliminate items which are in poor condition or incomplete ( Determine appropriate conversion process (e.g. film, then scan, disband originals etc.)
- Calculate staff time for selection of materials = cost 1
Determine the size of the collection
- Count number of titles, volumes and pages to be imaged, from bound or unbound documents
- Count number of frames, fiche or reels of micro-images to be converted
- Count number of finding aids required
- Retrieve documents from storage
- Remove documents from circulation
- Record physical condition of documents
- Collate and identify missing pages and damage
- Repair and replace missing or illegible pages
- Prepare intermediates (e.g. photocopies, transparencies)
- Disband originals (when required)
- Create documentation for bibliographic control, indexing, tagging and encoding information (when required)
- Calculate staff time for preparing documents = cost 2
Determine imaging requirements (benchmarking)
- Assess essential document attributes to determine scanning requirements (resolution, bit depth, enhancements, file format, compression)
- Confirm results by scanning a sample
- Perform inspection of sample on screen and in print
- Calculate staff time for benchmarking = cost 3
Determine requirements for and create metadata
- Create catalogue entries for digital resources
- Determine file naming and structuring strategies (e.g. individual images cf. Groups of images)
- Create additional indexes (e.g. index at article level for journal literature) or revise/enhance existing finding aids
- Calculate staff time for preparing metadata = cost 4
Determine imaging costs
- Assess costs of external or internal service providers = cost 5
Determine text conversion costs
- Define nature and extent of text conversion (e.g. full-text of all or specific documents)
- Assess costs of external or internal service providers = cost 6
Determine SGML encoding costs
- Define nature and extent of coding and accuracy requirements
- Assess costs of external or internal service providers = cost 7
Determine Finding Aid Conversion costs
- Define nature and extent of finding aid conversion and encoding
- Assess costs of external or internal service providers = cost 8
Post-process quality checking
- Load digital files
- Conduct data integrity checks
- Perform on-screen and paper inspection
- Ascertain accuracy and consistency of file naming , structuring, text conversion and encoding
- Integrate corrections into the digital file sequence
- Create derivatives for network access
- Calculate staff time and non-personnel costs (e.g. hardware) for quality checking = cost 9
Estimate additional local costs
- Project management and tracking
- Programming and systems support
- Shipping and insurance
- Purchasing storage devices, media and software
- Total = cost 10
Total cost = Costs (1-10) + (Indirect costs)