he printing press was developed and first used in 1439 by Johannes Gutenberg. It was quickly adopted across the world and laid the foundation for technological, socio-political and scientific revolutions in the following centuries. It has been estimated that since the time of Gutenberg some 130 million books have been published worldwide , and in the United States some 270-290 thousand books continue to be published annually . However, with the advent of the internet and the introduction of digitization technologies, how books are published, marketed, distributed, and the way people interact with them have begun to transform.
The digitization of books has been heralded by proponents as having the potential to democratize knowledge and push the kind of revolutionary impact that Gutenberg’s printing press did. These groups argue that digitization helps preserve knowledge by storing newer books and recovering rare and lost books and making them easily accessible to the public . Others, however, take a more skeptical perspective noting that the digitization of books may lead to some undesirable technical externalities including taking publication rights out of the hands of authors and publishers, and threatening user privacy . In this paper I examine what the digitization of books entails, the technologies currently being used to digitize books, the issues and complicating factors involved in digitization, and how some of these issues might be resolved.
“It is a press, certainly, but… it shall scatter the darkness of ignorance, and cause a light heretofore unknown to shine among men.”
This dramatized narrative of Gutenberg speaking to Conrad Saspach, a craftsman he’d contracted to construct a printing press, is recounted in the book “Memoirs of Celebrated Characters” (1854) by Alphonse de Lamartine . This book is one of thousands of rare and hard-to-find books that can be found today on digitized book databases such as Google Books.
The notion of digitizing books began in mid-20th century when the first government documents were digitalized by manual typing. In 1971, Project Gutenberg led the first and oldest effort to digitize and archive popular books and cultural artifacts, and store them on what would eventually become the internet . Since then, thousands of groups, such as the USA Library of Congress, have concentrated their efforts on digitizing portions of their collections. The European Union has made some 12 million pertinent European government and societal documents available online . The Europeana Digital Archive is a massive continental digital library that currently archives documents, images, sounds and videos from European museums and libraries online. The project is aims to make 10 million works available online by the end of 2010. 
However, these selective efforts are considered distinct from “mass digitization” efforts, which seek to catalogue every book ever published, led by private groups such as Google Books and the Open Content Alliance (OCA) . Similar mass digitization efforts have also been undertaken by public groups including Carnegie Mellon University’s Universal Digital Library (UDL) project  which has collected nearly 1.5 million books in its collection, as well the International Children’s Library (ICDL) which has a collection of 2500 children’s books in forty-one languages .
Converting a book on a shelf to a digitized book which can be accessed by users across the world has several steps. These include: acquiring print copies of the book, processing and storing images and textual content, quality control, placing content on a sever, and building a user-friendly platform for retrieval of the digitized content. The are many stumbling blocks in the path of digitization efforts, these include but are not limited to procuring books, digitizing a large number of pages in a short period of time, extracting data from scanned pages, and organizing the data in a way that is user-friendly and accessible.
Collection and collaboration
Collaborations between digital libraries, libraries and volunteers have proven critical to digitization efforts. Non-commercial efforts often depend on volunteers to help with the task of compiling books and putting them into databases. According to Project Gutenberg’s founder, Michael Hart, digitizing books for the collection was primarily done by its 20,000 volunteer human transcribers until the 1990s. Now nearly 90% of the books are digitized via scanners by volunteers . Major digital libraries such as Universal Digital Library (UDL) and Google Books have to rely on collections of books from major libraries in order to amass comprehensive databases of books. The 200-million dollar “Google Print Library Project” project began in 2004 when Google announced partnership with five libraries, including Harvard University Library, Stanford’s Green Library, Oxford’s Bodelian Library, Library of the University of Michigan and the New York Public Library, to scan thousands of volumes of their books . Similar collaborations have been made between other groups and international libraries.
Scanning & digitization
It has been calculated that if it takes one second to digitize each page of an average book, it would take 150,000 gigabytes and at least one hundred years to digitize a hundred million volumes . Prior to the 1990s digitization was done by manual human transcription. Modern digital libraries, however, have relied on optical character recognition (OCR) technologies to deal with the immense scale of their projects. Without these tools, and given the sheer number of pages to be scanned and data to be processed, the task of digitizing would be impossible.
In the process of scanning if multiple copies of a book is available, one copy is typically disbinded for efficient digitization . For fewer copies an operator is required to manually flip book pages. However, new robotic scanners (e.g. Kirtas APT BookScan 2400RA) have been developed that can automatically flip pages. In the case of fragile documents and documents on non-traditional media, such as palm leaves or papyrus, digital camera technology is used to take pictures of pages and avoid any wear-and-tear scanning may cause.
High-speed scanners are used can scan a letter-sized page in 2.4 seconds with 600 dpi resolution . Processed images can be stored in various formats but are typically stored as TIFF files. However the Library of Norway, in their digitization efforts, chose JPEG2000 format for preservation of files and found that it reduced files sizes by 50% compared to equal quality TIFF files . Once documents have been scanned, OCR technology is used to render the characters in print to text that is indexable and searchable. For large automated volumes there are several challenges to conducting OCR. These include language identification, chapter detection, topic and content identification and metadata extraction . With many old volumes of books, unusual fonts and poor quality or fragile statement of documents presents serious difficulties for automated scanning. Many OCR systems are not properly equipped to deal with calligraphy and non-Latin characters. This is an immense problem as there are books in approximately 430 languages waiting to be digitized but there remains a bias towards Latin languages in digital libraries. As well, commercially available OCR technology is not accurate and has a five-percent margin of error even in languages with Latin characters . It is therefore important to produce better tools and algorithms for character recognition. Another challenging issue has been rendering separating visual content from textual content in books, especially in children’s books that mix these elements .
New technologies have been developed to improve the quality and readability of books. These include photo-annotation tools, adding meta-data to documents, and tools to improve the quality of OCR generated texts . International Children’s Digital Library, for instance, uses an interface called ClearText which decouples images and visual backgrounds from actual text in a page. In some projects, manual human review has been used for verification of digitalized content and for parts of texts that cannot be accurately detected . However, this approach is impractical when dealing with large amounts of text.
reCAPTCHA is a tool developed in Carnegie Mellon University to aid with the Universal Digital Library (UDL) project. CAPTCHAs, which stand for Completely Automatic Public turing test to tell Computers and Humans Apart, were scripts developed to reduce spam in internet submission forms, blogs, forums and other environments where user input is required . CAPTCHAs use distorted text, images or sounds in order to ensure that user input comes from human and not automated spamming software. However, unlike regular CAPTCHAs, reCAPTCHAS do not use random distorted text, but instead use humans to recognize and transcribe words. In this case books are scanned normally, and then analyzed by two OCR programs. If the programs give different results, the scanned word is placed in a pool of CAPTCHA images and sent to multiple users on the web. Given that the target word’s identity is unknown, a known word is shown alongside to control for quality of transcription. Each identification by a human gives the word a value of 1.0, and similar OCR identification gives the word a value of 0.5. Once a word has acquired 2.5 points, it is established in the database as the proper word .
Databases, search and navigation
The Universal Digital Library (UDL) project reports a peak digitization of approximately 15,000 books a month in each processing location, of which they have 40 set up across China and India [12, 15]. Meanwhile, the smaller Norway Digital Library digitizes 2,000-3,000 books each month . These books are generally stored on multiple private and public servers . The Library of Norway for instance stores large, high resolution files and private storage servers, and more compressed copies of the files are stored for access on public servers .
Perhaps the single most important factor that the digitization of books has made possible is the ability to search through books in a matter of seconds. Given their existing digital search capabilities, Google.com and A9.com, Google and Amazon have been in a great position for developing strong search tools for digitized books . Currently Google Books offers four ways of accessing data on its website. The first, for books in public domain, is a full view of the document with the option to download the file in various formats. The second is a limited preview of the document provided for copyrighted works which have opted-into and are compensated by Google. The third is a snippet preview, and the last no preview for books which authors have opted-out . The Open Content Alliance (OCA) allows full view of content on its site using DJVU application which users must install on their computers .
While the proponents of mass digitization efforts such as Google Books argue that digitization is critical to preserving our society’s collective knowledge in the face of destructive disasters like that which befell the Library of Alexandria , others dismiss these claims. They argue that digitization will render society’s collective knowledge vulnerable in new ways, and make it much easier for content holders to hold monopolies over knowledge and infringe on the rights of content producers and users.
Books and print
Some have expressed discomfort and concern with the concept of digitization. It has been argued that mass digitization and the availability of new content in digital content only may lead to a “Digital Dark Age” in the future. This kind of “Dark Age” may occur because old file formats stored in archives may not be readable using software in any given time. As well, certain media formats (e.g. magnetic media) have very short life spans that may lead to a loss of data if stored for long periods of time without conversion . There are also fears that the complete digitalization of books and other resources may lead to the demise of libraries, and deprive the public of the hands-on education they provide . Others worry that open and accessible digital libraries may lead to the decimation of the publishing industry .
Concerns about digitization extend beyond books themselves to concerns about reading and acquiring knowledge. Michael Gorman, the president of the American Library Association, has contended that digitalization may change how people interact with books in a negative way . In recent years, various studies have shown that the brain deals with content on the internet and content on print in very different ways . Though there are currently no long-term studies examining this transformation, it has been suggested that neural wiring and cognitive responses may be modified due to interaction with content on the internet . Printed materials, specially books, follow a “linear-sequential reading model” , whereas online individuals can and do approach content non-linearly. These factors might also lead to a change in how books are ultimately written, perhaps leading to the elimination of the current linear narrative of books in favor of smaller chunks of texts optimized for a non-linear approach to reading.
Copyright laws and compensation
Beyond theoretical debates about how digitization may affect the future of books, some publishers and authors view current digitization efforts as infringing on their rights. They worry about retaining rights to their published work and receiving compensation for works which are included in digitized collections. In a highly publicized case in 2005, the Authors Guild and Association of American Publishers sued Google for including copyrighted works, without necessary permissions, in its collection . In the trial Google argued that given the scale of their operation, it was only feasible for them to scan every work without having to trace the copyright, and then permit authors to declare their rights on a work. They argued that when books are placed in their network, rather than causing a decline in sales, the ability to search leads to increased publicity and promote purchase of the books [4, 19].
The Universal Copyright Convention (UCC) adopted in Geneva (1952) places copyrighted works in public domain 25 years after publication , however countries across the world take very diverse approaches to copyright. In the United states, which has very stringent copyright laws, books are considered to be in the public domain 70 years after the death of the author . One contentious issue in the Google Books lawsuits was so-called “orphan works”. These are works that remain under copyright protection for which no copy-right holders can be found. Europe is estimated to have some three-million orphan books which under their current copyright laws cannot be digitized . However, in the US it has been estimated that 80% of books are orphaned works  and it is difficult, if not impossible to find copyright holders for these works. Google, therefore argued that their digitization of orphan works fell under the fair use section of US Copyright Law [25, 27] and has continued to index these works with few repercussions.
In 2008 a $125-million dollar settlement was announced between Google and authors and publishers’ groups . The terms of the settlement included an agreement in which Google pays $60 to book copy-right holders for every book scanned. Google Books also agreed as part of the provision to allow libraries access to its complete collection for a fee, from which the revenue be shared with publishers . One of the main criticisms of Google Books, however, has been precisely that copyright holders are must opt-out, rather than opt-in to the project . This means that copy-right holders would not only be not only be indexed in Google Books by default, but that they would receive no compensation from the company unless they registered. Authors and publication groups have argued that this system is a poor substitute for actual licensing of works to be collected and stored digitally . In contrast to Google, Open Content Alliance (OCA) publishing follows a model that has been deemed more desirable by publishers. They only collect and distribute copyrighted content where the author has explicitly opted in .
There has been a movement on the part of governments to reform copyright laws in order to address issues arising in digital formats . For instance, there is currently a movement towards the establishment of continent wide licensing agreements and annual royalties paid to copyright holders based on number of pages made available online .
Control over digitized content
Google and Amazon’s capabilities as powerful companies with powerful search engines have given them dominance in the digitization of books. Critics have argued that while Google argues that it is creating a “knowledge producing commons”, it is in fact building a commercial monopoly of knowledge based on works that were previously openly available in libraries . Richard Sarnoff, the chairman of the Association of American Publishers has described Google and Amazon as a duopoly in the digital book market . Indeed, while Google Books began as a non-commercial project, it has moved towards a commercial avenue. In settlements with authors and publishers, Google effectively moved to set itself up as a books vendor, and opened its store in December of 2010.
In addition to the monopolizing of knowledge, there are concerns about ownership of books. In the case of Google, once a book is purchased, it can only be accessed on the online “cloud” , and the access can be revoked. Amazon, on the other hand allows download of eBooks from its servers. Nevertheless, in 2009, after realizing an error had been made with the listing of an edition of “1984” by George Orwell in its catalogue, Amazon remotely deleted from its mobile reading device, the Kindle, copies of the book that customers had already been purchased .
One large but very little reflected on concern for readers using digital books has been the issue of privacy . In the United States, legal measures such as the USA PATRIOT Act have made it possible for library records to be subpoenaed . Accessing books online and in databases will mean that information about users’ read content and reading habits can be much more easily be tracked and used for commercial purposes, or obtained through legal subpoenas.
In 2008 it was estimated that 23.9% of the world’s population or some 1.7 billion people had access to the internet and that this number continues to grow exponentially. In developed countries, the numbers rise to 75% of the population . We are standing in the midst of a revolution, the foundations of which been brewing for nearly half a century. This revolution is digitized.
However, as global knowledge, in the form of books and other media, is digitized— it is important to be both cautious and optimistic. Mass digitization is a way by which we can bridge knowledge from the past to the future. We can also enhance and use information in ways that would not have been possible otherwise. The digitization of books may enable a young student to find and read a story about Johannes Gutenberg from a 19th century publication that would have otherwise been stored on a dusty shelf in some remote corner of our great libraries. At the same time, it is critical that we retain the integrity of knowledge and let it be shared across our communities. Though it is difficult to foresee what to anticipate in the future,
- Great care must be taken when storing both print and digitized books to ensure that they will continue to be preserved for generations to come. This can be achieved by collecting and storing print books in protected libraries, constantly updating and converting storage and public digital databases, and preventing revisionism which can so easily be done with digital media.
- Libraries must be expanded and allotted the resources to take an active role in digitization, not only in material they contribute, but creating repositories and independent collections. Like the libraries of the ancient world, libraries must continue to serve as centers for learning. They should provide individuals not only knowledge with a contextualized understanding of individual communities, but also help individuals navigate through the oceans of information which will be available to them.
- Non-commercial collective projects and government digitization efforts must be supported to ensure that knowledge does not fall, through commoditization, victim to the Tragedy of Commons.
- Commercial projects with technological proficiency should continue to play an important role by providing momentum and expertise as well as generating competition in the field. Commercialized projects should continue to work with libraries and non-commercial entities. However, return for the knowledge that they acquire from society, they must also act responsibly and contribute back to society in a constructive manner.
- Copyright laws should be tackled and revised to balance the needs of users, the ambitions of digital libraries which seek to preserve and share books, and the rights of authors and publishers.
- Privacy laws should be revised in a digital world to ensure that every citizen is safe to access and read works in a digital format without the fear of being taken advantage of or rebuked.
- Digitization combined with global internet access will create a many possibilities for democratization of data. Knowledge and information can serve as great equalizers, and empower people around world. However, we must ensure that information is truly democratized, and avoid the pitfalls and externalities that can carve even greater fault-lines between the haves and the have-nots.
- Taycher, L. Books of the world, stand up and be counted! All 129,864,880 of you. Inside Google Books 2010 [cited 2010; Available from: http://booksearch.blogspot.com/2010/08/books-of-world-stand-up-and-be-counted.html.
- Bowker, R.R. Bowker Reports U.S. Book Production Rebounded Slightly in 2006 2007 [cited 2010; Available from: http://www.bowker.com/index.php/press-releases-2007/146.
- Timmer, J. Google book digitization prompts the EU to rethink copyright. Ars Technica 2009 [cited 2010; Available from:http://arstechnica.com/tech-policy/news/2009/10/google-book-digitization-prompts-the-eu-to-rethink-copyright.ars.
- von Bubnoff, A., Science in the web age: The real death of print. Nature, 2005. 438(7068): p. 550-552.
- De Lamartine, A., Memoirs of Celebrated Characters. Vol. 2. 1854, New York: Harper & Brothers Publishers. 287.
- Vara, V. Project Gutenberg Fears No Google The Wall Street Journal 2005; Available from:http://online.wsj.com/public/article/SB113415403113218620-U_OqLOmApoaSvNpy5SjNwvhpW5w_20061209.html.
- BBC. EU’s New Online Library Opens. 2008 [cited 2010; Available from: http://news.bbc.co.uk/2/hi/entertainment/arts_and_culture/7798789.stm.
- Coyle, K., Mass Digitization of Books. The Journal of Academic Librarianship, 2006. 32(6): p. 641-645.
- The Universal Digital Library. Available from: http://www.ulib.org/.
- Chang, H., J.Q. Er, and B.B. Benjamin, Enhancing Readability of Scanned Picture Books. 2008.
- Google Inc. All booked up. The Official Google Blog 2004; Available from: http://googleblog.blogspot.com/2004/12/all-booked-up.html.
- Sankar, K., et al., Digitizing a Million Books: Challenges for Document Analysis, in Document Analysis Systems VII, H. Bunke and A. Spitz, Editors. 2006, Springer Berlin / Heidelberg. p. 425-436.
- Digitization of books in the National Library – methodology and lessons learned, N.L.o. Norway, Editor. 2007.
- Vincent, L. Google Book Search: Document Understanding on a Massive Scale. in Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on. 2007.
- Greene, K. How to Digitize a Million Books. Technology Review 2006 [cited 2010; Available from: http://www.technologyreview.com/Infotech/16434/?a=f.
- Boschetti, F., et al., Improving OCR accuracy for classical critical editions, in Proceedings of the 13th European conference on Research and advanced technology for digital libraries. 2009, Springer-Verlag: Corfu, Greece.
- Grossman, L. Computer Literacy Tests: Are You Human? Time Magazine 2008 [cited 2010; Available from: http://www.time.com/time/magazine/article/0,9171,1812084,00.html.
- von Ahn, L., et al., reCAPTCHA: Human-Based Character Recognition via Web Security Measures. Science, 2008. 321(5895): p. 1465-1468.
- Brin, S. A Library to Last Forever. The New York Times 2009 [cited 2010; Available from: http://www.nytimes.com/2009/10/09/opinion/09brin.html.
- Kuny, T., A Digital Dark Ages? Challenges in the Preservation of Electronic Information. IFLA Conference Proceedings, 1997: p. 1-12.
- Clark, I. We still need libraries in the digital age. Guardian 2010 [cited 2010; Available from: http://www.guardian.co.uk/commentisfree/2010/jul/13/internet-age-still-need-libraries.
- Sutherland-Smith, W., Weaving the Literacy Web: Changes in Reading from Page to Screen. The Reading Teacher, 2002. 55(7): p. 662-669.
- Carr, N., Is Google Making Us Stupid? Yearbook of the National Society for the Study of Education, 2008. 107(2): p. 89-94.
- UNESCO, Universal Copyright Convention. 1952.
- U.S. Code, Duration of copyright: Works created on or after January 1, 1978, in 17. 1999.
- Nawotka, E. Will Europe’s Three Million Orphan Books Ever Be Digitized? Publishing Perspectives 2010 [cited 2010; Available from:http://publishingperspectives.com/2010/07/will-europes-three-million-orphan-books-ever-be-digitized/.
- Baksik, C., Fair Use or Exploitation? The Google Book Search Controversy Libraries and the Academy 2006. 6(4): p. 399-415.
- Noakes, S. Google’s digitization of books. CBC News 2009 [cited 2010; Available from:http://www.cbc.ca/arts/books/story/2009/11/09/f-google-digitization-books.html.
- Lee, T. Publisher speculates about Amazon/Google e-book “duopoly”. Ars Technica 2009 [cited 2010; Available from:http://arstechnica.com/tech-policy/news/2009/02/publisher-speculates-about-amazongoogle-e-book-duopoly.ars.
- Jones, B.M. and F. American Library Association. Office for Intellectual, Protecting intellectual freedom in your academic library : scenarios from the front lines. 2009, Chicago: American Library Association.
- Roush, W. Digitize This. Technology Review 2005 [cited 2010; Available from: http://www.technologyreview.com/web/14881/.
- Liu, Q., R. Safavi-Naini, and N.P. Sheppard, Digital rights management for content distribution, in Proceedings of the Australasian information security workshop conference on ACSW frontiers 2003 – Volume 21. 2003, Australian Computer Society, Inc.: Adelaide, Australia.
- Burke, T. Digital Search I: Google Poisons the Well. 2009 [cited 2010; Available from:http://weblogs.swarthmore.edu/burke/2009/10/13/digital-search-i-google-poisons-the-well/.
- Stone, B. Amazon Erases Orwell Books From Kindle 2009 [cited 2010; Available from: http://www.nytimes.com/2009/07/18/technology/companies/18amazon.html.
- The World Bank. Internet users (per 100 people). 2008 2010; Available from: http://data.worldbank.org/indicator/IT.NET.USER.P2?cid=GPD_44.