When you publish a printed book, you usually run a test print, then do a final proofread, to make sure the reader gets the best reading experience.
The same thing should be done with EPUB digital book. But, in case of EPUB digital books there are some more complicated issues you have to put into consideration, this is not just proofread looking for correct punctuation, paragraph spacing etc. In digital books, the final look depends on the actual reader application. Unfortunately, although the IDPF and W3C standards define how the book should look, the reality is that books never look the same. Since there are so many readers out there, it is impossible to test the book on all readers. What can be done however, is check the book for common problems that may create problems.
These checks are also crucial if you want to upload the EPUB to online book stores such as apple store. Some stores have rules on what they expect in books in terms of image quality, positions etc.
I call it EPUB quality assurance. In this article I will try to show few of the possible problems and explain how to find and fix them. Note this article shows only the main issues we look for when doing a comprehensive quality assurance for an EPUB digital book. IDPF has a great service and open source software for validating EPUB files. The first step should always test the book using the validator. The only problem with the validator is that when a file fails the test, it's error messages are very general and tend to be cryptic.
However, the fact that a file passed all validator tests does not by itself means that the book will render well on all readers and that the reading experice would be good and consistent across readers. This is why we must do a manual quality assurance for EPUB files.
Introduction to EPUB digital book structure
Before I can explain some of the potential issues and how to fix them, I will give a brief introduction to how a digital book is structured. Those who are familiar with the internal structure of EPUB digital book can safely skip this section.
I will try to make it as simple as possible so even people with little technical background will understand.
An EPUB file is actualy a compressed file (renamed ZIP file) containing several files and subdirectories.
The main components of an EPUB are:
mimetype - a single file that must be the first file in the zip. It must not be compressed and always contain one line "application/epub+zip" this is used to identify that this is an EPUB file, some readers will still read the file even if this file is not the first file and even if it does not exist.
META-INF - a subdirectory usually containing one file named 'container.xml' the container.xml file states the name of the package file describing all the other files.
- OEPBS - content files, usually grouped in one or several subdirectories under the OEPBS directory. the name OEPBS is the traditional name for that directory. Some books call it in different name, the actual name used does not matter as long as the container.xml file points to the right directory.
- content.opf - main package file. Again this is the traditional name, but some books use different name. this file is the file pointed by the container.xml file. OPF means Open Package Format, this file describes all files in the book and contains some metadata. This is probably the most important file in an EPUB publication and is the first one to be tested. This file contains 3 major sections:
- metadata - book metadata such as title, publisher, author etc.
- manifest - list of all files in this book including CSS files, images, video files, audio files etc.
- spine - list of actual content files, note that this is not a list of file names but a list of descriptors named in the manifest section. The spine usually lists the file in the order they should be read but this is not mandatory.
- toc.xhtml - Table of content file, that is an XHTML file that can be both rendered as a Table Of contents and also machine read to get the table of contents. This file must contains links to file comprising the actual chapters of the book in the right reading order. Note this file exists only in EPUB3.
- toc.ncx - ncx file is needed only for compatability with EPUB2. NCX stands for Navigation Control for XML, this is a content file whose stucture was originaly designed by the DAISY consortium for Digital Talking Book. The EPUB3 standard has obsoleted this file, however many EPUB3 still contain it in order to keep compatability with EPUB2 readers.
- Content files - content files are usually HTML files. This is why an EPUB file sometimes refer to as an encapsulated web site.
Common issues we look for
Some of the following issues may seem obvious, but believe me, all of these issues are things I've seen myself in many EPUB files I've seen.
Check all images for good quality and readability on different screen size. Some images may look good only on big screen but almost unreadable on smaller screen size.
Graphs and charts
Check that all graphs and charts are comprehensible on small screen and large screen.
Best graphs are SVG images since they scale nicely on all screen sizes, however many EPUB readers don't have good support for SVG.
Empty pages are sometimes caused by leftovers from images that could not display in one page and only one line of them displayed on the next page. This will only efect some readers, not all of them
Table of contents is a new file in EPUB3 nevertheless table of contents exists in all EPUB files, in case of EPUB2 it was in a NCX file.
Sometimes files exist in the spine section of the OPF file but for some reason do not appear at the TOC or NCX files this will cause most readers not to display these files.
The Table of contents and/or NCX contain references to all files but two file id's will have the same title. This will not cause problems reading the book sequentialy but will confuse readers navigating through the table of contents.
The same file appears twice in the TOC with same or different title. This may cause problems for both sequential reading and navigation using the table of contents.
A mimetype is a description of the file type. This is used in the manifest section of the package file (OPF) readers use this name to find out how to display this file, or in case of videos or audios, what codec to use for the file. Usually the mime type is obvious by the file extension (for example: MP3) however readers usually don't use the file extension, they use the mimetype stated for the file. The automatic validator will not always find such errors. These are critical erros that can cause the reader to crash.
There are two kinds of metadata: general metadata and file specific metadata. File specific metadata usually describe things such as encoding and scaling limitations. Scaling limitation can help forcing an image to be readable, however it may cuase problems in several readers so it is generally not recomended to use scaling limitations.
General metadata appears in the package file (OPF) and describe things such as book title, publisher, author and publishing date. This data is sometimes used for cataloging purposes and thus although it is generally unseen, it is important to make sure data is correct and that there are no missing data.
CSS is used to define graphical styles for headings, text and paragraphs. It is also used to define things such as line spacing and spacing between paragraphs.CSS files are declared at the manifest section of the package file and the head of each HTML file.
There are two main issues we have noticed with CSS
Multiple contradicting definitions
CSS defines font style, size and weight using classes. In some cases the same class appear twice with different definitions, the reader in this case should display the text acoording to the last definition. It should be noted however that sometimes the author wanted to use the first one and not the last one.
I have also seen cases where different CSS files where declared at the head of each HTML file and in each file there was different decleration for the same class name. This can also lead to the next type of issue.
Inconsistent usage of CSS definitions
I have seen several cases when different CSS files are declared at each chapter, each CSS file had different definitions for chapter headers etc. so the result was inconsistent headers in the EPUB file.
Links can be either from one place the other inside the book, or simply URL's to websites (in that case we browser should open to view this website). Links inside books are used especially in text books, for example to let the user read more about something. There are few issues that can happen with links.
The obivous one is a link that leads to an invalid place in the book. This seems obvious and one may think this is something that the publisher should fix even before converting to EPUB, however sometimes this is caused during the conversion since the converter simply renamed one of the files without thinking that there are links from other places to this file.
The next issue with links is a return link. Since ebook reader are not like browsers that has back button (some of them do have) so publishers often put a link to go back. But sometimes this link, either leads nowhere (similar to the previous issue with links) or in many cases leads to totally different place in the book which of course confuses the user.
We are talking about functional errors that cause invalid operation.
This is very similar to standard software QA where we systematically test each function.
In order to find out what to test, we will first manually analyze the code and then formulate tests to verify it.
Another thing we will do is make sure the code does not conflict with reader internal code.