Creating an eBook with ePub

This is a first round overview of the ePub format used for digital books, which I’m finding is needed more and more in my freelance work. I’ll go over both how to code it, and tools I’ve used to translate real world files into an ePub as an example.

Download the demo file: fakeepub.epub

As a note, this does not include the picture book format recently produced by Apple. This seems to be a pdf wrapped in another delivery format, I’ll update this post as I find more information.

In brief, ePubs are used by just about every digital reader, and many quality desktop software (personal preference is for Stanza which runs on all desktops and iPhones, although their recent acquisition to Amazon might bury any future growth). It is a file type for Electronic PUBlications open in nature and convenient for development http://en.wikipedia.org/wiki/EPUB. ePub’s weaknesses are quirks with different readers (iBook, for example, ignores embedded fonts), lack of the ability for precise layout (I consider this a strength, since a liquid layout is best for the reader, but generally sucks for graphic intensive designs), and a few others I don’t know about for sure. The DRM Apple and a few other folks use are outside the ePub standard, and something of their own agenda.

For the most part, the foundations of ePub are XML and it’s variant, XHTML – making it relatively easy for those of us who can code websites. What’s interesting is that a few core files are a necessity, and the method to compress a folder into an epub is rather fickle (and was the most confusing issue in it’s development). I’ll note these as I discuss it briefly below.

Make sure you validate all your code and avoid tables unless tabular. I find eReaders to not be as forgiving as browsers, time to improve your craft.

Online resources for your arsenal

great online validator of your final epub http://www.threepress.org/document/epub-validate/
if you run your own server, you can host your own validator http://code.google.com/p/epubcheck/
an old school tutorial, but thorough http://www.hxa.name/articles/content/epub-guide_hxa7241_2007.html
an updated brief based off that tutorial http://sites.google.com/site/spontaneousderivation/an-epub-tutorial
The guy who wrote that tutorial @ArachneJericho
A ruby script to compress your files http://code.google.com/p/ruby-epub/source/browse/trunk/lib/epub/project.rb#106
an actionscript if you’d prefer a command line alternative http://www.mobileread.com/forums/showthread.php?t=55681

File Structure

Build your book in a folder, later it will be converted as the special ‘epub’ file (which is essentially a customized zip package). Here’s a screenshot of a complete ePub I’ve made, uncompressed, details below:

Content Folder

The Content Folder contains the various XHTML files I’ve made of each section of the book, my images inside an image folder, and my stylesheet. Although it’s important to put all these files in some folder, you do not need to name it content. Based off a few tutorials, I’ve chosen to do this for my own simplicity, you may see more complicated names or structures in other ePubs.

When writing an ePub, you could make one huge xhtml file with the whole book, but you’d limit your potential for the ePub reader to automatically number, link and find chapters– and you may overcomplicate your development. In the sample file, you’ll see I’ve made individual html files (that validate ala validator.w3.org of course) per chapter, as well as for sections such as the title page, intro page, copyright page, and anything else I deemed appropriate. You will organize this information in a separate file later on, simply know in this area is where you put the meat and potatoes, semantically rich as possible.

Organizing your images in a single image folder apparently will help keep things simple for some eReaders, and you may, of course, add a stylesheet as you would any other webpage. Adding fonts can be done by including the fonts file and using the standard CSS @font-face, although this is not fully supported by iBooks (and at the moment vaguely in mobile safari, although SVG is more stable, it’s there). I recommend using Font Squirrel if you want to test this road.

Regarding CSS in general, avoid a fixed layout with defined widths for the whole page, or you limit your users control, and reading is something personal, they will hate your draconian design obsession. If you’re trying to define a section with a smaller width, consider following liquid and flexible layout rules: use margins or paddings on the sides with percentages, or ems. Regardless, if you do use points or pixels, my poorly, unscientific testing shows eReaders will simply try to override it anyhow.

Here is a screenshot of a portion of the content folder (some tutorials referenced xhtml as the extension, but I see html working just fine, so not necessary from what I can tell at this juncture):

Pro Tip: if you don’t want to fret with validator.w3.org, don’t worry, you’re coding XML – open up any of those files in a Webkit or Gecko browser, and if you made a mistake, you should see the error listed front and center. As usual, Internet Explorer is an idiot, avoid.

META-INF Folder

A required folder, this guy should be in the root as shown. The file container.xml inside is also required. This file has only the following lines of code, and nothing more (I have yet to find clear information if there’s anything else that can be added later, will update if I do):

<?xml version="1.0"?>
<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
   <rootfiles>
      <rootfile full-path="metadata.opf" media-type="application/oebps-package+xml"/>
   </rootfiles>
</container>

metadata.opf file

A required file with specialized ePub XML data (think similar to an atom or rss file), metadata.opf documents all the preferences and details needed for your eReader to render your books interior pages. Like all files XML, if created with errors, your ebook will fail to load: for exmaple, iTunes will accept the file, send it to your iPad… but it won’t show in iBook. No warning. Helpful.

Oddly, every demo I found excluded the xml doctype at the top of this document (although that is invalid), and I have not thoroughly tested if this is a point of error. For now, I’m excluding it to match existing ePubs.

The overall containing tag is package with various metadata:

<package xmlns="http://www.idpf.org/2007/opf" version="2.0" unique-identifier="bookid">
</package>

Inside and at the top, you have metadata that calls out the title (required), the author (can also add illustrator and a slew of other options if you google), language, unique identifier (required, and really needs to be unique- consider adding your website and various characters if you need to make something up), and the legal rights:

<package xmlns="http://www.idpf.org/2007/opf" version="2.0" unique-identifier="bookid">
	<metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
		<dc:title>This is a fake ePub</dc:title>
		<dc:creator opf:file-as="Frey, Brady J." opf:role="aut">Brady J. Frey</dc:creator>
		<dc:language>en-US</dc:language>
		<dc:identifier id="bookid">ISBN: 11-11-11-11</dc:identifier>
		<dc:rights>Public Domain</dc:rights>
	</metadata>
</package>

After metadata, you have manifest which must list out every single xhtml, image, font, stylesheet you’re using in the book. Pay attention to the media-type, they should be relevant to the file, order is irrelevant. You’ll also notice there is a special ncx file type listed here too, that’s the table of contents, we’ll get to that in a moment. Unique ID’s are required for each entry, I’ve shortened the example here compared to the demo file:

<package xmlns="http://www.idpf.org/2007/opf" version="2.0" unique-identifier="bookid">
	<metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
		<dc:title>This is a fake ePub</dc:title>
		<dc:creator opf:file-as="Frey, Brady J." opf:role="aut">Brady J. Frey</dc:creator>
		<dc:language>en-US</dc:language>
		<dc:identifier id="bookid">ISBN: 11-11-11-11</dc:identifier>
		<dc:rights>Public Domain</dc:rights>
	</metadata>
	<manifest>
		<item id="ncx" href="toc.ncx" media-type="application/x-dtbncx+xml"/>
		<item id="style" href="content/stylesheet.css" media-type="text/css"/>
		<item id="intropage" href="content/intropage.xhtml" media-type="application/xhtml+xml" />
		<item id="title_page" href="content/title_page.xhtml" media-type="application/xhtml+xml" />
		<item id="image_1" href="content/images/1.jpg" media-type="image/jpeg" />
		<item id="image_2" href="content/images/2.jpg" media-type="image/jpeg" />
	</manifest>
</package>

Finally, we have the spine which defines the reading order of the files by ID and is a reference for that table of contents we’ll talk about later. Now, order matters, and you only list your xhtml documents here, after that, file done:

<package xmlns="http://www.idpf.org/2007/opf" version="2.0" unique-identifier="bookid">
	<metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
		<dc:title>This is a fake ePub</dc:title>
		<dc:creator opf:file-as="Frey, Brady J." opf:role="aut">Brady J. Frey</dc:creator>
		<dc:language>en-US</dc:language>
		<dc:identifier id="bookid">ISBN: 11-11-11-11</dc:identifier>
		<dc:rights>Public Domain</dc:rights>
	</metadata>
	<manifest>
		<item id="ncx" href="toc.ncx" media-type="application/x-dtbncx+xml"/>
		<item id="style" href="content/stylesheet.css" media-type="text/css"/>
		<item id="intropage" href="content/intropage.xhtml" media-type="application/xhtml+xml" />
		<item id="title_page" href="content/title_page.xhtml" media-type="application/xhtml+xml" />
		<item id="image_1" href="content/images/1.jpg" media-type="image/jpeg" />
		<item id="image_2" href="content/images/2.jpg" media-type="image/jpeg" />
	</manifest>
	<spine toc="ncx">
		<itemref idref="intropage" />
		<itemref idref="title_page" />
	</spine>
</package>

mimitype file

Required in the root with one simple line entry and no file extension:

application/epub+zip

toc.ncx file

Your required table of contents. This file has a specialized doctype, a uid number (required, and if it doesn’t match your metadata file identifier tag, it will fail in the reader), a depth listing (you can do sub chapters in chapters, for example), and a total/max page count (set to zero means as much as you want).

Similar to an xhtml document, the ncx is the containing tag like html, with a head tag. Oddly, compared to xhtml, the title of the book is outside the head. Various meta tags are available to allow extra features to your table of contents – Google each name for further details:

<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1">
<head>
	<meta name="dtb:uid" content="ISBN: 11-11-11-11"/>
	<meta name="dtb:depth" content="1"/>
	<meta name="dtb:totalPageCount" content="0"/>
	<meta name="dtb:maxPageNumber" content="0"/>
</head>
<docTitle>
		<text>This is a fake ePub</text>
</docTitle>
</ncx>

Beneath the docTitle you have the navMap which lists out each chapter by navPoint. navPoint’s require a unique ID and a numerical order for them to reference. Inside each navPoint is a navLabel with text you want to show up in your table of contents, and a content self closing tag that links to the subsequent chapter file. Various online references question if the content tag is needed if the ID matches the spine you created in the metadata.opf, but doesn’t hurt to have it:

<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1">
<head>
	<meta name="dtb:uid" content="ISBN: 11-11-11-11"/>
	<meta name="dtb:depth" content="1"/>
	<meta name="dtb:totalPageCount" content="0"/>
	<meta name="dtb:maxPageNumber" content="0"/>
</head>
<docTitle>
		<text>This is a fake ePub</text>
</docTitle>
<navMap>
	<navPoint id="intropage" playOrder="1">
		<navLabel>
			<text>Introduction</text>
		</navLabel>
		<content src="content/intropage.xhtml"/>
	</navPoint>
	<navPoint id="title_page" playOrder="2">
		<navLabel>
			<text>Title Page</text>
		</navLabel>
		<content src="content/title_page.xhtml"/>
	</navPoint>
</navMap>
</ncx>

And that’s it – if all your XML validates, you’ve completed your code… now it’s time to create an ePub.

Turning your folder into an ePub

The hard part is done, now you have to create your ePub self contained file but it can be tricky. An ePub is a zip file for the most part, but it requires a the metadata file to be without certain, well, metadata, and for portions inside the zip to be compressed before you do the whole zip! After a bit of head scratching, I found:

http://code.google.com/p/ruby-epub/source/browse/trunk/lib/epub/project.rb#106 a ruby script
But I would prefer to do this command line, so I also found this tutorial with an Actionscript for download: http://www.mobileread.com/forums/showthread.php?t=55681

I would follow those directions to a T, and be sure for mac users to exclude the dreaded, hidden, .DS_Store file that will do more damage then good. Just zipping your folder and renaming it to epub will cause the book to fail. By now, if you’ve done everything right, go ahead and run that freshly compressed epub file here:

http://www.threepress.org/document/epub-validate/

For those of you who have downloaded the zip of the folder I included at the beginning of the email, go ahead and download it, uncompress it, and run that folder through any of the above scripts (or command line) to make your ePub file. Run it through the validator for a quick check!

How to convert a physical book into an ePub

It is, of course, illegal to convert a physical book into an ePub if it is copyrighted. However, you may find some older books not yet converted and outside of copyright, which is fair game (or like me, you do it for a test and for non-commercial use, although I’m still legally liable for such a creation, so not recommended), and lord knows those old ‘books’ are just taking space on your ‘book shelf’… why not convert them for kindle, sony ereader or iPad?

Doing so can be daunting, depending on your software, and gloriously destructive. Below I’ll list out my brief process:

Cut the spine off of my book
Scan all the pages as a pdf (may I recommend the awesome Fujitsu that feeds multiple pages like a fax machine, and scans front to back in real time)
Since Adobe Acrobat Pro has decent but largely inaccurate OCR, run it through ABBYY FineReader (Mac version is http://www.abbyy.com/finereader_for_mac/ the full version for Windows is considered an industry standard for archivist, though others I’ve known swear by http://www.nuance.com/) with OCR pdf output. You can, if you choose, output as rich text (no images though) or html, but I found it’s html so bad it wasn’t worth cleaning up. After your first scan, ABBYY will launch a window that allows you to mask out sections of a page as either an image, a table, or text – useful in my books case. When you’re done, you can export your optimized file.
If you’ve chosen pdf, open it back up in Acrobat, and export as HTML. It’s code sucks, but it’s more divs and a few misplaced spans, easier to clean up.
Use your editor to mass remove the junk and change it to standard, semantic code… not to mention style sheets or anything else you need.
Follow my directions above to code your book.

There is one program that does a good job converting your eBooks into various formats, including ePub, that maybe useful called Calibre:

http://calibre-ebook.com

My tests showed a bit of inconsistency with the output unless your html when added was pristine (cough, garbage in garbage out). Since that was the case, I figured it’d be faster to code everything by hand as I had to redo all the html pages anyhow. Your mileage may vary.

That’s it for now – I’m still experimenting with cover pages, they seem to be different per reader. If anyone has any questions, please feel free to ask!