How to Edit EPUB Books Manually
At one time or another, you may find yourself cornered into modifying EPUB files manually. This may be the result of certain validation errors that can’t be fixed using your favourite authoring software. In those cases, it’s best to work at the low-level and get your hands dirty, but nothing to fear. You can wash your hands afterwards.
Our experience with Kotobee showed us many forms of validations errors that are nearly impossible to correct using a software. No matter how much effort we put into Kotobee Author development, there are types of errors that are inevitable. Those errors originate from the user. For example, those who create their ebook by copying and pasting from their external PDF sources, can send along with their content a number of hidden meaningless characters, invisible to the eye, but visible (and prohibited) to the EPUB validator. That is a problem.
In this brief article, we will explain step by step how to accomplish a few goals, by editing an EPUB file ourselves. If you are not yet very familiar with the EPUB format, a strongly recommended read can be found here: Alice in EPUB-land: Understanding the EPUB Format.
Just to recap. An EPUB file is a compressed ZIP archive file. Nothing more nothing less. Let’s go over a few simple operations:
To extra the contents, rename the extension to .zip. Use any Zip extraction tool, and grab the files inside the folder. I personally use WinRAR, allowing me to right-click on the file and choose Extract to ‘folder name’.
After completing your edits, select all the files at the root level (that you have extracted), and compress them back into a .zip file. On Windows, this can be done by right-clicking, and from the Context menu select Send to > Compressed (zipped) folder. Rename the extension to .epub.
The most important file you should be after is the package file, which describes the entire book. The package file not have a fixed location. Its location is rather indicated inside the container.xml file, which you’ll find inside the META-INF folder. As a reminder, for further understanding of the EPUB anatomy, this is a recommended read: Alice in EPUB-land: Understanding the EPUB Format.
The following actions steps all require that you have the package file open in front of you, using any text editor. I prefer Notepad++.
Editing chapter content
- From the package file, search for <manifest>.
- Directly under it you will see a long list of items, all the way until the closing tag </manifest>. This is an index of all the assets (chapters, images) existing inside the EPUB. The path to each asset is explicitly defined.
- To distinguish chapters from the other types of assets, chapters have a media-type value of “application/xhtml+xml”.
- There isn’t a way to preview the content (or even title) of each chapter unless you open it. So use the path defined for each chapter item, and open the document in your text editor.
- Apply the edits you need. Once you’re done, simply save. That’s all there is to it!
- From the package file, search for <spine>.
- The list found underneath (all the way until the closing tag </spine>) define the navigation order of chapters. Each item is a chapter by itself. The idref attribute defines the ID of the chapter, which links to a unique item in the manifest list through its id attribute.
Removing a chapter
- Find the chapter that you want to remove. Finding it will require you to open them one by one, as the package file does not describe the content. But the order in the spine should give you a hint. If the chapter is one of the last chapters, then look near the bottom of the spine list. You get the point.
- You will need to remove: the document file, the spine item, and the manifest item. Missing any one of these will give you an EPUB validation check error.
Adding a chapter
- Create a new HTML file anywhere, preferably beside the other HTML files. To avoid basic errors, it’s better to copy one of the existing chapters and modify it as needed.
- Inside the package file, under the manifest list, add a new item similar to any of the other existing chapters in the list (the order does not matter). Enter the new path of the chapter into the href attribute. The path must be relative to the package file. Also, enter a unique ID to the id property.
- Under the spine list, add a new item as such: <itemref idref=”YOUR_UNIQUE_ID”/>. Here order does matter. So make sure you’ve placed it before and after the appropriate items.
Each asset inside the book can have one or more properties assigned to describe it. For example, you can assign a cover-image property to an image, to indicate it’s the book cover. Or you can assign nav to an HTML document to indicate it’s the table of contents. For a list of all supported properties, click here.
- From the package file, search for <manifest>.
- From the item list underneath, search for the item in question.
- If the item does not have a properties attribute, you can add one, as follows:
<item id="chapter1.html" href="xhtml/ch1.html" properties="YOUR_PROPERTY" media-type="application/xhtml+xml"/>
- If the properties attribute exists, add your property name to the existing one(s), separating it with a space.
Find a bad (hidden) character
If you get a complaint from the EPUB validation for an unsupported character, then most probably you have copied a bad (hidden) character from an external source. The EPUB validator should tell you the path to the specific file encountered. Here is an example of an error:
<message>RSC-016, FATAL, [Fatal Error while parsing file 'An invalid XML character (Unicode: 0x1f) was found in the element content of the document.'.], EPUB/xhtml/spptsa.html (22-307)</message>
- Locate the file with the error.
- Open it in Notepad++ (other advanced text-editors may work as well. But avoid the basic Windows Notepad).
- In the menu, select View > Show Symbol > Show All Characters.
- You will start seeing new symbols in the document, such as LF (line feed) and CR (carriage return). Don’t worry, they are not a problem.
- Scan around the document. You should see some strange symbol in the middle of the document, clearly in the wrong place.
- Delete that symbol.
There’s a high chance that this error was repeated elsewhere in your book. So you would want a certain process to remove all instances of that character in bulk. That brings us to the next section. Instead of deleting the strange symbol, copy it instead. Let’s fight fire with fire.
For books consisting of several dozen chapters, you would want a more automated way to apply your edits throughout the whole book. The popular Find and Replace function is just what you need. However, you’ll need to use it in a tool that supports searching within folders instead of a single file. Again, this is where Notepad++ comes in handy.
- Open Notepad++
- From the menu, select Search > Find in Files..
- Enter what you are searching for in the Find what field.
- Enter what you want to replace it with in the Replace with field.
- In the Directory field, paste the path to the extract EPUB folder. Do not forget this step!
You can go deeper than that if all the chapters are collected in a specific folder you know of.
- Warning: This is considered a dangerous operation, since it can change literally hundreds if not thousands of files without an undo operation. Verify your entered values carefully. It is strongly recommended to click on Find All first, to make sure you’re getting what you intended.
- Click on Replace in Files. And you just saved yourself a day’s worth of work!
You can use this technique to get rid of strange symbols scattered across multiple documents. Once you find the strange symbol, using the steps outlined in the previous section, copy the symbol to your clipboard. In the Find what field, you will need to paste it there. Nothing will appear as it is a hidden character. But trust me, it is there. That sneaky son of a glitch. In the Replace with field, leave that empty. And now you’re set. Clicking on Replace in Files, will delete this character everywhere else in your EPUB.
Once you know the insides of an EPUB file, you will feel the capability of doing anything to it. You’re not restricted to using a certain software, nor request over-valued ebook services, for what you would discover to be simple changes. Validation error messages are pretty much clear to indicate exactly what the problem is. From our side, we will continue to enhance our Kotobee Author, to ultimately find ways to catch and correct user-generated errors that are tricky and extremely difficulty to catch otherwise. And three simple words before we close the curtains..
Never underestimate Notepad++