Extract pages from EPUB files

scott.hollows · 15 April 2022 05:50

I need to extract pages from ePub files

Does anyone know how to do this ?

A Delphi solution is preferred, but a DOS Batch method that involves calling other programs might also work

Geoff · 15 April 2022 07:14

I don’t know that EPUB has pages as such. The ePub file is a zip file that contains xhtml files. There are protected versions of them as well, I’m not sure how this is done though.

Didier · 15 April 2022 14:05

As Geoff mentions, the concept of pages in epub is ambiguous, While the reference to the printed book page numbers can be optionally retained to accompany the text, an epub page otherwise describes the amount of content that is represented on the reading device screen, which of course varies with the device. Epub publications are only chunked by meaningful divisions, for example chapters, sections, paragraphs, etc…

I presume, though, that you are referring to the textual content of the book, which is contained in one or more xml (xhtml) files within the epub file, which is a container in zip format for all the files making up the publication - you can simply open the file with 7zip, winzip, etc.

To extract the contents faithfully, as epub contains every presentation, navigation and formatting details, you need a thorough xml parser and a solid knowledge of the epub specification, which is very extensive - especially if the publication conforms to the later spec sets. There is a myriad of possible implementations of the specification to choose from to produce a publication, so unless you know roughly what range of it is used by a given publisher, your parser has to expect the full gamut, and that’s a massive task. Or you could simplify extract the text tags from the xml files and line up the results hoping for an intelligible output.

If your goal is only to display the content as best you can for a minimum amount of work, I’d suggest extracting all the xml files and their corresponding stylsheets, keeping the original folder structure, then opening the xml file(s) with a web browser component - this works well enough most of the time. If there are more than one xml file in the set, you will have to determine the reading order - that info is contained in the opf file, which describes the reading order.

As to protected epubs, they have to be unlocked before any of this, but that’s another story and the publication might be copyrighted.

Hope this helps,

Didier