Poor Man’s TOC Generation

Introduction

The following aims to be a quick method to prepare digitized documents for online publication.

As for tools, we only require a PDF editor like Adobe Acrobat or the excellent PDFStudio which is also available on Linux.

The document to be published will either be scanned and saved directly as PDF or the PDF editor will be used to group the single scans together and save them as single PDF.

The Metadata

As for the metadata, we use the already existing provisions for PDF: title, author and date fit into the standard fields of the PDF document properties. More detailed information, including (where available) a reference and a link to the OPAC, can be saved inside the IPTC container, which is an international exchange standard.

The TOC

In order to achieve a structured document, i.e. a document containing a table of contents (TOC), we have to manually create bookmarks which link the headers of books, chapters, sections and sub-sections to their specific pages.

In a second step we will then order these bookmarks in a hierarchical (structured) manner: a chapter is inside a book, a section is inside a chapter and so on.

What we want to achieve looks like this:

Arithmeticae libri duo
  | P. Ramus lectori
        | Errata corrige
  | P. Rami arithmeticae Liber I
        | Cap. I De notis arithmeticis
        | Cap. II De additione
        | Cap. III De subductione

Creating the Bookmarks

To create the bookmarks, we use the existing tools of the PDF editor.

Normally, you choose the page you want to bookmark (e.g. the beginning of a chapter), choose “Add Bookmark” from the menu and type in the text of the title.

This can usually done quite rapidly.

Click here for a video.

Re-Ordering the Bookmarks into a TOC

As we already said, in order to get a structured, hierarchical TOC the bookmarks have to be re-ordered.

Normally you do this by choosing an entry, holding down the mouse button and pushing the entry up and left or right until the outline resembles the book structure.

This takes some time to get used to, but even a long document should be finished in a couple of minutes.

Click here for a video.

Here is an example: link.

That’s it on the user side – the document is now ready to be used in the system!

Generating Images and Webpages

The following steps are for the administrator of the web site only.

PDF Tools

In Linux, we can use the pdftoppm & pdftohtml tools to generate the images and the web page.

pdftohtml -s -xml -i document.pdf

gives the following output:

 < ?xml version="1.0" encoding="UTF-8"?>
< !DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd"/>
<pdf2xml producer="poppler" version="0.20.4"/>
<page number="1" height="1246" width="910"/>
<page number="2" height="1245" width="988"/>
<page number="3" height="1275" width="951"/>
<page number="4" height="1270" width="987"/>
<page number="5" height="1281" width="1002"/>
<page number="6" height="1286" width="967"/>
<page number="7" height="1283" width="1005"/>
<page number="8" height="1270" width="944"/>
<page number="9" height="1271" width="989"/>
<page number="10" height="1283" width="1005"/>
<page number="11" height="1280" width="959"/>
<page number="12" height="1271" width="989"/>
<page number="13" height="1270" width="945"/>
<page number="14" height="1270" width="988"/>
<page number="15" height="1270" width="945"/>
<page number="16" height="1270" width="988"/>
<page number="17" height="1281" width="960"/>
<page number="18" height="1283" width="1005"/>
<page number="19" height="1272" width="948"/>
<page number="20" height="1271" width="989"/>
<page number="21" height="1278" width="955"/>
<page number="22" height="1283" width="1005"/>
<page number="23" height="1280" width="957"/>
<page number="24" height="1285" width="1007"/>
<page number="25" height="1270" width="944"/>
<page number="26" height="1283" width="1005"/>
<page number="27" height="1276" width="953"/>
<page number="28" height="1272" width="991"/>
<page number="29" height="1287" width="967"/>
<page number="30" height="1273" width="992"/>
<outline>
 <item page="4">Arithmeticae libri duo</item>
 </outline><outline>
  <item page="5">P.Ramus lectori</item>
  </outline><outline>
   <item page="9">Errata geometriae et corrigito</item>
  </outline>
 <item page="10">P. Rami arithmeticae liber I</item>
  <outline>
   <item page="10">Cap. I. De notis arithmeticis</item>
   <item page="10">Cap. II. De additione</item>
   <item page="12">Cap. III. De subdictione</item>
  </outline>

The file contains everything we need to build a webpage: the numbered scan list, the scan dimensions, and the corresponding TOC including the references to the scans.

A customized XSL-T can now generate the actual (X)HTML.

The Images

The images instead are generated by the following command:

pdftoppm -scale-to 1300 -jpeg document.pdf d001

The output are JPEGs with a maximum height/width of 1300px.

Alternatively one could output PNGs, as a first step, and in a second step build WEBPs for even faster page loading and lower server load. As of now, unfortunately, WEBPs are only supported in Chrome.

This entry was posted in Annotation. Bookmark the permalink.

Comments are closed.