Poor Man’s TOC Generation


The following aims to be a quick method to prepare digitized documents for online publication.

As for tools, we only require a PDF editor like Adobe Acrobat or the excellent PDFStudio which is also available on Linux.

The document to be published will either be scanned and saved directly as PDF or the PDF editor will be used to group the single scans together and save them as single PDF.

The Metadata

As for the metadata, we use the already existing provisions for PDF: title, author and date fit into the standard fields of the PDF document properties. More detailed information, including (where available) a reference and a link to the OPAC, can be saved inside the IPTC container, which is an international exchange standard.


In order to achieve a structured document, i.e. a document containing a table of contents (TOC), we have to manually create bookmarks which link the headers of books, chapters, sections and sub-sections to their specific pages.

In a second step we will then order these bookmarks in a hierarchical (structured) manner: a chapter is inside a book, a section is inside a chapter and so on.

What we want to achieve looks like this:

Arithmeticae libri duo
  | P. Ramus lectori
        | Errata corrige
  | P. Rami arithmeticae Liber I
        | Cap. I De notis arithmeticis
        | Cap. II De additione
        | Cap. III De subductione

Creating the Bookmarks

To create the bookmarks, we use the existing tools of the PDF editor.

Normally, you choose the page you want to bookmark (e.g. the beginning of a chapter), choose “Add Bookmark” from the menu and type in the text of the title.

This can usually done quite rapidly.

Click here for a video.

Re-Ordering the Bookmarks into a TOC

As we already said, in order to get a structured, hierarchical TOC the bookmarks have to be re-ordered.

Normally you do this by choosing an entry, holding down the mouse button and pushing the entry up and left or right until the outline resembles the book structure.

This takes some time to get used to, but even a long document should be finished in a couple of minutes.

Click here for a video.

Here is an example: link.

That’s it on the user side – the document is now ready to be used in the system!

Generating Images and Webpages

The following steps are for the administrator of the web site only.

PDF Tools

In Linux, we can use the pdftoppm & pdftohtml tools to generate the images and the web page.

pdftohtml -s -xml -i document.pdf

gives the following output:

 < ?xml version="1.0" encoding="UTF-8"?>
< !DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd"/>
<pdf2xml producer="poppler" version="0.20.4"/>
<page number="1" height="1246" width="910"/>
<page number="2" height="1245" width="988"/>
<page number="3" height="1275" width="951"/>
<page number="4" height="1270" width="987"/>
<page number="5" height="1281" width="1002"/>
<page number="6" height="1286" width="967"/>
<page number="7" height="1283" width="1005"/>
<page number="8" height="1270" width="944"/>
<page number="9" height="1271" width="989"/>
<page number="10" height="1283" width="1005"/>
<page number="11" height="1280" width="959"/>
<page number="12" height="1271" width="989"/>
<page number="13" height="1270" width="945"/>
<page number="14" height="1270" width="988"/>
<page number="15" height="1270" width="945"/>
<page number="16" height="1270" width="988"/>
<page number="17" height="1281" width="960"/>
<page number="18" height="1283" width="1005"/>
<page number="19" height="1272" width="948"/>
<page number="20" height="1271" width="989"/>
<page number="21" height="1278" width="955"/>
<page number="22" height="1283" width="1005"/>
<page number="23" height="1280" width="957"/>
<page number="24" height="1285" width="1007"/>
<page number="25" height="1270" width="944"/>
<page number="26" height="1283" width="1005"/>
<page number="27" height="1276" width="953"/>
<page number="28" height="1272" width="991"/>
<page number="29" height="1287" width="967"/>
<page number="30" height="1273" width="992"/>
 <item page="4">Arithmeticae libri duo</item>
  <item page="5">P.Ramus lectori</item>
   <item page="9">Errata geometriae et corrigito</item>
 <item page="10">P. Rami arithmeticae liber I</item>
   <item page="10">Cap. I. De notis arithmeticis</item>
   <item page="10">Cap. II. De additione</item>
   <item page="12">Cap. III. De subdictione</item>

The file contains everything we need to build a webpage: the numbered scan list, the scan dimensions, and the corresponding TOC including the references to the scans.

A customized XSL-T can now generate the actual (X)HTML.

The Images

The images instead are generated by the following command:

pdftoppm -scale-to 1300 -jpeg document.pdf d001

The output are JPEGs with a maximum height/width of 1300px.

Alternatively one could output PNGs, as a first step, and in a second step build WEBPs for even faster page loading and lower server load. As of now, unfortunately, WEBPs are only supported in Chrome.

Posted in Annotation | Comments Off on Poor Man’s TOC Generation

MAB Namespace

Da die DNB den Unterschied zwischen Namespace und Validation verwischt und dazu noch ihre Server umgezogen hat, ohne zumnidest die wichtigsten Validationsdateien auf die neuen Adressen umzuleiten, hier ein modifizierter (aber technisch gueltiger) Header einer MAB XML-Datei mit tatasaechlich am angegebenen Ort existierender Validationsdatei:

<?xml version="1.0" encoding="UTF-8"?>
<datei xmlns="http://www.ddb.de/professionell/mabxml/mabxml-1.xsd"
http://files.dnb.de/standards/formate/mabxml-1.xsd" >
<datensatz typ="h" status="n" mabVersion= "M2.0">
<feld ind=" " nr="001">BV037887150</feld>
<feld ind=" " nr="100">Gruner, Ludwig</feld>
<feld ind=" " nr="331">&lt;&lt;The&gt;&gt;
   Caryatides from the "Stanza dell'Eliodoro" in the Vatican,
   designed by Raffaelle d'Urbino</feld>
<feld ind=" " nr="410">London</feld>
<feld ind=" " nr="412">Gruner</feld>
<feld ind=" " nr="425">1852</feld>
<feld ind=" " nr="433">19 Bl.</feld>
<feld ind=" " nr="434">Ill.</feld>

Bei Bedarf – wenn etwa die DNB die Validationsdatei (http://files.dnb.de/standards/formate/mabxml-1.xsd) erneut unerreichbar verschiebt -, kann auch auf die hier abgelegte Datei zurueckgegriffen werden: http://dlc-tei.net/mabxml/mabxml-1.xsd

Naeheres zum MAB-XML Schema hier link.

Aber aufgepasst – die in den Beispielen und Schemapublikationen dort angegebene Validationsdatei ist ungueltig! An der dort angegebenen Adresse der physischen Validationsdatei xsi:schemaLocation=”http://www.ddb.de/professionell/mabxml/mabxml-1.xsd http://www.ddb.de/professionell/mabxml/mabxml-1.xsd” existiert nichts mehr. Ich habe deshalb eine Kopie auf http://dlc-tei.net gelegt und verlinke dorthin.

Zum vermurksten Namespace – der (unabsichtlich?) wie ein Dateiname aussieht und dazu auch noch wie der Name der nicht mehr existierenden Validationsdatei! – siehe auch die Diskussion hier.

Posted in Annotation | Leave a comment

Annotations and Notes (and Comments, too)

In the following text we want to present some of the different concepts for annotations, notes and comments which are present on the web.

We will briefly analyze the different setups, scopes and possibilties each solution offers.

We will also add some uses cases by the example of a fictional professor Brunelleschi who is using these tools at work in his institute and for himself.

But first of all, we need to set some things straight – first of all the terminology. In fact there are three fundamentally different ways to relate an information to a given resource available today:

Annotating which is adding a note to a line of text
Commenting which is adding a note to a website (mostly a post)
Note taking which is taking a note about a website

These meta-informations can have different scopes (or endpoints):

Site-Specific when notes, annotations and comments can relate to one specific web site only
Any Web Page, if Provided by Web Site Owner when notes, annotations and comments can relate to all websites where its owner has installed a specific script
Any Web Page when notes, annotations and comments can relate to any website

And finally, there are different places where these meta-informations are stored:

Server provided by Web Site when notes, annotations and comments reside on a server provided by the owner of the web site
Server provided by Third-Party when notes, annotations and comments reside on a third-party server, probably chosen by the user (free/open source or commercial offering)

We will now describe the most common cases in more detail.

1. Site-Specific Commenting

Our professor Brunelleschi browses through the websites searching for some place to spend the holidays – it’s August, after all  – when he comes upon a museum’s website where one of the artefacts is falsely presented as the work of one of his colleagues. And even as the work of one of his arch enemies!

Brunelleschi is really angry, but a bit undecided about what he should do: he could contact the museum staff, but right now, in mid August, there is zero probability that someone – and even less a knowledgable person – will answer his email. And he doesn’t want to end up dealing with disgruntled staff telling him he should try again later.

But then he sees that there’s a possibility to leave comments on the web sites: he decides to write a comment about the falsely attributed authorship. But wait – should he use his real name? Probably not, better avoid to give his enemies a pretext for denouncing him as petty and arrogant.

At the end Brunelleschi writes his comment using the pseudonym »Giovannino«.


Overview: Commenting is for leaving short messages regarding a blog post or web site. It might even be anonymous, even if you are forced to register with an email. The comment is not necessarily traceable to yourself, nor will it be part of your work material.

Target: The comment function works only on specially prepared web sites, like blogs.

Functions: Sharing the comment is only indirectly possible, by sending others a link to the whole website. Sometimes a function like “Inform me if someone replys to my post” is available. Replying to a comment is definitely possible. Tagging is mostly available.

Requirements: On the user side, mostly some kind of login is required to avoid automaized spam. Web site owners instead need to setup their site to allow commenting (and sometimes need to approve comments).

Storage: Comments are stored at the specific web site.There is no export functionality.

Reuse: Reuse is not foreseen.

Available services: “Comment” functions on web sites and most blogs, like WordPress, TypePad, Blogsmith, MovableType &c. Normally the comment relates to the whole post only, but there is also an exception: the Digress.it plugin for WordPress, which lets you comment on single paragraphs.

End User Perspective: Very convenient to leave a comment or start a discussion. On longer threads, the missing export function becomes a nuisance: I’ve seen more serious threads copied into personal blogs by their authors just to preserve them.

2. General Web Annotations

But the same museum’s website has also some interesting stuff Brunelleschi has never seen before: artefacts which show stylistic and technical properties which can only be ascribed to foreign artisans, most probably coming from the Orient.

Obviously, the museum staff does not recognize the importance of these artefacts and even describe them as early modern provincial works.

No need to tell them, though! Brunelleschi opens his electronic notebook – he paid for it by credit card, but that was worth it – and starts to take notes.

Most of the time he »clips« the relevant webpage – that is, he grabs the text and the relevant images only without the usual garbage on the header and footer of the page – and into his electronic notebook and then adds, on a new line, his personal remarks.

This is quite fast and very handy, and much better than mere bookmarking because when he looks up his notes  – or makes a search  – he quickly finds all the texts he has clipped inside his notebook.

He knows exactly about the importance of his electronic notebook, and therefore he’s happy he’s found a commercial provider with a good reputation – not one who might be gone tomorrow – and with the possibility to export and archive all his notes.

Brunelleschi’s notebook is even securely protected with a seemingly good encryption, to make sure nobody of his colleagues can snoop in. There is only one exception: to a certian young researcher who’s dealing with oriental artists he’s given acces to all the notes relating to oriental stuff: it’s enough to »tag« (another one of these catch words! his notes with a relevant tag like »Oriental« and both of them have access to the same notes. By sharing their notes both are profiting from the other’s expertise.


Overview: Note taking allows for rapidly collecting ideas from different web sites and store them into an electronic notebook. Depending on the software you can also copy text fragments from the web page into you notes or »clip« the whole story.

Target: All web sites

Functions: Some note taking apps allow sharing notes (via URL), and sometimes even following, where you automatically see the other’s notes. Tagging is mostly available, but I haven’t yet heard of a reply function.

Requirements: On the user side, a subscription to one of the note taking services is necessary. The service can then be used for all websites. Web site owners instead need not do anything.

Storage: Notes are stored at the service provider, who might give you the possibility to export them into compatible formats (mostly XML).

Reuse: Reuse of the collected material is one of the main selling points of these systems, so tools are provided on-site or even as separate applications for your desktop.

Available services: Elektronische Notizblöcke wie Evernote, Simple Note, Springpad etc.

End user perspective: Very much common among junior researchers for its ease of use and ubiquitous availability, including mobile devices. Highly regarded for collecting and ordering (using »tags«) research material.

3. Site-Specific Annotations

The real day-time work of Brunelleschi though is the administration of a study group for his research institute, which has grown to 5 full time and 2 part time workers. It takes a lot of time and energy to guide his people … and to avoid they chase phantoms!

Fortunately, Brunelleschi has taken care to install a commenting system which allows him to look at all the writings his workers are preparing and to leave a note: most of the time it’s a simple advice, a positive note, sometimes he takes care to have a longer discussion.

It’s also kind of an incubator, because all discussions – his notes and his worker’s replies – are behind a firewall and cannot be seen from outside. But for the study group, these notes and replies which always bear the author’s name and a time stamp, are permanently visible and, in fact, end up being part of the project.

At the end of the project, all texts – and all the notes – will be permanently archived in the institute’s long term archival system, where future researchers will find them.


Overview: A side-wide annotation systems allow to annotate large texts, or group of texts, within a controlled environment. Its users are mostly employees who automatically have access to the system. The system might mostly be used for revisioning documents (e.g. manuals).

Target: Unfortunately, the system works only on specially prepared websites.

Functions: Sharing annotations is automatically part of the process as a whole group of people will be assigned to the task. But you will be able to share annotations only withing the work group. Replying, and maybe even tagging will certainly be possible, but a “following” function is rare.

Requirements: On the user side, a subscription to service is necessary – if not automatically provided company-wide. On the provider side, these systems are mostly deployed by larger corporations, research institutions, and organizations like Europeana which can justify the complex and costly installations.

Storage: Comments are stored at the company or institute only, which might – or might not – give you the possibility to export them.

Reuse: Most systems are closed in the sense that they don’t allow for (or foresee) output and reuse outside the system itself.

Available services: The only working(!) system I know of is Highlighter (link) which is a commercial service mostly used in educational context.

End user perspective: The end user will see the system mostly as employee of a company or institute – adding notes to the document pool will be part of his normal work, but the result will mostly not be his own “intellectual property”.

4. General Web Annotations

For his personal studies about Dante’s »Divina Commedia« and its influence on early Renaissance artists our professor Brunelleschi is using yet another instrument, which allows him line-for-line commenting on texts, but is adapt at being used on all web pages, not only texts which reside on his institute’s server.

In this case it is necessary to annotate texts residing foreign servers, mostly in the US but also in Italy and some texts in Germany, and to add annotations without these servers even knowing about it.

At the same time, Brunelleschi can virtually work together with his colleague in L’Aquila, Italy, because the both see the same texts and there is a mechanism to reply to the other’s annotations. On a normal day, the time it takes his colleague in Italy to respond to one of his annotations is only a couple of minutes!

Brunelleschi can display the texts, the layered annotations and even a special page where all text fragments and annotations are collected, which serves him well when he has to write down an article.


Overview: Annotating texts line-for-line is a very demanding task for any application especially if it’s supposed to work on any website. Although extremely useful for intensive scientific research, until spring 2012 there was no working(!) application available for the common user.

Target: All web sites

Functions: Sharing annotations, following other users, tagging and replying are available.

Requirements: On the user side, a subscription to (the only) one of the commercial annotating services is necessary. The service can then be used for all websites. Web site owners instead need not do anything.

Storage: Annotations are stored at the commercial service provider, who gives you the possibility to export them into compatible formats (mostly XML).

Reuse: Reuse is at the forefront of all services, with possibilities to re-arrange material, and re-tag it, up to export functionality as MSWord.

Available services: As of now, I only know of Annotary link. A kind of similar service is Scrible which copies (clips) a website and then lets you annotate it (the clipped website, not the original one!).

End user perspective: Great for scientific work. Very young service provider, so no long time experience available. Export functions – as of now – are only print/pdf, but more (MSWord) is promised.

5. General Web / Public Commenting

At the end of the day, and before leaving for home, Brunelleschi finds some time to dedicate to his hobby: in fact, few know that he is a dedicated fan of a special kind of small hand-built boats named »gozzi« which are typical of the Amalfi and Sorrento peninsula. In the last days thee was some talk about a new boat by famous builder Aprea, and Brunelleschi immerges himself into the discussion.

Finally he finds the first images of the new boat – an open 7.50m boat – and puts a comment on the web page. At the same time, he shares the comment with his friends by using »Circles« function, which allows him to fine-tune who exactly will see his post: only his close friends, only the Aprea aficionados or the general public. His choices depend on the subject matter, but for today he shares his comment with everyone. Soon he sees the first responses, and some further links to other images.

The comment and response system is quite rudimentary, nothing scientific here, it’s more like a marketplace where everyone shares his comments, replies to the others’ comments and then moves on. The exact membership of each circle is hard to know, because people come and go, or might only be silent onlookers.


Scope: Public commenting systems are mostly used for rapidly sharing news to a larger, sometimes undefined audience. The content of the comment is mostly not much more than a simple “Hey, look here!” but it’s ease of use makes up for its shortcomings.

Target: Specially prepared websites

Functions: Sharing, following, tagging and replying are all implemented.

Requirements: On the user side, a subscription to one of the social services is necessary. The service can then be used for all participating websites. Web site owners instead mostly need to install only a small script on their site.

Storage: Comments are stored at the social service provider, who rarely (as of now, I only know of Google’s Takeout service) might give you the possibility to export them.

Reuse: Only few services allow export of social comments, and thus a possible reuse – although such a reuse would be quite meaningless without the accompanying platform (and the other users’ comments).

Available services: Social comment systems like Google+, FB Comments &c.

End user perspective: Useful for quickly and easily spreading news.


Add: RE-USE (e.g. Word.Doc export for own work, etc.)

Add: screenshots!

Add: Annotator & Yuma

Posted in Annotation | Comments Off on Annotations and Notes (and Comments, too)