Sunday, August 17, 2014

Back-of-the-book Indexes and CSS

An index in the context of printed books is a list of topics with associated page numbers, usually printed at or near the end of the book.

The plural is indexes, unlike the mathematical indices, which are different beasts that happen to share the same singular form, index.

To make an index for a digital document, you:
  1. Mark the targets, the topics, in the main book itself;
  2. Extract the list of topics, or index headings as they are called, and sort them;
  3. For each topic, collect the list of appropriate page numbers;
  4. Collapse runs of adjacent page numbers into ranges.
All of this can easily be done declaratively. There are some minor complications to resolve, but they are not too hard:
  1. Sorting multilingual headings, symbols, and other items;
  2. Multi-level indexes;
  3. Mixtures of references of differing types;
  4. References to footnotes;
  5. Balanced column formatting.
 

Sorting

Books might well have multiple scripts in use, such as Latin for English and Spanish and Devanagari for Hindi. It may be necessary to use different collations for different parts of the index, for example to get an index of Spanish names to sort names starting with Ch properly without affecting the index of French names on the next page. It may be appropriate to provide a sort key for an individual index heading to get it to go where you (as author) want it.

Even a moderate sized book can have hundreds or thousands of index entries, so sorting them obviously should be done automatically.  However, it is possible to do that outside of CSS, for example with JavaScript or XSLT, before formatting. A minimal proposal for supporting indexes in CSS therefore does not need to address sorting the index.

Multi-level indexes

It’s not uncommon to see sub-topics collated together in an index:
    Socks, black, 28, 73; argyle, 24-56, 72; white: 24

This is done partly to save space in the index, partly to make using the index more efficient (everything to do with socks is in one place, instead of under A for argyle,, B for black and so forth) and partly to avoid ugly and potentially confusing repetition of the main head (socks) for each sub-entry.

The consequence for software is just that you need a multi-level sort.

As with sorting the index itself, collation and sorting of sub-entries can  be done before formatting the document and doesn’t need to be specified with CSS. We only need to bother ourselves about the page numbers, which must of course be supplied by the formatter.

Mixtures of references of differing types

It’s common to indicate in an index more than just the page number but also what the reader might find there: a discussion, a definition of a term, a figure, a table, a mention in passing. Note that the actual index heading might not appear in the text; a discussion of argyle socks might not mention the word “sock” but only talk about the pattern, or perhaps use the word “stocking;” this is part of why a well-crafted index can be more valuable than automated searching for some applications.

The usual ways to distinguish the different types of reference are to use bold page numbers for definitions, italic for figures, and perhaps to prefix the page number with a symbol such as an asterisk (*) or dagger (†) for a table.

This means that if you have a figure that occurs in the middle of a discussion it’ll be listed separately:
    Socks, argyle, 24-28, 26, 72

References to footnotes and other page areas

A reference to a footnote usually has an annotation such as 46 note 1 so that the reader has warning both of where to look on the page and that the reference might not be part of the main discussion. If there are references to multiple footnotes on the same page they could be listed,
    Socks, argyle, 24-28, 26, 72 note 1, note 3.

Other areas that might have specific annotations in an index include sidebars, examples, tables, side-notes (marginalia) and other secondary content. Books with large, dense pages such as the Encyclopædia Britannica might divide the page into areas and have a reference like 46C to mean the bottom left quadrant of page 46 but these are specialized books, often produced with highly customized software.


Balanced Column Formatting

I’ll mention this for completeness, because the index-specific processing is all about the references. Indexes are usually printed in relatively narrow columns, three or more to a page. On the last page of the index the columns will probably all be balanced so they are the same height (give or take one or two lines in the last column).

It can be important with the back matter in a book to remember that the book must usually be a multiple of 16, 32 or 64 pages to be printed economically by folding large sheets of paper. So if you end up with two lines of index on the last page you’ll want to set a shorter page if that avoids having 63 blank pages after it!


Formatting with CSS


CSS doesn’t yet support index generation, but XSL-FO extended CSS 2.1 to add properties for indexes, and this was implemented and today is widely used in production. So maybe we can do something similar.

You can also refer to the XSL-FO 2.0 draft for a much more detailed explanation that I am going to give; we don’t need as much because CSS already has ways to do most of what we need. I think there may also be a mistake in the summary of generated page numbers as page 14 doesn’t appear anywhere; if so, sorry.

Let’s suppose for now that we have an HTML (or XHTML) document containing a sorted index ready to format. it might have markup like this, to generate
    Socks, argyle, 24-28, 26, 72

<li class="index-entry">
  <p>
    <span class="head">Socks</span>
    <span class="subhead">argyle</span>
    <a href="#ie41" class="ie"></a>

    <a href="#ie42" class="ie"></a>
    <a href="#ie46" class="ie"></a>
    <a href="#ie50" class="ie"></a>
    <a href="#fig17" class="ie figure-ref"></a>
    <a href="#ie51" class="ie"></a>
    <a href="#ie52" class="ie"></a>
    <a href="#ie53" class="ie"></a>
    <a href="#ie141" class="ie"></a>
    <a href="#ie142" class="ie"></a>
  </p>
</li>

We don’t know the page numbers in advance so we can’t put them there. They’ll be supplied by the content property.

Suppose when the document is formatted the targets referred to end up being on pages 24, 24, 25, 26, 26, 26 (the figure), 26, 27, 28, 72 and 72.

The formatter will need to process this list:
  1. separate out the figure reference
  2. collapse duplicate numbers in the list
  3. collapse ranges into a start and end with a separator
  4. merge the separate streams (here, references to text and to figures) by sorting on the first number in each range.
A complication is that the page number used for sorting will presumably be the page built-in variable, but the value printed will be as it might appear on the page running header or footer, so that an introduction page might get roman numerals (Socks, white: xxiii) and an appendix might get a preceding letter, known as a folio prefix, such as B for Appendix B (Socks, internet proxy: B-42). This is the same as for general cross-references within a book, of course.

When eliminating duplicates, the value of the index-merge property determines whether a sequence of three or more consecutive page numbers are merged together to form a range, or if they are kept separate.

A second complication is that if there were ranges in the example instead of just individual references, the formatter would typically expand the ranges into a list of individual pages first (in practice this will be a short list) between steps one and two in the process, because that simplifies the process of eliminating duplicates.

In addition, items with different index-class values are normally kept separate, but a property index-merge-differing-classes can be set so that the figure reference in the example would vanish, as it’s contained with the 24-28 page range.

After handling page ranges the index entries are formatted normally, with the content property and :before and :after being available to support the various page number formats, and to allow marking index entries to colour plates (say) in [square brackets], just as for a normal cross-reference. However, the number of “a” elements/nodes may in general  be fewer than in the original document before index processing. It can never be more: references within a range that were expanded to simplify processing in an implementation do not result in new elements being created.

Issue: should the transformation from original list to sorted, merged list use the shadow DOM?

Issue: how best to supply the comma, en dash or “ and ” between entries? XSL-FO just inserts a text node with no standard way to change the en dash or comma and space.


Properties (proposed)


index-class: a user-defined “ASCII-lower-case” string; entries with different index class values do not by default get merged together, so that you can keep figure references even when they fall within a range of text pages, of keep a definition even if it, too, falls within, say, a range of figures.

index-item (none, begin, end, span, point): set this on the “a” element. It does not inherit. “span” means that all of the pages generated by the target element are added to the index; otherwise the page on which the element starts is used. The begin and end values are used in pairs to mark index ranges and point is used for a stand-alone page reference. None turns off index processing on the given element.

index-list: Use index-list for the containing element to indicate that its children are to be processed as index citations; this would be used on the “p” element in the sample markup above. Values are none (don’t do index processing), no-merge, same-index-class, all). A value of same-index-class means page ranges are merged only when they have the same associated index-class value. A value of  between-classes means index-class values are ignored for the purpose of merging. A value of no-merge enables index processing but not merging of adjacent pages into a range. In all cases page numbers that are explicitly marked as being a range using index-item begin and end are kept as a range.

Issue: is this enough? I’ve proposed merging XSL-FO merge-pages-across-index-key-references and merge-ranges-across-index-key-references into the values of index-list but I have lost the ability to merge between index classes and not within a class; I think that combination never made sense. The page-number-prefix and suffix can be done with before/after.

Other items

Indexes might contain cross-references such as “Stockings: see under socks” which do not need special formatting. One might make the text be a link, of course.

Column balancing is outside the scope of index processing.

Formatters are not expected to support index entries that point back into the index itself.

If you reset the global page counter, the index facility proposed here will give you strange results.

Comments?



5 comments:

Unknown said...

See docbook for a good example of index entry markup Liam? Variants as you say on the index entry, level (primary, secondary, tertiary) which all make for easier sorting.

Liam Quin said...

@unknown (Dave?) - yes, DocBook has support for building an index, which can end up being formatted using XSL-FO, DSSSL (older) or in theory CSS...

Norman Walsh's Test Weblog said...

I wasn't aware that "indexes" vs "indices" was contextual, I thought it was just the authors prerogative.

O'Reilly preferred Indexes and Appendixes to Indices and Appendices so that's that I got used to.

Unknown said...

I was noting you make no mention of index entry syntax, which is the basis (IMHO) for collating indices as you say above? Are you assuming that?

Re specifying properties, how might CSS do as docbook does, collect a whole suite of 'options' which cover indexing? Docbook uses a kind of config file, not available in CSS?

Liam Quin said...

@Unknown - I haven't looked at the docbook options file; however, I'm assuming right now that the index is included in the HTML, e.g. generated with XSLT as per DocBook or with JavaScript in node.js or in a browser.

The properties I propose here are what's needed from the formatter.

I'll look at the DocBook stuff in more detail, but is there anything specific I've missed?

Balanced columns, special hyphenation rules for breaking after a dash and not before, some numbers in italic and some in bold, this sort of thing can already be done by CSS.

Thanks for commenting!