Thursday, December 3, 2015

Declarative Index Proposal for Printing with CSS

This proposal is about formatting an index with HTML and CSS. It’s something you can’t do today in a standard, cross-implementation way.

The part you can’t do today is getting the lists of page numbers right. The same problem occurs in an index and in cross references. There are vendor extensions to do it, but they are not compatible, so it’s a good candidate for standardization.

A back of the book index looks like this:




We’ll just work with a representative sample. We’re going to focus on ranges of pages in the index entries. We’re going to address this in a way that will also allow the formatting, of course.

In a typical index there are three or four kinds of reference (the numbers):
  1. To definitions or primary entries, shown in the example as bold page numbers;
  2. To discussions, marked in the text as ranges (once at the start and once at the end), for example under Fortunes (astrological) in the picture;
  3. To illustrations, with page numbers in italics (not shown in this picture);
  4. To mentions, shown in a regular type: this represents most of the index.
In all cases consecutive page numbers in the index for the same kind of index entry must not repeat: see pages 17, 17 and 17 just looks stupid and makes people wonder if the numbers are right; then they start wondering if the text is right.

Here’s some sample HTML markup. I’ve used the HTML class attribute to distinguish the different kinds of index reference, because we know that information when we make the index:

<dl class="index">
  <dt>Forehead, skin of</dt>
  <dd>
    <a href="#book1-chap-xv-para5">.</a>
    <a href="#book1-chap-xlii-para5">.</a>
  </dd>
  <dt>Fortunes (astrological)</dt>
  <dd>
    <a href="#book2-chap-vi-para-16">.</a>
    <a href="#book2-chap-l-para-2" class="range-start">.</a>
    <a href="#book2-chap-l-para-3">.</a>
    <a href="#book2-chap-l-para-7" class="range-end">.</a>
    <a href="#book2-chap-l-note-5" class="to-note">.</a>
  </dd>
</dl>


The content of the “a” elements will be replaced by the formatter because when we author the document we don’t know the page numbers. For use with formatters that don’t support page numbers, we could put section titles perhaps.

Now, in our CSS we can propose this, translating the class HTML attribute into an index-class so that the renderer can treat sequences of different classes of reference differently:

  dl.index dd {
    index-collapse: ranges;
    index-range-separator: "-";
    index-number-separator: ", ";
  }

  dl.index dd a.range-start {
    index-class: range-start;
    index-collapse: within-class;
  }

  dl.index dd a.range-end {
    index-class: range-end;
  }

  dl.index dd a {
    index-entry: number;
  }

  dl.index dd a.note {
    index-class: note;
  }


If in chapter fifty, the references paragraphs end up on the same page, we might end up with
  Fortunes (astrological) 250, 402 403 403, 404
This looks really stupid in a book, with the repeated 403; in more common cases there might be dozens of repeated numbers and the index will quickly become difficult to use.

So let’s solve that.

Since we said that ranges are to be collapsed, the range
    402 403 403
generated from the actual page numbers turns into
    402-403
using the index-range-separator between the values.

The 404 is coming from an entry with a different index-class CSS property value; I used index-collapse: within-class, to prevent values like this from being absorbed into the range. Typically this is done where page 404 has an illustration or, in the case of this book, a footnote, important because it’s a critical edition and the footnotes contain the translator’s comments on the text.

The need to merge index entries into ranges is why the process has to involve a formatter, so that the actual page numbers can be used.

Note: Most XSL-FO formatters already support merging index entries in the declarative way I have described here. In this proposal I’m extending the properties slightly to let users choose different characters other than the hyphen and comma, for better internationalization.

It would be possible to extend the approach further and allow merging of arbitrary ranges expressed in markup, but for now it’s sufficient to use page numbers.

Sometimes page numbers will be formatted differently, such as Roman numerals (i, ii, iii...) in the beginning of a book, or A-1, A-2, B-1, B2 to number each appendix; this is handled using the existing content CSS property. The merging happens on the value of the page counter at the location referenced by the anchor.

 Extending the approach might involve supplying a different value to be used for merging, but I don’t know that it such extensions would be needed and I am not proposing them.

Here are the proposed new properties; there are no changes to existing properties.

Index-collapse

index-collapse: none | ranges | all

The index-collapse property, when set to a value of ranges or all, indicates that an element’s immediate child elements are to be processed with index merging.

The value all indicates that all sequences of successive numbers are to be merged into ranges.

The value ranges restricts merging to values marked as belonging to a range.

This property value does not inherit.

index-range-separator


index-range-separator: none | initial | inherit | string

The index-range-separator property gives a string to be used to separate the start and end points in a range; the default is an en dash. Note that OpenType font processing may substitute a numeric dash where appropriate.

This property inherits (to make index sub-entries easier).


index-number-separator


index-number-separator: none | initial | inherit | string

The index-number-separator property gives a string to be used between numbers that are not merged into a range.

This property inherits. Initial value: ", ".


Issue: I think this property can be replaced using the existing content property on an a element, with a different value being supplied for the last entry.


index-class

index-class: none | initial | inherit | string

The purpose of this property is to be able to separate out references to tables, figures, definitions and discussions. If a selector-based approach could be used this would just be a general class attribute value.
Issue: An alternative worth exploring would be to use an ARIA role here.

This property inherits. Initial value: none
 

After index processing, then, our DOM tree might look like this:

<dl class="index">
  <dt>Forehead, skin of</dt>
  <dd>
    <a href="#book1-chap-xv-para5">71</a>
    <a href="#book1-chap-xlii-para5">204</a>
  </dd>
  <dt>Fortunes (astrological)</dt>
  <dd>
    <a href="#book2-chap-vi-para-16">250</a>
    <a href="#book2-chap-l-para-2" class="range-start">402</a>
    <a href="#book2-chap-l-para-3">403</a>
    <a href="#book2-chap-l-para-7" class="range-end">403</a>
    <a href="#book2-chap-l-note-5" class="to-note">404</a>
  </dd>
</dl>


We want this to appear as,
Forehead, skin of, 71, 204.
Fortunes (astrological), 250, 402-404.

We want to do this with as little magic as possible.

One way would be if we could select first, middle and last items in a range. Then we could hide the middle entries and use ::after on the first to insert the dash. But we can't select elements based on CSS properties, and we don’t know in advance which elements will be part of a range, let alone which will be first or last. It could change based on reflowing the document (e.g. for a different paper size or window size).

So this proposal follows XSL-FO in having the browser or formatter do that automatically, even though it’s a little bit magic.

The proposal is declarative and there is also implementation experience in XSL-FO of a very similar system.


No comments: