Based on our current v0.3: tkbr2icmlv0.3.xsl
The following is, hopefully, enough background to attack the HTML-IDML script. The big picture is simply that this script was prototyped by a couple of people with absolutely no xslt experience. So, while it is functional, it is a mess of spaghetti—I am certain that there must be a better way of saying “when we see an <em> element, regardless of whether it's in a paragraph, a table cell, or a footnote (etc, etc.), do X with it”—but I don’t have a very good sense of how to do this properly. Another e.g.: we wrote this script specifying “xhtml:” for each selector, because it only worked that way... it cuts down on readablity, though; do we need it, perhaps by declaring differently? Such are the limits of my knowledge...
The second issue is that this needs to be a script which I can be maintain—tweak, modify, and add to—over time. The editorial workflow that this script supports is still being prototyped, and I’m certain that there will be lots of little tweaks to how exactly things get marked up, and how exactly we treat things in InDesign. So cleanliness and some degree of modularity are probably critical—we need to design for future changes, to the degree that this is possible.
Our prototype script used xslt v1.0, but Liza indicated that xslt v2.0 would be a better result for handling mixed content models. The caveat is that we want to make the execution environment as minimal as possible. I’m currently running my transforms in Tod Ditchendorf’s XSLPalette, which handily includes both libxslt and Saxon 8.9 (saving me the hassle of having to install Java components and set environment variables). More importantly, I wouldn’t want our target audience—small book and journal publishers—to have to wrestle with installing development environments. So simplicity is a major virtue.
Ideally, in the future I’d like to run this script from within InDesign (as in, FILE>Open Web Page)—but that’s outside the scope of the current project—and I believe that CS4 only supports xslt v1.0. So that can remain a wish-list item for the time being.
The other environment component is TinyMCE (v3.2), which is what our editors are using. If you have any question about what kind of XHTML to expect, the short answer is: that which TinyMCE produces.
This entire project is made possible by Adobe’s new InDesign file format, IDML, which exposes the entire guts of an InDesign file as XML. The spec is robust, nicely modularized, and well (even “nicely”) documented. The 1200-page reference is here:
Some samples and quick-start material can be found here:
and general info is here:
IDML represents the totality of an InDesign file. That’s a LOT more complexity than we needed. For our purposes—which is simply to move web-based content into pre-existing InDesign templates—all we need is the “Story” part of an InDesign file. Helpfully, Adobe thought of this already: Adobe’s InCopy software (originally designed for simultaneous editor/designer interaction in a newsroom) uses a subset of IDML called ICML (as of the CS4 version). So, to make a long story short, what we are really producing here are ICML files, which can then be “placed” in an InDesign template (which has master pages, styles, and so on pre-defined). Or, to put it differently, we’re effectively replacing InCopy with the web.
IDML’s content structures (which make up the bulk of ICML) are really simple:
ParagraphStyleRange elements define chunks of text that inherit a particular named style. Within those are CharacterStyleRange elements, which do the same thing. They’re like divs and spans, except that all content needs to be in a CharacterStyleRange, even though it may be unstyled. It’s verbose as hell, but it’s exactly two layers deep: not complicated.
Oh... and line breaks are done manually, which explains the <br /> in the middle of each handler.
The only ‘magic’ ICML requires is a couple of boilerplate PIs and that all the named Styles are declared at the top of the ICML file. They don't have to be defined, just declared.
Images are relatively straightforward. The only hitch is that they have a size/co-ordinates system which must be converted from width and height attributes—a little arithmetic, which you’ll see in our script.
Tables are more hairy, because the InDesign table model is completely different from the HTML table model: column and row counts are declared up front, and then everything is done cell-by-cell. This requires some processing, and in our prototype, you can see a barely functional version cribbed more or less straight from Adobe’s IDML cookbook. Making this perfectly intellegent and robust is not a high priority, but I'm very interested to hear what you think could be done with it.
Beginning with an HTML document (this one, for instance), we merely save a clean XHTML copy out of the browser (the "Clean XHTML export" link on the left does the work for you). Firefox's "Save as Web Page - HTML Only" produces an XHTML-declared version.
Run the resulting HTML against the XSLT, and give the result a .icml extension. Then, open up a template file in InDesign CS4. In InDesign, File->Place... the .icml file, and shift-click to “auto-flow” (generating additional pages as needed).
You’ll notice in the prototype that there’s a simple mechanism for distinguishing between initial paragraphs and following paragraphs, so that these can be styled distinctly in InDesign (flush and indented, respectively). I did this as a bit of a show-off (see what you can do with XML!), but also because it is a mark of typographic excellence, and that’s what we're after.
Footnote support is pretty rudimentary. You'll see in here a table-based footnote structure that comes from the reStructuredText system in ZWiki -- which I have since abandoned in favour of TinyMCE, which does a passable job of making functional end-notes out of content pasted in from MSWord (and since our journal publishers especially really like this), this probably needs to be rewritten in the future to support whatever our editors want to with footnotes (currently they don't know). The bottom line is: for the short term however, it’s not a high priority, but this is one of those areas where I will need to make changes to the script to adapt it to editorial needs.
Quick-and-dirty metadata is handled by assigning ‘semantic’ class names to particular paragraphs (e.g.; <p class="dc-creator">John Maxwell</p>). This is simple enough, and also imperfect enough that what it really needs is flexibility. This is one of the key areas of this script where I can imagine going in and changing the behaviour over time: how the metadata is represented in the HTML, how it maps to InDesign styles, etc. What I had imagined was a fair number of possibilities for p classes, all assembled in such a way as to make quick script modifications easy. Open to suggestions about how to handle this...
Do I have to say, in 2009, that we just assume Unicode throughout, and not bother at all with those nasty character entities? I probably do. OK, there, I said it.
My intent is to release this under GPL. The only reason I haven’t already is that there are chunks of this script cribbed straight from Adobe’s cookbook, and I didn’t want to go GPLing something that wasn’t mine. Any distance you can take us toward this would be appreciated.
The overall agenda is to release this into the wild as a usable proof-of-concept (right now it is a not-quite-usable proof of concept) that people can pick up and work with. I have no intention of selling it or otherwise managing it on a commercial basis—apart from using it in a consulting context. I’m more interested in demonstrating to publishers that this way of working is possible and easily acheivable. Of course, the script and basic approach could be taken miles farther.