John Maxwell
This document describes an XML-based production environment prototyped by the CCSP in 2009. The system is built from free and/or wide ly used tools ion the publishing world, and aims to map out an integrated web + print editorial and production workflow.
The system was designed principally to handle books and monographs, but has at least as much applicability in journal publishing; in fact, the Journal of Scholarly and Research Communication (SRC) is currently providing a testbed for single-source article production.
The difficulty of actually implementing XML-based production systems is a long-standing problem, for reasons of complexity, scale, and lack of accessible tools—especially where traditional publishing staff are the target users. While a well-funded research group can make investments in fairly heavy software and IT infrastructure (e.g., the Humanities Computing group at UVic), this approach is largely out of reach for small publications. Furthermore, complex XML production systems tend to trade traditional production staff for IT/programming staff, rather than building on existing competencies and resources.
The CCSP’s approach has been to leverage existing, well-known, and well-used tools, strategically re-orienting them so as to achieve the benefits of sophisticated XML workflows without taking on the heavy and often expensive complexity of XML software. Our design goals were as follows:
We began with web technology as basic entry point to XML. The Web is fundamentally an XML-based system—arguably the largest, most successful, and longest-running XML application ever. It is ubiquitous, free, and used in some capacity by nearly everyone already. We recognized that in modern, robust Web-based content-management systems (CMS) was a substantial set of functionality paralleling that of expensive XML content systems: access controls, versioning and tracking, granular management of content, separation of content from display, and so on.
But web CMS are only really good at web publishing. Where publishers have difficulty is in getting from existing print editing and production processes to web content in the first place. There are any number of software tools and systems in use to try to add web publishing onto existing print production workflows. These remain problematic for a variety of reasons:
This situation is unlikely to improve because to the basic nature of the processes and tools involved. Our solution has been to turn this workflow upside-down, and instead go from web-based content to printed output instead of the other way around; in other words, that we be able to confidently move web content into—not out of—professional page layout software. In 2009, we achieved this using Adobe’s InDesign CS4, software which is rapidly becoming the tool of choice among print designers and production people.
The resulting workflow, which we have now prototyped and begun testing with a variety of content, looks like this:
Key to the practical success of this arrangement is that step 5 and/or 6 happen quickly and seamlessly enough to be thought of merely as ‘output’ stages. Any editing that needs to happen at this point happens in the CMS, and online and print “outputs” are re-run (in a matter of seconds). Layout files are thus ephemeral; they are not archival, but only exist to get the content printed.
Web CMS: Our proof-of-concept uses a simple wiki; a free, general purpose CMS platform like Drupal or WordPress would provide more than enough functionality (minimally: access control, version control, editorial stages, ability to dynamically apply stylesheets for different users).
Editing environment: We used TinyMCE (v3.2), which, when integrated with the CMS’ stylesheets not only looks and feels like a real editing environment, but can facilitate a good amount of ‘semantic’ tagging via HTML class attributes, simply via its drop-down menu interface. TinyMCE produces valid XHTML markup, allowing the content to integrate with systems beyond the web.
Print output: We developed an XSLT-based transformation script to convert XHTML content to IDML, the XML file format in InDesign CS4. We set up an InDesign template with master pages and stylesheets in place so that the web content can be automatically imported and the layouts be nearly perfect. At this point, proofreaders and/or production staff can take over working in a familiar DTP environment. Not surprisingly, the hardest part of this conversion is table-based content; our prototype is capable of building complete tables in InDesign, though they have so far required proofing and layout adjustments, though these are easy tasks in InDesign.
E-publishing: Our prototyping has so far involved:
All edited content stays in the Web CMS, as neutral, transparent XHTML. This content is reasonably ‘future-proof’ because it is in a simple, open, and ubiquitous format.
The use of free software for the CMS platform almost eliminates concerns over future incompatibilities or software lock-in. The XSLT conversion script—based on an open standard—is distributed as free software.
Future publishing platforms and/or formats are entirely achievable, as there is no lock in with the current technologies (recall that the InDesign step is an ephemeral one: after going to print, the InDesign file can be discarded). Chances are that future digital publishing formats will be based on HTML (or evolved from it) anyway.
Because the system is based on either free software or tools that publishers are likely to already own, the investment that users make is largely one of learning how to do it. So the value proposition is not dissimilar to that of OJS itself (although the overall complexity is even less).
The point this system makes is that it is entirely possible to implement a practical XML-based production workflow without needing expensive or complex software and skills. Rather, it is acheivable using free and commonplace tools already, by strategically re-orienting these tools and the way they work together.
XHTML is real XML, even though the vast majority of websites don’t treat it as such. XHTML defines a generic set of structures for marking up prose text—indeed, it is good enough to be the content core for a variety of other document schemas, like ePub and ONIX.
The argument for “semantic richness” must be made in the light of production costs and ROI: for example, the Text Encoding Initiative (TEI) defines a markup vocabulary that is stunning in its semantic richness. But outside of bibliographers and other such scholars, it isn’t worth anyone’s while to do the markup. The return just isn’t worth the investment.
XHTML can probably provide 80% of the real-world utility of more specialized DTDs and schemas. And, given ubiquitous, free, easily understood tools like the TinyMCE editor, that 80% is achievable in 20% of the time.
Our approach to things like basic journal metadata (author names, institutional affiliations, etc.) is to capture these straightforwardly within the article text, as visible content. They are marked up in XHTML via class attributes (e.g., <p class=”dc-creator”>). TinyMCE makes this kind of capture as easy as applying formatting styles.
Since the resulting markup simple yet unambiguous, metadata captured in this way can later be easily transformed into other DTDs for interchange or harvesting if needed. Our system rather focuses on making the content preparation as straightforward as possible on the editorial end.
There isn’t any software integration yet. Our prototyping with OJS has been done manually: copying the resulting HTML into OJS as galleys. OJS’ default assumption is that content is an attached file. In the future, OJS might be able to refer to content simply as a URL—making integration much simpler.
Authors still write in Word; they probably will for a long time to come. We’ve experimented with ways of converting .doc files to XML, and there is no perfect solution. One of the quickest, easiest methods is simply to cut and paste from Word directly into TinyMCE—which has a nice feature for doing just this—and then immediately proofing the result.
It seems advantageous to get the content out of Word and into the CMS at the earliest opportunity. But if for workflow reasons authors and editors prefer to pass Word files back and forth with “track changes,” then the editor could import the content into the CMS at a later point.
L8X does a good job of recognizing basic article metadata and bibliographic references (it does a terrific job with the latter, especially). But it doesn’t really offer much new functionality to the conversion of basic article content.
Ideally, the XML production environment could be integrated with a tool like L8X, where the WYSIWYG editing happens immediately in TinyMCE, and the L8X components help sort out metadata and bibliographic details. The online content is then maintained as the master editorial source, the archive, and can be output to print or other digital forms at will.