Project Brief
by jmax. Last edited by jmax @ 2009/12/3
http://thinkubator.ccsp.sfu.ca/wikis/xmlProduction/ProjectBrief

Elements of a Web-first XML Production Framework

John Maxwell

This document describes an XML-based production environment prototyped by the CCSP in 2009. The system is built from free and/or wide ly used tools ion the publishing world, and aims to map out an integrated web + print editorial and production workflow.

The system was designed principally to handle books and monographs, but has at least as much applicability in journal publishing; in fact, the Journal of Scholarly and Research Communication (SRC) is currently providing a testbed for single-source article production.

Starting principles, design goals

The difficulty of actually implementing XML-based production systems is a long-standing problem, for reasons of complexity, scale, and lack of accessible tools—especially where traditional publishing staff are the target users. While a well-funded research group can make investments in fairly heavy software and IT infrastructure (e.g., the Humanities Computing group at UVic), this approach is largely out of reach for small publications. Furthermore, complex XML production systems tend to trade traditional production staff for IT/programming staff, rather than building on existing competencies and resources.

The CCSP’s approach has been to leverage existing, well-known, and well-used tools, strategically re-orienting them so as to achieve the benefits of sophisticated XML workflows without taking on the heavy and often expensive complexity of XML software. Our design goals were as follows:

  • prefer existing and/or ubiquitous tools over specialized ones wherever possible;
  • prefer free software over proprietary systems where possible;
  • prefer simple tools controlled and coordinated by human beings over fully automated (and therefore complex) systems;
  • play to strengths: use web software for storing and managing content, use well-known DTP tools for layout, keep editors and production people in charge of their own domains.

We began with web technology as basic entry point to XML. The Web is fundamentally an XML-based system—arguably the largest, most successful, and longest-running XML application ever. It is ubiquitous, free, and used in some capacity by nearly everyone already. We recognized that in modern, robust Web-based content-management systems (CMS) was a substantial set of functionality paralleling that of expensive XML content systems: access controls, versioning and tracking, granular management of content, separation of content from display, and so on.

But web CMS are only really good at web publishing. Where publishers have difficulty is in getting from existing print editing and production processes to web content in the first place. There are any number of software tools and systems in use to try to add web publishing onto existing print production workflows. These remain problematic for a variety of reasons:

  • file conversion (print layouts to web markup) is an imprecise and cumbersome process;
  • no robust system for revision management outside of the usual “track changes” approach, or out-of-band file naming conventions (such as OJS uses);
  • archival formats tend to be either MS Word or DTP file formats, resulting in long-term problems.

This situation is unlikely to improve because to the basic nature of the processes and tools involved. Our solution has been to turn this workflow upside-down, and instead go from web-based content to printed output instead of the other way around; in other words, that we be able to confidently move web content into—not out of—professional page layout software. In 2009, we achieved this using Adobe’s InDesign CS4, software which is rapidly becoming the tool of choice among print designers and production people.

The resulting workflow, which we have now prototyped and begun testing with a variety of content, looks like this:

  1. authors submit content, usually in MSWord;
  2. editors import Word files to web CMS, using generic WYSIWYG editing tools, then proofing;
  3. editors tag basic structural text components and specialized semantics (author name, etc.) via WYSIWYG web editing tools;
  4. editorial workflow, access control, and revision history managed online via web CMS; all editing happens online, so the canonical version of the content remains in the CMS.
  5. publishing online is either done by moving the HTML content into a public-facing place, or simply by creating a public view/stylesheet for the article content, leaving it in the CMS.
  6. publishing to print or page-oriented PDF is accomplished by transforming the web content to Adobe InDesign CS4 (via IDML file format), resulting in fully-formed InDesign layouts that can then be proofed and assembled by print production staff.

Key to the practical success of this arrangement is that step 5 and/or 6 happen quickly and seamlessly enough to be thought of merely as ‘output’ stages. Any editing that needs to happen at this point happens in the CMS, and online and print “outputs” are re-run (in a matter of seconds). Layout files are thus ephemeral; they are not archival, but only exist to get the content printed.

Components of our prototype system

Web CMS: Our proof-of-concept uses a simple wiki; a free, general purpose CMS platform like Drupal or WordPress would provide more than enough functionality (minimally: access control, version control, editorial stages, ability to dynamically apply stylesheets for different users).

Editing environment: We used TinyMCE (v3.2), which, when integrated with the CMS’ stylesheets not only looks and feels like a real editing environment, but can facilitate a good amount of ‘semantic’ tagging via HTML class attributes, simply via its drop-down menu interface. TinyMCE produces valid XHTML markup, allowing the content to integrate with systems beyond the web.

Print output: We developed an XSLT-based transformation script to convert XHTML content to IDML, the XML file format in InDesign CS4. We set up an InDesign template with master pages and stylesheets in place so that the web content can be automatically imported and the layouts be nearly perfect. At this point, proofreaders and/or production staff can take over working in a familiar DTP environment. Not surprisingly, the hardest part of this conversion is table-based content; our prototype is capable of building complete tables in InDesign, though they have so far required proofing and layout adjustments, though these are easy tasks in InDesign.

E-publishing: Our prototyping has so far involved:

  • styling web content in place for public access;
  • exporting the HTML and re-importing into OJS as galleys;
  • converting HTML to .ePub files (a merely mechanical conversion) to create ebooks.

Planning for Sustainability

All edited content stays in the Web CMS, as neutral, transparent XHTML. This content is reasonably ‘future-proof’ because it is in a simple, open, and ubiquitous format.

The use of free software for the CMS platform almost eliminates concerns over future incompatibilities or software lock-in. The XSLT conversion script—based on an open standard—is distributed as free software.

Future publishing platforms and/or formats are entirely achievable, as there is no lock in with the current technologies (recall that the InDesign step is an ephemeral one: after going to print, the InDesign file can be discarded). Chances are that future digital publishing formats will be based on HTML (or evolved from it) anyway.

Because the system is based on either free software or tools that publishers are likely to already own, the investment that users make is largely one of learning how to do it. So the value proposition is not dissimilar to that of OJS itself (although the overall complexity is even less).

The point this system makes is that it is entirely possible to implement a practical XML-based production workflow without needing expensive or complex software and skills. Rather, it is acheivable using free and commonplace tools already, by strategically re-orienting these tools and the way they work together.

Some Q&A:

HTML isn’t a real XML document type; how can it provide semantically rich markup such as journal articles require? Shouldn’t you be using NLM’s Journal Publishing DTD?

XHTML is real XML, even though the vast majority of websites don’t treat it as such. XHTML defines a generic set of structures for marking up prose text—indeed, it is good enough to be the content core for a variety of other document schemas, like ePub and ONIX.

The argument for “semantic richness” must be made in the light of production costs and ROI: for example, the Text Encoding Initiative (TEI) defines a markup vocabulary that is stunning in its semantic richness. But outside of bibliographers and other such scholars, it isn’t worth anyone’s while to do the markup. The return just isn’t worth the investment.

XHTML can probably provide 80% of the real-world utility of more specialized DTDs and schemas. And, given ubiquitous, free, easily understood tools like the TinyMCE editor, that 80% is achievable in 20% of the time.

What about the metadata?

Our approach to things like basic journal metadata (author names, institutional affiliations, etc.) is to capture these straightforwardly within the article text, as visible content. They are marked up in XHTML via class attributes (e.g., <p class=”dc-creator”>). TinyMCE makes this kind of capture as easy as applying formatting styles.

Since the resulting markup simple yet unambiguous, metadata captured in this way can later be easily transformed into other DTDs for interchange or harvesting if needed. Our system rather focuses on making the content preparation as straightforward as possible on the editorial end.

How does this fit into OJS and PKP?

There isn’t any software integration yet. Our prototyping with OJS has been done manually: copying the resulting HTML into OJS as galleys. OJS’ default assumption is that content is an attached file. In the future, OJS might be able to refer to content simply as a URL—making integration much simpler.

What about Word?

Authors still write in Word; they probably will for a long time to come. We’ve experimented with ways of converting .doc files to XML, and there is no perfect solution. One of the quickest, easiest methods is simply to cut and paste from Word directly into TinyMCE—which has a nice feature for doing just this—and then immediately proofing the result.

It seems advantageous to get the content out of Word and into the CMS at the earliest opportunity. But if for workflow reasons authors and editors prefer to pass Word files back and forth with “track changes,” then the editor could import the content into the CMS at a later point.

How does this relate to Lemon8XML?

L8X does a good job of recognizing basic article metadata and bibliographic references (it does a terrific job with the latter, especially). But it doesn’t really offer much new functionality to the conversion of basic article content.

Ideally, the XML production environment could be integrated with a tool like L8X, where the WYSIWYG editing happens immediately in TinyMCE, and the L8X components help sort out metadata and bibliographic details. The online content is then maintained as the master editorial source, the archive, and can be output to print or other digital forms at will.

 

 

Add a comment

subject:

Search this wiki:

Filed under:

Home

New tag:
Comments and opinions expressed here belong to their respective authors, and do not represent the views of Simon Fraser University or the Canadian Centre for Studies in Publishing. Powered by Zope, and much more...