| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | ||
|
Editor: jmax
Time: 2009/11/10 23:39:19 GMT-8 |
||
| Note: | ||
removed: -<p style="padding-left: 30px;">jmax @ sfu.ca <br />@jmaxsfu on twitter</p> changed: -<p style="padding-left: 30px;"><Chapter></p> -<p style="padding-left: 60px;"><Section>...</p> <p style="padding-left: 30px;"><Chapter>...</p> changed: -<p>Integration with <strong>Adobe CS4</strong> for print output</p> <p>Integration with <em>Adobe CS4</em> for print output</p> changed: -<p>John Maxwell<br />Canadian Centre for Studies in Publishing<br />Simon Fraser University<br /><em>http://thinkubator.ccsp.sfu.ca/wikis/xmlProduction</em><br /><em>jmax @ sfu.ca</em><br /><em>@jmaxsfu</em> on twitter</p> <p>John Maxwell<br />Canadian Centre for Studies in Publishing<br />Simon Fraser University<br />http://thinkubator.ccsp.sfu.ca/wikis/xmlProduction<br /><em>jmax @ sfu.ca</em><br />twitter: <em>@jmaxsfu</em></p>
For PubWest 2009
Tucson AZ
Nov 12–14, 2009
XML 101: Paring it Down to Essentials
John Maxwell
Canadian Centre for Studies in Publishing
Simon Fraser University
Vancouver BC
Welcome to all attendees.
How many of you have dealt with XML already?
Agenda
The point of XML: Data Neutrality
The point of this talk: Keep it Simple, Stanley!
Three parts:
- Essential components of XML
- Practical advice on XML tools and strategies
- Outline of a "web-first" XML workflow
The point of XML: Data neutrality: how do you prepare content in such a way that it isn't tied to a particular device (printer, software application, hardware)? So that it can be dealt with appropriately across multiple environments and systems without having to do everything twice (or more); without having to go through painful conversion steps?
Most of what we are used to doing with computers and publishing software is device-dependent. Think of Word, or InDesign: in the first place, they are software-specific. In the second, they are entirely about producing ink on paper.
The point of this talk: It isn't rocket science. Well-meaning developers and industry consortia standards bodies (read: committees) have over the past decade made XML into a enormous snakes-nest of complex details, capable of being applied to all kinds of different problems (like getting completely different databases to talk to each other over the Internet). But you don't need all that complexity to make this technology work for you. My agenda is to give you the 20% of XML that does 80% of the work—and maybe even better numbers than that!
Going to talk about 3 things:
Markup language?
Embedding codes in the text so it becomes machine-readable as well as human-readable.Not AI, just markers (“tags”) to identify the structural parts of your document—handles for the software to be able to do things with the text:
- format it;
- convert it;
- fold, spindle, mutilate...
How to achieve data neutrality? The solution is markup languages—the idea dates back to the 1970s at IBM, where they were producing software supports for computer phototypesetting equipment, and facing the same device-dependence problem. Their innovation: a “Generalized Markup Language.”
The key idea is that the markup is about the semantics of the document, not the format. If the semantic structure is clear, you can use that to format it. This is called “separation of concerns.”
In the 1980s, the technology was formalized (by ISO) as Standard Generalized Markup Language (SGML); in the late 1990s, a "version 2.0" was released with the web in mind, called Extensible Markup Language. Same basic technology. Nothing new here.
Markup is what editors do anyway. XML is a set of conventions for doing markup with computers. XML is a set of completely open standards for markup—not dependent on any particular software or system.
<tag>content</tag>
In XML, these “pointy brackets” are the magic symbols that separates the markup from the content.
(This is not rocket science...)
Here's what it looks like:
Pointy brackets define the metadata, the tags, that surround (in pairs) and describe the content.
But you've seen this before... you've seen this in web pages, if you've had any opportunity to look at web pages in the past two decades. Because web pages are actually XML. More about that later...
But there's a little more to it than pointy brackets.
A hierarchy of elements
Pairs of tags, nested within other tags
= a hierarchical document structure.
Quality control: “Validity” vs “Well-formedness”
The more important thing is that the elements (tags—in pairs—surrounding content) are nested in a neat hierarchy that describes overall document structure.
So, technically speaking, if the markup is neatly nested, and the tags come in tidy pairs, the document is said to be “well formed”—and therefore the parsing software can make easy sense of it.
Further—and optionally—if the set of tags and their nesting sequence follows a set of formal rules (a “document type” or “schema”) the document can be said to be “valid,” according to that set of rules, giving the parser even more confidence.
<Book>
<Chapter>
<Section>
<Paragraph>The quick brown fox jumped over...</Paragraph>
</Section>
</Chapter>
<Chapter>...
Here's an example that will be familiar to anyone who's ever been to school.
That's about it. That's the essentials. What we have here is a rudimentary framework—a syntax—for marking up text with enough extra information to make it do what we want.
The rudiments of a data structure that can drive typesetting and formatting, and indexing, and so on.
Semantics and Granularity
Elements and attributes are standardized in “Document Types.”
Document Types have tended to be industry-specific.
More granularity means more semantic richness;
...also more work.
Now, we haven't yet talked about the tags themselves. The names of the elements, and the possible attributes they carry are going to define what we can meaningfully express in our markup.
Historically—especially in the days of SGML, when markup languages were industrial concerns administered by industrial IT people, Markup Languages (MLs) were defined for particular industrial applications: aerospace, pharmaceuticals, auto parts, philology and paleology, and so forth.
Granularity refers to how deep the markup goes.
In the book chapter example above, we didn't extend the markup beyond paragraphs, down to sentences or sub-sentence structures. But you could, if there was a good reason for capturing structure at that level. But that would mean a lot more work. Granularity implies trade-offs: the more semantic richness captured in the markup, the more you can do with the text. But someone has to do all that markup.
Different Document Types have different limits to granularity and semantic richness. They have different business cases.
The issue is not whether or not to use XML, but what do you want XML to do for you?
What can you do once you have it?
Typesetting & formatting – the original application;
Web pages – all made of XML;
Ebooks – ePub is just web pages in a wrapper (a binding?);
Searching, indexing, browsing;
Re-use, re-packaging, re-purposing;
Fold, spindle, mutilate...
Once you have marked up content, you have material which can be used with an unlimited number of different applications.
Typesetting was the original SGML application, way back in the 1970s and 80s.
The next great wave of markup came with the WWW in 1990—designed originally as an SGML application, it has been more or less XML based for almost a decade now. Which is to say, you have already read thousands of pages of XML content.
The coming of the ebook is about XML, too. Almost all ebook formats are XML-based, especially the “reflowable” ones like ePub (which is literally made up of a collection of web pages in a wrapper—perhaps we should think of it as a binding.)
Semantically rich, fine-grained XML content can be indexed and searched with precision. Anyone who has ever painstakingly compiled the index to a book can probably imagine the advantages of using explicit text markup through the book to track the occurance of terms, and the ease of maintaining the index once you have the markup.
And re-use and re-packaging are what the current wave of XML is all about: producing the same content in print AND online; creating the full-screen version AND the mobile-phone version. The slideshow presentation AND the notes handout, from a single source.
Consider the slides you're looking at right now. They are produced by an XML application that pulls them right from the context of the notes.
OK, How do I create XML?
The simple answer: by hand coding it.
The real answer: document conversion via software or cheap labour
The expensive answer: dedicated editing/processing tools
The smart answer: use web tools.
We need to talk about practical matters now.
One of the design goals for XML was that it should be human-readable as well as machine-readable. So you can mark up text by hand, using a text editor, keying in all the tags and attributes manually. But while many geeks still do this, you shouldn't, so let's move on.
Document conversion is the industrial answer to this question. If you have existing content, in Word or Quark or PDF or just books in the warehouse, there are document conversion service providers that will do this job for you, for a fee.
While the promise of effective software conversion (e.g., clean XML markup from old MS Word files) always seems like it should be possible, it generally isn't—because there just isn't enough explicit structure in those files. So document conversion services often involve cheap labour—often in India.
There are, of course, expensive software tools for writing, editing, and maintaining XML. If you've got content marked up in one of those heavy, industrial Document Types (say you work for Boeing), there are heavy, industrial software tools that will lighten your load. These tend to be designed for technical writers and long, complex, highly-structured documentation. Literary editors may not see themselves in this picture.
The smart answer, I argue, is to use web tools.
If you have a blog, a wiki, or a web content management system, you actually have some reasonably sophisticated XML tools already. But you've probably think of these as tools you use to create your website, not your published content.
These web tools, though, have been around for years; they're usually free (open source), and they often have big teams of developers maintaining and improving them. As a result, they have evolved into powerful XML tools (for writing, editing, categorizing, searching, and so on) without anyone really thinking of them that way.
TinyMCE
A WYSIWYG editor for the web.
I'll show you one such tool—called TinyMCE—which is a piece of free software, absolutely ubiquitous online (you have almost for sure used it yourself, though you may not have stopped to think about it). TinyMCE is a little embedded WYSIWYG editing environment for web pages. You see it in your blogging tool, you fill our comment fields with it. It looks like a word processor, and it painlessly generates valid XML code.
It's actually pretty robust. We've been using it over the past year at SFU, and it works great. We've run three whole books through it so far; and the production for a new scholarly journal, and using for all of our online activities.
Here are regular old web tools pressed into service as XML production tools. Does this make sense?
To properly assess this, I want to return briefly to talking about Document Types.
Making sense of Document Types
Which DTD should I use?
(er, what was a DTD again?)
(0r should that be “schema”?)
- DocBook, ePub, XHTML, DITA, OOxml, and the rest of the alphabet...
We don't mark up content in “XML”—this is a meaningless thing to say. We mark up content in a particular XML markup language, a particular Document Type.
Remember that a “valid” document is one which conforms to the set of rules for a particular Document Type (DTD is Document Type Definition). Where XML in general gives us the syntax of markup, the DTD gives us the vocabulary and grammar for making sense with it.
Different industrial applications have specified different things in their DTDs. The aerospace industry needs document structures like maintenance steps, safety checks, part numbers, and so on. The pharmaceutical industry needs ingredient lists and contraindications and disclaimers and the like. More generic documentation-oriented DTDs like TEI and DocBook specify structures like lists and subsections and quotations and so on.
The web's native markup language—HTML—contains a set of generic markup structures for prose text: paragraphs, headings, lists, tables. It also contains a structure for cross-reference links across the Internet. That's what the HT refers to: HyperText.
While the upper-level semantics of different DTDs vary widely, the basic encoding for the building blocks of prose are actually pretty generic. For this reason, many DTDs created over the past decade have simply incorporated HTML—in its more formal XML version (XHTML) directly. EPub is a good example—it incorporates XHTML wholesale, merely adding an organizational wrapper around the content.
The implication is this: for the vast majority of your text-markup needs, HTML is enough. It's not the most semantically rich markup out there, but recall that semantic richness is expensive—you have to carefully consider the ROI for going beyond it. For most trade applications—which revolve around content management and multi-mode output, the generic structures in XHTML are enough.
The huge benefit in using XHTML is the ubiquity of free, robust tools that are capable of handling it. Tools you are probably already using!
The Quick Way to XML
XHTML – the XML the web is made of.
It's real XML, even if the web is a mess!
It can be semantically rich, valid, powerful.
ePub is XHTML...
This is a bit iconoclastic. XML purists have always looked down on the web and HTML. I used to be one of those people. This is because the web is an absolute mess of invalid, non-well formed, abused and misused markup.
But that's not the point. The XML technologies that underlie the web and web tools are actually perfectly useable if you take them seriously.
It just requires a bit of a re-orientation.
SFU Publishing’s R&D work
A web-first workflow
Using webCMS tools for content management, access, versioning
Integration with Adobe CS4 for print output
Over the past year at the Canadian Centre for Studies in Publishing, we've been fleshing out an end-to-end editorial and production workflow based entirely on web technologies.
It uses generic web Content Management tools for storage, access, versioning, and so on. We're using a wiki, but you could potentially use Drupal or WordPress in the same way.
We use the TinyMCE editing environment, made a little bit smarter by virtue of a CSS stylesheet with some "semantic" elements defined.
We've prototyped a pathway to transform (that's a fancy word for “convert”) the XHTML content to Adobe CS4's open IDML file format. That means we go from the web to InDesign, rather than the other way around—this works about 400% better than the usual way people do it, because you're going from a clean source to an output format.
The Take-aways:
XML doesn't have to be complicated
XML doesn't have to be expensive
The web is already XML, which means you're already doing it.
XML is not rocket science. Everyone is talking about it these days, but there's very little practical advice going around. My intent is to give publishers—especially small and medium-sized firms—some practical advice on at least getting started with it.
Contact me!
John Maxwell
Canadian Centre for Studies in Publishing
Simon Fraser University
http://thinkubator.ccsp.sfu.ca/wikis/xmlProduction
jmax @ sfu.ca
twitter: @jmaxsfu