Category Archives: www

SGML to XML migration of the doc tree

I’m that type of guy, who doesn’t like talking too much and prefers to just silently work on something instead. I think it is bad, since I have some interesting projects in my queue although lacking free time I’m progressing slowly. I’ve decided to blog more about projects that I work on or that I’m interested in. This is the first entry of this series.

It has been quite some time that I started working on converting the doc tree from SGML to XML but because of (1) lacking free time, (2) the doc repo being a quickly moving target, (3) lacking a good VCS system that support moves and renames and good branching, it hasn’t ever been completed and I just always have it in a nearly complete stage, trying to keep up with merging upstream changes. But since (3) is resolved now, it will be easier to create a development branch and to keep it in sync until it is totally finished and the work can be merged back. Also, more people became interested and this motivates me to try to dedicate more time to finally finish this and they may also help out. In the following, I’ll try to summarize what this migration consists of.

First, I would like to emphasize that this change won’t make a big difference to doc committers since XML is actually a superset of SGML so there will only be minor changes in the markup and the change will rather affect the toolchain and the generated output. Let’s check the characteristics of SGML and XML to better understand what this means.


  • Stands for Standard Generalized Markup Language and it is the father of XML. It is really aged and when it was introduced, (1) document sizes did matter and (2) there were no such experience available in the field to rely on. As a consequence, SGML is a real beast and it supports much more features than XML but those features do not matter in today’s usages. For example, SGML permits starting and ending tags to have a different syntax than that one of the usual <> brackets or allows sometimes that we omit end tags or lets us abbreviate <foo>bar</foo> as <foo>bar</> or even <foo/bar/. Nowadays, these extra bytes are not expensive and mostly the commonly known syntax is used in the doc tree even if we could still abbreviate.
  • SGML being so complex with its great many of features, required complex processing software. Because of this complexity, there are few open source choices out there and their development is discontinued.
  • The DocBook schema that we use was earlier developed for SGML but it is discontinued and newer versions use XML technologies.
  • Rendering SGML documents is done with the DSSSL standard that is also very old and complex. The open source choices are also limited here, namely Jade and its fork OpenJade. DocBook provided its DSSSL stylesheets but not surprisingly, they are also discontinued.
  • As a result of using these old software with old stylesheets, we have several unresolved rendering issues in the documentation. Sometimes parts are missing in the RTF version or lines are running out of the page in PDF. Such PRs have been sitting for a long time in GNATS.


  • Stands for eXtensible Markup Language and was created to simplify SGML so that its processing can also be simplified. It has more strict requirements on the document to keep this simplicity. However, it is still perfectly valid SGML.
  • There are various XML parsers out there in several programming languages, which gives us more choices. These pieces of software are still being developed and XML is still widely used.
  • Recent versions of DocBook are based on XML and there are different schema standards that can be used for validation.
  • Transforming XML documents to plain text, HTML or another XML documents can practically done with the XSLT standard. As opposed to DSSSL, it is widely supported standard with various XSLT processors out there. However, XSLT is more limited to DSSSL because actually it is a transformation language not a rendering standard.
  • DocBook has excellent XSLT stylesheets that are developed together with the schema and support more output formats than currently available. For example, they support the popular EPUB format or there is a beta support for HTML5.
  • Because of XSLT being a transformation language, it cannot produce e.g. PDF or RTF output. It has to be done with another standard: XSL-FO. It is an XML-based typesetting language that can be rendered in various formats, depending on the capabilities of the chosen XSL-FO processor. First, the XSL-FO document is generated from the source document with an XSLT stylesheet and then it is further processed by an XSL-FO processor. Unfortunately, there are only 2 processor choices out there: xmlroff and Apache FOP. Apache FOP generates excellent quality output but it is Java-based, while xmlroff is written in C but it is lacking quite some features. This problem still needs to be solved or at last resort we can keep using the DSSSL stylesheets with (Open)Jade until a better solution urges. But even if we don’t build Apache FOP PDFs officially, it would be nice to have the opportunity to generate it since they are much better looking and better readable than the current solution.
  • XML has many related standards, like XLink, which is already used in DocBook 5.0 to support advanced linking. The rest are not planned to be used for any feature yet but there many possible open opportunities.

I hope I haven’t left anything important out. Comments and questions are welcome.