MathML Workflows
in STM Publishing

Paul Topping,
Design Science, Inc.
http://www.dessci.com
Updated February 2006

Introduction

One of the chief characteristics of STM (Scientific, Technical, Medical) content is the presence of mathematical equations. Traditionally, equations have been represented in workflows using a variety of formats: TeX, MathType, PowerMath, the math parts of various DTDs, to name a few. Now that many publishers are moving to a production system built around a central XML-based repository, describing mathematics using MathML, the XML-based language for mathematics, is a better choice. And, support for MathML is virtually an absolute requirement for targeting the new HTML+MathML delivery medium for publishing web content containing math. (To learn more about HTML+MathML, please see our companion paper, "MathML Adds Value to STM Publishing" [1]).

Once MathML has been chosen as the format-of-choice for the central repository, various MathML tools come into play at different processing stages:

This paper discusses what tools are available now and in the future for MathML processing in production systems built around a central XML-based repository. A few points should be made regarding the use of terminology in this paper. What we refer to as the "central XML repository" can be a content management system (CMS), a database, or some other kind of storage system. And although XML is the most desirable form for content in the repository, it will likely also contain other data format such as sound, video, graphics, etc.

The Case for MathML

XML favors structure and meaning over formatting

Today, many publishers are moving toward representing as much of their content in XML as possible. There are several advantages to putting XML at the center of the publishing enterprise:

The importance and advantage of XML in publishing workflows has been widely accepted in the publishing community, especially with STM publishers.

MathML is simply math in XML

The World Wide Web Consortium (W3C) [2] sets most of the standards for the web and XML technology. In 1997, the W3C's Math Working Group finished the MathML 1.0 Specification (superceded in 2001 by MathML 2.0 [3]). To learn more about MathML, visit the W3C's math home page [4] or see our articles, "MathML for Math and Science Communication" [5] and "A Gentle Introduction to MathML" [6].

Like all XML-based languages, MathML consists of tagged text. For example, the fraction, x/2, is represented in MathML as:

<math>
  <mfrac>
    <mi>x</mi>
    <mn>2</mn>
  </mfrac>
</math>

Note that this example does not include formatting — no font or point size is specified for 'x' or '2'. Although attributes can be set in MathML to specify formatting, they would normally be omitted for MathML sitting in the content repository. Formatting attributes are normally unnecessary even in final output as the font and point size of math defaults to that of the surrounding text in accordance with technical publishing convention.

MathML vs. other math formats

STM publishers have been dealing with math for a long time and many computer representations of mathematics have been invented and are still in use within publishing workflows. However, these formats do not fit well into the XML-based publishing scenario. Here are some reasons why MathML is a better choice:

Tools for working with MathML

Authoring Tools

Although there may be other ways of authoring documents containing mathematical notation, some tools are preferred when MathML is required. Here are some details on the available tools:

Microsoft Word + Design Science MathType

In a survey we did a few years ago, we discovered that about 70% of manuscript submissions to STM publishers were in Microsoft Word's document format with equations produced with Design Science's MathType [7] (or the junior version of MathType, called Equation Editor, that comes in the box with Word). This surprises some in the TeX community but TeX is really only dominant in mathematics and physics research. In other fields of engineering, science, and education, TeX is not so prevalent.

MathType 5 has the ability to convert equations to MathML. This can be done one equation at a time, but usually an entire Word document is converted to HTML+MathML using MathType's MathPage feature. A web page produced in this manner can be published directly to the web but in an XML-based production system, a transformation could be done on the page to get it into the repository.  MathType's conversion capabilities are also available through a scripting API, which can be used to automate conversion of equations to MathML in a workflow setting.

Alternatively, another product, such as exTyles (see below), can be used to simultaneously convert the Word document's text to XML and convert the MathType equations to MathML. The conversion to XML of the non-math portion of the document involves an assignment of Word styles to elements in a DTD.

HTML and XML editors + Design Science MathFlow or WebEQ Editors

If document input is to be done by in-house personnel or by a service bureau, rather than by scientists or engineers, it probably makes sense to use an XML editor, such as JustSystem's XMetaL [8] or PTC's Arbortext Editor [9], with a MathML editor, such as Design Science's MathFlow Editor [10], for the math. The MathFlow Editor integrates with XMetaL and Arbortext Editor, so that one can click on an Insert Equation toolbar icon (or use a keystroke shortcut) to open a new MathFlow Editor window that will save its results directly to the XML document. Similarly, if the user double-clicks on an existing equation in the XML document, a MathFlow window will open to allow the equation to be edited. The user experience is similar to that of using MathType with Microsoft Word.

For other XML editors such as Altova's XMLSpy [11] , the user must current develop MathML in stand alone MathML editor, such as Design Science's WebEQ Editor [12], and then copy it to the clipboard (or save it in a file), switch to the XML editor, and then paste (or import a file). However, we are continuing to develop closer interfaces between the Design Science editors and other popular XML and HTML editors, in addition to XMetaL and Arbortext Editor.

Which editor is best for working with MathML: MathFlow, WebEQ or MathType?

We at Design Science often get asked this question. Many people are familiar with MathType's user interface due to its popularity and association with Microsoft Word. However, MathType is not a true MathML editor as it contains features that have no analog in MathML. MathType produces MathML as a translation process and, like most translations, the results are not 100% equivalent to the original. MathType is also only capable of producing Presentation MathML, rather than Content MathML.

The MathFlow and WebEQ Editors, on the other hand, are MathML editors from the start — they represent the equation internally in MathML and read and save in MathML. Although they are not as well known as MathType and their user interfaces are not quite as powerful, we believe they are the right editors to use when MathML is a primary requirement. We will be enhancing the WebEQ and MathFlow user interfaces in future versions to bring them more in line with MathType's.

TeX/LaTeX

Most TeX users author in their favorite text editor. As TeX must always be converted to MathML, there's not much to say about the authoring process for TeX except perhaps that such authors should consider MacKichan's Scientific Word and Workplace as these are TeX-oriented authoring systems that can save as MathML.

The mathematics representation of both TeX and LaTeX is equivalent to Presentation MathML. Except for trivial mathematics, conversion to Content MathML is problematic.

MacKichan's Scientific Word and Workplace [13]

These products are comprehensive systems for authoring scientific and technical documents. As of version 4.0, users can output an HTML page with MathML equations.

Computer Algebra Systems

Computer algebra systems, like Waterloo Maple's Maple [14] and Wolfram Research's Mathematica [15], are built around a "notebook" interface and contain conversion to and from MathML. This should make it possible for a scientist or engineer to do their basic research and documentation within the interactive environment of these products. They can then submit such a notebook document to the publisher for conversion and integration into the XML repository. In fact, Mathematica version 4.2's notebook files are XML-based so, in theory, an XML-to-XML transformation is all that would be needed to incorporate a notebook into a publisher's XML repository.

Front-end Tools

Front-end tools are responsible for preparing and converting documents from their authored format into XML and placing them into the central repository. Conversion is a complex subject for which there are many tools.

XSLT processors and other text-to-text conversion tools

In the XML world, XSLT processors [16] are an important conversion tool for converting one form of HTML or XML to another. Many other tools can also be part of the toolbox, such as Perl scripts and Stilo's OmniMark [17] programs.

Inera's exTyles [18]

This product converts Microsoft Word documents to XML. It uses a set of rules to insert XML tags into the original document as well as applying cleanup processing. Although there are many tools that will aid in conversion of Word documents to XML, we mention this one because it makes use of MathType to convert equations in the document to MathML.

TeX4ht [19]

TeX4ht is a free software package for converting TeX and LaTeX documents into various HTML and XML formats. It is highly configurable and supports conversion of equations into MathML.  A number of other projects are also working to develop conversion software for TeX and LaTeX.

Maintenance Tools

Once MathML instances are placed in the central XML repository, it is sometimes necessary to make changes to them in order to fix bugs and to enforce design rules. Our MathFlow Editor is the right tool to use for such tasks.

Validation is an important task that checks XML against the DTD (Document Type Definition) that defines the elements and attributes allowed. This is done using XML parsing and validation tools. Since the MathML language is defined by the MathML DTD [20], MathML instances can be validated using the same techniques.

Back-end Tools

Preparing content for a particular publication is called "back-end processing". It involves one or more of the following tasks:

Output to HTML+MathML

If the output medium for the publication is HTML+MathML, the conversion is fairly simple. XSLT can be used to transform the repository XML items into HTML elements. The MathML items can be left untouched as they will be rendered by the web browser. (To learn more about HTML+MathML, please see Online Delivery, below, and our companion paper, "MathML Adds Value to STM Publishing" [1]).

Output to print or PDF

If the output medium is PDF or PostScript for print, the process is a bit more involved. One route is with XyEnterprise's XPP product [21] which combines a content management system with PDF and print output processing and MathML support.  Another route is to use PTC's Arbortext  Advanced Print Publisher - Desktop software.  MathFlow adds MathML composition capability to Arbortext Advanced Print Publisher - Desktop.

XSL-FO Renderers + Design Science MathPlayer technology

Another possible route to print or PDF is by converting the repository XML to XSL-FO, using XSLT or other text-to-text conversion tools. XSL-FO (FO = Formatting Objects) [22] is an XML-based language for describing the layout of paginated documents. There are several tools available that will convert an XSL-FO description of a document to PostScript or PDF: XEP from RenderX [23] and XSL Formatter from Antenna House [24]. Unfortunately, XSL-FO does not support mathematical notation. However, Design Science is working with vendors of XSL-FO formatters to combine our MathPlayer MathML formatting technology with their products to create a comprehensive output solution.

Online Delivery

As we explained in our companion paper to this one, "MathML Adds Value to STM Publishing" [1], recent advances in web browser technology have enabled MathML to be part of an online delivery solution. In this paper, we explain how using HTML+MathML as an online delivery medium allows an STM publisher to add value to their products by making them more useful to scientists, engineers, and educators.

We will briefly list the online delivery options involving MathML here:

Microsoft Internet Explorer + Design Science MathPlayer

MathPlayer [25] is Design Science's free MathML display engine for Microsoft's Internet Explorer (IE) [26] web browser. It currently requires IE 6.0 for Windows or later. Design Science is committed to making MathPlayer available free-of-charge as a service to the community. Several hundred thousand copies of MathPlayer have been downloaded at the time of this writing.  The current version, 2.0, adds sophisticated accessibility capabilities for mathematics, as described in "MathML Adds Value to STM Publishing" [1].

Netscape 7 / Firefox / Mozilla

Netscape 7 and related browsers includes native MathML display support. As of this writing, Netscape 7.1 [27] is available for free download. This browser is based on the Mozilla [28] open-source browser project.

Design Science WebEQ Input Control

Although most STM publications do not currently involve dynamic math, where the reader interacts with the web page, as STM publishers start to think more and more about adding value to their online content, we expect publications to become more dynamic. Our WebEQ Input Control [29] is a simplified version of our WebEQ Editor specifically designed to be embedded in a web page to allow the reader to enter mathematical formulas within an interactive page. Applications include web pages that demonstrate specific concepts that allow the student to manipulate, graph, and calculate, and to do online evaluation and testing.

Conclusions

Although some of the pieces are still in development, MathML is already a viable choice for representing mathematics in STM publishing. MathML is the natural choice of math representation when working with XML-based workflows. 

References

  1. "MathML Adds Value to STM Publishing", http://www.dessci.com/en/reference/white_papers/mathml_adds_value.htm
  2. World Wide Web Consortium (W3C), http://www.w3.org
  3. MathML 2.0 Specification, http://www.w3.org/TR/2001/REC-MathML2-20010221
  4. W3C's math home page, http://www.w3.org/Math
  5. "MathML for Math and Science Communication", http://www.dessci.com/en/reference/webmath/tech/mathml.htm
  6. "A Gentle Introduction to MathML", http://www.dessci.com/en/reference/mathml/default.htm
  7. Design Science's MathType, http://www.dessci.com/en/products/mathtype
  8. JustSystem's XMetaL, http://na.justsystems.com/content-xmetal
  9. PTC's Arbortext Editor, http://www.ptc.com
  10. Design Science's MathFlow Editor, http://www.dessci.com/en/products/mathflow/
  11. Altova's XMLSpy, http://www.xmlspy.com
  12. Design Science's WebEQ Editor, (formerly http://www.dessci.com/en/products/webeq/ since replaced by MathFlow Components)
  13. MacKichan's Scientific Word and Workplace, http://www.mackichan.com
  14. Waterloo Maple's Maple, http://www.maplesoft.com
  15. Wolfram Research's Mathematica, http://www.wolfram.com
  16. XSLT, http://www.w3.org/Style/XSL/
  17. Stilo's OmniMark, http://www.stilo.com
  18. Inera's exTyles, http://www.inera.com
  19. TeX4ht, http://www.cse.ohio-state.edu/~gurari/TeX4ht/
  20. MathML DTD, http://www.w3.org/TR/MathML2/appendixa.html
  21. XyEnterprise's XPP, http://www.xyenterprise.com/products/xpp.html
  22. XSL-FO (FO = Formatting Objects), http://www.w3.org/Style/XSL/
  23. RenderX's XEP, http://www.renderx.com/
  24. Antenna House's XSL Formatter, http://www.antennahouse.com/
  25. Design Science's MathPlayer, http://www.dessci.com/en/products/mathplayer/
  26. Microsoft's Internet Explorer, http://www.microsoft.com/windows/ie/default.asp
  27. Netscape 7, http://channels.netscape.com/ns/browsers/default.jsp
  28. The Mozilla Project, http://www.mozilla.org
  29. Design Science's WebEQ Input Control, (formerly http://www.dessci.com/en/products/webeq/ since replaced by MathFlow Components)