in STM Publishing

Paul Topping,

Design Science, Inc.

http://www.dessci.com

Updated February 2006

One of the chief characteristics of STM (Scientific, Technical, Medical) content is the presence of mathematical equations. Traditionally, equations have been represented in workflows using a variety of formats: TeX, MathType, PowerMath, the math parts of various DTDs, to name a few. Now that many publishers are moving to a production system built around a central XML-based repository, describing mathematics using MathML, the XML-based language for mathematics, is a better choice. And, support for MathML is virtually an absolute requirement for targeting the new HTML+MathML delivery medium for publishing web content containing math. (To learn more about HTML+MathML, please see our companion paper, "MathML Adds Value to STM Publishing" [1]).

Once MathML has been chosen as the format-of-choice for the central repository, various MathML tools come into play at different processing stages:

**Authoring:**Although STM publishers will probably have to deal with a variety of document formats for the foreseeable future, authoring tools already exist that allow MathML to be entered directly or contain built-in conversions to MathML, thereby simplifying the front-end processing needed to get such content into the central repository.**Front-end processing:**Authors submit documents in various formats that, in general, contain mathematical equations. Conversion tools must be used to convert the document into XML in a form suitable for the central repository. The most common formats for mathematical equations are TeX/LaTeX and Design Science MathType/Equation Editor equations that appear in Microsoft Word and other word processing documents. Tools are required to convert these math formats into MathML.**Maintenance:**Maintenance of the central repository involves the use of XML Editors, such as JustSystem's XMetaL and PTC's Arbortext Editor. For MathML items, this requires the use of MathML editors, such as Design Science's MathFlow products, which add MathML editing capability to XMetaL and Arbortext Editor.**Back-end processing:**When a particular publication is to be produced, content items must be chosen from the central repository, style rules applied, and layout and pagination added. For MathML, tools are required to format the mathematics.**Online Delivery:**The choice of HTML+MathML for web publication allows the MathML formatting step to occur in the reader's browser.

This paper discusses what tools are available now and in the future for MathML processing in production systems built around a central XML-based repository. A few points should be made regarding the use of terminology in this paper. What we refer to as the "central XML repository" can be a content management system (CMS), a database, or some other kind of storage system. And although XML is the most desirable form for content in the repository, it will likely also contain other data format such as sound, video, graphics, etc.

Today, many publishers are moving toward representing as much of their content in XML as possible. There are several advantages to putting XML at the center of the publishing enterprise:

- Publishers want to accept authors' work in many formats and, simultaneously, want to deliver content in several forms. If there are N input forms and M output forms, this potentially requires N x M conversions. By choosing to create a central repository using a single format, only N + M conversions are needed.
- Part of the XML philosophy is to retain the meaning and structure of content in the central repository, rather than its formatting. Layout and formatting is applied when extracting content from the central repository for delivery in a particular medium. This makes available any and all content items in the repository for use in any publishing project.
- Because almost all content is represented using a single, relatively simple, format, fewer tools are required for creating, maintaining, and editing the content. This reduces the cost of tool acquisition and the training to use those tools.
- The use of XML to represent all text-based content items makes searching and categorizing content much simpler. Such metadata applications are increasing in importance as the marketplace demands that electronic documents add value over print documents.
- XML is a safe archival format as it reduces dependency on proprietary formats and software tools whose lifetime may be shorter than that of the content.
- Because XML text uses the Unicode encoding, translation, localization, and internationalization are made simpler and more reliable.

The importance and advantage of XML in publishing workflows has been widely accepted in the publishing community, especially with STM publishers.

The World Wide Web Consortium (W3C) [2] sets most of the standards for the web and XML technology. In 1997, the W3C's Math Working Group finished the MathML 1.0 Specification (superceded in 2001 by MathML 2.0 [3]). To learn more about MathML, visit the W3C's math home page [4] or see our articles, "MathML for Math and Science Communication" [5] and "A Gentle Introduction to MathML" [6].

Like all XML-based languages, MathML consists of tagged text. For example, the fraction, *x*/2, is represented in MathML as:

<math> <mfrac> <mi>x</mi> <mn>2</mn> </mfrac> </math>

Note that this example does not include formatting — no font or point size is specified for 'x' or '2'. Although attributes can be set in MathML to specify formatting, they would normally be omitted for MathML sitting in the content repository. Formatting attributes are normally unnecessary even in final output as the font and point size of math defaults to that of the surrounding text in accordance with technical publishing convention.

STM publishers have been dealing with math for a long time and many computer representations of mathematics have been invented and are still in use within publishing workflows. However, these formats do not fit well into the XML-based publishing scenario. Here are some reasons why MathML is a better choice:

- Legacy math representations favor formatting over mathematical meaning and structure. On the other hand, MathML (especially Content MathML) represents the structure and meaning of the mathematics. This means that mathematical meaning is available in the repository, allowing it to be searched in a meaningful way. It also means that formatting decisions are not expressed in the repository but are applied during the generation of each publishing job.
- The use of XML-based languages, like MathML, wherever possible, minimizes the number of tools that need to be acquired and learned and maximizes the utility of each tool.
- When math is represented using a single format, consistency of layout and the implementation of layout rules and constraints becomes easier and makes output more consistent.

Although there may be other ways of authoring documents containing mathematical notation, some tools are preferred when MathML is required. Here are some details on the available tools:

In a survey we did a few years ago, we discovered that about 70% of manuscript submissions to STM publishers were in Microsoft Word's document format with equations produced with Design Science's MathType [7] (or the junior version of MathType, called Equation Editor, that comes in the box with Word). This surprises some in the TeX community but TeX is really only dominant in mathematics and physics research. In other fields of engineering, science, and education, TeX is not so prevalent.

MathType 5 has the ability to convert equations to MathML. This can be done one equation at a time, but usually an entire Word document is converted to HTML+MathML using MathType's MathPage feature. A web page produced in this manner can be published directly to the web but in an XML-based production system, a transformation could be done on the page to get it into the repository. MathType's conversion capabilities are also available through a scripting API, which can be used to automate conversion of equations to MathML in a workflow setting.

Alternatively, another product, such as exTyles (see below), can be used to simultaneously convert the Word document's text to XML and convert the MathType equations to MathML. The conversion to XML of the non-math portion of the document involves an assignment of Word styles to elements in a DTD.

If document input is to be done by in-house personnel or by a service bureau, rather than by scientists or engineers, it probably makes sense to use an XML editor, such as JustSystem's XMetaL [8] or PTC's Arbortext Editor [9], with a MathML editor, such as Design Science's MathFlow Editor [10], for the math. The MathFlow Editor integrates with XMetaL and Arbortext Editor, so that one can click on an Insert Equation toolbar icon (or use a keystroke shortcut) to open a new MathFlow Editor window that will save its results directly to the XML document. Similarly, if the user double-clicks on an existing equation in the XML document, a MathFlow window will open to allow the equation to be edited. The user experience is similar to that of using MathType with Microsoft Word.

For other XML editors such as Altova's XMLSpy [11] , the user must current develop MathML in stand alone MathML editor, such as Design Science's WebEQ Editor [12], and then copy it to the clipboard (or save it in a file), switch to the XML editor, and then paste (or import a file). However, we are continuing to develop closer interfaces between the Design Science editors and other popular XML and HTML editors, in addition to XMetaL and Arbortext Editor.

## Which editor is best for working with MathML: MathFlow, WebEQ or MathType?We at Design Science often get asked this question. Many people are familiar with MathType's user interface due to its popularity and association with Microsoft Word. However, MathType is not a true MathML editor as it contains features that have no analog in MathML. MathType produces MathML as a translation process and, like most translations, the results are not 100% equivalent to the original. MathType is also only capable of producing Presentation MathML, rather than Content MathML. The MathFlow and WebEQ Editors, on the other hand, are MathML editors from the start — they represent the equation internally in MathML and read and save in MathML. Although they are not as well known as MathType and their user interfaces are not quite as powerful, we believe they are the right editors to use when MathML is a primary requirement. We will be enhancing the WebEQ and MathFlow user interfaces in future versions to bring them more in line with MathType's. |

Most TeX users author in their favorite text editor. As TeX must always be converted to MathML, there's not much to say about the authoring process for TeX except perhaps that such authors should consider MacKichan's Scientific Word and Workplace as these are TeX-oriented authoring systems that can save as MathML.

The mathematics representation of both TeX and LaTeX is equivalent to Presentation MathML. Except for trivial mathematics, conversion to Content MathML is problematic.

These products are comprehensive systems for authoring scientific and technical documents. As of version 4.0, users can output an HTML page with MathML equations.

Computer algebra systems, like Waterloo Maple's Maple [14] and Wolfram Research's Mathematica [15], are built around a "notebook" interface and contain conversion to and from MathML. This should make it possible for a scientist or engineer to do their basic research and documentation within the interactive environment of these products. They can then submit such a notebook document to the publisher for conversion and integration into the XML repository. In fact, Mathematica version 4.2's notebook files are XML-based so, in theory, an XML-to-XML transformation is all that would be needed to incorporate a notebook into a publisher's XML repository.

Front-end tools are responsible for preparing and converting documents from their authored format into XML and placing them into the central repository. Conversion is a complex subject for which there are many tools.

In the XML world, XSLT processors [16] are an important conversion tool for converting one form of HTML or XML to another. Many other tools can also be part of the toolbox, such as Perl scripts and Stilo's OmniMark [17] programs.

This product converts Microsoft Word documents to XML. It uses a set of rules to insert XML tags into the original document as well as applying cleanup processing. Although there are many tools that will aid in conversion of Word documents to XML, we mention this one because it makes use of MathType to convert equations in the document to MathML.

TeX4ht is a free software package for converting TeX and LaTeX documents into various HTML and XML formats. It is highly configurable and supports conversion of equations into MathML. A number of other projects are also working to develop conversion software for TeX and LaTeX.

Validation is an important task that checks XML against the DTD (Document Type Definition) that defines the elements and attributes allowed. This is done using XML parsing and validation tools. Since the MathML language is defined by the MathML DTD [20], MathML instances can be validated using the same techniques.

Preparing content for a particular publication is called "back-end processing". It involves one or more of the following tasks:

- Selecting content items from the central repository;
- Assembling the content items into the publication;
- Converting the structural XML used in the repository to XML in which formatting is specified;
- Transforming the XML with formatting specifications into the delivery medium.

If the output medium for the publication is HTML+MathML, the conversion is fairly simple. XSLT can be used to transform the repository XML items into HTML elements. The MathML items can be left untouched as they will be rendered by the web browser. (To learn more about HTML+MathML, please see Online Delivery, below, and our companion paper, "MathML Adds Value to STM Publishing" [1]).

If the output medium is PDF or PostScript for print, the process is a bit more involved. One route is with XyEnterprise's XPP product [21] which combines a content management system with PDF and print output processing and MathML support. Another route is to use PTC's Arbortext Advanced Print Publisher - Desktop software. MathFlow adds MathML composition capability to Arbortext Advanced Print Publisher - Desktop.

Another possible route to print or PDF is by converting the repository XML to XSL-FO, using XSLT or other text-to-text conversion tools. XSL-FO (FO = Formatting Objects) [22] is an XML-based language for describing the layout of paginated documents. There are several tools available that will convert an XSL-FO description of a document to PostScript or PDF: XEP from RenderX [23] and XSL Formatter from Antenna House [24]. Unfortunately, XSL-FO does not support mathematical notation. However, Design Science is working with vendors of XSL-FO formatters to combine our MathPlayer MathML formatting technology with their products to create a comprehensive output solution.

As we explained in our companion paper to this one, "MathML Adds Value to STM Publishing" [1], recent advances in web browser technology have enabled MathML to be part of an online delivery solution. In this paper, we explain how using HTML+MathML as an online delivery medium allows an STM publisher to add value to their products by making them more useful to scientists, engineers, and educators.

We will briefly list the online delivery options involving MathML here:

MathPlayer [25] is Design Science's free MathML display engine for Microsoft's Internet Explorer (IE) [26] web browser. It currently requires IE 6.0 for Windows or later. Design Science is committed to making MathPlayer available free-of-charge as a service to the community. Several hundred thousand copies of MathPlayer have been downloaded at the time of this writing. The current version, 2.0, adds sophisticated accessibility capabilities for mathematics, as described in "MathML Adds Value to STM Publishing" [1].

Netscape 7 and related browsers includes native MathML display support. As of this writing, Netscape 7.1 [27] is available for free download. This browser is based on the Mozilla [28] open-source browser project.

Although most STM publications do not currently involve dynamic math, where the reader interacts with the web page, as STM publishers start to think more and more about adding value to their online content, we expect publications to become more dynamic. Our WebEQ Input Control [29] is a simplified version of our WebEQ Editor specifically designed to be embedded in a web page to allow the reader to enter mathematical formulas within an interactive page. Applications include web pages that demonstrate specific concepts that allow the student to manipulate, graph, and calculate, and to do online evaluation and testing.

Although some of the pieces are still in development, MathML is already a viable choice for representing mathematics in STM publishing. MathML is the natural choice of math representation when working with XML-based workflows.

- "MathML Adds Value to STM Publishing", http://www.dessci.com/en/reference/white_papers/mathml_adds_value.htm
- World Wide Web Consortium (W3C), http://www.w3.org
- MathML 2.0 Specification, http://www.w3.org/TR/2001/REC-MathML2-20010221
- W3C's math home page, http://www.w3.org/Math
- "MathML for Math and Science Communication", http://www.dessci.com/en/reference/webmath/tech/mathml.htm
- "A Gentle Introduction to MathML", http://www.dessci.com/en/reference/mathml/default.htm
- Design Science's MathType, http://www.dessci.com/en/products/mathtype
- JustSystem's XMetaL, http://na.justsystems.com/content-xmetal
- PTC's Arbortext Editor, http://www.ptc.com
- Design Science's MathFlow Editor, http://www.dessci.com/en/products/mathflow/
- Altova's XMLSpy, http://www.xmlspy.com
- Design Science's WebEQ Editor, (formerly http://www.dessci.com/en/products/webeq/ since replaced by MathFlow Components)
- MacKichan's Scientific Word and Workplace, http://www.mackichan.com
- Waterloo Maple's Maple, http://www.maplesoft.com
- Wolfram Research's Mathematica, http://www.wolfram.com
- XSLT, http://www.w3.org/Style/XSL/
- Stilo's OmniMark, http://www.stilo.com
- Inera's exTyles, http://www.inera.com
- TeX4ht, http://www.cse.ohio-state.edu/~gurari/TeX4ht/
- MathML DTD, http://www.w3.org/TR/MathML2/appendixa.html
- XyEnterprise's XPP, http://www.xyenterprise.com/products/xpp.html
- XSL-FO (FO = Formatting Objects), http://www.w3.org/Style/XSL/
- RenderX's XEP, http://www.renderx.com/
- Antenna House's XSL Formatter, http://www.antennahouse.com/
- Design Science's MathPlayer, http://www.dessci.com/en/products/mathplayer/
- Microsoft's Internet Explorer, http://www.microsoft.com/windows/ie/default.asp
- Netscape 7, http://channels.netscape.com/ns/browsers/default.jsp
- The Mozilla Project, http://www.mozilla.org
- Design Science's WebEQ Input Control, (formerly http://www.dessci.com/en/products/webeq/ since replaced by MathFlow Components)