Enhancing the Searching of Mathematics
A position paper based on the proceedings of the
Enhancing
the Searching of Mathematics Workshop
Robert Miner
Design Science, Inc.
June 8, 2004
- Introduction
- Trends in Scientific Collections
- Metadata and its Management
- Math-aware Searching
- An Agenda for Progress
Introduction
The rise of the World Wide Web has made information available to
knowledge workers on an unparalleled scale. As the availability of
information has increased, the challenge of finding relevant
information has become more central and more difficult. The success
of text-based Web search engines such as Google has dramatically
illustrated the potential of effective information retrieval to impact
the way people conduct research and think about information
management. However, current text-based search paradigms have a number of
limitations, particularly for scientific, mathematics, engineering and
technical (SMET) literature.
One challenge is that the majority of this literature is held in
collections where the dominant economic model for sustainability
involves collecting fees for access. Cataloging information or
federating searches involving these collections requires negotiating
separately with each such organization. Another challenge is that a
significant portion of the information in SMET documents is
non-textual, consisting of figures, tables and especially mathematical
notation. This problem takes an extreme form for the large and
rapidly growing class of documents that were "born in print" and
subsequently digitized, where even textual information must somehow
be recovered from a scanned image.
At the same time, a window of opportunity for enhancing the
searching of SMET literature is opening. For some years, the demand
for cross-media publishing and the concomitant need for more effective
information reuse and management in large collections of electronic
documents has been moving the publishing industry toward XML-based
workflows. In particular, SMET publishers have begun moving to
XML-based workflows using MathML, a standard for representing
mathematical expressions in XML. As a data format, XML offers
enticing possibilities for enhanced searching. Since SMET workflows
are still transitioning to XML, this is the ideal time to introduce
best practices that will facilitate better searching in the future.
To explore the possibilities for enhancing searching of SMET
literature in general, and mathematics in particular, a workshop was
held in April of 2004 at the Institute for Mathematics and its
Applications at the University of Minnesota. The workshop was funded
by the National Science Foundation through a National Science Digital
Library grant awarded to Design Science, Inc. The remainder of this
document describes the issues and areas of consensus identified by
that workshop, and lays out an agenda for action.
Trends in Scientific Collections
The workshop brought together a wide variety of SMET content providers,
researchers, software vendors and library scientists.
Several academic society publishers were represented, as well as
commercial publishers, abstracting and indexing services, digitization
and archiving projects, and an assortment of academic and
educationally-focused collections. The level of interest and
participation served to reinforce perceptions that questions of
information management, workflow and added functionality such as
searching are both timely and relevant to the community. However, it was
also clear that enhanced searching of mathematics meant
widely-divergent things to different organizations, and during the
workshop, several distinct user groups began to emerge.
One natural grouping consisted of academic society publishers from
highly mathematical fields. Representatives from SIAM, AMS, CMS, The
Albert Einstein Institute (Max Planck Institute for Gravitational
Physics) and the IEEE attended, as well as from the abstracting and
indexing services Math Reviews and Zentralblatt MATH, which work
closely with society publishers. They all publish documents
containing many involved and complicated mathematical expressions.
Their users are primarily academic researchers, and their users
interests are strongly biased toward the apparatus of scholarship:
access to the peer-reviewed literature, citations, abstracts, and
bibliographies. The material is very dense, and more suited to
reading in print than on the screen, so PDF is widely used for
electronic publication.
Another important characteristic of this community is that many of
its authors use TeX and LaTeX. Dr. Michael Doob of the University of
Manitoba reports that author submissions to the Canadian Math Society
went from around 10% TeX in 1990, with a large proportion of
submissions as manuscripts, to 100% percent LaTeX submissions today.
Other publishers of research mathematics and theoretical physics show
similar trends. Consequently, many of these publishers share the
distinctive characteristic of having TeX-based production workflows.
In other disciplines, most production workflows have been based on
XML, SGML or proprietary page layout software such as Quark Xpress.
For example, a survey of commercial SMET publishers in areas other than
mathematics suggests that overall, perhaps 80% submissions are in Word
format, with equations rendered in Equation Editor or MathType™
format, and that the use of LaTeX is relatively rare.
Some math-intensive society publishers are planning to migrate or
are in the process of migrating to XML-based workflows using MathML, a
standard XML-based markup language used to encode mathematical
expressions. However, even among those publishers that were not
planning to change their production workflows in the immediate future,
many are looking at XML+MathML for information management purposes at
some point. MathML is a highly-structured, more information-rich
representation of mathematics as compared to LaTeX. It contains
vocabularies for describing both the visual presentation of
expressions, and for indicating their semantic content. However, it
is also a low-level representation, more akin to PostScript than TeX.
Thus, MathML is almost always generated and processed via software,
and is not suitable for hand authoring. For automatic processing, the
regularity, structure and ability to represent both presentation and
semantic aspects of expressions are highly appealing. Consequently,
there was strong interest in LaTeX to XML+MathML conversion strategies
at the workshop, as was evidenced by the attention given to the
presentation on the Hermes project, a part of the larger MowGLI
project funded by the European Union.
Another group that was well-represented, and which has become
increasingly significant in recent years, consists of
"retro-digitization" projects. Projects such as JSTOR, NUMDAM,
ERAM and others have made tremendous strides toward scanning and
digitizing research literature which originally appeared in print,
stretching back into the 19th century in some cases. For these
groups, creating and maintaining accurate cataloging and abstract
information is important.
A number of projects also employ OCR and other techniques to
extract text for full-text searching. Several are also looking into
at least limited use of XML+MathML for some of their holdings, such as
abstracts. However, the challenge of augmenting traditional OCR
processes so that they can infer the information needed to mark up
document structure, equations, tables and other non-textual data is
substantial. A presentation by Dr. Masakazu Suzuki of Kyushu
University on his INFTY system for mathematical formula recognition
was therefore both significant and encouraging. The system, which is
already commercially available, appears to yield impressive results
with low error rates.
One final group worthy of note consisted of those content providers
who publish to the web in HTML or XML as opposed to PDF. This
includes many educational content providers and content providers
targeting broader audiences, where graphic design, interactivity, and
seamless integration with other Web content is important. For this
group, XML+MathML is also an appealing format, though most
organizations are planning to use XML+MathML as a source format and
to generate HTML (or XHTML) for publication, for example using XSLT.
There was a strong consensus that XML and
MathML will be important for SMET publishing, especially for
information management purposes. The corollary is that effective
conversion software is a prerequisite to migrating large bodies of
existing material to that format. In fact, the process has begun.
The Grainger Library at UIUC has converted around 200,000 pages of
SMET research literature into XML+MathML under a Digital Library
Initiative grant and over time, documents will increasingly be
originally generated as XML+MathML. Some estimates suggest
as many as 100,000 pages a year of research literature will be
published, processed or archived in that format as early as 2005.
Metadata and its Management
A major theme that emerged from the workshop was metadata and its
management. Scholarly publishing has long relied on bibliographic
cataloging information for accessing and referencing literature. In
the electronic world, cataloging information such as author, title, and
date of publication is regarded as metadata. While precise
definitions of metadata are disputed by experts, broadly speaking,
metadata consists of assertions made by third parties describing the
content of an item of information. In practice, most metadata records
apply at the document level, as opposed to describing items within a
document, such as a picture, but such "fine-grained" metadata is not
unheard of. Another useful distinction is that between objective and
subjective metadata. Bibliographic cataloging information and
intellectual property rights are examples of objective metadata, while
grade-level, subject area, and overall quality are examples of
subjective metadata.
Properly speaking, metadata records are not part of a document, but
are instead stored, maintained and shared separately. For pragmatic
reasons, certain kinds of objective metadata, such as the author, are
often embedded within the documents to which they apply, usually in
some sort of special header field. However, even in these cases, such
metadata is often extracted into separate records during production.
The point is that to be useful, cards belong in the card catalog and
not in the books on the shelves, to use a pre-digital metaphor.
For digital libraries, the role of the card catalog is filled by
databases of metadata records. Such databases have obvious utility
for searching and information retrieval. One frequent kind of
question knowledge workers ask is "what are all the articles by author
X" and a metadata database is an ideal for answering such questions.
However, even this simple example illustrates two major challenges
that must be addressed when using metadata for information retrieval.
The first difficulty is that not all the articles by X may
be in the same database. The second difficult is that in practice,
there is wide variation in how metadata is recorded. As an extreme
example, there are dozens of spellings for the name of the Russian
mathematician Chebychev, or Chebeshev, or Tschebyscheff, or ...
To attack the first issue, the Open Archives Initiative (OAI) has
developed a protocol for sharing metadata so that unified metadata
databases can be developed. When libraries were exclusively physical
places, it was reasonable that their card catalogs reflected their own
holdings. But in digital libraries, boundaries between collections
are artificial, and from the point of view of the searcher, what is
needed is a single unified metadata store. The basic idea of OAI is
that content providers set up OAI servers which returns XML metadata
records in response to standardized queries. Aggregators then
periodically "harvest" metadata from multiple collections in order to
create unified databases, and offer search services built on top of
them.
Note that only metadata records are shared, so that content
providers who restrict access to their holdings can be included in
search services without losing control over their content. A searcher
without proper credentials following a link to an item in a protected
collection is typically redirected to a page with information about
obtaining access to the collection. Of course, for some organizations
such as indexing and abstracting services, or reviewing services such
as Eisenhower National Clearinghouse, metadata is itself important
intellectual property that carries value. However, even in these
cases, it is typically possible to share some basic metadata, such as
a title, merely to indicate that an item is in a collection, and that
authorized users can obtain further information about it. This is
precisely the strategy that most of these organizations have adopted
to date.
OAI has gathered substantial momentum in the digital library
community. The protocol specification has gone through several
versions and is now relatively mature and stable. A variety of open
source software is available for setting up an OAI server, and for
small collections, a mechanism called the OAI Static Repository has
been devised whereby metadata can be shared by placing an XML file on
a web site and registering it with a gateway. At the workshop, most
content providers had either already set up OAI servers or were
considering doing so in the near future, and there was strong
consensus that OAI would become increasingly dominant as a means of
metadata sharing.
The second broad metadata problem, lack of uniformity, has more
facets and is therefore less subject to system solutions. One such
facet, touched on above, is the identification of equivalent variations
in names, spellings, abbreviations of journal titles, and so on. This
is an old problem in library science, and is merely exacerbated by the
scale of metadata sharing in digital libraries. The principal line of
attack is the compilation of name authority databases. For example,
the US Library of Congress and the "Deutsche Bibliothek" maintain
large name authority databases, and both institutions participate
in the Virtual International Authority File (VIAF) project. The aim of
this project will be to explore virtually combining the name authority
files of both institutions into a single name authority service.
A facet more specific to digital libraries is lack of uniformity in
electronic formats for metadata. There have been a number of attempts
to standardize metadata formats within different communities. Perhaps
the most notable is the Dublin Core standard developed in 1995 under
the auspices of NCSA and OCLC. Dublin Core defines a basic set of
about a dozen metadata elements such as title, creator, publisher,
description and so on. Dublin Core is by far the most widely used
metadata standard in the publishing and digital library communities.
However, as its name suggests, many if not most organizations using
it, have extended it in incompatible ways to record additional
metadata specific to their collections. Extensions have proliferated
to the extent that the European Center for Standardization has formed
a working group that tries to keeps these extensions to Dublin Core
organized and has produced an agreement on the presentation of Dublin
Core based application profiles. It is now working on a machine
readable version of that. Further, while Dublin Core specifies a set
of metadata elements, it does not specify a particular electronic
format for recording this data, and there are a number in use,
including the W3C RDF format and a direct XML encoding. Consequently,
creating "crosswalks" or translation guidelines among varying metadata
formats and element sets used by various SMET document collections
remains a major issue.
The topic of improving the quality, interoperability and management of
metadata generated much discussion at the workshop. While there was
broad consensus on the momentum and value of OAI, there was little
sense that the problems of metadata interoperability had easy or short
term solutions. Major content providers have been working toward
convergence of metadata formats and crosswalks for interoperability,
both on a case-by-case basis, and under the umbrella of various
standards organizations. While progress has been slow and difficult, it has
been relatively steady. For example, the major math abstracting
services Math Reviews and Zentralblatt are nearing an agreement on a
common format for their metadata records. Consequently, raising
awareness of the issues and continued incremental improvement are
likely the best that can be done, given the economic considerations
that constrain content providers with very large collections
stretching back for many years into the past. In areas such as
mathematics where the useful lifespan of research literature routinely
runs to decades, such economic constraints cannot be ignored.
Historically, the collection and management of metadata is
expensive, as it generally involves a great deal of hand work on the
part of subject experts. Consequently, another focus in the workshop
discussions was techniques for lowering the cost of metadata without
sacrificing quality. One approach is to enlist the aid of authors in
identifying and/or checking at least objective bibliographic metadata,
for example, by incorporating identification and verification into
either the submission process or the proofing process associated with
scholarly publication. IEEE and the arXiv both use electronic
submission processes that take some steps in this direction. Another
avenue is to try to incorporate metadata creation into authoring
tools. However, this must be done with care. If metadata creation is
purely optional, it will likely be ignored, while at the same time
authors will resent intrusive techniques. Furthermore, many kinds of
technical documents are edited by multiple people over time. This
makes metadata management becomes more complex. Nonetheless, the
prevailing view was that there is a good deal of potential for
facilitating easy identification of basic metadata such as author and
title as part of the authoring process. For example, a standardized
set of LaTeX macros for title, author, abstract, MSC classification,
date, and so on might be used.
Another area of active research is automatic metadata generation.
Because human generated metadata is so expensive, even minimally
adequate automatic generation of metadata can has an appealing cost
benefit analysis. However studies such as the "Breaking the Metadata
Bottleneck" NSDL project conducted by Dr. Elizabeth Liddy of Syracuse
University have shown that the quality of human generated metadata
varies widely, and in the case of certain kinds of well-defined,
objective metadata, automatic algorithms can actually perform better.
Another reason automatic algorithms are interesting is that they might
be able to make much finer-grained metadata economically viable, for
example adding metadata at the individual equation level, as opposed
to the document level. Whether or not such fine-grained metadata
would be generally useful, however, is a question which leads us into
the topic of the next section.
Math-aware Searching
While it is relatively clear to everyone that mathematical notation
is not accessible to traditional text-base search to any meaningful
degree, there is much less agreement on what a math-aware search ought
to look like. Discussions at the workshop covered ideas ranging from
fine-grained per-equation metadata, to combined text and equation
search, to data mining of specialized databases of mathematical
objects. The discussion is further complicated by the paucity of
functional, real-world examples of math-aware search. As a result,
the need for test bed collections, development of use cases, and
further research and usability testing of possible search techniques
were the areas where strongest consensus emerged.
The most obvious notion of math aware search is to simply extend
the keyword search models used by virtually all popular Web search
engines. That is, one would type in a collection of text keywords and
mathematical expressions, and the search engine would return a list of
documents in which they occur. This is the model used by a math-aware
search engine developed by Dr. Abdou Youssef of George Washington
University for the Digital Library of Mathematical Functions (DLMF).
The DLMF documents are coded in LaTeX using special macro packages.
For searching purposes, the mathematical expressions are first
converted to structured text, e.g. "x begin_superscript 2
end_superscript," which is then indexed by a conventional text search
engine. Queries are entered in a special linear query language
similar to LaTeX. To perform the search, the query is also converted
into text, and the resulting text search is performed against the
index. While preliminary indications are that this yields
surprisingly effective math searching, the DLMF search engine is not yet
online, so there is not yet a substantial body of usage data.
The DLMF technique of piggy-backing on a text-search engine has
some interesting side effects. A common capability of text search
engines is control of proximity searching, where searching terms must
appear with in a certain distance of each other, for example within 5
words. Using this capability, it is possible to form queries such as
"find all expressions of the form (x^2 + ?)" where ? represents any
single term. On the other hand, a query such as "find all expressions
matching $a^3 + $a^2" is not possible, where both $a's denote the same
variable name. Of course, these examples merely scratch the surface
of how one might specify abstractions and constraints in a math search
query. Setting aside the non-trivial questions of devising an
effective query language or graphical query editor, a more basic
question is how sophisticated queries need to be in order to be
effective. Research indicates that most users of
search engines don't bother to inform themselves about the subtleties
of the query language, nor do they use even basic features very often
if they can't be learned quickly and remembered easily. Consequently,
real world usability data is clearly important in determining the
success of any math search paradigm that targets a general audience.
Another aspect unique to the DLMF model is that it takes advantage of
exceptionally regular notation across the collection and specialized
source code macros. In general, however, TeX and LaTeX with their
powerful macro mechanisms and ability to redefine basic language
constructs on the fly, are notorious for lack of regularity in
source code. Consequently, a naturally question to ask is
whether the analogous strategy of piggy-backing on an XML-based
XQuery search engine indexing XML+MathML source would perform better.
In particular, MathML enforces a certain regularity of structure, and
admits a fair degree of normalization. Also, the XQuery language is
more powerful than text-based query languages in its ability to define
constraints and relationships between terms.
In both these models, there is substantial appeal in leveraging
existing technology which incorporates high quality text search.
However, another intriguing math-search model takes a different
approach entirely. This model is exemplified by the Online
Encyclopedia of Integer Sequences and Series, created by Neil Sloan of
AT&T Research. Using a web interface, a user enters a sequence of
integers. If the sequence matches an entry in the database, an page
is returned detailing the basic properties of the sequence, known
relations to other sequences, and references to instances of the
sequence in the literature. In a talk at the Future of Mathematical
Communication II conference at MSRI, mathematician Rob Corless told of
solving a problem in dynamical systems using results from an article
on combinatorics which he located using the Online Encyclopedia. The
point here is that Corless thought it very unlikely that he would have
found the information using text-based keyword searching; he wouldn't
have known any of the appropriate keywords, since he was unfamiliar
with the specific area of combinatorics, and unaware of the
connection.
The model of a specialized database of mathematical objects is also
suggestive of the special functions web site created by Wolfram
Research. In that case, the database of special functions contains
representations both in the Mathematica language as well as MathML and
LaTeX. Using Mathematica, the entries in the database can be formally
manipulated, and in theory, could be algorithmically mined for
relationships and interconnections. Other groups are also working on
searching within formal systems, such as the Coq theorem proving
environment, as well as in less structured contexts such as raw
MathML.
In this vision of math-aware searching, specialized databases of
mathematical objects become powerful research tools, as algorithmic
searching reveals interconnections between mathematical objects
themselves, as well as the research areas in which instances of the
objects occur. One possible architecture for such databases would
involve automatically extracting non-trivial equations from documents
into databases, and piggy-backing on a metadata sharing system such
as OAI to allow aggregators to harvest equations into unified
collections. Aggregators could then offer specialized search
services, and algorithmically enhance the collection by searching for
inter-relationships using more sophisticated but still computationally
tractable forms of mathematical equivalence. For example, by analogy
with the integer sequences and series, one might have databases of
polynomials, matrices, continued fractions, and many other classes of
mathematical objects with normal or semi-normal forms, or that can be
linearly ordered. In this context, it may be useful to differentiate
between searching in or for published literature and the building of
specialized services such as an online encyclopedias or expert
systems that combine searching and computation.
The methods and examples of math searching discussed so far are
strongly biased toward the research literature. However, the
educational community is a far larger group of users. In educational
contexts at the lower levels, a third notion of math-aware searching
returns to the idea of fine-grained, per equation metadata. A
motivating use case might be a student searching for "a^2 + b^2 =
c^2". Even using Google today, this search returns many thousands of
hits. The problem in this case is picking out the one, for example,
that occurs with worked examples of the Pythagorean theorem
appropriate for grades 6-8. The point is that the student's
information need revolves more closely around the metadata assertions
than the actual formula.
A number of research groups, such as the NSDL MetaTest project,
have begun investigating the effectiveness of search systems based on
human and automatically generated metadata, especially in comparison
with full-text search systems. However, in general metadata devoted
to classifying and describing subject matter is expensive to produce
and suffers from lack of uniformity and objectivity. This suggests
that metadata search systems based on these kinds of data are unlikely
to out perform full-text search systems. By contrast, experience from
successful, education-oriented Web sites such as the Eisenhower
National Clearinghouse and the Math Forum at Drexel, indicates that
end users do perceive significant value in augmenting text searching
with metadata-based constraints involving criteria such as educational
function, grade level and the like. Similarly, there has been some
work done indicating that metadata on physical units might be used to
good effect, particularly in engineering and the sciences.
Consequently, it seems clear there is a role for faceted
metadata-based searching in many contexts, provided metadata can be
created and managed in a cost effective way.
Another part of the appeal of a fine-grained metadata approach is
that it is well-suited to automatic creation via authoring tools. In
languages like LaTeX, a natural approach would be to use specialize
macro packages, where the macros would contain metadata labels in some
fashion. This would facilitate, for example, tagging an equation as
containing a Riemann tensor by using a \Riemann macro. Alternatively,
and perhaps more importantly in the educational sphere, wysiwyg
authoring tools such as Design Science's MathType editor could
facilitate labeling of significant equations via a simple user
interface such as pulldown menus, etc. A more sophisticated approach
might be to use the local context around an equation together with a
clustering algorithm to guess appropriate metadata, which a user could
then accept, modify or ignore.
However, as noted at the beginning of the section, all of these
visions of math-aware searching must be considered speculative until
there is more validation using relatively large collections and
representative user groups. There is a growing number of research
projects aimed at examining or implementing various of the techniques
touched on here, but very few have undergone any degree of real-world
use or usability testing. Consequently, there was strong consensus that the availability of
test bed document collections, and cooperation
in conducting pilot studies is essential if research is to advance
quickly to concrete, deployable solutions.
An Agenda for Progress
In a round-table discussion, workshop participants summarized their
perceptions of the areas of consensus that had emerged in a twelve
point agenda for progress. These fall under five headings:
Metadata
- Content providers should expose Metadata via OAI.
- Content providers should work towards convergence of metadata
formats.
Information Management
- Content providers should consider XML and MathML for information
management purposes, and more generally work toward structure data
with semantic content.
- Research and development on conversion technologies should be a
priority.
Search / Discovery / Identification
- Content providers should make data available for test bed purposes
either via open collections or by arranging appropriate consent.
- The community should work to develop use cases, e.g. exact matching,
search by property, searching formulae, etc.
- Search service providers should expose standardized interfaces
through Web services.
Enriching Content
- Metadata workflow should be managed.
- Tool vendors should work toward generation of metadata through
authoring process.
- Content publishers should work toward validation and generation of
metadata as part of the publication process.
SMET community outreach
- Researchers and developers should inform the community
about the potential and possibilities of math-aware searching.
- Researchers and developers should better inform themselves of SMET
community requirements at all levels from elementary education through
advanced research.
This document is intended as a first step toward SMET community
outreach. To further that goal, a web site has also been created as
part of the Enhancing the Searching of Mathematics project funded by
the National Science Foundation through its National Science Digital
Library program. For further information about the Agenda for
Progress, what other organizations are doing to implement it, and ways
you can contribute, please visit http://www.dessci.com/searching.
|