≡ Menu

Math markup marked down

Sun’s Erwin Tenhumberg fights some FUD about ODF and in passing provides a link that is worth a few more words. It appears that Science, the journal of the America Association for the Advancement of Science (AAAS), itself the largest scientific society in the world, has updated its authoring guidelines to include advice for Office 2007 users. The news is not good.

Because of changes Microsoft has made in its recent Word release that are incompatible with our internal workflow, which was built around previous versions of the software, Science cannot at present accept any files in the new .docx format produced through Microsoft Word 2007, either for initial submission or for revision. Users of this release of Word should convert these files to a format compatible with Word 2003 or Word for Macintosh 2004 (or, for initial submission, to a PDF file) before submitting to Science.

Well, so much for 100% compatibility, eh? That is what I’ve been talking about. Whether you move to OOXML or ODF you will be making a change that will break compatibility with your past document processing systems. You will need to change over the next couple of years and you will need to examine your choices carefully. But don’t get suckered into thinking that the choice of OOXML is magically painless. The 100% compatibility claims don’t hold water.

More bad news:

Users of Word 2007 should also be aware that equations created with the default equation editor included in Microsoft Word 2007 will be unacceptable in revision, even if the file is converted to a format compatible with earlier versions of Word; this is because conversion will render equations as graphics and prevent electronic printing of equations, and because the default equation editor packaged with Word 2007 — for reasons that, quite frankly, utterly baffle us — was not designed to be compatible with MathML. Regrettably, we will be forced to return any revised manuscript created with the Word 2007 default equation editor to authors for re-editing. To get around this, please use the MathType equation editor or the equation editor included in previous versions of Microsoft Word.

Uh oh. Not only cannot you not submit files in OOXML format, but you can’t even use Office 2007 and save in the old binary formats. Down conversion or using the Compatibility Pack won’t help. Microsoft’s decision to push a new “Open Math Markup Language” rather then use the well-established MathML standard appears to be a serious flaw.

Nature appears to have the same problem:


We currently cannot accept files saved in Microsoft Office 2007 formats. Equations and special characters (for example, Greek letters) cannot be edited and are incompatible with Nature’s own editing and typesetting programs.

Of course, when targeting final publication of a paper, a PDF file is fine. But when engaging in collaboration with another researcher, or an editor, you need to agree of a standard format in which you both can work.

Reuse of existing standards is important. When you reuse a standard, you are reusing more than a piece of paper. You are reusing the experience and effort that went into creating and reviewing that standard. You are reusing the experience gathered by those who have already implemented the standard. You are reusing the books and training materials already written for that standard. You are reusing the interfaces for other technologies that have already integrated with that standard or can produce or consume output that conforms to that standard.

Isaac Newton wrote, “If I have seen further it is by standing on the shoulders of giants”. When you reuse standards you reuse the accumulated wisdom of an industry and assume the vision and powers of giants. But when you ignore all precedents and go forth on our own, well, let’s just say the outcome is more variable in that case. You may be the next Einstein, or you may be the next fool.

If Science and Nature need to update their templates, then I’d suggest they take a look at ODF. Not only does it use MathML for equations, but it is an open standard, an ISO standard, a platform and application-neutral standard that has many implementation, including several good open source ones. If they need to update their processing, then they might want to make the smart choice now, the choice that increases their choices and flexibility going forward.


18 June 2007 Update

A response from Nature and one of their vendors, explaining the complexity of migrating their publishing ecosystem to a new file format. Quoting a letter to Microsoft from Bruce Rosenblum of Inera:

Had the conversion from DOCX to DOC provided a conversion from OMML to Equation Editor format, it would have provided the necessary backwards compatibility for publishers to upgrade one system at a time. But because this compatibility is not available, it’s created the need for a “big bang” upgrade, or a delay until the ecosystem of inter-dependent systems is deliberately updated over time. In the environment of scholarly publishing, such substantive upgrades often take years, not months.

Creative Commons License
This work, unless otherwise expressly stated, is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.

{ 13 comments… add one }

  • bjelkeman 2007/04/26, 12:08

    It seems like they forgot not to “embraced, extended and extinguish” on their own product lines.

    It really is time to start using non-proprietary document exchange formats.

  • David 2007/04/29, 18:24

    I’m not really convinced by the argument that there is any technical difference between storing MathML and storing something transformable to MathML, given that a transform is provided. I give a larger worked example of extracting MathML from OpenOffice.org in my blog

  • PolR 2007/04/29, 20:38

    David, the transform you refer to must satisfy a number of conditions to be really equivalent to storing MathML.

    1- It must work both ways (ie transform to MathML and from MathML)

    2- The round trip from MathML to the alternative format and back to MathML must have 100% percent fidelity.

    3- The round trip from the alternative format to MathML and back to the alternative format must also have 100% fidelity.

    If you don’t have both forms of 100% fidelity round trips, there would be no way to edit the text in one format and store it back in the other format while being confident the changes are properly preserved.

    and the last condition:

    4- Both format must evolve their new versions simultaneously in locksteps to ensure conditions 1 to 3 are preserved across the changes.

    This one is the killer. Trying to have two formats permanently synchronized this way is a maintenance nightmare, especially when we discuss standards with multiple implementations maintained by different organizations.

  • Rob 2007/04/29, 20:42

    David, thanks for writing.

    You could use that same reasoning to argue that there is no difference between the proprietary Office binary formats and the new XML standard because the binary can be transformed into the XML, and a transform is provided in the form of Office 2007.

    But I think there is a big difference between a world with a single standard and a world with multiple standards and a set of transformations. This is the difference that we all see when we travel internationally with a bag of adapters and transformers. Transformation, for no purpose than to paper over vendor intransigence, introduces additional code that serves no purpose. But it will slow down processing, introduce complexity and bugs, and increase costs.

    Take a look for example at Microsoft’s ODF Add-in for Word. This is their parallel argument, that they do not need to support ODF natively because they provide a translator to do this. However this translator takes 30+ seconds to translate a one-page document, making it useless in any workflow.

    Note also that the OMML2MML.XSL that you are modifying is an XSLT stylesheet that ships with Office 2007 and therefore implicitly has a Microsoft copyright and license. Or do you see something that grants permission to modify and redistribute?

  • David 2007/04/30, 04:31

    Rob,

    You could use that same reasoning to argue that there is no difference between the proprietary Office binary formats

    Not really, that’s the thing about XML, it allows the documements to be inspected/transformed, whatever. A binary blob of data is just that.

    As for licences. Note I neither modify nor redistribute the stylesheet that comes with office, just run it. The version of the stylesheet on Brian Jones’ blog site (which is more or less the same) is rather strangely not covered by any licence at all as far as I can see (so the legal position of whether it is usable probably depends on which jurisdiction you are in). But in any case these (like the discussion of whether or not it’s a good thing for ISO to standardise an XML version of Word’s internal data structures) are essentially political/legal/commercial issues that are important, but essentially separate from the technical question of “can users generate re-usable standard MathML out of Word. It feels really weird to say this as a (very) long time TeX user (and maintainer of the LaTeX system) who has avoided Word (and all similar wysiwyg systems) like the plague for over 20 years but Word does have, out of the box, MathML cut and paste to the clipboard (something not in oo.org as far as I can see) and does have usable (but under documented) mechanisms for accessing the MathML from other API as well. personally I’d rather use emacs, but I’ve come to realise that not everyone feels the same way about emacs:-)

    PolR,

    Round tripping is important, I’m not sure how well the two MS supplied styesheets roud trip. (There are limits to how far I’m prepared in a free project to debug the code of an organisation that has I’m sure an army of QA testers to do that kind of thing). The same issue will come up with any system that isn’t actually using the MathML element structure in its editor internals. this includes OO.org which also explictly maintains the mathematics in a linear star-office (eqn-ish) format which is what’s actually used to enter/edit the mathematics. The proposition in this post is not that Word (or OO.org) should use MathML internally in its eding structures, it is that Word (like OO.org) should generate Mathml on output and store it in the zip file. My point is that technically there is no difference in information content between storing an input XML and a stylesheet and storing the result XML.

  • Rob 2007/04/30, 08:31

    David,

    And a binary format cannot be inspected or transformed? Come on, we’ve been doing this for years. XML makes some things easier, and has allowed the creation of some standards based tools, but the hard part of doing real work with Microsoft documents is just as difficult now as it was in the binary days. Remember the old saying, “You can write FORTRAN in any language”. The same thing can be said of OOXML, “You can write a opaque file format even in XML”.

    XML is useful, but it isn’t magic.

    In any case, I’d be interested in reading some time how well this XSLT transform handles the conversions. I haven’t seen anyone list what constructs are supported and which are not, what the limitations are and what the performance is like. A formal profile that maps between the two would be interesting.

    From a practical perspective, for this to be useful, the conversion would need to be flawless, have a well-defined and useful feature set that it supports, and be sufficiently fast that it can process a document with 10’s to 100’s of equations quickly, where “quickly” is relative to user expectations for loading Office documents, which is typically less than 5 seconds.

    In any case, I think I agree with your underlying premise, that given an adequate transform between the two (though I make no claim on whether the current stylesheet is adequate) and placing that transform at all places in the universe where someone is expecting MathML, then OOMML could be substituted for MathML.

    And when we have access to all control points in the universe then we can at the same time also add transforms for VML to SVG, XPS to PDF, HD Photo to PNG, OOXML to ODF, .NET to Mono, etc. When this is done we can look around, admire the universe full of adapters and transformers, doing nothing but wasting CPU cycles, and congratulate ourselves on having wasted years chasing after Microsoft interfaces, while Microsoft has moved on and cemented new monopolies in other areas.

    So I’ll agree with you that it can be done. I just don’t think it should be done. Wouldn’t it have been better, for example, if Microsoft had worked with the MathML WG to help evolve the standard?

  • David 2007/04/30, 10:24

    And a binary format cannot be inspected or transformed? Come on, we’ve been doing this for years. XML makes some things easier, and has allowed the creation of some standards based tools, but the hard part of doing real work with Microsoft documents is just as difficult now as it was in the binary days.

    Judging difficulty is a personal thing, but I wrote a stylesheet to get from the output that Word optimistically calls html to valid (on my one test document) XHTML+MathML in half an hour or so plus another hour or so debugging their stylesheet for them, and a similar amount of time (and less debugging) for the OO.org version.
    I could do it quickly because I was using the tools I use every day and I could use those tools because the systems are generating XML. If they were writing .doc format it may not have been _much_ more difficult, but it would probably have tipped the balance to the point where I wouldn’t even have tried (actually, I probably wouldn’t have installed the office suites either).

    Wouldn’t it have been better, for example, if Microsoft had worked with the MathML WG to help evolve the standard?

    Ah, well, now there’s an interesting point, Microsoft are a member of the MathML3 WG and were a member of the MathML2 one as well Group members Neither Sun nor IBM have chosen to join, although IBM was a member of the earlier WGs of course. Personally I’d be really happy to see someone who works closely with an ODF implementation join the WG, but it’s up to individual W3C member organisations to decide for themselves whether to join.

  • Anonymous 2007/06/02, 22:38

    Two words: plain text.

  • Anonymous 2007/06/02, 23:27

    Word 2007 uses MathML internally and on the clipboard for equation editing: http://blogs.msdn.com/microsoft_office_word/archive/2006/10/04/Equations-in-Word-2007.aspx

    That breaks compatibility with previous versions of Word (<=2003) that used a version of proprietary and inferior MathType. Now that Microsoft has moved to an open standard, they should be applauded, not criticized for this change. I hope won’t be long before these publication’s review work flow catches up to this open standard.

  • Anonymous 2007/06/03, 04:52

    I really wish that the new equation editor format was completely compatible, but these seems to be a bit overblown. You can just save the document as Word 97-2003 compatible and use the MathML editor for work that is going to be published.

    The actual look of the equations made with the new version of office is really nice, maybe with a few updates it will be more supportable.

  • Anonymous 2007/06/03, 07:41

    Speaking as a working researcher trying to use open tools for preparing manuscripts:

    i) The big missing feature in openoffice from the scientist’s point of view is a reference manager (the equivalent of EndNote/ReferenceManager etc). It is unthinkable these days to write a manuscript without one. For openoffice, the best I know of is BIBUS. Then of course the journals have to accept those manuscripts…

    ii) I wish more journals in biological sciences would accept LaTeX.

  • Rob 2007/06/03, 10:23

    There seems to be some misunderstanding here. Word 2007 does not use MathML internally. Their file format uses a new math markup called Office Open Math Markup Language (OOMML)developed internally by Microsoft rather than going through the peer review and standards development process of the W3C’s MathML Activity, where MathML 3.0 is currently being drafted.

    OOMML is not supported by legacy versions of Office, so saving your document as Office 2003 format, or using the OOXML Compatibility Pack in Office 2003 won’t help you. If you do that your formulas will be converted to images and then cannot be further edited.

    It isn’t clear to me why they just don’t convert the formulas back to the pre-Office 2007 MathType format when saving in to the legacy formats.

    If you use Microsoft’s ODF Add-in for Word, the formulas will simply be dropped altogether.

    So it comes down to this: if you use the equation editor in Office 2007 you can only collaborate with other Office 2007 users. Colleagues using older versions of Office, or Office for the Mac, or open source alternatives like OpenOffice, none of them will be able to collaborate with you.

  • Anonymous 2007/06/03, 21:09

    Office suites are fine for writing office documents, but you must learn LaTeX if you ned to explain anything mathematical. Equation editors are all tedious and slow. But writing equations in LaTeX as smooth as typing.

Leave a Comment