OOXML’s (Out of) Control Characters

2008/03/24 By Rob 14 Comments

Let’s start with the concepts of “lexical” and “value” spaces in XML, as well as the mechanism of “derivation by restriction” in XML Schema. Any engineer can understand the basics here, even if you don’t eat and drink XML for breakfast.

The value space for an XML data item comprises the set of all allowed values. So the value space for the “float” data type would be all floating point numbers, such as 12.34 or 43.21. The lexical space comprises all ways of expressing these values in the character stream of an XML document. So lexical representations of the value 12.34 include “12.34”, “12.340” and ‘1.234E1”. For ease of illustration I will indicate value space items in bold, and lexical space items in quotes. In general there are multiple lexical representations that may represent the same value.

Character data in XML also permits more than one lexical representation of the same value. For example, “A” and “A” both represent the value A. The “numerical character reference” approach allows an XML author to easily encode the occasional Unicode character which is not part of the author’s native editing environment, e.g., adding the copyright character or occasional foreign character. The value space allowed by XML includes most of Unicode, including all of the major writing systems of the world, current and historical.

The concern I have with DIS 29500 concerns Ecma’s introduction of a ST_XString (Escaped String) datatype. This new type is defined via the following XML Schema definition:

This uses the “derivation by restriction” facility of XML Schema to define a new type, derived from the standard xsd:string schema type. The xsd:string type is defined to allow only character values that are also allowed in the XML standard.

The use of derivation by restriction implies a clear relationship between the ST_Xstring type and the base type xsd:string. This is stated in XML Schema Part 1, clause 2.2.1.1:

A type definition whose declarations or facets are in a one-to-one relation with those of another specified type definition, with each in turn restricting the possibilities of the one it corresponds to, is said to be a restriction.

The specific restrictions might include narrowed ranges or reduced alternatives. Members of a type, A, whose definition is a restriction of the definition of another type, B, are always members of type B as well.

The latest sentence can be taken as a restatement of the Liskov Substitution Principle, a fundamental principle of interface design, that a subtype should be usable (substitutable) wherever a base type is usable. It is this principle that ensures interoperability. A type derived by restriction limits, restricts, constrains, reduces the permitted value space of its base type, but it cannot increase the value space beyond that permitted by its base type.

So, with that background, let’s now look at how OOXML defines the semantics of its ST_Xstring type:

ST_Xstring (Escaped String)

String of characters with support for escaped invalid-XML characters.

For all characters which cannot be represented in XML as defined by the XML 1.0 specification, the characters are escaped using the Unicode numerical character representation escape character format _xHHHH_, where H represents a hexadecimal character in the character’s value. [Example: The Unicode character 8 is invalid in an XML 1.0 document, so it shall be escaped as _x0008_. end example]

This simple type’s contents are a restriction of the XML Schema string datatype.

In other words, although ST_Xstring is declared to be a restriction of xsd:string it is, via a proprietary escape notation, in fact expanding the semantics of xsd:string to create a value space that includes additional characters, including characters that are invalid in XML.

Let’s review some of the problems it introduces.

First, the semantics of XML strings that contain invalid XML-characters is undefined by this or any other standard. For example, OOXML uses ST_Xstring in Part 4, Clause 3.3.1.30 to store the error message which should be displayed when a data validation formula fails. But what should an OOXML-supporting application do when given a display string which contains control characters from the C0 control range, characters forbidden in XML 1.0?

U+0004 END OF TRANSMISSION
U+0006 ACKNOWLEDGE
U+0007 BELL
U+0008 BACKSPACE
U+0017 SYNCHRONOUS IDLE

How should these characters be displayed?

There is a reason XML excludes these dumb terminal control codes. They are neither desired nor necessary in XML.

Elliotte Rusty Harold explains the rationale for this prohibition in his book Effective XML:

The first 32 Unicode characters with code points 0 to 31 are known as the C0 controls. They were originally defined in ASCII to control teletypes and other monospace dumb terminals. Aside from the tab, carriage return, and line feed they have no obvious meaning in text. Since XML is text, it does not include binary characters such as NULL (#x00), BEL (#x07), DC1 (#x11) through DC4 (#x14), and so forth. These noncharacters are historic relics. XML 1.0 does not allow them.

This is a good thing. Although dumb terminals and binary-hostile gateways are far less common today than they were twenty years ago, they are still used, and passing these characters through equipment that expects to see plain text can have nasty consequences, including disabling the screen.

Further, since these characters are undefined in XML, they are unlikely to work well with existing accessibility interfaces and devices. At best these characters will be ignored and introduce subtle errors. For example, what does “$10,[BS]000” become if one system processes the backspace and another does not? Worst case, the accessibility interface expecting a certain range of characters as defined by the xsd:string type will crash when presented with values beyond the expected range.

Interfaces with existing programming languages are also harmed by ST_Xstring. How does a C or C++ XML parser deal with XML that now can allow a U+0000 (NULL) character in the middle of a string, something which is illegal in that programming language?

What about XML database interfaces that take XML data and store it in relational tables? If they are schema-aware and see that ST_Xstring is merely a restriction of xsd:string, they will assume the normal range of characters can be stored wherever an xsd:string can be stored. But since the value space is expanded, there is no guarantee that this will still be true. These characters may cause validation errors in the database.

By now, the observant reader may be accusing me of pulling a fast one. “But Rob, none of the above is a problem if the application simply leaves the ST_Xstring encoded and does not try to decode or interpret the non-XML character,” you might say.

OK. Fair enough. Let’s follow that approach and see where it leads us.

Let’s look at interoperability with other XML-based standards. Imagine you do a DOM parse of an OOXML document that contains “strings” of type ST_Xstring. Either your parser/application is OOXML-aware, or it isn’t. In other words, either it is able to interpret the non-standard _xHHHH_ instructions, or it isn’t.

If it doesn’t understand them, then any other code that operates on the DOM nodes with ST_Xstring data is at risk of returning the wrong answer. For example, what is the length of the string “ABC”? Three-characters, of course. But what is the length of the string “_x0041_BC” ? These two strings both have the same values according to OOXML. But an XML application might return 9 or return 3, depending on whether it is OOXML-aware or not. Since most (all) XML parsers are unaware of the non-standard escape mechanism proposed by OOXML, they will typically calculate things such as string lengths, string comparisons, string sorting, etc., incorrectly.

But suppose the parser/application is OOXML-aware and correctly decodes these character references into the correct Unicode values, then what? Assuming the host language doesn’t crash from the existence of this control characters, we then are presented with problems at the interface with any other code that operates on the DOM. Suppose we try to transform the DOM via XSLT to XHTML. Will the XSLT engine properly handle the existence of these forbidden character values? The XSLT engine may just crash. But suppose it doesn’t. How does it write out these control characters into XHTML? It can’t. These values are not permitted in XHTML. Dead end. What about DocBook? DITA? OpenDocument Format? Not possible. Since these characters are not permitted in XML 1.0 at all, they will be forbidden in all other markup languages that are based on XML 1.0, or even XML 1.1 for that matter (XML 1.1 allows some but not all of these characters, in particular the NULL character is excluded).

Note further that with XML pipelining and with mashups, the application that writes XML output typically does not have direct knowledge of the application that originally produced the XML values. This decoupling of producers and consumers is an essential aspect of modern systems integration, include Web Services. By corrupting XML string values in the way that it does, DIS 29500 breaks the ability to have loosely coupled systems. Once the value space is polluted by these aberrant control characters, every application, every process that touches this data must be aware of their non-standard idiosyncrasies lest they crash or return incorrect answers. In this way, one standard perverts the entire XML universe, forcing them all to contend with the poor hygiene of a single vendor.

The reader might think that I exaggerate the importance of this, that surely ST_Xstring is only used in OOXML in edge cases, in rare, compatibility modes. We wish that this were true. However, a look at the DIS 29500 shows that ST_Xstring is pervasive, and in fact is the predominant data type in SpreadsheetML, used to express the vast majority of spreadsheet content, including cell contents, headers, footers, displays strings, error strings, tooltip help, range names, etc. Any application that operates on an OOXML spreadsheet will need to deal with this mess.

For example, here are some uses of ST_Xstring in DIS 29500, Part 4:

Clause 3.2.3 for the name of a custom view in a spreadsheet
Clause 3.2.5 for the name of a spreadsheet named range, for the descriptive comment, for the name description, for the
help topic, the keyboard shortcut, the status bar text and for the menu item text
Clause 3.2.14 for the name of a spreadsheet function group
Clause 3.2.19 for the name of a sheet in a workbook
Clause 3.2.22 for the name of a smart tag as well as for the URL of a smart tag.
Clause 3.2.25 for the destination file name and title when publishing spreadsheet to the web.
Clause 3.3.1.10 for the value of a conditional formatting object, e.g., a gradient
Clause 3.3.1.20 for the name of a custom property
Clause 3.3.1.28 for sheet and range names
Clause 3.3.1.30 for error message string, error message title, prompt string and prompt title in a spreadsheet data validation definition.
Clause 3.3.1.35 for the value of a footer for even numbered pages.
Clause 3.3.1.36 for the value of a header for even numbered pages.
Clause 3.3.1.38 for the content of the first page footer
Clause 3.3.1.39 for the content of the first page header
Clause 3.3.1.44 for the display string for a hyperlink, the tooltip help for the link, also the anchor target if the hyperlink is to an HTML page
Clause 3.3.1.49 for values of input cells in a scenario
Clause 3.3.1.50 for cell inline text values
Clause 3.3.1.55 for the value of a footer for odd numbered pages.
Clause 3.3.1.56 for the value of a header for odd numbered pages.
Clause 3.3.1.73, in scenarios for the comment text, the scenario name and the name of the person who last changed the scenario.
Clause 3.3.1.88 when defining sort condition, for the values of a the custom sort list
Clause 3.3.1.93 for the value contained within a cell
Clause 3.3.1.94 for information associated with items published to the web, including the destination file and the title of the output HTML file
Clause 3.3.2.2 for expressing the criteria values in a filter
Clause 3.3.15 for the key/values for smart tag properties
Clause 3.4.4 for expressing the contents of a rich text run
Clause 3.4.5 for expressing the name of a font
Clause 3.4.6 for expressing the text of a phonetic hint for East Asian text
Clause 3.4.8 for expressing a text item in the shared string table
Clause 3.4.12 for the text content shown as part of a string
Clause 3.5.1.2 for a table, expressing a textual comment, a display name as well as style names.
Clause 3.5.1.3 for a table column, expressing cell and row style names, column name
Clause 3.5.1.7 for column properties created from an XML mapping, for expressing the associated XPath.
Clause 3.5.2.4 for the XPath associated with column properties for XML tables
Clause 3.7.1-3.7.6 for specifying content of tracked comments, including the text of the comments as well as the authors of the comments
Clause 3.8.29 expressing the name of a font

There are hundreds of additional uses. A search of DIS 29500 Part 4 for “ST_Xstring” returns 467 hits. OOXML also defines two additional types, “lptsr” (7.4.2.8) and “bstr” (7.4.2.4) that have the same flaw as ST_Xstring.

The reader might further argue that, although the type allows characters that are forbidden by XML, the actual occurrence of these values in real legacy documents is likely to be rare. This might be true, but this is cause for even greater concern. If every document contained these control characters, then we would immediately be aware of any interoperability problems when integrating OOXML data with other systems. But if these characters are permitted, but occur rarely and randomly, then the integration errors will also occur rarely and randomly, allowing data corruption and other problems to occur and propagate further before detection.

In summary, we are concerned that the ST_Xstring type in OOXML opens us up to problems such as:

Introducing accessibility problems
Breaking unaware C/C++ XML parsers
Breaking XML databases
Breaking interoperability with other XML languages
Breaking application logic related to string searching, sorting, comparisons, etc.
Introducing errors that will be hard to detect and resolve

Possible remedies include:

Use xsd:string uniformly instead of ST_Xstring, with no use of forbidden XML characters. This would require that applications that read legacy binary documents containing such characters eliminate them at this point, perhaps replacing them with licit characters or with whitespace. No application will be more able to devise the original meaning and intent of these characters than the original vendor. So they should be responsible for cleaning up these strings to make them XML-ready.
Use a non-string type such as the binary xsd:hexBinary or xsd:base64Binary to represent these data items.
Use a mixed content encoding, where the licit characters are represented by xsd:string data, and the forbidden characters are denoted by specially-defined elements. So “A_x0008_BC” would become: <text>A<backspace/>BC </text>. In this case the semantics of the <backspace> element would need to be documented in the DIS 29500 specification, including its effect on searching, sorting, length calculations, etc.

Comments

Anonymous says

2008/03/24 at 3:58 pm

By corrupting XML string values in the way that it does, DIS 29500 breaks the ability to have loosely coupled systems. Once the value space is polluted by these aberrant control characters, every application, every process that touches this data must be aware of their non-standard idiosyncrasies lest they crash or return incorrect answers.

That’s not a bug, Rob, it’s a feature.

Reply
Anonymous says

2008/03/24 at 4:12 pm

4. define entities for OOXML like &backspace; etc

Then again, aren’t we beyond the stage of improving this POS?

You’d think the combined might of Microsoft and ECMA would be able to produce something with fewer errors than normal specs, not the opposite…

jd

Reply
steve_l says

2008/03/24 at 4:27 pm

One problem with any escaping rule is how to handle double escapes, or when to unescape them.

Any XSL engine will still be able to handle escaped values in the text, just not unescape them. Sometimes that is good, sometimes it will be hopelessly bad. It depends entirely on the use.

Reply
Rob says

2008/03/24 at 4:50 pm

@steve,

I raised a similar issue in the US NB review of OOXML last summer. We did register one comment, US-0162, that pointed out that the “bstr” type lacked a mechanism to “escape the escape”, i.e., encode a literal value of _x0008_. This was addressed by Ecma Response 118 at the BRM. But it left untouched the other types with the same issue, like ST_Xstring and lpstr.

I would have reported ST_Xstring as a problem during the initial review, but we ran out of time,

Reply
Anonymous says

2008/03/24 at 8:43 pm

OOXML also defines two additional types, “lptsr” (7.4.2.8) and “bstr” (7.4.2.4)

I never read the OOXML documentation, but the name of these types gave me a bad feeling. Aren’t LPSTR and BSTR two Windows API string types? (BSTR being the COM UTF-16 counted string type and LPSTR being the C 8-bit “ANSI” zero-terminated string type.)

It wouldn’t surprise me if the source of the issues with these two string types is due to them attempting to serialize their native representation. In particular, BSTR can have embedded NULLs. Either way, it sounds like an implementation detail “leaking” into the specification.

Reply
Vexorian says

2008/03/25 at 10:58 am

I thought OOXML stood for “optionally open XML” but it looks to me it actually is a recursive acronym:

OOXML Obviously ain’t XML

Reply
billposer says

2008/03/25 at 3:41 pm

This is soooo typical of Microsoft. Having become somewhat of a collector of ascii escapes through the continued expansion of my programs uni2ascii/ascii2uni, which currently support 29 escape mechanisms, I thought I had seen them all. I was wrong. Microsoft has come up with yet another non-standard escape! They couldn’t use U+XXXX like the Unicode consortium, or &#xXXXX; as in HTML, or \uXXXX as in several programming languages? They just had to add yet another one?!

Reply
Anonymous says

2008/03/25 at 6:28 pm

MS is trying to outdo Houdini in escape artistry. OOXML should just be binary blobs between an open and close tag and spare us this nonsense. They ARE blobs as this post shows.

Reply
Michael S Collins says

2008/03/25 at 8:56 pm

Rob,
Once again you’ve done a great job of highlighting the technical deficencies of OOXML. This is so much more effective than the pro/anti OOXML zealotry that is pervasive in both camps. The holes in the DIS 29500 spec that you’ve brought out in the open ought to encourage all the NBs to disapprove this proposed standard until such time as MS/ECMA fix these glaring flaws.
I used to wonder why MS was not content to let OOXML stand or fall on its technical merits – no more! Without the lobbying, ballot-stuffing, and subverting of the ISO processes, this “standard” wouldn’t have made it out of Redmond.
Keep up the good work!
-MC

Reply
Anonymous says

2008/03/26 at 5:50 pm

I was wrong. Microsoft has come up with yet another non-standard escape!
There! You see how we innovate!

Aren’t some of these C0 characters printing characters in some fonts? I know they used to be in the original PCs. Maybe it’s not so far-fetched that someone would use \u0008 in a document.

Reply
Anonymous says

2008/03/27 at 3:00 am

OOXML Obviously ain’t XML

Indeed, it should be called OpenBIFF.

Winter

Reply
cwitty says

2008/03/27 at 3:20 pm

@anonymous: Aren’t some of these C0 characters printing characters in some fonts? I know they used to be in the original PCs. Maybe it’s not so far-fetched that someone would use \u0008 in a document.

If somebody has a document with an 08 in it, where the 08 is supposed to represent “INVERSE BULLET”, then according to this table, this should be mapped to the Unicode character U+25d8.

Reply
Anonymous says

2008/05/06 at 7:48 am

I wish you wouldnt repeat propaganda lines like the following:

“There is a reason XML excludes these dumb terminal control codes. They are neither desired nor necessary in XML.”

The inability of XML at a native level to store ALL characters (and NO CDATA sections dont solve the problem) is a serious serious weakness in XML that has resulted in a) lower adoption than it should have, b) multiple other standards being used instead, and c) multiple reimplementations of differing and incompatible ways to encode an arbitrary binary data packet at the application layer so it can be stored in XML. This is an absolute nightmare for a serialization and data exchange format.

One of the most common frustrations i see in the development community relating to XML is that XML is mindboggling hard to use with completely arbitrary data packets. CDATA doesnt allow nesting, and XML as a whole doesnt allow a range of highly useful and important characters to be used in any way.

Now while i agree that the way MS has done this is pretty brain damaged, the claim that XML neither needs nor wants to support arbitrary chars is just BS that does not in any way match what I hear from developers wanting to use XML.

An example, a simple website want to produce an XML feed of notes entered by their users. Since these notes are user entered they can contain pretty much anything. There is NO safe way to deliver this data faithfully using XML without additional application logic to handle encoding special content.

On this level XML is broken. And I think you undermine your basic argument by repeating such nonsense. And your basic argument is valid. So undermining it isnt all that clever.

And YES, i do know the arguments against leaving them out. They were specious then and are specious now.

Reply
Rob says

2008/05/06 at 8:52 am

A word of advice. If you find that using XML is “mindboggling hard to use with completely arbitrary data packets”, then maybe you shouldn’t be using XML for that task.

Just a thought.

Reply

Reader Interactions

Comments

Leave a Reply to Rob Cancel reply