Let’s start with the concepts of “lexical” and “value” spaces in XML, as well as the mechanism of “derivation by restriction” in XML Schema. Any engineer can understand the basics here, even if you don’t eat and drink XML for breakfast.
The value space for an XML data item comprises the set of all allowed values. So the value space for the “float” data type would be all floating point numbers, such as 12.34 or 43.21. The lexical space comprises all ways of expressing these values in the character stream of an XML document. So lexical representations of the value 12.34 include “12.34”, “12.340” and ‘1.234E1”. For ease of illustration I will indicate value space items in bold, and lexical space items in quotes. In general there are multiple lexical representations that may represent the same value.
Character data in XML also permits more than one lexical representation of the same value. For example, “A” and “A” both represent the value A. The “numerical character reference” approach allows an XML author to easily encode the occasional Unicode character which is not part of the author’s native editing environment, e.g., adding the copyright character or occasional foreign character. The value space allowed by XML includes most of Unicode, including all of the major writing systems of the world, current and historical.
The concern I have with DIS 29500 concerns Ecma’s introduction of a ST_XString (Escaped String) datatype. This new type is defined via the following XML Schema definition:
This uses the “derivation by restriction” facility of XML Schema to define a new type, derived from the standard xsd:string schema type. The xsd:string type is defined to allow only character values that are also allowed in the XML standard.
The use of derivation by restriction implies a clear relationship between the ST_Xstring type and the base type xsd:string. This is stated in XML Schema Part 1, clause 126.96.36.199:
A type definition whose declarations or facets are in a one-to-one relation with those of another specified type definition, with each in turn restricting the possibilities of the one it corresponds to, is said to be a restriction.
The specific restrictions might include narrowed ranges or reduced alternatives. Members of a type, A, whose definition is a restriction of the definition of another type, B, are always members of type B as well.
The latest sentence can be taken as a restatement of the Liskov Substitution Principle, a fundamental principle of interface design, that a subtype should be usable (substitutable) wherever a base type is usable. It is this principle that ensures interoperability. A type derived by restriction limits, restricts, constrains, reduces the permitted value space of its base type, but it cannot increase the value space beyond that permitted by its base type.
So, with that background, let’s now look at how OOXML defines the semantics of its ST_Xstring type:
ST_Xstring (Escaped String)
String of characters with support for escaped invalid-XML characters.
For all characters which cannot be represented in XML as defined by the XML 1.0 specification, the characters are escaped using the Unicode numerical character representation escape character format _xHHHH_, where H represents a hexadecimal character in the character’s value. [Example: The Unicode character 8 is invalid in an XML 1.0 document, so it shall be escaped as _x0008_. end example]
This simple type’s contents are a restriction of the XML Schema string datatype.
In other words, although ST_Xstring is declared to be a restriction of xsd:string it is, via a proprietary escape notation, in fact expanding the semantics of xsd:string to create a value space that includes additional characters, including characters that are invalid in XML.
Let’s review some of the problems it introduces.
First, the semantics of XML strings that contain invalid XML-characters is undefined by this or any other standard. For example, OOXML uses ST_Xstring in Part 4, Clause 188.8.131.52 to store the error message which should be displayed when a data validation formula fails. But what should an OOXML-supporting application do when given a display string which contains control characters from the C0 control range, characters forbidden in XML 1.0?
- U+0004 END OF TRANSMISSION
- U+0006 ACKNOWLEDGE
- U+0007 BELL
- U+0008 BACKSPACE
- U+0017 SYNCHRONOUS IDLE
How should these characters be displayed?
There is a reason XML excludes these dumb terminal control codes. They are neither desired nor necessary in XML.
Elliotte Rusty Harold explains the rationale for this prohibition in his book Effective XML:
The first 32 Unicode characters with code points 0 to 31 are known as the C0 controls. They were originally defined in ASCII to control teletypes and other monospace dumb terminals. Aside from the tab, carriage return, and line feed they have no obvious meaning in text. Since XML is text, it does not include binary characters such as NULL (#x00), BEL (#x07), DC1 (#x11) through DC4 (#x14), and so forth. These noncharacters are historic relics. XML 1.0 does not allow them.
This is a good thing. Although dumb terminals and binary-hostile gateways are far less common today than they were twenty years ago, they are still used, and passing these characters through equipment that expects to see plain text can have nasty consequences, including disabling the screen.
Further, since these characters are undefined in XML, they are unlikely to work well with existing accessibility interfaces and devices. At best these characters will be ignored and introduce subtle errors. For example, what does “$10,[BS]000” become if one system processes the backspace and another does not? Worst case, the accessibility interface expecting a certain range of characters as defined by the xsd:string type will crash when presented with values beyond the expected range.
Interfaces with existing programming languages are also harmed by ST_Xstring. How does a C or C++ XML parser deal with XML that now can allow a U+0000 (NULL) character in the middle of a string, something which is illegal in that programming language?
What about XML database interfaces that take XML data and store it in relational tables? If they are schema-aware and see that ST_Xstring is merely a restriction of xsd:string, they will assume the normal range of characters can be stored wherever an xsd:string can be stored. But since the value space is expanded, there is no guarantee that this will still be true. These characters may cause validation errors in the database.
By now, the observant reader may be accusing me of pulling a fast one. “But Rob, none of the above is a problem if the application simply leaves the ST_Xstring encoded and does not try to decode or interpret the non-XML character,” you might say.
OK. Fair enough. Let’s follow that approach and see where it leads us.
Let’s look at interoperability with other XML-based standards. Imagine you do a DOM parse of an OOXML document that contains “strings” of type ST_Xstring. Either your parser/application is OOXML-aware, or it isn’t. In other words, either it is able to interpret the non-standard _xHHHH_ instructions, or it isn’t.
If it doesn’t understand them, then any other code that operates on the DOM nodes with ST_Xstring data is at risk of returning the wrong answer. For example, what is the length of the string “ABC”? Three-characters, of course. But what is the length of the string “_x0041_BC” ? These two strings both have the same values according to OOXML. But an XML application might return 9 or return 3, depending on whether it is OOXML-aware or not. Since most (all) XML parsers are unaware of the non-standard escape mechanism proposed by OOXML, they will typically calculate things such as string lengths, string comparisons, string sorting, etc., incorrectly.
But suppose the parser/application is OOXML-aware and correctly decodes these character references into the correct Unicode values, then what? Assuming the host language doesn’t crash from the existence of this control characters, we then are presented with problems at the interface with any other code that operates on the DOM. Suppose we try to transform the DOM via XSLT to XHTML. Will the XSLT engine properly handle the existence of these forbidden character values? The XSLT engine may just crash. But suppose it doesn’t. How does it write out these control characters into XHTML? It can’t. These values are not permitted in XHTML. Dead end. What about DocBook? DITA? OpenDocument Format? Not possible. Since these characters are not permitted in XML 1.0 at all, they will be forbidden in all other markup languages that are based on XML 1.0, or even XML 1.1 for that matter (XML 1.1 allows some but not all of these characters, in particular the NULL character is excluded).
Note further that with XML pipelining and with mashups, the application that writes XML output typically does not have direct knowledge of the application that originally produced the XML values. This decoupling of producers and consumers is an essential aspect of modern systems integration, include Web Services. By corrupting XML string values in the way that it does, DIS 29500 breaks the ability to have loosely coupled systems. Once the value space is polluted by these aberrant control characters, every application, every process that touches this data must be aware of their non-standard idiosyncrasies lest they crash or return incorrect answers. In this way, one standard perverts the entire XML universe, forcing them all to contend with the poor hygiene of a single vendor.
The reader might think that I exaggerate the importance of this, that surely ST_Xstring is only used in OOXML in edge cases, in rare, compatibility modes. We wish that this were true. However, a look at the DIS 29500 shows that ST_Xstring is pervasive, and in fact is the predominant data type in SpreadsheetML, used to express the vast majority of spreadsheet content, including cell contents, headers, footers, displays strings, error strings, tooltip help, range names, etc. Any application that operates on an OOXML spreadsheet will need to deal with this mess.
For example, here are some uses of ST_Xstring in DIS 29500, Part 4:
- Clause 3.2.3 for the name of a custom view in a spreadsheet
- Clause 3.2.5 for the name of a spreadsheet named range, for the descriptive comment, for the name description, for the
help topic, the keyboard shortcut, the status bar text and for the menu item text
- Clause 3.2.14 for the name of a spreadsheet function group
- Clause 3.2.19 for the name of a sheet in a workbook
- Clause 3.2.22 for the name of a smart tag as well as for the URL of a smart tag.
- Clause 3.2.25 for the destination file name and title when publishing spreadsheet to the web.
- Clause 184.108.40.206 for the value of a conditional formatting object, e.g., a gradient
- Clause 220.127.116.11 for the name of a custom property
- Clause 18.104.22.168 for sheet and range names
- Clause 22.214.171.124 for error message string, error message title, prompt string and prompt title in a spreadsheet data validation definition.
- Clause 126.96.36.199 for the value of a footer for even numbered pages.
- Clause 188.8.131.52 for the value of a header for even numbered pages.
- Clause 184.108.40.206 for the content of the first page footer
- Clause 220.127.116.11 for the content of the first page header
- Clause 18.104.22.168 for the display string for a hyperlink, the tooltip help for the link, also the anchor target if the hyperlink is to an HTML page
- Clause 22.214.171.124 for values of input cells in a scenario
- Clause 126.96.36.199 for cell inline text values
- Clause 188.8.131.52 for the value of a footer for odd numbered pages.
- Clause 184.108.40.206 for the value of a header for odd numbered pages.
- Clause 220.127.116.11, in scenarios for the comment text, the scenario name and the name of the person who last changed the scenario.
- Clause 18.104.22.168 when defining sort condition, for the values of a the custom sort list
- Clause 22.214.171.124 for the value contained within a cell
- Clause 126.96.36.199 for information associated with items published to the web, including the destination file and the title of the output HTML file
- Clause 188.8.131.52 for expressing the criteria values in a filter
- Clause 3.3.15 for the key/values for smart tag properties
- Clause 3.4.4 for expressing the contents of a rich text run
- Clause 3.4.5 for expressing the name of a font
- Clause 3.4.6 for expressing the text of a phonetic hint for East Asian text
- Clause 3.4.8 for expressing a text item in the shared string table
- Clause 3.4.12 for the text content shown as part of a string
- Clause 184.108.40.206 for a table, expressing a textual comment, a display name as well as style names.
- Clause 220.127.116.11 for a table column, expressing cell and row style names, column name
- Clause 18.104.22.168 for column properties created from an XML mapping, for expressing the associated XPath.
- Clause 22.214.171.124 for the XPath associated with column properties for XML tables
- Clause 3.7.1-3.7.6 for specifying content of tracked comments, including the text of the comments as well as the authors of the comments
- Clause 3.8.29 expressing the name of a font
There are hundreds of additional uses. A search of DIS 29500 Part 4 for “ST_Xstring” returns 467 hits. OOXML also defines two additional types, “lptsr” (126.96.36.199) and “bstr” (188.8.131.52) that have the same flaw as ST_Xstring.
The reader might further argue that, although the type allows characters that are forbidden by XML, the actual occurrence of these values in real legacy documents is likely to be rare. This might be true, but this is cause for even greater concern. If every document contained these control characters, then we would immediately be aware of any interoperability problems when integrating OOXML data with other systems. But if these characters are permitted, but occur rarely and randomly, then the integration errors will also occur rarely and randomly, allowing data corruption and other problems to occur and propagate further before detection.
In summary, we are concerned that the ST_Xstring type in OOXML opens us up to problems such as:
- Introducing accessibility problems
- Breaking unaware C/C++ XML parsers
- Breaking XML databases
- Breaking interoperability with other XML languages
- Breaking application logic related to string searching, sorting, comparisons, etc.
- Introducing errors that will be hard to detect and resolve
Possible remedies include:
- Use xsd:string uniformly instead of ST_Xstring, with no use of forbidden XML characters. This would require that applications that read legacy binary documents containing such characters eliminate them at this point, perhaps replacing them with licit characters or with whitespace. No application will be more able to devise the original meaning and intent of these characters than the original vendor. So they should be responsible for cleaning up these strings to make them XML-ready.
- Use a non-string type such as the binary xsd:hexBinary or xsd:base64Binary to represent these data items.
- Use a mixed content encoding, where the licit characters are represented by xsd:string data, and the forbidden characters are denoted by specially-defined elements. So “A_x0008_BC” would become: <text>A<backspace/>BC </text>. In this case the semantics of the <backspace> element would need to be documented in the DIS 29500 specification, including its effect on searching, sorting, length calculations, etc.