OOXML

Release the OOXML final DIS text now !

2008/05/04 By Rob 24 Comments

The JTC1 Directives [pdf] are quite clear on this point. After a Ballot Resolution Meeting (BRM), if the text is approved, the edited, final version of the text is to be distributed to NB’s within 1 month. This requirement is in the Fast Track part of JTC1 Directives, specifically in 13.12:

13.12 The time period for post ballot activities by the respective responsible parties shall be as follows:
.
.
.

In not more than one month after the ballot resolution group meeting the SC Secretariat shall distribute the final report of the meeting and final DIS text in case of acceptance.

The OOXML BRM ended on February 29th. One month after February 29th, if my course work in scientific computing does not fail me, is… let’s see, carry the 3, multiply, convert to sidereal time, account for proper nutation of the solar mean, subtract the perihelion distance at first point of Aries, OK. Got it. Simple. One month later is approximately March 29th +/- 3 days.

So the SC34 Secretariat should have distributed the “final DIS text” by March 29th, or at the very least, when the final ballot results on OOXML were known a few days later.

But that didn’t happen. Nothing. Silence. What is the hang up? I note that when NB’s said that the Fast Track schedule did not give sufficient time to review OOXML, the response from ISO/IEC was “There is nothing we can do. The Directives only permit 5 months”. And when NB’s protested at the arbitrary 5 day length of the OOXML BRM, the response was similarly dismissive. But when Microsoft needs more time to edit OOXML, well that appears to be something entirely different. “Directives, Schmerectives. You don’t worry yourself about no stinkin’ Directives. Take whatever time you need, Sir.”

It makes you wonder who ISO/IEC bureaucracy is working for? The rights and prerogatives of NB’s? Or of large corporations? Almost every decision they made in the OOXML processing was to the the detriment of NB prerogatives.

This delay has practical implications as well. Consider the following:

We are currently approaching a two month period where NB’s can lodge an appeal against OOXML. Ordinarily, one of the grounds for appeal would be if the Project Editor did not faithfully carry out the editing instructions approved at the BRM. For example, if he failed to make approved changes, made changes that were not authorized, or introduced new errors when applying the approved changes. But with no final DIS text, the NB’s are unable to make any appeals on those grounds. By delaying the release of the final DIS text, JTC1 is preventing NB’s from exercising their rights.
Law suits, such as the recent one in the UK, are alleging process irregularities, including (if I read it correctly) that BSI approved OOXML without seeing the final text. I imagine that having the final DIS text in hand and being able to point to particular flaws in that text that should have justified disapproval would bolster their case. But if JTC1 withholds the text, then they cannot make that point as effectively.
There are obvious anti-competitive effects at play here. Microsoft has the final DIS version of the ISO/IEC 29500:2008 standard, and by JTC1 delaying release to NB’s, Microsoft is able to have 2+ extra months, free of competition, to produce a fix pack to bring their products in line with the final standard, while other competitors like Sun or Corel are left behind. So much for transparency. So much for open standards. How can this can considered open if some competitors are given a significant time and access advantage?

Note that I’m not talking about the publication of the IS here. I’m talking about the requirements of 13.12 and the release of the final DIS text. Obviously ITTF will have a lot of work to do prepping OOXML for publication. For ODF it took 6 months. For OOXML I would expect it to take at least that long. But that does not prevent adhearance to the Directives, in particular the requirement to distribute the final DIS text.

JTC1/SC34, noticing the delay in the release of this text, adopted the following Resolution at their Plenary in early April:

Resolution 8: Distribution of Final text of DIS 29500

SC 34 requests the ITTF and the SC34 secretariat to distribute the already received final text of DIS 29500 to the SC 34 members in accordance with JTC 1 directives section 13.12 as soon as possible, but not later than May 1st 2008. Access to this document is important for the success of various ISO/IEC 29500 maintenance activities.

This indicates that the final DIS text had already been received by SC34 (but not distributed) as of that date (April 9th).

Well, here we are, May 4th, over two months since the final DIS text was due, and past the date requested by the SC34 Plenary (who by they way have no authority to extend the deadline required by JTC1 Directives, but that is another story). We have nothing.

So, I’ll make my own personal appeal. JTC1 has the text. The Directives are clear. The delay is unnecessary and harmful in the ways I outlined above. Release the final DIS text now. Not next month. Not next week. Release it now.

Sinclair’s Syndrome

2008/04/17 By Rob 10 Comments

A curious FAQ put up by an unnamed ISO staffer on MS-OOXML. Question #1 expresses concerns about Fast Tracking a 6,000 page specification, a concern which a large number of NB’s also expressed during the DIS process. Rather than deal honestly with this question, the ISO FAQ says:

The number of pages of a document is not a criterion cited in the JTC 1 Directives for refusal. It should be noted that it is not unusual for IT standards to run to several hundred, or even several thousand pages.

Now certainly there are standards that are several pages long. For example, Microsoft likes to bring up the example of ISO 14496, MPEG 4, at over 4,000 pages in length. But that wasn’t a Fast Track. And as Arnaud Lehors reminded us earlier, MPEG 4 was standardized in 17 parts over 6 years.

So any answer in the FAQ which attempts to consider what is usual and what is unusual must take account of past practice JTC1 Fast Track submissions. That, after all, was the question the FAQ purports to address.

Ecma claims (PowerPoint presentation here) that there have been around 300 Fast Tracked standards since 1987 and Ecma has done around 80% of them. So looking at Ecma Fast Tracks is a reasonable sample. Luckily Ecma has posted all of their standards, from 1991 at least, in a nice table that allows us to examine this question more closely. Since we’re only concerned with JTC1 Fast Tracks, not ISO Fast Tracks or standards that received no approval beyond Ecma, we should look at only those which have ISO/IEC designations. “ISO/IEC” indicates that the standard was approved by JTC1.

So where did things stand on the eve of Microsoft’s submission of OOXML to Ecma?

At that point there had been 187 JTC1 Fast Tracks from Ecma since 1991, with basic descriptive statistics as follows:

mean = 103 pages
median = 82 pages
min = 12 pages
max = 767 pages
standard deviation = 102 pages

A histogram of the page lengths looks like this:

So the ISO statement that “it is not unusual for IT standards to run to several hundred, or even several thousand pages” does not seem to ring true in the case of JTC1 Fast Tracks. A good question to ask anyone who says otherwise is, “In the time since JTC1 was founded, how many JTC1 Fast Tracks have been submitted greater than 1,000 pages in length”. Let me know if you get a straight answer.

Let’s look at one more chart. This shows the length of Ecma Fast Tracks over time, from the 28-page Ecma-6 in 1991 to the 6,045 page Ecma-376 in 2006.

Let’s consider the question of usual and unusual again, the question that ISO is trying to inform the public on. Do you see anything unusual in the above chart? Take a few minutes. It is a little tricky to spot at first, but with some study you will see that one of the standards plotted in the above chart is atypical. Keep looking for it. Focus on the center of the chart, let your eyes relax, clear your mind of extraneous thoughts.

If you don’t see it after 10 minutes or so, don’t feel bad. Some people and even whole companies are not capable of seeing this anomaly. As best as I can tell it is a novel cognitive disorder caused by taking money from Microsoft. I call it “Sinclair’s Syndrome” after Upton Sinclair who gave an early description of the condition, writing in 1935: “It is difficult to get a man to understand something when his salary depends upon his not understanding it.”

To put it in more approachable terms, observe that Ecma-376, OOXML, at 6,045 pages in length, was 58 standard deviations above the mean for Ecma Fast Tracks. Consider also that the average adult American male is 5′ 9″ (175 cm) tall, with a standard deviation of 3″ (8 cm). For a man to be as tall, relative to the average height, as OOXML is to the average Fast Track, he would need to be 20′ 3″ (6.2 m) tall !

For ISO, in a public relations pitch, to blithely suggest that several thousand page Fast Tracks are “not unusual” shows an audacious disregard for the truth and a lack of respect for a public that is looking for ISO to correct its errors, not blow smoke at them in a revisionist attempt to portray the DIS 29500 approval process as normal, acceptable or even legitimate. We should expect better from ISO and we should express disappointment in them when they let us down in our reasonable expectations of honesty. We don’t expect this from Ecma. We don’t expect this from Microsoft. But we should expect this from ISO.

OOXML’s (Out of) Control Characters

2008/03/24 By Rob 14 Comments

Let’s start with the concepts of “lexical” and “value” spaces in XML, as well as the mechanism of “derivation by restriction” in XML Schema. Any engineer can understand the basics here, even if you don’t eat and drink XML for breakfast.

The value space for an XML data item comprises the set of all allowed values. So the value space for the “float” data type would be all floating point numbers, such as 12.34 or 43.21. The lexical space comprises all ways of expressing these values in the character stream of an XML document. So lexical representations of the value 12.34 include “12.34”, “12.340” and ‘1.234E1”. For ease of illustration I will indicate value space items in bold, and lexical space items in quotes. In general there are multiple lexical representations that may represent the same value.

Character data in XML also permits more than one lexical representation of the same value. For example, “A” and “A” both represent the value A. The “numerical character reference” approach allows an XML author to easily encode the occasional Unicode character which is not part of the author’s native editing environment, e.g., adding the copyright character or occasional foreign character. The value space allowed by XML includes most of Unicode, including all of the major writing systems of the world, current and historical.

The concern I have with DIS 29500 concerns Ecma’s introduction of a ST_XString (Escaped String) datatype. This new type is defined via the following XML Schema definition:

This uses the “derivation by restriction” facility of XML Schema to define a new type, derived from the standard xsd:string schema type. The xsd:string type is defined to allow only character values that are also allowed in the XML standard.

The use of derivation by restriction implies a clear relationship between the ST_Xstring type and the base type xsd:string. This is stated in XML Schema Part 1, clause 2.2.1.1:

A type definition whose declarations or facets are in a one-to-one relation with those of another specified type definition, with each in turn restricting the possibilities of the one it corresponds to, is said to be a restriction.

The specific restrictions might include narrowed ranges or reduced alternatives. Members of a type, A, whose definition is a restriction of the definition of another type, B, are always members of type B as well.

The latest sentence can be taken as a restatement of the Liskov Substitution Principle, a fundamental principle of interface design, that a subtype should be usable (substitutable) wherever a base type is usable. It is this principle that ensures interoperability. A type derived by restriction limits, restricts, constrains, reduces the permitted value space of its base type, but it cannot increase the value space beyond that permitted by its base type.

So, with that background, let’s now look at how OOXML defines the semantics of its ST_Xstring type:

ST_Xstring (Escaped String)

String of characters with support for escaped invalid-XML characters.

For all characters which cannot be represented in XML as defined by the XML 1.0 specification, the characters are escaped using the Unicode numerical character representation escape character format _xHHHH_, where H represents a hexadecimal character in the character’s value. [Example: The Unicode character 8 is invalid in an XML 1.0 document, so it shall be escaped as _x0008_. end example]

This simple type’s contents are a restriction of the XML Schema string datatype.

In other words, although ST_Xstring is declared to be a restriction of xsd:string it is, via a proprietary escape notation, in fact expanding the semantics of xsd:string to create a value space that includes additional characters, including characters that are invalid in XML.

Let’s review some of the problems it introduces.

First, the semantics of XML strings that contain invalid XML-characters is undefined by this or any other standard. For example, OOXML uses ST_Xstring in Part 4, Clause 3.3.1.30 to store the error message which should be displayed when a data validation formula fails. But what should an OOXML-supporting application do when given a display string which contains control characters from the C0 control range, characters forbidden in XML 1.0?

U+0004 END OF TRANSMISSION
U+0006 ACKNOWLEDGE
U+0007 BELL
U+0008 BACKSPACE
U+0017 SYNCHRONOUS IDLE

How should these characters be displayed?

There is a reason XML excludes these dumb terminal control codes. They are neither desired nor necessary in XML.

Elliotte Rusty Harold explains the rationale for this prohibition in his book Effective XML:

The first 32 Unicode characters with code points 0 to 31 are known as the C0 controls. They were originally defined in ASCII to control teletypes and other monospace dumb terminals. Aside from the tab, carriage return, and line feed they have no obvious meaning in text. Since XML is text, it does not include binary characters such as NULL (#x00), BEL (#x07), DC1 (#x11) through DC4 (#x14), and so forth. These noncharacters are historic relics. XML 1.0 does not allow them.

This is a good thing. Although dumb terminals and binary-hostile gateways are far less common today than they were twenty years ago, they are still used, and passing these characters through equipment that expects to see plain text can have nasty consequences, including disabling the screen.

Further, since these characters are undefined in XML, they are unlikely to work well with existing accessibility interfaces and devices. At best these characters will be ignored and introduce subtle errors. For example, what does “$10,[BS]000” become if one system processes the backspace and another does not? Worst case, the accessibility interface expecting a certain range of characters as defined by the xsd:string type will crash when presented with values beyond the expected range.

Interfaces with existing programming languages are also harmed by ST_Xstring. How does a C or C++ XML parser deal with XML that now can allow a U+0000 (NULL) character in the middle of a string, something which is illegal in that programming language?

What about XML database interfaces that take XML data and store it in relational tables? If they are schema-aware and see that ST_Xstring is merely a restriction of xsd:string, they will assume the normal range of characters can be stored wherever an xsd:string can be stored. But since the value space is expanded, there is no guarantee that this will still be true. These characters may cause validation errors in the database.

By now, the observant reader may be accusing me of pulling a fast one. “But Rob, none of the above is a problem if the application simply leaves the ST_Xstring encoded and does not try to decode or interpret the non-XML character,” you might say.

OK. Fair enough. Let’s follow that approach and see where it leads us.

Let’s look at interoperability with other XML-based standards. Imagine you do a DOM parse of an OOXML document that contains “strings” of type ST_Xstring. Either your parser/application is OOXML-aware, or it isn’t. In other words, either it is able to interpret the non-standard _xHHHH_ instructions, or it isn’t.

If it doesn’t understand them, then any other code that operates on the DOM nodes with ST_Xstring data is at risk of returning the wrong answer. For example, what is the length of the string “ABC”? Three-characters, of course. But what is the length of the string “_x0041_BC” ? These two strings both have the same values according to OOXML. But an XML application might return 9 or return 3, depending on whether it is OOXML-aware or not. Since most (all) XML parsers are unaware of the non-standard escape mechanism proposed by OOXML, they will typically calculate things such as string lengths, string comparisons, string sorting, etc., incorrectly.

But suppose the parser/application is OOXML-aware and correctly decodes these character references into the correct Unicode values, then what? Assuming the host language doesn’t crash from the existence of this control characters, we then are presented with problems at the interface with any other code that operates on the DOM. Suppose we try to transform the DOM via XSLT to XHTML. Will the XSLT engine properly handle the existence of these forbidden character values? The XSLT engine may just crash. But suppose it doesn’t. How does it write out these control characters into XHTML? It can’t. These values are not permitted in XHTML. Dead end. What about DocBook? DITA? OpenDocument Format? Not possible. Since these characters are not permitted in XML 1.0 at all, they will be forbidden in all other markup languages that are based on XML 1.0, or even XML 1.1 for that matter (XML 1.1 allows some but not all of these characters, in particular the NULL character is excluded).

Note further that with XML pipelining and with mashups, the application that writes XML output typically does not have direct knowledge of the application that originally produced the XML values. This decoupling of producers and consumers is an essential aspect of modern systems integration, include Web Services. By corrupting XML string values in the way that it does, DIS 29500 breaks the ability to have loosely coupled systems. Once the value space is polluted by these aberrant control characters, every application, every process that touches this data must be aware of their non-standard idiosyncrasies lest they crash or return incorrect answers. In this way, one standard perverts the entire XML universe, forcing them all to contend with the poor hygiene of a single vendor.

The reader might think that I exaggerate the importance of this, that surely ST_Xstring is only used in OOXML in edge cases, in rare, compatibility modes. We wish that this were true. However, a look at the DIS 29500 shows that ST_Xstring is pervasive, and in fact is the predominant data type in SpreadsheetML, used to express the vast majority of spreadsheet content, including cell contents, headers, footers, displays strings, error strings, tooltip help, range names, etc. Any application that operates on an OOXML spreadsheet will need to deal with this mess.

For example, here are some uses of ST_Xstring in DIS 29500, Part 4:

Clause 3.2.3 for the name of a custom view in a spreadsheet
Clause 3.2.5 for the name of a spreadsheet named range, for the descriptive comment, for the name description, for the
help topic, the keyboard shortcut, the status bar text and for the menu item text
Clause 3.2.14 for the name of a spreadsheet function group
Clause 3.2.19 for the name of a sheet in a workbook
Clause 3.2.22 for the name of a smart tag as well as for the URL of a smart tag.
Clause 3.2.25 for the destination file name and title when publishing spreadsheet to the web.
Clause 3.3.1.10 for the value of a conditional formatting object, e.g., a gradient
Clause 3.3.1.20 for the name of a custom property
Clause 3.3.1.28 for sheet and range names
Clause 3.3.1.30 for error message string, error message title, prompt string and prompt title in a spreadsheet data validation definition.
Clause 3.3.1.35 for the value of a footer for even numbered pages.
Clause 3.3.1.36 for the value of a header for even numbered pages.
Clause 3.3.1.38 for the content of the first page footer
Clause 3.3.1.39 for the content of the first page header
Clause 3.3.1.44 for the display string for a hyperlink, the tooltip help for the link, also the anchor target if the hyperlink is to an HTML page
Clause 3.3.1.49 for values of input cells in a scenario
Clause 3.3.1.50 for cell inline text values
Clause 3.3.1.55 for the value of a footer for odd numbered pages.
Clause 3.3.1.56 for the value of a header for odd numbered pages.
Clause 3.3.1.73, in scenarios for the comment text, the scenario name and the name of the person who last changed the scenario.
Clause 3.3.1.88 when defining sort condition, for the values of a the custom sort list
Clause 3.3.1.93 for the value contained within a cell
Clause 3.3.1.94 for information associated with items published to the web, including the destination file and the title of the output HTML file
Clause 3.3.2.2 for expressing the criteria values in a filter
Clause 3.3.15 for the key/values for smart tag properties
Clause 3.4.4 for expressing the contents of a rich text run
Clause 3.4.5 for expressing the name of a font
Clause 3.4.6 for expressing the text of a phonetic hint for East Asian text
Clause 3.4.8 for expressing a text item in the shared string table
Clause 3.4.12 for the text content shown as part of a string
Clause 3.5.1.2 for a table, expressing a textual comment, a display name as well as style names.
Clause 3.5.1.3 for a table column, expressing cell and row style names, column name
Clause 3.5.1.7 for column properties created from an XML mapping, for expressing the associated XPath.
Clause 3.5.2.4 for the XPath associated with column properties for XML tables
Clause 3.7.1-3.7.6 for specifying content of tracked comments, including the text of the comments as well as the authors of the comments
Clause 3.8.29 expressing the name of a font

There are hundreds of additional uses. A search of DIS 29500 Part 4 for “ST_Xstring” returns 467 hits. OOXML also defines two additional types, “lptsr” (7.4.2.8) and “bstr” (7.4.2.4) that have the same flaw as ST_Xstring.

The reader might further argue that, although the type allows characters that are forbidden by XML, the actual occurrence of these values in real legacy documents is likely to be rare. This might be true, but this is cause for even greater concern. If every document contained these control characters, then we would immediately be aware of any interoperability problems when integrating OOXML data with other systems. But if these characters are permitted, but occur rarely and randomly, then the integration errors will also occur rarely and randomly, allowing data corruption and other problems to occur and propagate further before detection.

In summary, we are concerned that the ST_Xstring type in OOXML opens us up to problems such as:

Introducing accessibility problems
Breaking unaware C/C++ XML parsers
Breaking XML databases
Breaking interoperability with other XML languages
Breaking application logic related to string searching, sorting, comparisons, etc.
Introducing errors that will be hard to detect and resolve

Possible remedies include:

Use xsd:string uniformly instead of ST_Xstring, with no use of forbidden XML characters. This would require that applications that read legacy binary documents containing such characters eliminate them at this point, perhaps replacing them with licit characters or with whitespace. No application will be more able to devise the original meaning and intent of these characters than the original vendor. So they should be responsible for cleaning up these strings to make them XML-ready.
Use a non-string type such as the binary xsd:hexBinary or xsd:base64Binary to represent these data items.
Use a mixed content encoding, where the licit characters are represented by xsd:string data, and the forbidden characters are denoted by specially-defined elements. So “A_x0008_BC” would become: <text>A<backspace/>BC </text>. In this case the semantics of the <backspace> element would need to be documented in the DIS 29500 specification, including its effect on searching, sorting, length calculations, etc.

Five (Bad) Reasons to Approve OOXML

2008/03/24 By Rob 7 Comments

If you don’t approve OOXML, Microsoft will walk away, and you’ll never hear from them again. Forget the fact that OOXML is already an Ecma standard (Ecma-376), and cannot be taken away. Forget the fact that Microsoft has other formats lined up for ISO approval in the near future, like XPS or HD Photo. Microsoft wants you to think that if you don’t give them exactly what they want, now, they will walk away from ISO and you will be the worse from it. We need to encourage Microsoft for their abuse of the standardization process, in hopes that their participation will evolve in line with our hopes, and not our fears, that they will improve on the standardization side, while curbing the abuse side. Of course, the encouragement could be misinterpreted to mean the opposite, and we could get more abuse, and even lower quality standards. I guess that’s the risk we’ll just need to take. By similar abuses of logic small children hold their breath until their faces turn blue, thinking they can scare adults into giving them what they want. It doesn’t work there either.
If you approve OOXML, you can have the privilege of spending the next 5 years in the glorious work of fixing thousands of defects in the text. You can get a seat at the table, fixing bugs that should have been fixed in Ecma before OOXML was even submitted to JTC1. Forget the fact that maintenance in JTC1 is a ponderous, time consuming activity, where individual defects are enumerated, changes proposed, discussed, voted on, etc. Forget the fact that the recent BRM showed that you can’t really get through more than 60 defects in a week-long meeting. Forget the fact that fixing defects in Ecma, not JTC1, would be far faster and easier due to the lighter-weight process Ecma imposes on their TC’s. Forget that Fast Track is intended for mature, adopted standards not for ones that will require a “Perpetual BRM”. Forget all that. You want a seat at the bug fixing table? You got it.
Billions and Billions of legacy documents. Well, actually these legacy documents are not in OOXML format; they are in the legacy binary format. And no mapping has been provided from the legacy formats to OOXML. But there are billions and billions of these legacy documents. That must be important. So vote Yes for OOXML because there are billions and billions of documents in some other format that is nebulously related to it.
More standards are better. More standards means more choice, means more decisions, means more consultants, means more money paid to XML experts. You’ll sooner find the American Dairy Council recommending less milk consumption than a standards professional calling for fewer standards. So ignore quality, maturity and need. More standards are a good thing. Like Blue-ray and HD DVD.
ODF will be better if OOXML is approved. In OASIS we’re too stupid to look up legacy features or Excel spreadsheet formulas in Ecma-376. We would have never thought of that. We believe the only way to make ODF better is to make it more like OOXML. That is why we would like to encourage nice little JTC1 countries like Kazakhstan to vote YES for OOXML. As soon as OOXML is approved, then magically, it becomes useful to us. But the exactly same text, not approved by Kazakhstan and JTC1, is not useful to us at all. It is all or nothing. There is nothing in the middle. Rather than taking a useful, high quality text, and approving it on its merits, we are asked to approve a specification with thousands of defects, and by our approval we transform it into something useful to ODF.

How many defects remain in OOXML?

2008/03/18 By Rob 54 Comments

DIS 29500, Office Open XML, was submitted for Fast Track review by Ecma as 6,045 page specification. (After the BRM, it is now longer, maybe 7,500 pages or so. We don’t know for sure, since the post-BRM text is not yet available for inspection.) Based on the original 6,045 page length, a 5-month review by JTC1 NB’s lead to 48 defect reports by NB’s, reporting a total of 3,522 defects. Ecma responded to these defect reports with 1,027 proposals, which the recent BRM, mainly through the actions of one big overnight ballot, approved.

So what was the initial quality of OOXML, coming into JTC1? One measure is the defect density, which we can say is at least one defect for every 6045/1027 = 5.8 pages. I say “at least” because this is the lower bounds. If we believed that the 5-month review represented a complete review of the text of DIS 29500, by those with relevant subject matter expertise, then we would have some confidence that all, or at least most, defects were detected, reported and repaired. But I don’t know anyone who really thinks the 5-month review was sufficient for a technical review of 6,045 pages. Further, we know that Microsoft worked actively to suppress the reporting of defects by NB’s. So the actual defect density is potentially quite a bit higher than the reported defect density.

But how much higher? This is the important question. It doesn’t matter how many defects were fixed. What matters is how many remain.

There are several approaches to answering this question. One approach is to look at defect “find rates”, the number of defects found per unit of time spent reviewing, and fit that to a model, typical an S-curve (sigmoid) and use that model to predict the number of defects remaining. However, we have no time/effort data for the DIS 29500 review, so we don’t have enough data to create that model. Another approach is to randomly sample the post-BRM text and statistically estimate the defect density by this sample.

Are there any other good approaches?

Here is the plan. I will use the second approach. Since I do not actually have the post-BRM text, I need to make some adjustments. I’ll start with the original text, in particular Part 4, the XML reference section, at 5,220 pages, where the meat of the standard is. I’ll then create a spreadsheet and generate 200 random page numbers between 1 and 5,220. For each random page I will review the clause associated with that page and note the technical and editorial errors I find. I will then check these errors to see if any of them were addressed by BRM resolutions.

Based on the above, I will be able to estimate two numbers:

The defect density of the text, both pre and post BRM
The fraction of defects which were detected by the Fast Track review.

So if I find N defects, and 0.9N of those issues were already found during the Fast Track review and were addressed by the BRM, then we can say that the Fast Track procedure was 90% effective in finding and removing errors. Some practitioners would call that the defect removal “yield” of the process. But if we find that only 0.1N of the errors were reported and addressed by the BRM, then we’ll have a different opinion on the sufficiency of the Fast Track review.

Clear enough? Microsoft is claiming something like 99% of all issues were resolved at the BRM. So let’s see if we get anything close.

I’m not done with this study yet. I’m finding so many defects that recording them is taking more time than finding them. But since this is topical, I will report what I have found so far, based on the first 25 random pages, or 1/8th completion of my target 200. I’ve found 64 technical flaws. None of the 64 flaws were addressed by the BRM. Among the defects are some rather serious ones such as:

storage of plain text passwords in database connection strings
Undefined mappings between CSS and DrawingML
Errors in XML Schema definitions
Dependencies of proprietary Microsoft Internet Explorer features
Spreadsheet functions that break with non-Latin characters
Dependencies on Microsoft OLE method calls
Numerous undefined terms and features

As I said, this study is still underway. I’ll list the defects I’ve found so far, and add to it as I complete the task over the next few days.

Page 692, Section 2.7.3.13 — no errors found
Page 1457, Section 2.15.3.45 — This is a compatibility setting which creates needless complexity for implementers who now must deal with two different ways of handling a page break, one in which a page break ends the current paragraph, and another where it does not. This is not a general need and expresses only a single vendor’s legacy setting.
Page 490, Section 2.4.72 — This defines the ST_TblWidth type, used to express the width of a table column, cell spacing, margins, etc. The allowed values of this type express the measurement units to be used: Auto, Twentieths of a point, Nil (no width), Fiftieths of a percent. I find these choices to be capricious and not based on any sound engineering principle. It also mixes units with width values (Nil) and modes (auto). This should be changed to allow measurements in natural units, such as allowed in XSL-FO or CSS2, such as mm, inches, points, pica. Also, do not mix units, values and modes in the same attribute. Nil is best represented by the value 0 and Auto should be its own Boolean attribute.
Page 328, Section 2.4.17 — The frame attribute description says it “Specifies whether the specified border should be modified to create a frame effect by reversing the border’s appearance from the edge nearest the text to the edge furthest from the text.” This is not clear. What does it mean to reverse a border’s appearance? Are we doing color inversions? Flipping along the Y-axis? What exactly? Also a typographical error: “For the right and top borders, this is accomplished by moving the order down and to the right of its original location.” Should be “moving the border down…” Also, it is not stated how far the border should be moved.
Page 1073, Section 2.14.8 — This feature is described as: “This element specifies the connection string used to reconnect to an external data source. The string within this element’s val attribute shall contain the connection string that the hosting application shall pass to a external data source access application to enable the WordprocessingML document to be reconnected to the specified external data source.” Since connection to external data typically requires a user ID and a password, the lack of any security mechanism on this feature is alarming. The example given in the text itself hardcodes a plain-text password in it the connection string.
Page 4387, Section 6.1.2.3 — For the “class” attribute it says “Specifies a reference to the definition of a CSS style.” The example implies that some sort of mapping will occur between CSS attributes and DrawingML. But no such mapping is defined in OOXML. The “doubleclicknotify” attribute implies some sort of event model that us undefined in OOXML. How do you send a message for doubleclicknotify? Why do we describe organization chart layouts here when it is not applicable to a bezier curve? What happens if this shape is declared to be a horizontal rule or bullet or ole object? The text allows you label it as one of these, but assigns no meaning or behavior to this. Why do we have an spid as well as an id attribute? The “target” attribute refers to Microsoft-specific I.E. features such as “_media”. Although the text says that control points have default values, the schema fragment does not show this.
Page 3164, Section 4.6.88 — This and the following two elements are all called “To” but this seems to be a naming error. 4.6.89 is essentially undefined. What does “The element specifies the certain attribute of a time node after an animation effect” mean? It doesn’t seem to really signify anything. Ditto for 4.6.90.
Page 5098, Section 7.1.2.124 — The example does not illustrate what the text claims it does. The example doesn’t even use the element defined by this clause.
Page 4492, Section 6.1.2.11 — The “althref” attribute is described as “Defines an alternate reference for an image in Macintosh PICT format”. Why is this necessary for only Mac PICT files? Why would “bilevel” necessarily lead to 8 colors? We’re well beyond 8-bit color these days. “blacklevel” attribute is defined as “Specifies the image brightness. Default is 0.” What is the scale here? This needs to be defined. Is it 0-1.0, 0-255 or what? And what is “image brightness” in terms of the art? Is this luminosity? Opacity? Is this setting the level of the black point? For “cropleft”, etc. — what units are allowed? (implies %) How does “detectmouseclick” work when no event model is defined? “emboss effect” is not defined. “gain” has the same problem as “blacklevel” — no scale is defined. This element has two different id attributes in two different namespaces, with two different types. “movie” attribute is described as “Specifies a pointer to a movie image. This is a data block that contains a pointer to a pointer to movie data”. Excuse me? “A pointer to a pointer to movie data”? This is useless. The “recolortarget” example appears to contradict the description. It shows shows blue recolored to red, not black. The “src” attribute is said to be a URL, yet is typed to xsd:string. This should be xsd:anyURI.
Page 1431, Section 2.15.3.30 — no errors noted
Page 3405, Section 5.1.5.2.7 — The conflict resolution algorithm should be normative, not merely in a note.
Page 875, Section 2.11.21 — Instead of saying that the footnote “pos” element should be ignored if present at the section level, the schema should be defined so as to not allow it at the section level. In other words, this should be expressed as a syntax constraint.
Page 1955, Section 3.3.1.20 — This facility for adding “arbitrary” binary data to spreadsheets is said to be for “legacy third-party document components”. No documentation or mapping for such legacy components has been provided, so interoperability with this legacy data cannot be achieved. Why isn’t this expressed using the extension mechanisms of Part 5 of the DIS?
Page 4526, Section 6.1.2.13 — The “allowoverlap” attribute is not sufficiently defined. In particular, what determines whether the object shifts to right or left? ST_BWMode is not adequately defined. For example, one option is “Use light shades of gray only”. How light? And what is the difference between “hide” and “undrawn”? Also, concept of “wrapping polygon” is not sufficiently defined. For example, what is the wrapping polygon for an oval? The purpose of “dgmlayoutmru” is obscure. Wouldn’t the most-recently-used layout option be the one which is actually in use, “dgmlayout”? The “dgmnodekind” attribute is undefined, said to be “application-specific”. Is interoperabilty not allowed? The text seems to imply that applications must use application-specific values. The “href” attribute is give a string schema type. Shouldn’t this be xsd:anyURI. The “id” attribute is said to be a “unique identifier”. Unique in what domain? Among shapes of this type? Among all shapes? All shapes on this page? Among all ID’s in the document? The “preferrelative” attribute is not sufficiently defined. Where is the original size stored? After what reformatting? This appears to be a specification for runtime behavior, not a storage artifact. But it is not clear what is required. For the “regroupid”, where is the list of these possible id’s? The Hyperlink targets _media and _search are Internet Explorer proprietary features.
Page 1193, Section 2.15.1.39 — no errors noted
Page 1459, Section 2.15.3.46 — no errors noted
Page 2671, Section 3.17.7.150 — no errors noted
Page 2347, Section 3.10.1.69 — An “AutoShow” filter is not defined in this standard, though it is called for in several places of this section. “Average” aggregation function is not defined. In fact, none of these aggregation functions are defined. Although some have common mathematical definitions, in a spreadsheet context it is critical to make an explicit statement on treatment of strings, blanks, empty cells, etc. For dataSourceSort, what type of sort is required? Lexical or locale-sensitive? This element seems to mix field-specific settings, like dragToCol with pivotTable-wide settings like hiddenLevel. This will result in large data redundancy as settings like hiddenLevel are stored multiple times, once for each pivotField. “Inclusive Mode” is not defined. “Measure based filter” is not defined. “AutoSort” mode is not defined. The resolution of pivot table versus cell styles is ambiguous. “If the two formats differ, the cell-level formatting takes precedence.” Is this negotiation done at the level of the entire text style? Style ID? Or at the attribute level? “Outline form” is not defined. “server-based page field” is not defined. (what is a page field?) “member caption” is undefined.
Page 2885, Section 3.18.51 — The values of the given type (ST_OleUpdate) are explicitly tied to the Microsoft Windows OLE2 technology via the two method calls IOleObject::Update or IOleLink::Update
Page 3951, Section 5.5.3.4 — The base values “margin” and “edge” are ambiguous. Is it specifying positioning from the left or right page edge?
Page 2710, Section 3.17.7.200 — The description of “lookup-vector” is insufficient. It seems to be saying that the range should be sorted. Is this really correct? Spreadsheet functions typically do not have side effects. Also, the sorting procedure is explicitly defined only defined for the Latin alphabet. What about the rest of allowed Unicode characters, including the C0 control characters which are allowed in SpreadsheetML cell contents? Where are they sorted?
Page 934, Section 2.13.5.5 — The “id” attribute is required to be unique, but it is not specified over what domain it must be unique.
Page 607, Section 2.6.2 — What does “reversing the borders’s appearance mean”? How much offset is required for a shadow?
Page 201, Section 2.3.2.19 — This feature allows the suppressing of both spell and grammar checking for a text run. These should be two different settings, one for spelling and one for grammar proofing. There are many cases where it is important to check one, but not the other, just as in content comprised of sentence fragments, which are not grammatically complete, but where correct spelling is desired.
Page 1240, Section 2.15.1.74 — This setting specifies that the document should be saved into an undefined invalid XML format. But it is not stated how an XSLT transfor can be applied to an OOXML document, since OOXML is a Zip file containing many XML documents. So what exactly is the specified XSLT applied to?

That’s as far as I’ve gone. But this doesn’t look good, does it? Not only am I finding numerous errors, these errors appear to be new ones, ones not detected by the NB 5-month review, and as such were not addressed in Geneva. Since I have not come across any error that actually was fixed at the BRM, the current estimate of the defect removal effectiveness of the Fast Track process is < 1/64 or 1.5%. That is the upper bounds. (Confidence interval? I’ll need to check on this, but I’m thinking this would be based on standard error of a proportion, where SE=sqrt((p*(1-p))/N)), making our confidence interval 1.5% ± 3%) Of course, this value will need to be adjusted as my study continues. However, it is starting to look like the Fast Track review was very shallow and that detected only a small percentage of the errors in the DIS.

[20 March Update]

As one commenter noted, the page numbers I’m using above are PDF page numbers, not the page numbers on bottom of each page. If I used the printed pages then I would need to deal with all the Roman numeral front matter pages as an exception. Simpler to just use the one large domain of PDF page numbers.

PDF Page Number = Printed Page Number + 7

I will continue to report new defects, according to the original random number list I generated. I’ll update the statistics every 25.

Here’s some more for today:

Page 4192, Section 5.8.2.20 — “fPublished” attribute is defined as “Specifies whether the shape shall be published with the worksheet when sent to the spreadsheet server. This is for use when interfacing with a document server.” What worksheet? This section is in the DrawingML reference material. Charts could appear in presentations as well. This should not be limited to worksheets. Also what is a “spreadsheet server”? No such technology has been defined in this standard. Also no protocol has been defined for publishing to a spreadsheet server. Is this some proprietary hook for SharePoint? The “macro” attribute allows the storage of application-defined scripts. We are told that the macro “should be ignored if not understood.” However there is no mechanism for determining what language the script is in. How do we know if we understand the macro? Content sniffing? Attempt to execute it and see if we get a runtime error? But by that time, once we find out that we do not understand it, it is too late to ignore the macro. We may have already triggered runtime side effects. What we really need here is some way to declare what scripting language is being used, via a namespace or an additional attribute like “lang”.
Page 3526, Section 5.1.5.4.21 — The “algn” attribute specifies the text alignment. Allowed values include left, right, center, justified, etc. However, what is lacking is “start” and “end” alignment, which are sensitive to writing direction and are part of internationalization bets practices, for example, XSL-FO. When translating a document between RTL and LTR systems, the approach used by OOXML will harder to deal with and be more expensive to translate, since the translator will need to manually play with styles on not just perform an semi-automated translation.

[End Update]

I’ll continue to review the remaining 173 pages of my random sample and update the numbers and the defect list as I go. If you want to play along at home, the upcoming random page numbers will be:

1039
4933
3334
1993
1632
4787
460
481
4497
310
282
2383
1793
2451
3310
3716
1261
1077
2219
4236
285
3090
737
2370
741
164
5044
364
2272
1377
4512
1410
964
5079
5030
4110
3620
3588
2301
3222
4485
5082
193
3632
985
1593
5155
1054
3371
3717
5015
1071
2965
2294
1809
161
4922
5219
1719
1040
4259
3134
1195
4232
4444
3931
2302
2788
3584
8
5092
2580
1080
1239
1415
1170
1501
151
148
4754
1350
3714
1895
3926
4833
2886
2983
1439
3622
4960
2000
2555
671
2388
352
222
1630
3033
4994
3346
531
2393
482
207
2252
4074
3302
2459
751
1891
1635
3120
2226
1119
810
1728
837
4570
4474
1072
3901
300
4895
1764
2332
619
4392
2112
1653
4339
2384
4566
4085
1171
2238
5144
1399
4157
1352
27
4118
4167
5046
4460
4053
1258
4252
922
3748
1742
458
4448
963
2227
1404
593
4140
1739
1102
1611
3016
2646
3083
5105
747
1142
2596
845
626
4047
1415
5143
3997