Archives for May 2008

Fractured YEARFRAC and Discounted DISC

2008/05/19 By Rob 48 Comments

There are a few things that a spreadsheet can get wrong. You can be hard to use. You can be ugly. You can be slow. You can even crash occasionally. These are all annoyances. With a monopoly, annoyances are a way of life. However, there is one place where a spreadsheet can never be wrong, and that is in the area of calculations, especially financial calculations. A business uses spreadsheets to model and track its business. It makes decisions based on these spreadsheets. If the spreadsheet gives the wrong answer, then the business makes the wrong decision. And a business that makes the wrong decisions does not stay in business for long.

So it was with much lament last year that I reported that OOXML, then under ballot in ISO/IEC JTC, had many egregious errors in its spreadsheet formula definitions. In addition to enumerating these errors in my blog, I submitted them for consideration by INCITS V1, the US SC34 mirror committee, and these became part of the bundle of comments with which the US accompanied its ballot.

Although the ballot wasn’t due until September 2nd, the extravagant “sit and do nothing” provisions in INCITS lead to our technical review being cut off in July. Because of the lack of review time, I was not able to do an in-depth review of all the spreadsheet functions, but only a cursory review. But the existence of such errors as I did identify, in the already-approved Ecma Standard, was disquieting. It should have lead Ecma TC45 to conduct a more through review of the spreadsheet functions. But this did not appear to happen.

When we received the response from Ecma to the NB comments, in January, INCITS V1 was asked to go through all the responses (over 1,000 of them) to determine whether they were acceptable. This review period was again insufficient.

As for the BRM in February, this was a travesty, as I and others have noted.

I have a theory concerning committees. A committee may have different states, like water has gas, liquid or solid phases, depending on temperate and pressure. The same committee, depending on external circumstances of time and pressure will enter well-defined states that determine its effectiveness. If a committee works in a deliberate mode, where issues are freely discussed, objections heard, and consensus is sought, then the committee will make slow progress, but the decisions of the committee will collectively be smarter than its smartest member. However, if a committee refuses to deliberate and instead merely votes on things without discussion, then it will be as dumb as its dumbest members. Voting dulls the edge of expertise. But discussion among experts socializes that expertise. This should be obvious. If you put a bunch of smart people in a room and don’t let them think or talk, then don’t expect smart things to happen as if the mere exhalation of their breath brings forth improvements to the standard.

So the the BRM ended a different committee than it started, and the mode of operation it was lead into caused it to act like a very stupid committee indeed. I don’t say this to be accusatory. I’m just making an observation about crowd behavior. When a committee of experts ceases to be a deliberate committee, then you will achieve subpar results.

One of the ways the BRM was stupid is that it approved changes to OOXML that have totally broken SpreadsheetML’s financial calculations, rendering the resulting calculations both mathematically incorrect as well as inconsistent with what Excel actually calculates. More about this later.

Here be dragons.

One of the persistent problems with OOXML has been in the area of “day count conventions” as used in SpreadsheetML.

Why should day counting be complicated? Just count how many days between the two dates and you are done, right? Indeed, everywhere but in finance it is simple. Some of the complications are for historical reasons, to try to make hand calculations easier in the pre-computer era. Also, these conventions made the calendar more regular, so financial instruments were less distorted by calendar irregularities like leap years or variable length months. If you assume a year is exactly 360 days in 12 months with 30 days each, then some things in life are simpler. Of course, this makes other things in life more complicated, including defining spreadsheet functions.

The devil is in the details. You must get these count conventions right. Being wrong by just one day in a year may only be 0.3%, but in a million dollar transaction, that is $3000. Not many people can afford to routinely ignore a $3000 error.

Different financial organizations have developed their own different day count conventions. We have the Banker’s Rule, the LIBOR rule, the NASD rule, the ISDA rule, ICMA rule, SIFMA rule, etc.

In Excel, these conventions are defined by a function parameter called the “basis” and this basis is used by many financial spreadsheet functions. In fact, none of the following functions are adequately defined unless the day count basis values are adequately defined:

ACCRINT()
ACCRINTM()
AMORDEGRC
AMORLINC
COUPDAYBS()
COUPDAYS()
COUPDAYSNC()
COUPNCD()
COUPNUM()
COUPPCD()
DISC()
DURATION()
INTRATE
MDURATION()
ODDFPRICE()
ODDFYIELD()
ODDLPRICE()
ODDLYIELD()
PRICE()
PRICEDISC()
PRICEMAT()
RECEIVED()
YEARFRAC()
YIELD()
YIELDDISC()
YIELDMAT()

There are five basis conventions defined in OOXML, with values 0-4:

Basis 0 = US 30/360
Basis 1 = Actual/Actual
Basis 2 = Actual/360
Basic 3 = Actual/365
Basic 4 = European 30/360

In this convention X/Y indicates that X is the assumption for month length, and Y is the assumption for year length. Note that there is no external authority given as a reference for any of these conventions. So it is not clear, whether the Actual/Actual convention is following, for example ICMA rules or ISDA rules. In fact, it may be following neither rule. So lacking any external authority, all we have to go on are the definitions provided in the specification itself. OOXML is its own authority for day count conventions.

As I reported last July, the definitions provided by OOXML did not sufficiently define the behavior of the conventions that lie behind most of the financial functions in Excel. Unfortunately, Microsoft/Ecma have failed to fix this problem in their proposed Resolutions, and in fact have made it worse. Further, the DIS 29500 BRM, in its negligent bulk approval of Ecma’s responses merely advanced these serious errors into the text which was then approved as an International Standard. So essentially, my work in pointing out errors in the spreadsheet language was for naught. Microsoft just shoved it all though anyways.

This puts me in a delicate situation. On the one hand, the ODF TC really would like to finish its work on ODF 1.2, and part of that is completing the OpenFormula work. A key remaining part of OpenFormula is to ensure that our financial functions synch up with how Excel works. What OOXML says is irrelevant, except to the extent that it accurately tells how Excel defines these functions.

However, inquiries to Ecma on these day count conventions, inqueries made months ago, have received no response. Also, the final DIS text of OOXML has not been made available, not even to SC34 members and liaisons. And there is not mechanism in place yet in SC34 for collecting defect reports on OOXML. So we’re stuck.

Or maybe not.

David Wheeler, Chair of the OpenFormula subcomittee of the OASIS ODF TC, has been trying to nail down the behavior of Excel’s spreadsheet functions for over a year now. One of the last remaining pieces is to nail down the day count conventions. After waiting and waiting for this to be clarified in OOXML, David took matters into his own hands and decided to solve the problem by brute force, enumerating millions of test cases, indeed a comprehensive set of date pairs over a 6 year period, to try to determine exactly how the date bases in Excel work. You can read David’s conclusions on his blog.

What strikes me in David’s report is that not only are the OOXML definitions incomplete and inconsistent, but they do not accurately reflect what Excel actually calculates. So either Excel is wrong, or the OOXML standard is wrong when calculating almost every financial spreadsheet function. This is quite an embarrassment for an ISO standard, and an unnecessary one, since I have been talking about how poorly defined these functions are for almost a year now.

Let’s work through some of David Wheeler’s test cases by hand, to get a better feel for how OOXML is broken.

Let’s take the YEARFRAC() function as the simplest example. YEARFRAC() takes two dates and a basis value as inputs, and returns the fraction of a year that is between those two dates. So logically, the calculation is like:

YEARFRAC = (interval between start-date and end-date) / length of year.

That is the logical definition. The only complication is the 5 date bases, and what exactly the year length is. This last point is something that stuck out in my mind when I first reviewed the draft of OOXML last summer. You might think the year length is 365 days. But what about leap years? And what about date ranges that straddle normal years and leap years? This is the key fact that this function requires in is definition. This is the problem I reported on my blog last July. This was the problem INCITS V1 submitted with our ballot comments last September. This is the problem that Ecma responded to in a severely flawed way in January. This is the problem that the BRM refused to discuss and merely agreed with Ecma’s flawed changes. And this is the problem that is now in the final DIS version of OOXML.

Take a look for yourself in this brief extract from the final DIS version [pdf] of OOXML, provided for purposes of critical review and commentary.

(Yes, I now have a complete copy of the final DIS version of OOXML. If you think that this is unfair — and I would agree with you on that — then maybe you should ask ITTF why I was able to get a copy of the final DIS, but no one else in SC34 was.)

Let’s take a look at the ISO OOXML definitions and try some test calculations to reproduce some of David’s findings.

First, let’s take the date basis 1 first, Actual/Actual, since that is the easiest. But we immediately run into a problem. The standard says two different things. In the “description” table it says:

Actual/actual. The actual number of days between the two dates are counted. If the date range includes the date 29 February, the year is 366 days; otherwise it is 365 days.

However, later, when defining the return value, the standard says this:

If the Actual/actual basis is used, the year length used is the average length of the years that the range crosses, regardless of where start-date and end-date fall in their respective years.

There is absolutely no way in which these can both be correct. This would have been easily fixable at the BRM, if the BRM had been allowed to do its job. But I wasn’t even allowed to open my mouth and point out this problem. So now this fatal ambiguity sits in the text of OOXML, as authorized by the BRM experts and approved by JTC1 NB’s. Gotta love it.

But let’s forge on and assume that we really have two algorithms, the first (from the description) which we will call basis 1, and the second basis (from the return value section) which we will call 1′ (“one-prime”). We’ll calculate it both ways.

Let calculate YEARFRAC(DATE(2000,1,1),DATE(2001,1,1),1)

With basis 1, this is simple. 2000 was a leap year (since it was a century divisible by 400) so the interval between the two dates is 366 days. Similarly, since the date range includes a February 29th, the length of the year used in the calculation is 366. So the returned value returned by YEARFRAC should be 366/366 or 1.0.

With basis 1′, this is also simple. The date interval is still 366. But the length of the year is now the average of the year lengths crossed by the date range. So the average of 366 (for 2000) and 365 (for 2001), or an average of 365.5. So using basis 1′, YEARFRAC should return 366/365.5 or 1.00137.

Let’s try another, with basis 1. What is YEARFRAC(DATE(2000,1,1),DATE(2002,1,1),1) ?

In basis 1, the interval is 731 days (366 days in 2000 plus 365 days in 2001). The year length is 366, since the interval includes a February 29th. So YEARFRAC should return 731/366 = 1.997.

With basis 1′, the interval is also 731 days, but the year length is the average of 366 (for 2000), 365 (for 2001) and 365 (for 20002). So YEARFRAC should return 2.0009

Just to make sure we have this down, let’s try another example:

What is YEARFRAC(DATE(2000,1,1),DATE(2000,1,2),1) ?

A one day interval? Yes, please humor me.

OK. With basis 1, the interval is one day. Since it does not cross February 29th, the year length is 365 days. So it will return 1/365 or 0.0028.

With basis 1′ the interval is also one day, and the year length is 366, So it will return 1/366 or 0.0027

OK. That was too easy. One more example, to make sure that you have it down.

What is YEARFRAC(DATE(2000,1,1),DATE(2004,1,31),1) ?

Hmm…. this one will require more thought. I might have to take off my shoes and count using my toes as well.

With basis 1, the interval is 1491 days = 366 + 365 + 365 +365 +30.

The year lengths are:

2000=366 (since the range cross February 29th)
2001=365
2002=365
2003=365
2004=365 (since the date range does not cross February 29th)

So we have 4 full years plus 30 days of a 365 day year. YEARFRAC should return 4 + 30/365 = 4.0822.

With basis 1′ we treat the 2004 as having 366 days and average the years in the interval, so average year length = (366+365+365+365+366)/5 = 365.4. So YEARFRAC should return 4 + 30/365.4 = 4.0821.

Now that we’re done with the examples, we can throw them into a table, and compare them to what Excel 2007 calculates for these same parameters:

start-date	end-date	basis	ISO Value	Excel’s Value	Excel Correct?
2000-01-01	2001-01-01	1	1.0000	1.0000	Yes
2000-01-01	2001-01-01	1′	1.00137	1.0000	No
2000-01-01	2002-01-01	1	1.9972	2.0009	No
2000-01-01	2002-01-01	1′	2.0009	2.0009	Yes
2000-01-01	2000-01-02	1	0.0028	0.0027	No
2000-01-01	2000-01-02	1′	0.0027	0.0027	Yes
2000-01-01	2004-01-31	1	4.0822	4.0805	No
2000-01-01	2004-01-31	1′	4.0821	4.0805	No

Let your mind linger on these results for a bit. Let it sink in. Look at this table until you recognize its significance and cringe in disgust. In some cases, Excel seems to be using the first definition of date basis 1. In other cases it is using the second definition of date basis 1. And in one case it is using neither definition. In other words, this is a lot more screwed up than at first it appears. This is not just a simple ambiguity. The OOXML definition of date basis 1 is totally wrong.

That is just the Actual/Actual date basis. The other 4 conventions are more complicated. I encourage you to read David’s write up in full to see how 3 of 5 basis conventions defined in OOXML differ from what Excel actually calculates. David also shows how he believes Excel really calculates these day count conventions, based on his extensive tests.

Now if this was just a matter of one function in Excel, just YEARFRAC(), then this would not be a big deal. But this is flaw is inherited into most financial functions in OOXML. Let’s take an example at random, DISC(), which calculates the discount rate for a security, given settlement and maturity dates, as well as par and redemption values. You can read the definition of DISC() from the final DIS text here [pdf].

You don’t need to be an Wall Street quantitative analyst to see some obvious problems here. First, in the formula given, the 2nd term is divided by “SM”. What is SM? There is no “SM” defined here. There is a “DSM” defined, however. Is that what is meant? Let’s assume so.

We can try a test calculation, using the example given in the text of OOXML:

What is the value of: DISC(DATE(2007,1,25),DATE(2007,6,15),97.975,100,1) ?

B = 365, since the date range does not include February 29th. (Note that DISC has a single definition for year length in basis 1, not the two conflicting definitions we saw in YEARFRAC)
DSM = 141 days = 6 in Jan+28 in Feb+31 in March+30 in April+31 in May+15 in June
par = 97.975, which was our input parameter
redemption = 100, which is another one of our input parameters

You can do the calculation given in the text, and find the return value of DISC should be 5.35037%. Plug the example into Excel and you will get the answer 5.24202%.

So Excel is off by 2% or so. Do we really care. It’s just money, right?

The problem is that the function in OOXML is defined incorrectly, from the financial perspective. The discount rate is the discount from the redemption value, not the discount from the purchase price. So the first term in the formula should be (redemption-par)/redemption, not (redemption-par)/par. If you make this change, then the calculated value matches the value Excel gives.

Does anything strike you as odd in the above? Do you have a chill running down your spine? Do you have renewed sense of dread? You should, because I just illustrated another grave problem with the OOXML standard: The spreadsheet examples in OOXML are a fraud.

You might have mistakenly been reassured by these numerous examples in the spreadsheet formula, that these actually had some relationship to the standard, that they were examples of how the calculations should be done, that they were evidence of some sort of quality assurance, that they may even be of assistance to an implementor to see whether they implemented the function correctly.

But they aren’t.

What would be normal practice would be to take the definitions, as given in the OOXML text, and to calculate the values according to the definition provided in the text, and then to compare the resulting values with what Excel returns. That would provide a useful double-check of the definitions in the text. But OOXML doesn’t do that. The examples here are mere fluff.

The discrepancy here also indicates that no one has actually reviewed these formulas for accuracy. Errors like this are immediately evident, but only if you look. The fact that things like this have escaped the notice of Microsoft, Ecma TC45, their Project Editors, 80 NB reviews, the BRM experts, and the eagle eyes of ITTF, should make one have considerable concerns over the the sufficiency of the Fast Track review and approval process.

Let’s try one more example before we wrap this up.

What is the value of: DISC(DATE(2000,1,1),DATE(2004,1,31),97.975,100,1) ?

B = 366, since the date range includes February 29th
DSM = 1491 days = 366 + 365 + 365 +365 +30.
par = 97.975, which was our input parameter
redemption = 100, which is another one of our input parameters

So, using the stated formula in the OOXML text, the answer would be 0.50736%. Using the “correct” formula, in the financial sense, the answer would be 0.49708%. What does Excel say the value is? Well, 0.49627%, of course.

So with the DISC() function we found:

The given text provided a formula that referred to a non-existant “SM” variable. This appears to be a cut & paste error.
After accounting for that, we found that the formula was incorrect according to recognized financial standards. Securities are discounted from their redemption values, not from their purchase prices.
Even correcting for the formula errors, we find that the given definition of DISC() does not match what Excel returns, due to errors and ambiguities in the day count conventions, errors that David Wheeler delves into more deeply in his report.
The examples given in the standard are bogus. They are not actually examples of the defined function.
Excel does not implement OOXML.

Spreadsheet file format performance

2008/05/13 By Rob 19 Comments

I’ve been doing some performance timings of file format support, comparing MS Office and OpenOffice. Most of the results are as expected, but some are surprising, and one in particular is quite disappointing.

But first, a little details of my setup. All timings, done by stopwatch, were from Office 2003 and OpenOffice 2.4.0 running on Windows XP, with all current service packs and patches. The machine is a Lenova T60p, dual-core Intel 2.16 Ghz and 2 GB of RAM. I took all the standard precautions — disk was defragmented, and test files were confirmed as defragmented using contig. No other applications were running and background tasks were all shut down.

For test files, I went back to an old favorite, George Ou’s (at the time with ZDNet) monster 50MB XLS file from his series of tests back in 2005. This file, although very large, is very simple. There are no formulas, indeed no formatting or styles. It is just text and numbers, treating a spreadsheet like a giant data table. So tests of this file will emphasize the raw throughput of the applications. Real world spreadsheets will typically be worse than this due to additional overhead from process styles, formulas, etc.

A test of a single file is not really that interesting. We want to see trends, see patterns. So I made a set of variations on George’s original file, converting it into ODF, XLS and OOXML formats, as well as making scaled down versions of it. In total I made 12 different sized subsets of the original file, ranging down to a 437KB version, and created each file in all three formats. I then tested how long it took to load each file in each of the applications. In the case of MS Office, I installed the current versions of the translators for those formats, the Compatibility Pack for OOXML, and the ODF Add-in for the ODF support.

I find it convenient to report numbers per 100,000 spreadsheet cells. You could equally well use the original XLS spreadsheet size, or the number of rows of data, or any other correlated variable as the ordinate, but values per 100K cells is simple for anyone to understand.

I’ll spare you all the pretty picture. If you want to make some, here is the raw data (CSV format). But I will give some summary observations.

For document sizes, the results are as follows:

Binary XLS format = 1,503 KB per 100K cells
OOXML format = 491 KB per 100K cells
ODF format = 117 KB per 100K cells

So the XML formats are far smaller than the legacy binary format. This is due to the added Zip compression that both XML formats use. Also, note that the ODF files are significantly smaller than the OOXML files, less than 1/4 the size on average. Upon further examination, the XML document representing the ODF content is larger than the corresponding XML in OOXML, as expected, due to its use of longer, more descriptive markup tags. However the ODF XML compresses far better than the OOXML version, enough to overcome its greater verbosity and result in files smaller than OOXML. The compression ratio (original/zipped) for ODF’s content.xml is 87, whereas the compression ratio for OOXML’s sheet1.xml is only 12. We could just mumble something about entropy and walk away, but I think this area could bear further investigation.

Any ideas?

For load time, the times for processing the binary XLS files were:

Microsoft Office 2003 = 0.03 seconds per 100K cells
OpenOffice 2.4.0 = 0.4 seconds per 100K cells

Not too surprising. These binary formats are optimized for the guts of MS Office. We would expect them to load faster in their native application.

So what about the new XML formats? There has been recent talk about the “Angle Bracket Tax” for XML formats. How bad is it?

Microsoft Office 2003 with OOXML = 1.5 seconds per 100K cells
OpenOffice 2.4.0 with ODF = 2.7 seconds per 100K cells

For typical sized documents, you probably will not notice the difference. However with the largest documents, like the 16-page, 3-million cells monster sheet, the OOXML document took 40 seconds to load in Office, the ODF sheet took 90 seconds to load in OpenOffice, whereas the XLS binary took less than 2 seconds to load in MS Office.

OK. So what are we missing. Ah, yes, ODF format in MS Office, using their ODF Add-in.

Microsoft Office 2003 with ODF, using the ODF Add-in = 74.6 seconds per 100K cells

Yup. You read that right. To put this in perspective, let’s look at a single test file, a 600K cells file, as we load it in the various formats and editors:

Microsoft Office 2003 in XLS format = 0.75 seconds
OpenOffice 2.4.0 in XLS format = 3.03 seconds
Microsoft Office 2003 in OOXML format = 8.28 seconds
OpenOffice 2.4.0 in ODF format = 14.09 seconds
Microsoft Office 2003 in ODF format = 515.60 seconds

Can someone explain to me why Microsoft Office needs almost 10 minutes to load an ODF file that OpenOffice can load in 14 seconds?

(I was not able to test files larger than this using the ODF Add-in since they all crashed .)

(Update: Since it is the question everyone wants to know, the beta version of OpenOffice 3.0 opens the OOXML version of that file in 49.4 seconds and Sun’s ODF Plugin for Microsoft Office loads this file in 30.03 seconds. )

This is one reason why I think file format translation is a poor engineering approach to interoperability. When OpenOffice wants to read an legacy XLS file, it does not approach the problem by translating the XLS into an ODF document and then loading the ODF file. Instead they simply load the XLS file, via a file filter, into the internal memory model of OpenOffice.

What is a file filter? It is like 1/2 of a translator. Instead of translating from one disk format to another disk format, it simply loads the disk format and maps it into an application-specific memory model that the application logic can operate directly on. This is far more efficient than translation. This is the untold truth that the layperson does not know. But this is how everyone does it. That is how we support formats in SmartSuite. That is how OpenOffice does it. And that is how MS Office does it for the file formats they care about. In fact, that is the way that Novell is now doing it now, since they discovered that the Microsoft approach is doomed to performance hell.

So it is with some amusement that I watch Microsoft and others propose translation as a solution to interoperability, creating reports about translation, even a proposal for a new work item in JTC1/SC34 concerning file format translation, when the single concrete attempt at translation is such an abysmal failure. It may look great on paper, but it is an engineering disaster. What customers need is direct, internal support for ODF in MS Office, via native code, in a file filter, not a translator that takes 10 minutes to load a file.

The astute engineer will agree with the above, but will also feel some discomfort at the numbers. There is more here than can be explained simply by the use of translators versus import filters. That choice might explain a 2x difference in performance. A particularly poor implementation might explain a 5x difference. But none of this explains why MS Office is almost 40x slower in processing ODF files. Being that much slower is hard to do accidentally. Other forces must be at play.

Any ideas?

Berry Good, Berry Bad

2008/05/08 By Rob 5 Comments

It has been an interesting Spring here in Westford, weather-wise. April dipped below freezing on the 3rd, 8th, 9th, 15th and 16th. Then we got a warm spell, a week of days that reached 75 °F (23 °C) and even one day that reached 87.4 °F (30.8 °C) (April 23rd). Then it struck, on May 1st, an overnight low of 28.7 °F (-1.8 °C).

The vulnerability, when a late frost like this occurs, is in bud development. If the plant, by warm sunny days, has been tricked into bud development, and then a freeze occurs, the bud will be injured or killed. Strawberries are particularly prone to this problem.

Because of interactions of thermal inversions at the ground, humidity levels, etc., a simple temperature reading is not an accurate indicator of whether damage actually occurred. For example, if humidity is high, the temperature can dip to freezing, but in the act of freezing water vapor (creating frost) energy is released (latent heat of fusion, as the chemists call it). So you have a few degrees of tolerance if humidity is high, if you have a fog, etc. In fact, commercial strawberry growers will handle this problem by running sprinklers when a freeze threatens, to increase the amount of water around the plants available to freeze, as a buffer to protect the plants. Every degree helps.

But I wasn’t so lucky. The extent of damage was not clear until the strawberry plants started blooming this week. Here’s what I am seeing.

A healthy strawberry blossom

Above is an example of a normal, healthy strawberry blossom. You see the ring of stamens, the male organs of the flower, each with a filament stalk tipped with an anther containing the pollen. In the center is the receptacle with the many carpels, which are the female side of the equation.

But in the picture below, we see a blossom from my garden that shows injury. Although the plant is sound, and it did flower, the carpels are dead. This blossom will not yield a berry.

From the looks of it, 40-50% of buds are damaged in this way. So no strawberry wine this year. I’ll only have enough for fresh eating and ice cream.

A strawberry flower that has suffered frost damage

Achieving the impossible

2008/05/07 By Rob Leave a Comment

Unadulterated copy of James Clark’s Relax NG validator jing. Unadulterated copy of Kohsuke Kawaguchi’s Sun Multi-Schema Validator msv. Unadulterated copy of the ODF 1.0 Relax NG schema. Unadulterated copy of the ODF 1.0 Standard, in ODF format.

No errors from either validator.

msv is so good as to tell us “the document is valid”. But jing indicates success with only silence. So will I.

Standards Words

2008/05/06 By Rob 7 Comments

Introduction

There are several words, more widely used than understood, that recur frequently when discussing standards. Specification and standardization requires us precisely to describe technology in such a way that practitioners in that field can achieve the goals set out in the standard. But this precision is only perfectly intelligible to those who share the same code words. What follows is a handful of the more important ones, what they mean, and how they are unintentionally confused or intentionally misused. You are at a distinct disadvantage when reading (or writing) a news article, a blog post, or evaluating an argument if you do not know the correct meaning of the following words.

Standard

Take the definition from ISO/IEC Guide 2:2004, definition 3.2:

[A] document, established by consensus and approved by a recognized body, that provides, for common and repeated use, rules, guidelines or characteristics for activities or their results, aimed at the achievement of the optimum degree of order in a given context.

NOTE Standards should be based on the consolidated results of science, technology and experience, and aimed at the promotion of optimum community benefits.

So, it is a document, a written description, not an embodiment in the form of a product, that is standardized. Its aims are the “achievement of optimum degree of order” and “promotion of optimum community benefits”, and it is achieved through consensus and consolidation.

international standard

According to ISO/IEC Guide 2:2004, definition 3.2.1.1:

[A] standard that is adopted by an international standardizing/standards organization and made available to the public.

International Standard

[An] international standard where the international standards organization is ISO or IEC

Note the distinction. With capital letters only ISO or IEC standards apply. With lowercase, other standards are included. This is a bit self-serving. ISO and IEC Standards are the only International Standards, because ISO says so. Sorry ITU, sorry CEN, sorry W3C.

So think of “International Standards” as a controlled mark of ISO, like “parmigiano reggiano” is a controlled mark of the Northern Italian Cheese Consorzio.

Normative

The normative parts of a standard are those which set out the scope and provisions of the standard. See ISO Directives, Part 2, section 3.8.

Provisions

The provisions of a standard consist of:

Requirements that must be met for conformance
Recommendations
Statements of permissible, possible or capable actions

See ISO Directives, Part 2, section 3.12.

Note that standards have specific words which denote and distinguish requirements, recommendations and capabilities. Different standards organizations have different vocabulary for this, so a W3C Recommendation, an IETF RFC and an ISO Standard may have different ways of stating the same provision. For ISO Standards, the conventions are:

“shall” and “shall not” are the normal terms for expressing requirements.
“should” and “should not” are the normal terms for expressing recommendations.
“may” and “need not” are the normal terms for expressing permission.
“can” and “cannot” are the normal terms for expressing possibility and capability.

This is necessary because of the extreme ambiguity of the English language in the area of modality. Consider the following sentences, using the word “must”:

(On hearing the doorbell ring), “Oh, that must be the mailman!” [expressing likelihood]
(To a misbehaving child) “You must obey your mother” [expressing obligation]

Or the following exchange with a teenage daughter:

Teen: “I shall return by 11pm” [simple future]
Parent: “No, you shall return by 10pm” [expressing a command]

We can be loose and still be understood, in context, in normal conversation, but in standards work we try to be precise and uniform in the use of our control vocabulary.

Conformance

This simply is a question of whether something meets the requirements of the standard. However, for many standards, there are multiple levels, perhaps even multiple classes of conformance. So you need to be very specific about what you are saying.

For example, you should not ask “Does Excel 2007 conform to OOXML?” You should ask “Is Excel 2007 a conforming transitional class SpreadsheetML Producer?” If you count it all up, OOXML probably has at least 18 distinct conformance classes, by various combinations of applications, documents, readers/writers and transitional/strict conformance classes.

Not in particular that conformance does not mean that an application implements the entire standard.

[My definition above is not very satisfactory. Anyone have something better? Is there an ISO definition of conformance?]

Compliance

This is not a typical standards term. The more typical term is “conformance”. Best to avoid it unless you are talking in regulatory or legal context. See ISO Directives, Part 2, section 6.6.1.1:

A document does not in itself impose any obligation upon anyone to follow it. However, such an obligation may be imposed, for example, by legislation or by a contract. In order to be able to claim compliance with a document, the user needs to be able to identify the requirements he/she is obliged to satisfy. The user also needs to be able to distinguish these requirements from other provisions where there is a certain freedom of choice.

Validity

This is an XML term, referring to the relationship between an XML document instance (an XML file) and a schema (the definition of the syntax of the markup language). Generally, an XML document instance is valid if it adheres to the constraints defined in the schema. The precise definition of validity will depend on the schema definition language used.

I’d welcome any suggestions for other words or definitions that should be included here.