Monday, January 08, 2007

The Formats of Excel 2007

I've installed the new Office 2007. This isn't my preferred platform. In fact I find I'm not using heavy-weight editors of any variety much. For every page I compose in a dedicated word processor I author perhaps 50 pages in emails, blogs or wiki's. However, since I do have a license for Office 2007, and I am curious, I decided to take it for a spin. If you want to be a film critic, you've got to see the movies...

Here is a quick survey of what I saw in Excel 2007, concentrating on the file format support, my particular area of interest.

First, let's look at the "Save As" dialog. As you can see from this screen capture, we have some new options:




The Default

The first choice saves in the default format. This is configurable under "Excel Options", but by default this saves in the new Office Open XML (OOXML) format, with an "xlsx" file extension.

With Macros

The "Excel Macro-Enabled Workbook" option saves as an "xlsxm" extension. It is OOXML plus proprietary Microsoft extensions. These extensions, in the form of binary blob called vbaProject.bin, represent the source code of the macros. This part of the format is not described in the OOXML specification. It does not appear to be a compiled version of the macro. I could reload the document in Excel and restore the original text of my macro, including whitespace and comments. So source code appears to be stored, but in an opaque format that defied my attempts at deciphering it.

(What's so hard about storing a macro, guys? It's frickin' text. How could you you screw it up? )

This has some interesting consequences. It is effectively a container for source code that not only requires Office to run it, but requires Office to even read it. So you could have your intellectual property in the form of extensive macros that you have written, and if Microsoft one day decides that your copy of Office is not "genuine" you could effectively be locked out of your own source code.

New Style Binary

The "Excel Binary Workbook" option caught me by surprise. This is not the legacy binary formats. This is not the new OOXML. This is a new binary format, with an "xlsb" extension. Similar to OOXML it has a Zip container file (the so-called Open Packaging Conventions container file format), but the payload consists (aside from a manifest) entirely binary files.

I can't tell if they are some proprietary binary mapping of the OOXML XML, or whether this is an entirely new binary format unrelated to the XML format. In any case this format is entirely undocumented and is unreadable to anyone by Microsoft.

It is also interesting that Microsoft is positioning this format as the preferred one for performance and interoperability. The online help for Excel 2007 says:
In addition to the new XML-based file formats, Office Excel 2007 also introduces a binary version of the segmented compressed file format for large or complex workbooks. This file format, the Office Excel 2007 Binary (or BIFF12) file format (.xls), can be used for optimal performance and backward compatibility.

Old Style Binary

The Excel 97-2003 option provides the legacy binary "xls" formats, the familiar BIFF format from earlier versions of Office.

Find add-ins

This takes you to a page where you can download the "Microsoft Save as PDF or XPS" Add-in. Note that you are prompted to download an Add-in that provides support for both PDF and XPS. But if you hunt around a bit you can find another page where you can download just one format or the other, which is what I did, installing just the PDF support. This added a new option, "PDF" to the Save As dialog.

Other Formats

This brings up a dialog where you can choose from the previously mentioned formats as well as the several legacy export formats, including:


Summary

My overall impression was soured a bit by the large number of crashes I experienced. Indeed Excel crashed on exit on almost every session. This was dozens of crashes over the course of an afternoon. This will need to be fixed before I would trust it with my data.

Another curiosity was a legacy binary document that gave the following error message whenever I tried to save it to the new OOXML format:




It did not get this message when I saved it back to the binary format. So evidently I'm losing something when moving to OOXML, whatever "Line Print settings" are. So much for the claims of 100% backwards compatibility...

My examination also put to rest any lingering hope I had that Microsoft had fundamentally changed their position on proprietary file formats and has decided to follow in the paths of openness. The new proprietary binary format and the undocumented ways that macros are encoded put any hope of that to rest.



1/22/07, A quick update: Microsoft's Doug Mahugh helped track down and fix the crash problem I had earlier reported when exiting Excel. This is a bug in the"Send to Bluetooth" COM Add-in that Excel was loading at startup. After disabling that Add-in, I'm no longer crashing.

Labels: ,

Thursday, October 12, 2006

A Leap Back

1/23/2007 — A translation of this post, in Spanish has been provided by a reader. You can find it in the Los Trylobytes blog.

I've also taken this opportunity to update page and section references to refer to the final approved version of the Ecma Office Open XML specification, as well as providing a link to the final specification.


Early civilizations tried to rationalize the motions of the heavenly bodies. The sun rises and sets and they called that length of time a “day”. The moon changes phases and they called a complete cycle a “month”. And the sun moves through the signs of the zodiac and they called that a “year”. Unfortunately, these various lengths of time are not nice integral multiples of each other. A lunar month is not exactly 30 days. A solar year is not exactly 12 lunar months.

To work around these problems, civil calendars were introduced — some of the world's first international standards — to provide a common understanding of date reckoning, without which commerce, justice and science would remain stunted.

In 45 B.C., Julius Caesar directed that an extra day be added to February every four years. (Interestingly, this extra day was not a February 29th as we have today in leap years, but by making February 24th last for two days.) This Julian System was in use for a long time, though even it has slight errors. By having a leap year every four years, we had 100 leap years every 400 years. However, to keep the seasons aligned properly with church feasts, etc., (who wants to celebrate Easter in Winter?) it was necessary to have only 97 leap years every 400 years.

So, in 1582 Pope Gregory XIII promulgated a new way of calculating leap years, saying that years divisible by 100 would be leap years only if they were also evenly divisible by 400. So, the year 1600 and 2000 were leap years, but 1700, 1800 and 1900 were not leap years. This Gregorian calendar was initial adopted by Catholic nations like Spain, Italy, France, etc. Protestant nations pretty much had adopted it by 1752, and Orthodox countries later, Russia after their 1918 revolution and Greece in 1923.

So, for most of the world, the Gregorian calendar has been the law for 250-425 years. That's a well-established standard by anyone's definition. Who would possibly ignore it or get it wrong at this point?

If you guessed “Microsoft”, you may advance to the head of the class.

Datetimes in Excel are represented as date serial numbers, where dates are counted from an origin, sometimes called an epoch, of January 1st, 1900. The problem is that from the earliest implementations Excel got it wrong. It thinks that 1900 was a leap year, when clearly it isn't, under Gregorian rules since it is not divisible by 400. This error causes functions like the WEEKDAY() spreadsheet function to return incorrect values in some cases. See the Microsoft support article on this issue.

Now I have no problems with that bug remaining in Excel for backwards compatibility reasons. That's an issue between Microsoft and their customers and not my concern. However, I am quite distressed to see this bug promoted into a requirement in the Ecma Office Open XML (OOXML) specification. From Section 3.17.41 of SpreadsheetML Reference Material, page 3305 of the OOXML specification (warning 49MB PDF download!) , “Date Representation”:

For legacy reasons, an implementation using the 1900 date base system shall treat 1900 as though it was a leap year. [Note: That is, serial value 59 corresponds to February 28, and serial value 61 corresponds to March 1, the next day, allowing the (nonexistent) date February 29 to have the serial value 60. end note] A consequence of this is that for dates between January 1 and February 28, WEEKDAY shall return a value for the day immediately prior to the correct day, so that the (nonexistent) date February 29 has a day-of-the-week that immediately follows that of February 28, and immediately precedes that of March 1.

So the new OOXML standard now contradicts 400 years of civil calendar practice, encodes nonexistent dates and returns the incorrect value for WEEKDAY()? And this is the mandated normative behavior? Is this some sort of joke?

The “legacy reasons” argument is entirely bogus. Microsoft could have easily have defined the XML format to require correct dates and managed the compatibility issues when loading/saving files in Excel. A file format is not required to be identical to an application's internal representation.

Here is how I would have done it. Define the OOXML specification to encode dates using serial numbers that respect the Gregorian leap year calculations used by 100% of the nations on the planet. Then, if Microsoft desires to maintain this bug in their product, then have Excel add 1 to every date serial number of 60 or greater when loading, and subtract 1 from every such date when saving an OOXML file. This is not rocket science. In any case, don't mandate the bug for every other processor of OOXML. And certainly don't require that every person who wants the correct day of the week in 1900 to perform an extra calculation.

Sure this requires extra code to be added to Excel. Excel has a bug. Of course it will require code to fix a bug. Deal with it. I think the alternative of forcing the rest of the world to a adopt a new calendar system is the ultimate in chutzpah. The burden of a bug should fall on the product that has the bug, not with everyone else in the world.

Further, I'd note that section 3.2.28 (page 2693) defines a workbookPr (Workbook Properties) element with several attributes including the following flag:

date1904 (Date 1904)

Specifies a boolean value that indicates whether the date systems used in the workbook starts in 1904.

A value of on, 1, or true indicates the date system starts in 1904.
A value of off, 0, or false indicates the workbook uses the 1900 date system, where 1/1/1900 is the first day in the system.

The default value for this attribute is false.

What is so special about 1904 you might ask? This is another legacy problem with Excel, that implementations of Excel on the Mac, for reasons unknown to me, had an internal date origin of January 1st, 1904 rather than January 1st, 1900. This is unfortunate for Microsoft's Mac Business Unit, and has likely been a source of frustration for them, needing to maintain these two date origins in their internal code.

But why is this my problem? Why should a standard XML format care about what Excel does on the Mac? Why should it care about any vendor's quirks? If RobOffice (a fictional example) wants to internally use a date origin of March 15th, 1903 then that is my business. In my implementation I can do whatever I want. But when it comes to writing a file format standard, then the caprices of my implementation should not become a requirement for all other users of the file format. Further, if I cannot make up my mind and choose a single date origin then my indecisions should not cause other implementations to require extra code because of my indecision.

So there you have it, two ways in which Microsoft has created a needlessly complicated file format, and made your life more difficult if you are trying to work with this format, all to the exclusive advantage of their implementation. I wish I could assure you that this is an isolated example of this approach in OOXML But sadly, it is the rule, not the exception.

Labels: , ,

This page is powered by Blogger. Isn't yours?