ODF

An Invitation to the Apache ODF Toolkit

2011/08/15 By Rob 3 Comments

Perhaps overlooked in all the excitement generated by the move of OpenOffice.org to Apache was the fact that a parallel move is occurring with the ODF Toolkit. A few weeks ago we submitted a proposal to Apache to start a new project based on the Java components that were until then hosted by the ODF Toolkit Union. This was done after consulting with ODF Toolkit community and getting approval from the ODF Toolkit Union’s Steering Committee. This proposal was recently reviewed, voted on and approved by Apache. So now we have the Apache ODF Toolkit project in the Apache Incubator.

So what is this project and what is it good for?

This project consists of Java libraries and tools for working with ODF documents. Not editors, not viewers, not anything with a user interface. These are not end-user tools. These are tools for developers who need to write programs that read, write or manipulate ODF documents. These tools do not require that you have any ODF editor installed. They operate directly on the files. So they are ideal for running on a server, for things like report generation, information extraction, document validation, conversion, etc. We have a page of demos that gives a good idea of the range of things possible with the ODF Toolkit.

The ODF Toolkit is important because it enables innovation on top of ODF. By analogy, look at HTML. At one point, the web consisted mainly of hand-authored documents at a handful of academic and government websites. If that was all there was to the web, it would not have been very interesting. What made the web the platform it is today has been the technologies that enable server-side generation of web pages from database queries, or services that analyze web pages and extract and aggregate information. Google was made possible because HTML was an open standard that could be programmatically understood. PHP was possible because HTML was an open standard that could be written.

ODF, unlike the previous generation of binary document formats, is also an open standard. You can read and write ODF documents freely. But writing the code to understand the nitty-gritty of the ODF format is a considerable task. The ODF Toolkit makes this easy for Java programmers. How easy? Here is a “hello world” text document:

TextDocument doc=TextDocument.newTextDocument();
doc.addParagraph("Hello world!");
doc.save("hello.odt");

Other tasks, like change styles, combining presentations slide decks, searching and replacing text in a document, extracting text from a document are also simple. More examples that give a flavor of the ODF Toolkit are in the “cookbook“.

But along with the “Simple API” the ODF Toolkit has the ODFDOM layer. This layer allows you to get to every part of an ODF document, at the finest grain level. Some tools out there give you only a high level API but then leave you hanging if you want to do something more complicated. Not so with the ODF Toolkit. If you want to drill down and adjust the line spacing of a bullet list in a footnote, then you can do it.

These components enable innovation on top of ODF, innovation that thinks “outside the editors” and “beyond office”.

So how do you get involved? If you want to help with the project then I invite you to sign up on the project’s development mailing list. And if you have questions about using the ODF Toolkit, but don’t want the additional email traffic from the dev list, then you can sign up for the users list. Of course, I’ve signed up for both lists. I hope I’ll see you there!

ODF Plugfest, Berlin

2011/08/09 By Rob 2 Comments

I attended the 6th ODF Plugfest took place in Berlin a few weeks ago, hosted by the German Federal Ministry of the Interior (BMI) and the Ministry of Economics and Technology (BMWi). It followed the pattern of previous events, a two-day event, with the first day dedicated to technical interop activities among implementors, followed by a day of public presentations. The full agenda is here.

(I’m the jet-lagged figure in the lower left of the above group photo taken on the first day of the Plugfest).

We tested a range of interoperability scenarios, including preservation of RDF metadata, conditional formatting in spreadsheets and advanced text layout scenarios (“hard cases”). There were also a variety of presentations, from vendors, academics and government. I’ve uploaded my presentations if you are interested. I’ll draw your attention as well to Ross Gardler’s talk on The Apache Way and OpenOffice.org. He covered a lot of good material. (A recording of the talk is also online)

Finally, it was also announced that the next ODF Plugfest will occur in the Netherlands, in Gouda, in November. And another one is anticipated in Brussels in March 2012.

Gwenell Doc: A Small and Fast ODF Text Editor

2011/05/26 By Rob 9 Comments

Today I look at Gwennel Doc, an ODF-based text editor for Microsoft Windows. In interesting attribute of Gwennel is its small size and fast speed. It can load and display the 792 page ODF 1.2, Part 1 specification in around 2 seconds, using an executable that is around 1/4 the size of that document. Something interesting is going on here that needs investigation. I contacted the author of Gwennel Doc, Marc Kerbiquet, who consented to the following email interview. Enjoy!

Could you tell us a little bit about yourself, where you live and what you do for work? Are you a professional programmer? Or a hobbyist?

I live in France, I work as a professional programmer to pay the bills and I write programs as a hobbyist. Gwennel Doc is a hobby program.

What got you interested in writing a text editor? How did you pick the name “Gwennel”?

I wrote first a folding text editors for programmers (Code Browser), then I wrote an ODF viewer (Woodrat Reader), so an ODF editor was a natural continuation to this :-)

The initial goal was to make a folding/outlining editor like Code Browser with rich text. But it would have required an hybrid format to handle folding directives.

“Gwennel” means “swallow” in the Breton language, a small and fast bird. Breton is a language spoken in Brittany, a region in the north-west of France.

You call your tool a “WYSIWYM” (What You See is What you Mean) editor. How is this different than other editors, and how does Gwennel Doc support this style of editing?

It is different from all the lightweight rich editors that edit RTF documents or equivalent formats because it supports styles. Styles allow to separate the presentation and content: you can tag a word as “menu item” or “keyword” instead of “Bold” or “Color-Red” and change later how it should be displayed.

Word and OpenOffice allow WYSIWYM but they promote the WYSIWYG (What You See is What You Get) paradigm.

As your website describes, your intent was to make a text editor, not a full word processor. How do you define the boundary between these two? What features did you decide to omit?

The goal of a word processor is to produce a printed document. The paper is an important aspect:

the page format

the header and footer

how paragraphs and tables must be splitted when the end of page is reached

footnotes

the table of content

the index

Gwennel Doc is more a note taking software intended to on-screen reading, so it does not have to deal with all these features.

Printing command is not implemented yet but it will be very basic.

What made you choose ODF as a document format?

I already worked with ODF before (Woodrat Reader)

I didn’t want to create a new format.

As far as I know, there is no other open standard designed for edition and supporting styles:

RTF: no support of styles

HTML + CSS: not intended for edition

OOXML: just a political standard, too complicated anyway

Interoperability, even if limited:

it can read partially documents from other word processors (unsupported elements are just ignored)

Use OpenOffice for all missing features (print, export as PDF, …)

Gwennel Doc documents can be read without Gwennel Doc

How hard was it to support ODF in Gwennel? What was the hardest part?

Gwennel was designed from start to work with ODF, so the application model, apart from the table styles, fits very well with ODF. Easy to load, easy to save.

A difficult part was to understand the ODF specification, but I’ve already done it when writing Woodrat Reader.

The hardest part was to find a solution to implement table styles in Gwennel as the ODF has no support for table styles. I’ve finally found a solution to keep a compatibility with ODF and to keep a minimum of interoperability. Unfortunately table styles are lost when a Gwennel document is edited with OpenOffice.

One thing that strikes the user is how small and fast Gwennel is. It is less than 150KB in size and requires no install. Compared to other word processors, this is amazingly fast. How did you accomplish this? Can give some details on your approach, such as what programming language you used, what ZIP and XML libraries you used, etc. What is the secret to making a small, fast editor?

I really take care a lot on speed. Almost everything should be instantaneous on a computer that can execute billions of instructions per seconds.

Now, here is the secret :-)

First, the executable is 270K big, not 150K

I cheated, I’ve used UPX to compress the executable.

For curious people, here is the detail of what’s inside:

50K – the zip library (zlib)

20K – the XML parser (AsmXML)

40K – core library (memory management, strings, lists, GUI layer)

70K – the rich edit component

24K – the (partial) ODT schema

64K – the main application code and resources (text, menu, icons)

The operating system (Windows XP and better) provides all the remaining stuff (GDI for font and image rendering, GDI+ for image manipulation, …).

There is even unused code that I could remove to save few kilobytes.

On the other side, I plan to add a lot of pictures to better show the role of properties in styles, it should increase the size of the executable by 100K.

The program is written entirely in assembly (except for the zlib library)

I would be too long to explain why, but I didn’t choose assembly for speed (it could seem crazy as any programmer would say that the only reason to use assembly is for speed), Woodrat Reader is a bit faster than Gwennel Doc to load a document and it is written in C++.

The most visible benefit of assembly is the size of the application, not the speed.

I use common optimization techniques:

choosing the right algorithm and data model in the critical parts (e.g. a hash table instead of a simple list)

optimizing access to memory

caching data (to save computation)

I don’t think that Gwennel is fast; It’s the other word processors that are slow

Some reasons are common to software bloat found in most software:

long history of development

marketing considerations (spend more time to develop new features rather than optimizing)

Other reasons are more specific to word processing as word processors support a lot more features than Gwennel, for instance:

font kerning makes the computation of the layout more complicated,

computing the paging in realtime requires a lot of CPU (Gwennel uses just one infinite-length page)

Gwennel is written for modern machines with modern OS, the font and image rendering is entirely done by the operating system, so it can take advantage of hardware acceleration. But on the other side, it is limited to the capabilities of the system library and some features cannot be implemented (e.g. no control on spacing between characters or no outline effect).

Loading is fast but it could be faster: The time to load the file, unzip it, parse the XML and build the model is almost immediate even with big documents (1000 pages), most of the time is spent to layout the text by asking Windows the width and the height of each word. Windows is very good for this but it could be optimized: for instance it shouldn’t be necessary to make a system call for each “the” word in Times New Roman, 12pt of the document because the result will be always the same.

Do you have any future plans for Gwennel?

The future plan for Gwennel Doc is to make it a ‘finished’ application:

a Print command,

a Find command,

and minor goodies one can expect such as opening recent documents.

There is no plan to support more elements of the ODF but the compliance has yet to be improved (online ODF validators are not very happy with documents created by Gwennel).

Ten Things You Didn’t Know About ODF 1.2

2011/05/05 By Rob 6 Comments

Some little known facts, all of them true, but only some of them amusing, and even then only just so, about ODF 1.2, recently approved as a Committee Specification by the OASIS ODF TC:

In producing OASIS ODF 1.2, we had 184 Technical Committee meetings, not including the numerous subcommittee meetings.
During the development of ODF 1.2, the active TC membership grew by 78%.
The ODF TC , during the ODF 1.2 work, had 76 members, from 17 countries, representing 23 companies or organizations, as well as 17 individual members. The sun never sets on the ODF TC.
ODF TC members received 14,655 emails from the TC’s email list while working on ODF 1.2, including 474 notes with a post-script (PS), 113 with a post-post-script (PPS) and 13 with a post-post-post-script (PPPS), suggesting a new phrase for derangement: “going postscript”.
ODF 1.2 has been out for public review a total of 210 days.
The ODF TC resolved 1,822 public comments while working on ODF 1.2. We read every one of them.
ODF 1.2 says “shall” 628 times, but says “please” only 14 times, making it one of the most discourteous specifications around.
ODF 1.2 has 72 external normative references and 16 external non-normative references.
If you printed out all of ODF 1.2 and laid the pages end-to-end, it would be approximately 20% taller than the Eiffel Tower. You would also probably be arrested.
ODF 1.2’s OpenFormula knows how many imperial pints will fill a cubic light year. But please, drink only in moderation.

OASIS ODF 1.2 Committee Specification Approved

2011/03/25 By Rob 3 Comments

A few quick ODF updates. We have a number of projects moving forward at multiple levels.

First, just last week the OASIS ODF TC approved the ODF 1.2 Committee Specification. This is the highest level of approval we can give to the specification in the technical committee.

As some of you probably know, most standards bodies have a two-level approval process, where work originates in a technical committee (in some organizations called a working group) where the specification is written, reviewed and approved by specialists, before being passed on to a “consensus body” for approval by a wider group of interests. We see this in ISO/IEC JTC1, with work first approved at the WG/SC level, and then final approval given by JTC1.

An OASIS Committee Specification requires 2/3 approval of the TC, with no more than 25% disapproving. ODF 1.2’s ballot ended last week with 17 Yes votes, 100%.

The TC’s work on ODF 1.2 is now done. There are some adminstrative tasks remaining, and we need to go through the review/approval by the general OASIS membership, but the technical work is now done. We now move on to ODF 1.3, as well as some maintenance-related activities on ODF 1.1.

And speaking of maintenance, we have two ballots related to IS 26300 underway in ISO/IEC JTC1:

A DCOR ballot to approve technical corrigenda for ISO ODF, mainly correction of typographical errors reported by the UK and Japan. This ballot will end April 25th.
An FPDAM ballot to approve an amendment to ISO ODF. The effect of this amendment will be to make ISO ODF be equiavelent to OASIS ODF 1.1. This ballot will end June 8th.

I’d urge NB members to review these documents carefully and cast a vote in these ballots.

On the ODF-Next side, the discussion that is getting the most attention right now is related to change tracking. The Advanced Document Collaboration subcommittee is now reviewing two proposals, one contributed by DeltaXML and another contributed by Microsoft. We’ll be having a series of meetings in April to discuss these two proposals. Hopefully we’ll reach a consensus, possibly a compromise. If necessary, as a last resort, we’ll vote.