Uncategorized

Apache OpenOffice 2013 Mailing List Review

2013/12/18 By Rob 1 Comment

I did a quick study of the 2013 mailing list traffic for the Apache OpenOffice project. I looked at all project mailing lists, including native language lists. I omitted the purely transactional mailing lists, the ones that merely echo code check-ins and bug reports. Altogether 14 mailing lists were included in this study.

In 2013 the OpenOffice community mailing lists saw 24,423 posts from 2,211 unique posters, in 4,819 threads.

A word cloud of the most frequent words in post titles (thanks to Jonathan Feinberg’s Wordle app) follows. As you can see, the terms used in the Propose/Approve/Code/Test/Release workflow rise to the top. That shows the project’s focus.

I thought it would also be interesting to look at this from a social network perspective, looking at the atomic units of collaboration on a mailing list: responding to a post. Of course, not all posts involve a response. It is common for someone to post information, not requiring or expecting a response. But there are many responses. As mentioned above, there were 24,423 posts in 4,819 threads, so an average of 4 responses per post. We can represent this as a directed graph, with each poster treated as a node, and a directed arc to each responder node from the node of the original post author. (This might seem backwards, and you could argue for reversing the arcs, but in general in mailing lists the responder is providing value to the original poster, so the centrality of the responder will be more relevant. Consider, for example, the questions coming from random users, and the experienced project members who answer them.)

Forming a graph in this way gives us a giant component (representing 98.84% of the whole graph) with 1,955 nodes and 7,069 arcs. Average degree (number of collaboration partners for each person) is 3.6. 46 people responded to more than 50 other people. Maximum degree is 714 (Apache OpenOffice V.P. Andrea Pescetti). A visualization of this graph, using the open source Gephi) follows. You can click on the image for a larger version. Nodes have been scaled to reflect betweenness centrality (a measure the degree to which a node helps connect others into the graph) and colored via a modularity algorithm which finds sets of nodes that have a high degree of interconnection.

You should click on the graph to see the full-size version.

What a marvelous, large and complex project we have in Apache OpenOffice!

Drowning in Data

2009/02/09 By Rob 12 Comments

Bob Congdon writes on something we’re all living through — the decline “hard media” (paper, LP’s, even CD’s, etc.) and the prevalence of digital media.

From a green perspective, getting rid of all of this hard media is a good thing. Why print out documents when you can read them on your computer? Why should publishers print hundreds of thousands of copies of a newspaper each day to be read once and tossed out? The same with weekly and monthly magazines. Why produce millions of CDs that just end up in landfills?

I agree that the trend is here to stay, but I, personally, am scared to death. I think we’re headed for disaster. The problem is that few of us have an adequate back up regime for all of this data. When disaster hits, and a single disk drive holds all of our downloaded commercial software, our e-books, our electronic documents, our financial data, our music, our photographs, etc., then we’ve lost everything. So what used to require a devastating house fire now will hit unprepared users every time their hard drive fails. We tend to have all our eggs in one basket now and a single failure has now a greater impact.

Sure, we could back everything up. I used to do that. Floppies, ZIP drives, tapes, CD-ROM, DVD-ROM, external drives, online backup services, I’ve done them all over the years. The problem is that my data needs keep on increases. Back in 1990 my entire data needs, a few dozen WordPerfect files, could fit on a single floppy disk. Today, a single photograph, in RAW format, can take 10x that amount of storage. Add to this music files (at high bit rates), video files (now in Hi Definition, of course), and so on, and I’m nearing a terabyte of data at home. Forget about backing up to 125 double-sided DVD-R’s. Forget about online backups — the latency would make a backup take a month. We’re not going to change the speed of light so that option will never scale. All I can really do is archive to a portable hard drive, and even then I have only space for the most recent snapshot, not a history of recent backups. This is fine for recovering from a system failure, but I’d be in trouble if I suffered serious data or file corruption and that made it into my backups before I noticed.

So, yes, we use less paper. But my unread ebook folder gets larger and larger. My unlistened to play list is longer and longer. My unwatched shows on Tivo continue to accumulate. I have no assurance that I will catch up before data disaster hits. I know I should be feeling green, but instead I’m feeling blue. I could sure use some quantum storage right around now.

Ten Resolutions for 2009

2009/01/03 By Rob 9 Comments

Here are my obligatory New Year’s Resolutions. These are my personal ones, what I’m doing for myself. I’ll also have a set of professional resolutions, what we at IBM call “Personal Business Commitments” or PBC’s. I still need to develop those for 2009.

Exercise at least 30-minutes every day. Unlike every other January when I said this, this time I’ll actually do it. Really. Honest.
Eat better: less saturated fat, more fiber, more veggies, more whole grains, etc.
Suffer fools gladly, or at least a bit more gladly than I did in 2008.
Learn how to use R well. I dabble with it today, but I don’t really know how to use its full power.
Learn to program Python well. Today I can write programs to do simple administrative tasks, parsing data, downloading web pages, etc. But Python is here to stay and I feel my life would be simpler if I knew it well.
Spend more time in the garden. With global warming, fuel prices, chaos in the Middle East, etc., it is absurd to eat tomatoes that have been shipped 2,000+ miles to my dinner table. I’m going to try to grow a substantial portion of the fresh vegetables that I eat.
Read Ulysses, maybe twice.
Enjoy turning 40, but not too much.
Study a new language, something cool like Sumerian or Dutch.
Spend more time when taking pictures and less time in Photoshop trying to fix them.

The biggest media launch of all time?

2007/09/27 By Rob 13 Comments

The news from all directions is that Halo 3 had a big day, with “first day” sales of $170 million, which actually includes advance sales as well. Let’s take the report from the XBox.com web site as the canonical version of the tale:

Microsoft today announced that Halo® 3 has officially become the biggest entertainment launch in history, garnering an estimated $170 million in sales in the United States alone in the first 24 hours. The Xbox 360™ title beat previous records set by blockbuster theatrical releases like Spider-Man 3 and novels such as Harry Potter and the Deathly Hallows.

I’m not sure who determines whether this is true or not “officially,” but before the boys at Guinness update their book, let’s examine.

Halo 3 is a video game. Spiderman is a film. Harry Potter is a book. These have very different sales models, so it is odd to compare them and declare one of them as “biggest entertainment launch in history”. But if you want to compare different media, then by what objective criterion can you exclude television? Certainly, TV is entertainment, right? Although the sales revenue in broadcast television comes from advertisers, not from the viewers, these are booked as sales nonetheless.

So, let’s take the Super Bowl, television’s annual blockbuster. In 2007, estimates are that CBS took in $162.5 million for in-game advertisements, a further $78.1 million in pre-game and post-game show advertisements. Local network affiliates took in an additional $42.2 million in local spots. This gives a total for Super Bowl XLI advertsing sales of $233.8. Also we need to factor in ticket sales. At $600/ticket (for legitimate tickets — let’s ignore the inflated secondary market) and with Dolphin Stadium having a capacity of 76,600, this comes out to an additional $46 million. So the total of tickets plus advertising for this one-day media event was $279.8 million, or 65% more than Halo 3’s first-day sales. Sorry, Master Chief.

So the claim that Halo 3 has “officially become the biggest entertainment launch in history” is unsubstantiated, in my opinion. The sales of Halo 3 are undoubtedly strong, but let’s drop the hype and give the gridiron its due.

The World Ends on May 1st, 2010

2007/02/13 By Rob 5 Comments

Actually, at 6:45AM by my calculations.

According to ZDNet’s Dan Farber, quoting an IBM whitepaper, by 2010, “the world’s information base will be doubling in size every 11 hours.”

Every 11 hours? That’s quite a statement. Let’s see what this means. The largest storage system in the universe is the universe. (Let that sink in for a moment). When I grew up, I was taught that there were approximately 10^79 electrons in the universe. Let’s use them all! 10^79 bits of storage, stored using the spin state of the electrons, in a giant quantum computer.

I have no idea how much data we will have on January 1st, 2010, so let’s assume, for sake of argument, that a virus wipes out all the data in the world on New Year’s Eve, and we start the year with only 1 bit of data, and it doubles every 11 hours. So after 22 hours, we have 2 bits of data, after 33 hours 4 bits, and then after almost two days we get our first byte (8 bits). This isn’t too bad, is it?

The equation is: 2^x=10^79. Solve for x, a simple exercise in logarithms, giving the answer 262.43. We can only double that many times before hitting the universal limit and we exhaust all of the storage in the entire universe on May 1st at 6:45AM. Of course, maybe we’ll just Zip it all up and last until dinner time?

I think I’ll call in sick that day.

But seriously, I wonder if this “every 11 days” figure is a typo? Doubling “every 11 months” would be easier to imagine and would give us to 2250.