Archives for 2011

Photographing Waterfalls: 1000-fold Exposure Range

2011/02/27 By Rob Leave a Comment

When photographing a waterfall (or other forms of moving, turbulent water), the choice of shutter speed determines whether you get a stop-action view of every droplet in motion, or whether you get a smooth, time-averaged view of the currents. What is the best shutter speed to use with waterfalls? I had the opportunity last year to do an experiment, an attempt to answer that question. Of course, there is no single “best” shutter speed. It is a mater of taste and creative vision. But the experiment was useful nonetheless, since it resulted in a visual catalog of how a waterfall looks over a thousand-fold range of shutter speeds.

These pictures were all taken at Niagara Falls, with a Pentax K-7 DSLR, with a Tamron 28-75mm zoom lens at 28mm. There are 21 shots, taken at 1/2 stop shutter speed increments, from 1/6000th of a second to 1/6th of a second. In order to ensure the same exposures while varying the shutter speed I made compensatory adjustments to the aperture and the ISO settings. (With a neutral density filter one could extend this to shutter speeds 2-stops slower.) Post-processing was limited to white balance and cropping.

So what did this tell me? With shutter speed down to around 1/60th of a second, the waterfall looks pretty much the same. From there down to 1/20th of a second you get gradual softening, and at the slowest shutter speeds you get effects ranging from painterly to otherworldly. My favorite is 1/10th of a second.

I suspect the important factors here are:

The speed of the water
The distance of the camera from the water
The focal length of the lens

These combine to determine, from the perspective of the camera sensor, the speed at which the water’s image moves, in pixels/second. If the image is moving at 1 pixel/second, then shutter speeds faster than 1 second will show no blur. But a shutter speed of 15 seconds will show a blur over 15 pixels. I bet if I took the original 1/6000 second shot, loaded it in Photoshop and applied a motion blur until it looked as close as possible to the water in the 1/6 second exposure, I could work backwards to determine the speed of the water.

Shutter Speed	Photograph
1/6000 second
1/4000 second
1/3000 second
1/2000 second
1/1500 second
1/1000 second
1/750 second
1/500 second
1/350 second
1/250 second
1/180 second
1/125 second
1/90 second
1/60 second
1/45 second
1/30 second
1/20 second
1/15 second
1/10 second
1/8 second
1/6 second

Jeopardy x 10

2011/02/16 By Rob 1 Comment

I’ve been reading a lot of interesting discussions about the amazing Jeopardy match between human champions and IBM’s Watson computer. Although I’m in no way involved in this project, it is very exciting to watch this event unfold. Some of the discussions I’ve read concern the impact of reaction time, who can get to the buzzer fastest. I recommend this analysis of how the buzzers in Jeopardy work, which explains what Watson does and what the best human players do.

But I’d ask you to try this thought experiment to see how reaction time really doesn’t matter in the larger sense.

First, let’s imagine a trivial version of Jeopardy, where the categories are “Multiplication tables up to 12×12” and similar. Questions so easy (for average adults) that winning or losing is pretty much 100% determined by skill and timing with the buzzer. This would not be a very interesting computing challenge.

But then let’s imagine playing something like Jeopardy, but where the questions are 10x more difficult. So have questions that are more obscure, require more calculation, greater recall, in general more significant thinking. In this case, no one is jumping to the buzzer, because everyone is digging deep for the answer. Essentially, take reaction time is taken out of the equation.

Some examples:

“The first African contestants in the modern Olympic Games represented this African republic”
“According to the Book of Genesis, he was the father of Mathusael”
“This George I’s 2nd wife was Euphemia of Kuyavia”
“In Shakespeare’s A Midsummer Night’s Dream, he was the Master of Revels”

Get the idea? Ratchet up the difficulty level.

Who wins now?

I think in that case, Watson would be the clear winner. This plays to the machine’s strength’s, based on its ability to process huge amounts of data, far more and far faster than any human.

The underlying technology of Watson is not best used to give instant answers to liberal arts questions that would be covered in any freshman survey course on European history, or classical music. Those topics are used only because Jeopardy is targeted for mainstream television and needs questions of limited difficulty, sufficient to make it an exciting game for humans to play and watch. Sure, Jeopardy is more difficult than other U.S. game shows (but not as difficult as some British quiz shows), but it is still “human-scaled”. In essence, questions on Jeopardy are “dumbed-down” to match human capabilities and the format of the show. That is why I’m so excited about the future, not entertainment uses of this technology, but uses that attack and solve much harder problems, problems with greater impact in fields like medical diagnosis, law enforcement, etc.

The real excitement is not what we can do on TV game show, but how we can scale up this technology to change the world.

BTW, if you are wondering, here are the answers to the above questions:

What is the Orange Free State?
Who was Mehujael?
Who was George I, King of Galacia-Volhynia?
Who is Philostrate?

The Versions of ODF

2011/02/10 By Rob 8 Comments

It has been a few months now since the OASIS ODF TC has done substantive technical work on ODF 1.2. We had a 60-day public review last summer, a 15-day public review last December and we will start another (hopefully final) 15-day public review starting this week. Every time we make a change to the specification in response to public comments we are required to have another 15-day review of the changes. This is all necessary procedural work, to make sure all stakeholders have the opportunity to comment. But it is not very exciting.

However, as the ODF 1.2 specification goes through remainder of its review/approval process in OASIS, we’ve increasingly turned our attention to ODF-Next. Tentatively (and we should have a TC vote on this work plan in the next few weeks), we’re looking at a two-year schedule for ODF 1.3, with four intermediate drafts (Committee Specification Drafts or CSDs). The first CSD would appear in September, 2011. We have not yet defined what features will be in ODF 1.3. So this is a great time to join the ODF TC, to “get in on the ground floor” for defining the next release.

While we await approval of ODF 1.2 and start work on ODF 1.3, we continue to maintain ODF 1.0 and ODF 1.1, the previous versions of ODF. And by “maintain” I mean we receive and track defect reports and publish corrections to the specification So effectively, the OASIS ODF TC is working on four versions of ODF.

Since the progression from ODF 1.0 –> ODF 1.1 –> ODF 1.2 –> ODF 1.3 is designed to be compatible, the average user will not notice a difference. Your ODF 1.0 documents should load just fine in your ODF 1.2 or ODF 1.3 editor. We try very hard not to introduce “breaking changes” that would cause trouble with older documents. Of course, the application vendor has a responsibility here as well, to pay attention to version compatibility issues. But from the perspective of the standard I do not believe that we’ve done anything that would prevent an editor from being (at the same time) a conforming ODF 1.0, ODF 1.1 and ODF 1.2 application. In fact, I’d expect most ODF editors today to be able to read any version of ODF, though they might only save the most-current version, or maybe the 2-most current versions.

An additional complexity is that we have ODF standards in OASIS and ISO. I’ve heard that some are confused by this, especially how these different versions correspond. I hope I can make this clearer.

First, flash back to the 1990’s. After decades of success with standardizing nuts and bolts and shipping containers and the various aspects of the physical world relevant to international trade, ISO was at the crossroads. There wasn’t much more left for them to standardize in that physical world. They were seeing success with management standards, which would soon become a major part of their work, e.g., ISO 9001, quality management. But ISO was not doing that well with technology standards. Their OSI reference network model was a flop. C++ was laboring on, six-years in committee. And then competition emerged from new, more agile, standards consortia, like the IETF and W3C. They were rocking the industry with highly relevant specifications that essentially created the web. Almost every core technology of the internet, including TCP/IP, HTTP, HTML, XML, JavaScript, SMTP, MIME, POP3, IMAP, etc., was developed outside of the ISO system.

You can be quite sure that this new competition did not escape notice in Geneva. As they say, “If you can’t beat them, join them”. Or in this case, get them to join you. One of the ways in which ISO/IEC JTC1 (the ISO committee that controls tech standards) responded was to introduce the Publicly Available Specification (PAS) transposition process. The idea here was to allow recognized standards consortia (and there is a formal ISO process to gain such recognition) to submit already-approved market relevant standards to ISO/IEC JTC1 for accelerated processing and approval as an International Standard. Essentially, such PAS submissions skip over the ISO Working Group and Subcommittee stages of work, and advance directly to a final approval ballot. This is a win-win situation. ISO has more relevant standards in its catalog, and consortia can continue to produce their work at a more nimble pace.

So when we look at the versions of ODF, we have more than just ODF 1.0, 1.1, 1.2 and 1.3. For each of these have an OASIS and an ISO version. And for each numbered version we have published corrections, and these are reflected both in the OASIS and the ISO catalogs. It sounds messy at first, but the important thing to note is that OASIS and ISO/IEC JTC1 have agreed to keep their corresponding versions of ODF “technically equivalent”. This was agreed to in a Memorandum of Understanding. This means that you should be able to use the OASIS or the ISO version according to your needs and have confidence that they are compatible. If you require an ISO version, then you can use that. If you want the very latest version, then use the OASIS version, since the ISO version typically lags by a year or more.

I hope the above diagram clarifies which versions of ODF are technically equivalent. Note that this is not a time line. The actual order that the various versions were published in is more complicated, since corrections to older versions of ODF can (and do) come after publication of newer editions. But this diagram shows the correspondence of “technically equivalent” OASIS and ISO versions of ODF. The big rounded blocks are published standards, the indented smaller ovals are published corrections (“Errata” in OASIS and “Corrigenda” in ISO), and the indented rectangle on the ISO side is an amendment.

In particular, note:

OASIS ODF 1.0 corresponds to ISO/IEC 26300:2006
OASIS has published two Errata documents for ODF 1.0, and both have corresponding Corrigenda in ISO, the first one already approved, the second one currently under ballot.
OASIS ODF 1.1 + Errata 01 corresponds to ISO/IEC 26300:2006 + Corr.1 + Corr.2 + Amd. 1. This is a more complicated case, since we’re rolling up several corrigenda as well as the changes from OASIS ODF 1.1. But the net result is that after Amd. 1 is approved (and the ISO ballot is now underway) we will have an ISO version of ODF 1.1.
The plan is to submit OASIS ODF 1.2 to ISO/IEC JTC1 under PAS transposition rules. I expect that we will receive defect reports on ODF 1.2, and these would be addressed as Errata in OASIS and Corrigenda in ISO, to maintain technical equivalence.
Ditto for ODF 1.3. Once approved by OASIS, we submit for PAS transposition and maintain to preserve technical equivalence.

So this isn’t really all that complicated. We have a series of compatible ODF versions over several years. The technical work is done in OASIS, in a technical committee. Once approved by the OASIS membership the OASIS version of ODF is submitted under PAS rules to JTC1. Once approved by ISO, the OASIS ODF committee and the ISO ODF committee (called ISO/IEC JTC1 SC34/WG6) meet regularly to ensure that the two versions remain aligned, with specific attention to ensuring that we’re both looking at the same set of defect reports and keeping corrections in sync.

Will Microsoft Remove DOC Format Support?

2011/01/11 By Rob 17 Comments

I noticed a curious argument in Jonathan Corbet’s LWN article “Supporting OOXML in LibreOffice” (behind a pay wall). Why should we support OOXML?

…as has been pointed out in the discussion, Microsoft will, someday, phase out support for its (equally proprietary) DOC format, leaving OOXML as the only real option for document interchange. There appears to be little hope that Microsoft’s ODF support will be sufficient to make ODF a viable alternative. So any office productivity suite which aspires to millions of users, and which does not support OOXML, will find itself scrambling to add that support when DOC is no longer an option. It seems better to maintain (and improve) that support now than to be rushing to merge a substandard implementation in the future.

Really? The same company that is unable to fix a leap-year calculation bug from 20 years ago because of fears it might break backwards compatibility is going to remove support for their binary formats? Seriously, is that what people are saying? This sounds like something Microsoft would say to scare people into migrating.

But don’t listen to my opinions. Let’s look at the numbers. I’ve been tracking document counts via Google for almost four years now, looking at the relative distribution of document types, across OOXML, ODF, Legacy Binary, PDF, XPS, etc. Because the size of the web is growing, one cannot fairly compare the absolute numbers of documents from week to week. But the distribution of documents over time is something worth noting.

The following chart shows the percentage of documents on the web that are in OOXML format, as a percentage of all MS Office documents. Note carefully the scale of the chart. It is peaking at less than 3%. So 97+% of the Microsoft Office documents on the web today are in the legacy binary formats, even four years after Office 2007 was released.

Of course, for any given organization these numbers may vary. Some are 100% on the XML formats. Some are 0% on them. If you look at just “gov” internet domains, the percentage today is only 0.7%. If you look at only “edu” domains, the number is 4.5%. No doubt, within organizations, non-public work documents might have a different distribution. But clearly the large number of existing legacy binary documents on government web sites alone is sufficient to prove my point. DOC is not going away.

I call “FUD” on this one.

Twitter 2010 by the Numbers

2011/01/02 By Rob 1 Comment

Throughout 2010, I recorded Twitter messages (tweets) from Twitter’s “public timeline“. I took these snapshots every two minutes, around the clock. At the end of the year now have 4,973,728 tweets.

The first message, at 12:00:08 EST January 1st, 2010, was: “2009 will be remembered at [sic] the year the music died. Michael left us a powerful mantra for 2010”.

And the ominous last message was “2011 chegou… 2012 o mondo acaba” at 11:58:52 EST, December 31st. (2011 has arrived…. in 2012 the end of the world)

I collected the data without any advance thought on what I could do with it. The general thought was I’d look for something interesting. Now that 2010 is over, I’ve started some analysis. I’d to share what I’ve found so far. More will follow.

It is important to have a proper sense of caution when using this data. How random is it really? I don’t know how Twitter produces the “public timeline”. Certainly it excludes those whose tweets are private. Also, the time sampled approach, of taking a snapshot of 20 tweets every two minutes, leaves open the possibility of missing short-term phenomena. For example, if a million users all tweet something on a particular topic at exactly noon on January 4th, and I don’t take a snapshot until 12:04, then I will miss that topic entirely At the very least it will be under-represented. But so long as we’re willing to acknowledge that we might be missing interesting behavior that occurs at shorter times scales, we can fairly make some more general observations.

First up I wanted to look at the distribution of tweet lengths. Are people keeping it short? Or are they running into the 140-character limit? The answer, as seen in the following chart, is a clear “Yes” on both counts. There are two clear peaks, one in the 20-25 character range, and another pushing at the limit, in the 139-140 character length.

So what is going on here? One explanation could be that the shorter tweets are coming from mobile clients, while longer tweets are coming from web and desktop twitter clients. At first I suspected that the longer tweets might indicate users bumping into the 140 character limit, and this might show itself as truncated content, greater use of abbreviations and general frustration. But when I took a closer look I saw that most of the maximal length tweets were machine-generated, intentionally targeted to that length, like: “New story posted ‘This title is truncated so total tweet is 140 characters long…’ http://bit.ly/foobar”.

Next, I took a look at hash tags, and tabulated those by frequency. A word cloud (made in Wordle) of the top 100 tags is:

And here are the top 20 hash tags as a list:

#nowplaying
#ff
#jobs
#np
#fb
#tinychat
#teamfollowback
#followmejp
#news
#fail
#shoutout
#tcot
#worldcup
#sougofollow
#follow
#nicovideo
#job
#tweetmyjobs
#iphone
#quote

Although we might wonder what Twitter’s revenue model is in the long term, it seems that other companies like Facebook and Google are using Twitter to the own advantage. There were very few hash tags at the top that were actually topical, like #wordcup. This is clearly at odds with what Twitter reported on their “Top Twitter Trends in 2010” page. They list #rememberwhen in first place. It is 925th on my list. It is odd that Twitter’s official list contains no non-English hash tags as well. And no competing social sites, like Facebook. Obviously they are using a different methodology in putting together their numbers.

Next I looked at the top Twitter accounts targeted in a tweet:

Justin Bieber is the clear winner here. The top 20 were:

@justinbieber
@addthis
@youtube
@foursquare
@nickjonas
@joejonas
@ihatequotes
@soalcinta
@luansantanaevc
@ladygaga
@detikcom
@ddlovato
@revrunwisdom
@soalbowbow
@zodiacfacts
@thelovestories
@adriesubono
@dff_clickbokin
@eduardosurita
@nickiminaj

Next I looked at the content of the tweets, to find what 5-word strings were the most common. The top 20 phrases are:

just joined a video chat
i favorited a youtube video
i liked a youtube video
i uploaded a youtube video
check this video out —
photos on facebook in the
you need to check out
add a #twibbon to your
i just became the mayor
if you want more followers
way of getting 100 free
for a chance to win
get 100 free more twitter
in a live video chat
just snapped a new picture
click the link to join
a live video chat with
want more followers check out
you should check out this
should check out this site

Rather depressing, isn’t it? Spam, spam, spam, spam…

Last up, something I haven’t seen before, a look at the most-common numbers that occured in the text of tweets, truly “Twitter by the Numbers”:

The top numbers were:

2 (likely in first place because it is also used as short form of “two”, “too” and “to”
1
4 (also used as short form of “for”)
3
5
2010
10
7
6
100
8
20
9
12
30
15
11
50
0
24

(I also have a list of the top URL’s included in tweets. But they are almost entirely via URL shortners (bit.ly, etc.) where the link no longer works, taken down by the service no doubt for being used in spam.)

If anyone has any further ideas for analysis, let me know. One thing I’d like to do is chart the percentage of tweets that come from various Twitter clients over time, to track market share statistics. If I can find a good language classifier (one that can work even on very short text fragments) I can do a breakdown of those.