We’ve all seen tag clouds by now, the visualization technique that shows the importance (however defined, but typically by prevalence) of a word by assigning a proportionately sized font.
But now comes along a tool that treats these clouds as art. Wordle’s “Beautiful Word Clouds” is quite addictive, allowing you to enter the raw text and then play around with layout algorithms, fonts and coloring schemes to produce some very nice looking clouds. The author — Jonathan Feinberg — works here at IBM, a fact I did not discover until I had already wasted hours playing with the tool. So maybe I can count this as work now?
Here are a few examples of word clouds formed by analyzing three different texts. Can you guess the identity of the three texts?
Some of my wish-list items are:
- Apply a stemming algorithm to conflate words with the same root. So in the last example, “standard” and “standards” are counted separately, when they are probably best counted as the same word.
- Auto generate an image map associated with the cloud
- Export to PNG (even if just written temporarily to server, I can download it from there)
- I’d love to read a paper on how the layout algorithms works
- What would happen if you combined Kohonen self-organizing maps with word clouds? Arrange the words so their proximity in the cloud was correlated with co-occurrence in the text.
The word “open” really should be larger in that last cloud.
The first cloud is obviously from “Moby Dick”; the second one appears to be from a collection of Shakespeare’s sonnets. The third appears to be from a OOXML-vs-ODF rant of some kind.
I see Moby Dick and the third is probably this blog. But what’s the 2nd one? I want to say the Bible, but it’s missing some obvious words that seem to rule that out.
Number 2 looks like Shakespeare sonnets, or possibly John Donne to me.
Number 2, William Shakespere, Romeo and Juliet ?
If I had to guess on the second one, I would say Song of Solomon.
The first is *obviously* Moby Dick.
The second, I think, is Shakespeare’s sonnets.
And I agree the third is quite possibly Rob’s blog.
Are the second one shakespeare?
The 2nd one looks like Shakespeare to me. Since there’s no visible character names, I’m guessing that’s the collected sonnets.
Welcome back, Rob. We missed you.
shakespeare isn’t it?
The answer is:
1) Moby Dick
2) Shakespeare’s Sonnets (here I increased the words shown to 1000, resulting in the denser cloud)
3) This blog
I see trouble with a stemming algorithm – would it think “XML” and “OOXML” were the same word? Scary thought… :)
Stemming algorithms usually have enough language smarts to avoid things like that. They are looking for grammatical endings like -ing, -ly, etc., and conflating words with these suffixes with their roots.
For quick and dirty processing, I’ve always used the Porter Stemmer, which I see now is online.
These Wordle summaries are cool. I like the way they summarize the concepts in a large document and provide a high-level overview that’s often pretty accurate.
For example, if you look at the words in the “Moby Dick” image and that sounds interesting to you at first glance, there’s a good change you’ll find the book interesting on some level. If not, you probably won’t.
Its not as advanced as wordle, but you might find http://wispy.me fun to play with as well