≡ Menu

Beautiful Word Clouds

We’ve all seen tag clouds by now, the visualization technique that shows the importance (however defined, but typically by prevalence) of a word by assigning a proportionately sized font.

But now comes along a tool that treats these clouds as art. Wordle’s “Beautiful Word Clouds” is quite addictive, allowing you to enter the raw text and then play around with layout algorithms, fonts and coloring schemes to produce some very nice looking clouds. The author — Jonathan Feinberg — works here at IBM, a fact I did not discover until I had already wasted hours playing with the tool. So maybe I can count this as work now?

Here are a few examples of word clouds formed by analyzing three different texts. Can you guess the identity of the three texts?

Some of my wish-list items are:

  • Apply a stemming algorithm to conflate words with the same root. So in the last example, “standard” and “standards” are counted separately, when they are probably best counted as the same word.
  • Auto generate an image map associated with the cloud
  • Export to PNG (even if just written temporarily to server, I can download it from there)
  • I’d love to read a paper on how the layout algorithms works
  • What would happen if you combined Kohonen self-organizing maps with word clouds? Arrange the words so their proximity in the cloud was correlated with co-occurrence in the text.
{ 16 comments… add one }
  • Matthew Raymond 2008/06/26, 8:11 am

    The word “open” really should be larger in that last cloud.

  • trader.name 2008/06/26, 5:29 pm

    The first cloud is obviously from “Moby Dick”; the second one appears to be from a collection of Shakespeare’s sonnets. The third appears to be from a OOXML-vs-ODF rant of some kind.

  • Anonymous 2008/06/26, 8:54 pm

    I see Moby Dick and the third is probably this blog. But what’s the 2nd one? I want to say the Bible, but it’s missing some obvious words that seem to rule that out.

  • Peter 2008/06/26, 11:21 pm

    Number 2 looks like Shakespeare sonnets, or possibly John Donne to me.

  • Anonymous 2008/06/26, 11:50 pm

    Number 2, William Shakespere, Romeo and Juliet ?

  • Anonymous 2008/06/27, 2:23 am

    If I had to guess on the second one, I would say Song of Solomon.

  • Anonymous 2008/06/27, 3:33 am

    The first is *obviously* Moby Dick.

    The second, I think, is Shakespeare’s sonnets.

    And I agree the third is quite possibly Rob’s blog.

  • Anonymous 2008/06/27, 3:44 am

    Are the second one shakespeare?

  • Nate 2008/06/27, 5:34 am

    The 2nd one looks like Shakespeare to me. Since there’s no visible character names, I’m guessing that’s the collected sonnets.

    Welcome back, Rob. We missed you.

  • Konrad 2008/06/27, 6:36 am

    shakespeare isn’t it?

  • Rob 2008/06/27, 8:48 am

    The answer is:

    1) Moby Dick

    2) Shakespeare’s Sonnets (here I increased the words shown to 1000, resulting in the denser cloud)

    3) This blog

  • Anonymous 2008/06/27, 6:00 pm

    I see trouble with a stemming algorithm – would it think “XML” and “OOXML” were the same word? Scary thought… :)

  • Rob 2008/06/27, 8:25 pm

    Stemming algorithms usually have enough language smarts to avoid things like that. They are looking for grammatical endings like -ing, -ly, etc., and conflating words with these suffixes with their roots.

    For quick and dirty processing, I’ve always used the Porter Stemmer, which I see now is online.

  • Doug Mahugh 2008/07/02, 12:55 am

    These Wordle summaries are cool. I like the way they summarize the concepts in a large document and provide a high-level overview that’s often pretty accurate.

    For example, if you look at the words in the “Moby Dick” image and that sounds interesting to you at first glance, there’s a good change you’ll find the book interesting on some level. If not, you probably won’t.

  • Sean O'Donnell 2011/06/15, 5:42 am

    Its not as advanced as wordle, but you might find http://wispy.me fun to play with as well

Leave a Comment