Throughout 2010, I recorded Twitter messages (tweets) from Twitter’s “public timeline“. I took these snapshots every two minutes, around the clock. At the end of the year now have 4,973,728 tweets.
The first message, at 12:00:08 EST January 1st, 2010, was: “2009 will be remembered at [sic] the year the music died. Michael left us a powerful mantra for 2010”.
And the ominous last message was “2011 chegou… 2012 o mondo acaba” at 11:58:52 EST, December 31st. (2011 has arrived…. in 2012 the end of the world)
I collected the data without any advance thought on what I could do with it. The general thought was I’d look for something interesting. Now that 2010 is over, I’ve started some analysis. I’d to share what I’ve found so far. More will follow.
It is important to have a proper sense of caution when using this data. How random is it really? I don’t know how Twitter produces the “public timeline”. Certainly it excludes those whose tweets are private. Also, the time sampled approach, of taking a snapshot of 20 tweets every two minutes, leaves open the possibility of missing short-term phenomena. For example, if a million users all tweet something on a particular topic at exactly noon on January 4th, and I don’t take a snapshot until 12:04, then I will miss that topic entirely At the very least it will be under-represented. But so long as we’re willing to acknowledge that we might be missing interesting behavior that occurs at shorter times scales, we can fairly make some more general observations.
First up I wanted to look at the distribution of tweet lengths. Are people keeping it short? Or are they running into the 140-character limit? The answer, as seen in the following chart, is a clear “Yes” on both counts. There are two clear peaks, one in the 20-25 character range, and another pushing at the limit, in the 139-140 character length.
So what is going on here? One explanation could be that the shorter tweets are coming from mobile clients, while longer tweets are coming from web and desktop twitter clients. At first I suspected that the longer tweets might indicate users bumping into the 140 character limit, and this might show itself as truncated content, greater use of abbreviations and general frustration. But when I took a closer look I saw that most of the maximal length tweets were machine-generated, intentionally targeted to that length, like: “New story posted ‘This title is truncated so total tweet is 140 characters long…’ http://bit.ly/foobar”.
Next, I took a look at hash tags, and tabulated those by frequency. A word cloud (made in Wordle) of the top 100 tags is:
And here are the top 20 hash tags as a list:
Although we might wonder what Twitter’s revenue model is in the long term, it seems that other companies like Facebook and Google are using Twitter to the own advantage. There were very few hash tags at the top that were actually topical, like #wordcup. This is clearly at odds with what Twitter reported on their “Top Twitter Trends in 2010” page. They list #rememberwhen in first place. It is 925th on my list. It is odd that Twitter’s official list contains no non-English hash tags as well. And no competing social sites, like Facebook. Obviously they are using a different methodology in putting together their numbers.
Next I looked at the top Twitter accounts targeted in a tweet:
Next I looked at the content of the tweets, to find what 5-word strings were the most common. The top 20 phrases are:
- just joined a video chat
- i favorited a youtube video
- i liked a youtube video
- i uploaded a youtube video
- check this video out —
- photos on facebook in the
- you need to check out
- add a #twibbon to your
- i just became the mayor
- if you want more followers
- way of getting 100 free
- for a chance to win
- get 100 free more twitter
- in a live video chat
- just snapped a new picture
- click the link to join
- a live video chat with
- want more followers check out
- you should check out this
- should check out this site
Rather depressing, isn’t it? Spam, spam, spam, spam…
Last up, something I haven’t seen before, a look at the most-common numbers that occured in the text of tweets, truly “Twitter by the Numbers”:
- 2 (likely in first place because it is also used as short form of “two”, “too” and “to”
- 4 (also used as short form of “for”)
(I also have a list of the top URL’s included in tweets. But they are almost entirely via URL shortners (bit.ly, etc.) where the link no longer works, taken down by the service no doubt for being used in spam.)
If anyone has any further ideas for analysis, let me know. One thing I’d like to do is chart the percentage of tweets that come from various Twitter clients over time, to track market share statistics. If I can find a good language classifier (one that can work even on very short text fragments) I can do a breakdown of those.