Mapping Twitter's Firehose - More stories from your data

On FirstMonday.org a journal was published last week called; Mapping the global Twitter Heartbeat: The geography of Twitter. The stories and visualisations from the data share fascinating insights.

The study analysed 1.5 billion tweets sent by 70 million users in one month to work out where people were. On average, it found, people who mentioned or retweeted each other's messages were 750 miles apart.

From 12:01AM 23 October 2012 through 11:59PM 30 November 2012, the Twitter Decahose from GNIP streamed 1,535,929,521 tweets from 71,273,997 unique users, averaging 38 million tweets from 13.7 million users each day.
All Exact Location coordinates in the Twitter Firehose 23 October 2012 to 30 November 2012.

All Exact Location coordinates in the Twitter Firehose 23 October 2012 to 30 November 2012.

The average tweet is 74 characters long and consists of 9.4 words. In all, this dataset encompasses just over 0.9 percent of all tweets ever sent since the debut of Twitter and 35.6 percent of all active users as of December 2012

Remember that old 80/20 rule?

Twitter’s content stream is dominated by a small number of users. The top 15 percent of users account for 85 percent of all tweets, while the top five percent of all users account for 48 percent of all tweets and the top one percent of all users (just 720,365) account for 20 percent of all tweets.

A very small number of core users thus drive the majority of Twitter’s traffic. A quarter of users active during this period tweeted just once, while half tweeted between one and four times. Roughly 30 percent of users were active a single day (sending one or more tweets that day), while half were active one–three days, and 75 percent of users were active 10 days or less. The top 10 percent of users were active 24–39 days, with about one percent of users active all 39 days.

The strong presence of Twitter in the United States is reflected in the fact that six of the top 20 cities are from the United States. Jakarta alone accounts for nearly three percent of all georeferenced tweets, illustrating Indonesia’s outsized presence on Twitter, while New York City and São Paulo are nearly tied for second. Texas stands out in that two cities, Dallas and Houston, both make the top 20 list, with a third city, San Antonio, at number 42, with 0.32 percent.
Screen Shot 2013-05-08 at 9.12.10 AM (2).png

English is by far the most common language on Twitter, accounting for 38.25 percent of all tweets and 41.57 percent of georeferenced tweets. Yet, just 2.17 percent of all English tweets are georeferenced, indicating that the vast majority of tweets in the language do not carry native geographic information. Spanish is the second most popular georeferenced language at just a quarter of English, but for georeferenced tweets, it is tied with Japanese.

Screen Shot 2013-05-08 at 9.15.47 AM (2).png
In all, there were 485,941,182 links to 223,712,255 distinct URLs from 4,816,802 different Web sites (tweets can contain multiple links). The top six domains with the most links are twitter.com (16.8 percent), instagram.com (13.3 percent), facebook.com (11.9 percent), youtube.com (6.2 percent), ask.fm (3.2 percent), and tmblr.co (2.9 percent).

Screen Shot 2013-05-08 at 9.20.06 AM (2).png
Looking just at georeferenced tweets, there were a total of 8,943,092 links to 7,331,672 distinct URLs from 113,389 Web sites. The top domains were foursquare.com (45.5 percent), instagram.com (17.5 percent), twitter.com (15.3 percent), myloc.me (3.5 percent), path.com (2.2 percent), and youtube.com (1.8 percent).

What's the impact of location in shared content and engagement?

The average minimum distance between user and the geographic focus of the article across all 18,650 news stories was 1,151 statute miles, in keeping with the large average distances seen in retweet and reference pairings among users in the previous section. Examining the distances more closely, just over a quarter of all links (26 percent) were to stories about the same city the user was located in, 37 percent were to events within a 100 mile radius of the user, and 47 percent were within 300 miles. At the same time, a nearly identical proportion (46 percent) were to stories about events more than 600 miles away, meaning that tweeted news stories were nearly evenly split between events near the user and those far away. This indicates that not only do users not preference communicating with users physically near them from those far away, but they discuss nearby and distant events at equal levels as well. This suggests that geography may play an even lesser role in social media than previously thought. 

What about Klout Scores? Based on data from georeferenced tweets we can see Indonesia has the highest concentration of mid-high range Klout users [37-57]. Why the report failed to include the top klout brackets I'm unsure.

figure20.png
There is massive variation by hour in the percentage of tweets matched by the geocoding system. From a peak of 68.9 percent of all tweets geocoded at 1AM PST to a low of 15.9 percent of tweets at 7PM PST, the textual geographic density of Twitter changes by more than 53 percent over the course of each day. This has enormous ramifications for the use of Twitter as a global monitoring system, as it suggests that the representativeness of geographic tweets changes considerably depending on time of day.
Percent of Twitter Decahose tweets geocoded by hour 23 October 2012 to 30 November 2012 (PST).

Percent of Twitter Decahose tweets geocoded by hour 23 October 2012 to 30 November 2012 (PST).

Check out the rest of the report here, it's full of interesting scientific analysis.