Visualizing the Twitter social graph, Part 1: Getting the data
This series is a technical writeup of how we created visualizations of Twitter social graphs like this:

I will attempt to cover the entire process, from prototyping to release. This first part will cover retrieving the initial data necessary to build a prototype.
Let’s get started
Technically, the first thing to do when designing a visualization is to ask yourself, “What questions am I trying to answer about this data?” Asking yourself this question will shape everything you do about the project, so it’s important that you take it seriously - otherwise you’re just creating visualizations for the sake of creating visualizations. Unfortunately, this is out of the scope of this article, but I highly recommend you check out Visualize This, by Nathan Yau and Mining the Social Web, by Matthew A. Russell. We will be assuming you have already decided what you’re building - a graph of your Twitter network.
First, I created an application with Twitter. This is actually a good idea even if you already have an application key with them, because you can dedicate all of your REST API calls for retrieving the data that you need for your prototype. It’s also a good way to secure the name for your project/website/startup, as Twitter application names must be unique.
Next, I defined what data I wanted to get. In this case, I wanted a list of all my friends and followers and I wanted lists of all their friends and followers. In my opinion, it’s better to start off ambitious about what data you want and slowly refine that to what data you actually need.
Perusing the Twitter API docs, we come across these two API methods:
These API methods return user IDs of the friends and followers of a specific user. Perfect, right?
The next step is to actually test what we get back from these API methods. One of the great things about the Twitter API is that Twitter has one of the best API consoles I’ve ever seen, it’s hidden in the Twitter for Mac client (check out the Developer Tab in Preferences):

First, I use GET users/show to figure out my Twitter user ID from my screen_name. Then I pass that into GET friends/ids and take a look at the response:
{
"previous_cursor": 0,
"next_cursor_str": "0",
"ids": [
20015311,
35097545,
30923,
.
.
.
14572071,
755936,
1058011
],
"previous_cursor_str": "0",
"next_cursor": 0
}
I do the same for GET followers/ids and we’re already half-way to our goal - we now have a list of all my friends and followers.
It’s worth mentioning that, as of this writing, I only have 207 friends and 167 followers on Twitter. Astute readers will note that these numbers are considerably smaller than the amount that we can get back from either of these two API calls - let’s talk about what we would need to do if we had a much larger number of friends and/or followers.
If we re-read the API documentation for GET followers/ids, we will see that the maximum number of user IDs from this call is 5000. This means that if we have more than 5000 followers, we will need to execute multiple API calls to get all the followers. Let’s see how this works in pseudo-code:
Call GET followers/ids with our user_id and no additional params.
Store all the user_ids somewhere.
Check the response, if next_cursor is greater than 0, call GET followers/ids again with our user_id and the cursor set to next_cursor.
Store all the user_ids somewhere.
Keep doing this until next_cursor is 0.
In practice, my actual PHP code does not look that different:
$twitter_user_id = 6588972;
$cursor = -1;
$followers = array();
do {
try {
$ret = $twitter->getFollowers(array(
'user_id' => $twitter_user_id,
'cursor' => $cursor,
));
} catch (TwitterException $e) {
return $this->handleTwitterException($e);
}
if ($ret['success']) {
foreach ($ret['data']->ids as $id) {
$followers[] = $id;
}
$cursor = $ret['data']->next_cursor_str;
}
} while (!$ret['success'] || $cursor > 0);
In this case $twitter is my homegrown Twitter class that talks to the Twitter API, but even without seeing that, you can tell there are two weird things going on here. First of all, it appears that we’re potentially throwing exceptions as a result of fetching followers and it seems that the same API call can get repeated over and over again. There are two important caveats with the Twitter API that are being abstracted here.
Twitter Caveats
Believe it or not, the Twitter API will not always return you a successful response. Sometimes it will tell you that the data you’re requesting does not exist (HTTP code 404), sometimes you don’t have permission to receive it (HTTP code 401 or 403), and sometimes it’s just not in the mood to give you what you want (HTTP code 503). Every time you get a response from Twitter, you must check the HTTP code and handle the situation appropriately. In the case of my Twitter class, I consider codes 401, 403, and 404 to be fatal exceptions, but I consider code 503 to be a side-product of the cost of doing business with Twitter. I also set a threshold for the number of times that we can fail to successfully fetch content so we don’t get stuck in an endless loop of fail whale - when it exceeds that threshold, it throws an exception.
The other thing to consider about the Twitter REST API is rate-limiting. Here is their documentation about rate-limiting, but a quick summary is that if you do un-authenticated API calls, you can do 150 requests per hour, per IP address. If you do OAuth calls, you can do 350 requests per hour, per access token. Thus, when we move this prototype to production, you should never do un-authenticated API calls because your application simply won’t scale, unless you have an amazing pool of free public IP addresses. For the prototype though, you can actually combine these two to get 500 API calls per hour. The way that this affects you in terms of your code is that you should never let your client hit the Twitter API after you’ve been rate-limited. This is an abuse of Twitter’s API and goodwill - you must track your usage on your side. Fortunately, Twitter makes this easy, they return headers that tell you exactly how many API calls you have left, it’s simply a matter of monitoring these headers after each call and updating the count on your side. The ones of particular importance to you are:
X-FeatureRateLimit-Remaining This tells you how many API calls you have remaining before you get rate-limited.
X-FeatureRateLimit-Reset This tells you when the rate limit will be reset.
The way I do is it that I set a threshold where I rate-limit myself. For example, let’s say the threshold is 2 API calls remaining. After every call, I check X-FeatureRateLimit-Remaining against my threshold and if it is the same or lower, I have the program sleep until the time specified in X-FeatureRateLimit-Reset. This ensures that we never send Twitter a request after we’ve been rate-limited.
Last thing to mention about rate-limiting before we move on - if there is an equivalent way to get the data you need through the Streaming API, you should do so, as the Streaming API has practically no limits. Also, if you’re using User Streams and you’re planning on moving on from prototype to production, you should take a look at the Site Streams documentation. User Streams are fine for prototyping but in production, you must move on to Site Streams which requires applying for beta access.
What’s the conclusion that we should draw from these two caveats? Either ensure the Twitter library you’re using handles these issues, or write the client yourself.
Grabbing the data
Now that we understand how the Twitter API works and how to properly talk to it, we can get back to the problem at hand - how do we get all the data? Assuming that we’re starting with my user (207 friends, 167 followers), let’s do some quick napkin math. We have to loop over approximately 400 users and fetch friends and followers for each of them. Most of those users will have less than 5000 friends and followers each, but let’s assume a handful will be Internet celebrities and have lots of followers. Let’s say 5% will have more than 5000 followers but less than 25000 followers. If we assume worst-case scenario:
400 * 0.95 * 2 = 760
400 * 0.05 * (5 + 1) = 120
760 + 120 = 880
880 / 350 = 2.5
What does this mean? Well, for 95% of the users, we have to make two API calls each, one to fetch followers and one to fetch friends. For 5% of the users, we have to make 6 API calls each, 5 to fetch up to 25000 followers, and one to fetch friends. This totals 880 API calls. We can only make 350 OAuth calls an hour, so it will take us 2.5 hours to fetch all this data.
This napkin math is all well and good, but we have to consider what I like to call the Barack Obama issue.
When I worked at Flickr and was writing code that had to scale, I would often sanity check my code by considering, what happens when Barack Obama gets involved? For many tasks, I would use a specific photograph to test code, which was this one:

http://www.flickr.com/photos/whitehouse/5680724572/
The Situation Room photo has over 2.6 million views on Flickr and over 6,000 favorites, well out of the league of most Flickr photos. If you were writing code that had to paginate those favorites, or loop over them checking for something, you could test it against that photo and it would tell you how well your algorithm would hold up in the worst case scenario.
In the case of Twitter, I use Barack Obama himself @BarackObama (680,337 friends, 13,712,222 followers).
It is not an edge case that I would follow @BarackObama, in fact it’s a perfectly reasonable assumption that I might. So your code has to deal with this, but perhaps not in the way that you think.
In order to fetch over 20,000,000 twitter user ids to get both his friends and followers, we would have to execute 4000 API calls, which would take 11.4 hours. Yeah - that’s not going to happen. The obvious thing to do is to exclude @BarackObama from the social graph. How useful could his network be anyways, if he follows that many people or that many people follow him? But here lies the question - how do we programatically know to exclude him?
GET users/show will return the extended information of a user, including their friends and followers counts, but GET users/lookup will return the extended information for up to 100 users.
Thus, if we have 400 users, we will only need to make 4 additional API calls to determine which users not to fetch friend and follower IDs for. So our rough estimate is 884 calls and it will take approximately 2.5 hours to fetch the data.
The code for this should be pretty obvious, but let’s say that you have three functions:
get_followers($twitter_user_id)
get_friends($twitter_user_id)
lookup_users($user_ids)
Here’s some PHP that would bring it all together:
<?php
$twitter_user_id = 6588972;
$followers = get_followers($twitter_user_id);
$friends = get_followers($twitter_user_id);
file_put_contents($twitter_user_id . '_followers.csv', implode(',', $followers));
file_put_contents($twitter_user_id . '_friends.csv', implode(',', $friends));
// There's probably a better way to do this, but this abuses PHP's hash table abilities and is relatively fast
$users = array();
foreach ($followers as $user_id) {
$users[$user_id] = 1;
}
foreach ($friends as $user_id) {
if (!isset($users[$user_id])) {
$users[$user_id] = 1;
}
}
$user_ids = array_keys($users);
if (count($user_ids) > 0) {
do {
$set = array_splice($user_ids, 0, 100);
$extended_users = lookup_users($set);
foreach ($extended_users as $extended_user) {
if ($extended_user->followers_count > 25000 || $extended_user->friends_count > 25000) {
unset($users[$extended_user->id]);
}
}
} while (count($user_ids) > 0);
foreach ($users as $user_id => $nothing) {
$followers = get_followers($twitter_user_id);
$friends = get_followers($twitter_user_id);
file_put_contents($user_id . '_followers.csv', implode(',', $followers));
file_put_contents($user_id . '_friends.csv', implode(',', $friends));
}
}
?>
Now that we have all the data in CSV files, we must process the data so that it will be accepted by our graphing program. We will cover that in the next part of this series, Visualizing the Twitter social graph, Part 2: Processing the data.
Bertrand Fan, co-founder
Solving Instagram’s Unshredder with Mechanical Turk and $0.50
By now you’ve probably seen the Instagram Engineering Challenge: The Unshredder and a few solutions including one written purely in Canvas and Javascript. I don’t know any company code blogs that solve other company code blogs’ engineering challenges but that sounds meta enough to be awesome, so let’s do it.
The Solution
First I took the source image and sliced it into 20 slices:
#!/bin/bash
for i in {0..19}
do
offset=$[$i*32]
index=$[$i+1]
convert -crop "32x359+$offset" TokyoPanoramaShredded.png "$index.jpg"
done
Then I made a simple page that lets someone drag-and-drop the slices into the correct order:

This uses some simple jquery UI to make the slices draggable. We bind an event after each slice is moved that saves the order of the images to a hidden input and then we POST that data when they click the button. PHP then takes that order and saves it into a text file like this:
9.jpg 11.jpg 15.jpg 17.jpg 19.jpg 14.jpg 8.jpg 4.jpg 3.jpg 12.jpg 5.jpg 20.jpg 18.jpg 13.jpg 7.jpg 16.jpg 2.jpg 6.jpg 1.jpg 10.jpg
ImageMagick can then reconstruct the image using these params:
params=`cat out/order.txt`
convert +append $params unshredded.jpg
Next I downloaded Amazon Mechanical Turk Command Line Tools and installed them. I cloned the helloworld example using hits/makeTemplate.sh and modified it to create a HIT template with these parameters:
unshred.properties
title:Move strips of an image around until the original image is formed
description:An image will be sliced into 20 vertical slices and randomized, you will need to put them back in the right order
keywords:puzzle image slices
reward:0.50
assignments:1
unshred.question
<?xml version="1.0" encoding="UTF-8"?>
<QuestionForm xmlns="http://mechanicalturk.amazonaws.com/AWSMechanicalTurkDataSchemas/2005-10-01/QuestionForm.xsd">
<Question>
<QuestionIdentifier>answer</QuestionIdentifier>
<QuestionContent>
<FormattedContent><![CDATA[<h1>Move strips of an image around until the original image is formed</h1>
<p>An image will be sliced into 20 vertical slices and randomized, you will need to put them back in the right order.</p>
<p>Here is an example of an image you will need to solve: </p>
<p><img alt="" src="http://unshred.recollect.com/example.jpg" /></p>
<p>When you're ready, go this this url:</p>
<h1><a href="http://unshred.recollect.com" target="_blank">http://unshred.recollect.com</a></h1>
<p>and drag the slices around until the original image is formed, then click the <b>I'm done</b> button. </p>
<p> </p>
<p>Enter your solution ID in the box below:</p>]]></FormattedContent>
</QuestionContent>
<AnswerSpecification>
<FreeTextAnswer/>
</AnswerSpecification>
</Question>
</QuestionForm>
When we run the script using run.sh, this submits a task to Mechanical Turk which looks like this to the end user:

If they want to make a whopping $0.50, they can accept the task of unshredding our image.
Then I wrote a bash script that brings it all together:
#!/bin/bash
if [ -z "$1" ]
then
echo "Usage: $0 file_to_process"
exit
fi
for i in {0..19}
do
offset=$[$i*32]
index=$[$i+1]
convert -crop "32x359+$offset" "$1" "$index.jpg"
done
FILE=out/order.txt
rm -f $FILE
rm -f unshredded.jpg
START=$(date +%s)
cd /var/www/recollect/unshred.recollect.com/mturk/aws-mturk-clt-1.3.0/hits/unshred/
./run.sh
cd /var/www/recollect/unshred.recollect.com/
echo "Waiting for someone to solve the puzzzle..."
while [ ! -f "$FILE" ]
do
inotifywait -t 86400 -e create -e move_to "$(dirname $FILE)" &>/dev/null
done
END=$(date +%s)
DIFF=$(( $END - $START ))
echo "Puzzle solved in $DIFF seconds."
sleep 2
params=`cat $FILE`
convert +append $params unshredded.jpg
Does it work? Yes, yes it does.
Starting with the source image:

We run our script:

Then if we open unshredded.jpg, we get:

This also works on other images:
Source image (flickr):

We run our script:

And open unshredded.jpg:

That’s it! It’s blazingly fast, taking about 5 to 10 minutes to run. Check out the landing page or browse the code on Github.
Bertrand Fan, co-founder
Queue within a queue
This article is about queues. If you’re not familiar with queues, here’s a good introduction to them.
The Problem
Less than 36 hours after we launched our splash page, I noticed our queue wasn’t draining.
Sometimes queues don’t drain as fast as you would like them to but since we’re running on Amazon EC2, we have a great contingency plan for this scenario: Spin up more boxes. We can spin up a few High-CPU instances, put the worker code on them and drain the queue until it goes down to 0 and then shutdown the boxes. It costs us less than a cup of coffee. If it kept happening, we might increase the number of permanent dedicated Offline Task boxes or even write a script that monitors the size of the queue and automatically spin up and shut down High-CPU instances to handle the load. Unfortunately, the problem we were having was that the queue wasn’t draining at all.
We use gearman at Recollect. I’ve seen a few queueing systems over the years, including two iterations of Flickr’s homegrown system and Leonard Lin’s system for Barack Obama’s 2008 campaign, which had a nice Amazon SQS fallback because apparently when you process contributions for a political campaign it’s a good idea to have a redundant queue in case your main queue goes down. I’d also rolled my own for Coca Cola. It seems to me that there’s a point where every engineer wants to write their own blogging system and those engineers eventually graduate to wanting to write their own queueing system. This isn’t a bad thing, it’s good to know how the internals of such things work, but I no longer have that desire. I had experimented with Gearman in the past with good results, so we went with it.
I SSH’d into our Offline Task box and noticed that the gearmand process was consuming between 70-90% of the CPU. The workers were all running but not receiving any new tasks and there were over 60,000 tasks in the system. I checked the gearman and worker logs and then ran strace on the processes. Then I shutdown the workers, restarted gearman, and started them up again. They started to process tasks and I monitored it for a few minutes and sure enough, after a while the workers stopped receiving tasks and the gearmand process gradually took over the whole CPU. The logs and straces were similar. Over the next few hours, I repeated this process a few times.
We use a persistent storage layer for Gearman so that if you shutdown the daemon and start it again, it still has the same tasks in the system that it had before. I knew that if I deleted all the tasks and started it again, it would be fine, but then we would just be rolling the dice to when this would happen again.
The Diagnosis
At first I got side-tracked thinking this was the bug that I was encountering, so I dutifully upgraded Gearman to the newest version but had the same results. I also compared my Gearman worker code to similar implementations and watched most of this presentation at Etsy on the off chance that someone mentioned this issue. Finally I shutdown the gearman server and examined the tasks themselves and noticed an interesting characteristic, over 5000 of them were tasks that had future timestamps.
Many of our offline tasks interact with the Twitter REST API and if you’ve done any work with that API, you know that Twitter rate-limits you to 350 OAuth calls per hour, per authed user. Once you reach your limit for the hour, you must throttle your requests to the Twitter API and wait out the remainder of the hour. Twitter even facilitates this by sending you headers with how many API calls you have remaining and if you have reached the limit, the Unix timestamp when you can resume hammering their API.
When we built Gearman into our system, I researched ways for it to handle this rate-limiting and found that recent versions of Gearman support a command called SUBMIT_JOB_EPOCH, which lets you specify that a job will not run before a certain timestamp. This sounded like exactly what we needed - when a task that worked with the Twitter API got a response from Twitter that we had been rate-limited, we could simply re-insert the same task into the system with SUBMIT_JOB_EPOCH using the same Unix timestamp that Twitter had provided for when to resume the API call. I found that neither the PECL extension nor the PEAR Net::Gearman libraries supported this new command, so I patched it into the PEAR library and we started using it in production. It worked under the small loads that we put upon it.
I decided to run an experiment. While the gearman server was shutdown, I modified the tasks in our persistent store to all run immediately and when they re-inserted due to rate-limiting, to not use SUBMIT_JOB_EPOCH but instead immediately re-insert the task for processing. This would mean that we were trying to run tasks that we knew could not be run because Twitter had already rate-limited them but we could at least test the possibility that SUBMIT_JOB_EPOCH was inefficient and crashing gearmand. Sure enough, after I restarted the server, it started processing tasks and eventually drained all 60,000 tasks.
Now that I had identified that SUBMIT_JOB_EPOCH almost certainly had something to do with gearmand hanging, I did some more research on it and found its origin story. Looking over it’s code, I realized that the process was likely iterating through a linked list of 5000+ future jobs over and over again which might explain why the gearmand CPU skyrocketed. Technically I could just move the gearmand process to a beefier server, but logically we would just have the same problem once we reached 10,000 or 25,000 future jobs. No, as per Unix philosophy, we would just have to provide that functionality somewhere else.
Here’s the problem: we have a bunch of jobs, each with a timestamp that specifies when it needs to be run in the future. If we do the “Stupidest Thing That Works ™”, then we insert these into a mySQL table and have a cronjob that retrieves and deletes rows that are earlier or equal to the current time and insert these all into Gearman. At first glance, this seems appealing, but part of the reason we’re using Gearman in the first place is so we don’t have a crappy mySQL queue that has to be polled. Then I thought, we could do it in memory - we could have an array hashed by timestamp and then check if the first timestamp is earlier than the current time and if not, have the system sleep until that time. But then we have a problem, how do we handle inserting new tasks into the system if the system is sleeping? If we don’t sleep, do we poll? How often do we poll and if it’s too often, isn’t that exactly what Gearman is doing in the first place?
Usually when I can’t figure out how to code something, I sleep on it and without fail halfway through my shower the next day it occurs to me. In this case, I had lunch instead and halfway through lunch I realized that I had been approaching it all wrong - why not just use a system with events already built into it?
The Code
60 lines of node.js flavored Javascript later (newest version here):
var gearman = require("gearman"),
gearman_client = gearman.createClient(4730, "localhost");
var http = require('http');
http.createServer(function (req, res) {
req.setEncoding('utf8');
var fullBody = '';
req.on('data', function(chunk) {
fullBody += chunk.toString();
});
req.on('end', function() {
res.writeHead(200, {'Content-Type': 'text/plain'});
res.end('SUCCESS\n');
var data = JSON.parse(fullBody);
var recv_string = 'Received Job: ' + data.job.name + ' - ' + data.job.workload;
var now = Math.round(new Date().getTime() / 1000);
if ((data.when_to_run - now) <= 0) {
console.log(recv_string + ' - executing now!');
job = data.job;
gearman_client.submitJob(job.name, job.workload, {
background: true,
priority: 'normal',
encoding: 'utf8',
uniqid: data.uniqid
});
} else {
console.log(recv_string + ' - executing in ' + (data.when_to_run - now) + ' seconds.');
setTimeout(function(job) {
console.log('Executing Job: ' + job.name + ' - ' + job.workload);
gearman_client.submitJob(job.name, job.workload, {
background: true,
priority: 'normal',
encoding: 'utf8',
uniqid: data.uniqid
});
}, (data.when_to_run - now) * 1000, data.job);
}
});
}).listen(10101, "127.0.0.1");
In our worker code, I just make a POST request to this server with a JSON object containing the job and when the job should run. The server leverages setTimeout to delay the Gearman insert until the epoch specified. Since node.js is single threaded, it is not guaranteed that the Gearman insert will happen exactly at the epoch, but this isn’t a requirement for our rate-limiting case - we just need it to happen some time after the epoch. We run it through forever to daemonize it and that’s about it - it took me longer to compile node.js on our EC2 instance than it did to write the program. Gearman continues doing what it does best, sending tasks to workers without blinking, and this does what it does best, firing events more or less when it’s supposed to fire them.
Some Final Thoughts
Earlier in the post, I talked about how I didn’t have a desire to write a queue and then I did, which just goes to show you that engineers write queues even when they don’t want to write them - our brains are just wired that way.
Last but not least, I like to name our daemons and tools by describing what they do. This program delivers messages to the future. At the end of Back to the Future Part II, Doc Brown is living in 1885 and needs to send a message to Marty McFly in 1955, so he leaves a letter with a Western Union. In this one case, Western Union did manage to successfully deliver a message to the future, so I named it westernunion. There’s also an episode of Quantum Leap where Sam needs to get a message to Ziggy to open a chamber door in 1999 but they’re in 1945, so they mail a letter to his dad’s lawyer. I think you will agree that westernunion sounded like a better delivery mechanism than dadslawyer.
Bertrand Fan, co-founder
