Text

Accurate Twitter Statistics vs Efficiency

I wrote about Twitter Statistics earlier today, but got to thinking about the actual processing time of generating these.

The more tweets/followers your analyze, the more accurate the results are going to be for a time period.

But, the problem here is: the more tweets/followers you analyze, the longer it takes.

Tweets are displayed (at maximum through the API), at two hundred per page.  Analyzing 8 pages through the API, will yield about 1407 tweets being Analyzed (don’t ask ;P).

The time it takes to crawl through those 8 page are surprising though.  Can you imagine the processing time it takes to crawl through at least a short period of the lifetime of an active Twitter account?

In all of my programming classes, I’ve learned to separate and not to differ away from my programming cycle.

1. Input (declarations before this)
2. Processing
3. Output
4. Storage

But when it comes to Twitter statistics, this can work a little different, this could apply to statistics in general?

1. Input (username to begin the analyzing at)
2. Crawling.  This could technically count as the input also, because you’re not calculating yet, you’re gathering the required information.  So this adds quite another step to the programming cycle, which I call “Retrieving”.
3. Storage (to reduce the calls on the Twitter server, you need to cache it in general for at least a short period.  Store it so we can calculate it later.  I’m becoming fonder of pipe separated statements (for dates, etc) to store arguments).
4. Calculation (calculate the actual statistics you are working on.)
5. Output (display the statistics)
6. Storage (this is optional, but if you want to save some strain on your server, you can also cache the output along with the data you already cached (from twitter), to save having to calculate it again).

In summary, generating accurate Twitter Statistics does cost some time, it takes some time to crawl, takes some time to calculate, and takes some time to make things accurate.

That’s why some of these things are not very good for “instant-based” applications (meaning the output is done rather fast), although your program can always be tweaked for better efficiency, and it’s not always you that costs time.

It can be the latency between your server and Twitter’s, the load of Twitter at the moment (and the API), downtime with Twitter (which seems quite a bit the past little while.. status.twitter.com doesn’t accurately display it either).

Sorry for the long blog post! Just rambling on :]

lovingly posted at 9:59pm Thursday, 5th March 2009 with comments