Twitter Analysis – Part 3: Scraping tweets and a minor detour

Finally! The time came to scrape the tweets for UFC 197. Everything was set up from the script to the virtual machine.

The schedule for the event was as follows:

UFC_197_Poster

 

  • 23:30 BST – UFC Fight Pass Prelims
  • 01:00 BST – Prelims
  • 03:00 BST – Main Card

I started the script when the UFC Fight Pass Prelims began and stopped it at 05:37 BST, approximately half an hour after the main event finished. The run went successfully, I think the test run on the UFC on Fox event gave me confidence that this run would go well. In the end I obtained a JSON file 637MB in size.

Now that I had my dataset I was looking through it and realised that some of the tweets where re-tweets. Almost immediately I didn’t want them in the dataset as they may alter the analysis because some words may have a stronger influence just because they were retweeted many times. So I set about extracting the original tweets from the UFC 197 JSON file.

“Cleaning” the original dataset

This was uncharted territory for me as the guide I was following as mentioned in Part 1 of this series scraped tweets from the user’s own timeline.

In order to get only original tweets, I thought I’d do some initial analysis to see the total number of tweets, number of retweets and the number of original tweets. If the numbers matched then I’d know I’m on the right track and can extract the original tweets only.

Another way to boost my confidence was to use the dataset I obtained from the UFC on Fox event. This would serve as a test bed to see if my code was doing what it was suppose to and since it was a smaller file compared to the UFC 197 file I could carry out the checks relatively quickly.

Running into problems/issues

While I was trying to read each tweet line by line I kept running into a about the JSON file not being read correctly. I tried debugging this myself but no luck so I posted a question on Stack Overflow. It turned out I wasn’t accounting for the blank lines in my JSON file which separated each tweet. I thought my code would allow me to “plug-n-play” my data and therefore I didn’t consider the error being the blank lines being read.

I have encountered problems like this before where I don’t consider all possibilities since I’m spending too long and getting flustered and therefore result to Stack Overflow. But I hope that eventually by doing more projects that I further improve my debugging skills.

An issue I ran into was extracting the attribute I wanted from the tweet (a single JSON object). This was easy to solve as it was reading the Twitter documentation to see how the JSON object is structured.

Click here to be taken to the IPython Notebook which extracted the original tweets.

As a result of filtering my dataset the filesize went from 637MB to 212MB. I knew the file size would be somewhat less than the original but I didn’t think it would be over a half the size. Concerned by this I decided to do the good ol’ {CTRL + F} trick in Notepad++ to find out how many retweets and original tweets where in the dataset. After doing this, the numbers matched up to the numbers produced by my code. This was a relief to me and I gladly moved on.

In the next part I finally get to produce the work cloud…woop woop!

One thought on “Twitter Analysis – Part 3: Scraping tweets and a minor detour

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s