For my first steps into data analysis I wanted to do some analysis around my interests to make it enjoyable and informative. Since I like a wide variety of sports I decided I would do some analysis on an UFC (Ultimate Fighting Championship) event. The reason being that these events last for approximately 6 hours depending on the length of the bouts. This kind of time frame would be perfect for me as I would get a large amount of data from the event without having to wait days or weeks.
After some brief research I decided would do the analysis for the upcoming event UFC 197: Jones vs. Saint Preux. However I wanted do a test run of scraping tweets from the event before this, UFC on Fox: Teixeira vs. Evans. The reason for this is because I want make sure that everything go well when scraping tweets for UFC 197 and the UFC on Fox event would be a perfect test event as both events have the same structure.
Now that I had set some objectives I needed to learn how to scrape tweets. From the beginning I wanted to use Python as my choice of language since it’s one of the programming languages used for data analysis due to the tools and libraries available for it. Also, I didn’t have a Twitter account before this so I set one up and did all the things you needed to do before scraping tweets (i.e. getting consumer keys, tokens and etc.).
After some googling I found a series of guides to Mine Twitter Data with Python. Funnily enough the author of these guides, Marco Bonzanini, was a teaching assistant of mine for a module in my undergraduate degree (small world eh!). After reading through the first few parts I felt this was a suitable resource of information to begin scraping tweets.
So I started of following Part 1: Collecting data. I used all the code he provided as I felt that there is no point reinventing the wheel when it comes to scraping tweets. However, I made sure I would understand the code to get a better idea of what was going on and add in some comments for myself.
Click here to see part 1 of my IPython Notebook.
For example –
I didn’t understand one particular line of code:
What is filter()? What does it do?
Filter allows you to match incoming tweets based on a ‘filter’ you apply. If the tweet matches your filter then you can do whatever you want with it, in my case store the tweet in a JSON file.
The filters you can apply are:
- follow – returns statuses of the user IDs specified
- track – searches for a word in a tweet
- location – returns tweets based on the location they were tweeted from
Now I have understood that to return tweets with a certain hashtag filter and track must be used in order to do this. This will be perfect for me as the UFC197 associated hashtag (#UFC197) which will help me return tweets related to the event.
In the next part I will discuss how the test run went.