This data wrangling project is the second project of the ALX-T Data Analyst Nanodegree programme on Udacity. The project focused on wrangling and analysing data from @WeRateDogs Twitter account.
This involved:
- gathering data from multiple sources (downloaded @WeRateDogs' Twitter archive dataset and image predictions data using the Requests library, and queried Twitter API using tweepy for additional data),
- assessing data (visually and programmatically),
- cleaning and merging the datasets, and then
- performing analysis on the tweets to extract insights.
- Pandas: For storing and manipulating structured data.
- Numpy: For multi-dimensional array, matrix data structures and, performing mathematical operations
- Matplotlib: For all visualizations (including maps and graphs)
- Seaborn: a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
- tweepy: open-sourced, easy-to-use for accessing the Twitter API. It gives you an interface to access the API from your Python application.
- requests: allows you to send HTTP requests using Python.
- os: For interacting with the operating system
- json: For parsing JSON into a Python dictionary or list. It can also convert Python dictionaries or lists into JSON strings.
- re: provides a set of powerful regular expression facilities, which allows one to quickly check whether a given string matches a given pattern (using the match function), or contains such a pattern (using the search function).
The main steps for this project are as follows:
- Data Wrangling:
- Data Gathering
- Data Assessment
- Data Cleaning
- Analysis and Visualisation
- Conclusions/Results
Based on the data and analysis carried out, I found that:
- There's a very positive correlation (r=0.93) between retweets and favourite count of a tweet.
- Tweets of the doggo-puppo category overwhelmingly outperformed the rest in retweets and likes. @WeRateDogs might focus on this category for future tweets, given its apparent popularity with the followers.
- Tweeting from a web browser seems to have gathered more retweets on average. It is important to note that other factors (such as tweet content) are likely contributing here.
-
84% (1561/1851) of the dogs didn't have a category. This means conclusions about dog stage were made on a very small portion of the observations.
-
A tweet source "Vine - Make a Scene" wasn't part of the final analysis. I suspect this tweet source was dropped when I dropped retweet rows and tweets with nonstandard dog names.
-
I couldn't query additional data on 29 tweets. 28 of the tweets threw up "No status found with that ID" error, while one threw up "Sorry, you are not authorized to see this status" error.
Below are some of the websites I consulted for this project:
-
Pandas Combine Two Columns of Text in DataFrame, SparkByExamples, Website Link
-
How to Convert Text Data from Requests object to DataFrame, StackOverflow, Website Link
-
BeautifulSoup: Extract Text from Anchor Tag, StackOverflow, Website Link
-
Authenticate the Twitter API with Python (Tweepy), JC Chouinard, Website Link
-
How to Make a Table in Jupyter Notebook, CodeGrepper, Website Link