Project 5¶

Act Report/Findings¶

We are wrangling, analyzing, and visualizing the tweet archive of the Twitter acount, @dog_rates aka WeRateDogs.

'WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.'

Analyze and Visualize¶

There are 3 questions we want to answer:¶

1. What are the most popular dog breeds (in terms of favorites)?¶

2. What are the most popular dog breeds (in terms of retweets)?¶

Is there a correlation between most favorites and most retweets?

3. Which breeds are the most highly rated, on average?¶
Does Favs or Retweets have a relationship with most highly rated breeds? is one better than the other? which one is stronger if so?

1. What are the most popular dog breeds (in terms of favorites)?¶

french bulldog
golden retriever
pembroke
samoyed
labrador retriever

2. What are the most popular dog breeds (in terms of retweets)?¶

French_bulldog
Pembroke
Samoyed
golden_retriever
chow

Was there a correlation between likes and retweets?¶

There was a 0.98 correlation, showing a strong relationship between favs and retweets. This means that the more favs, the more retweets

3. Which breeds are the most highly rated on average?¶

Visualizing dog breeds with the highest rating on average.

Notes to consider:

- only using data from predictions with >50% confidence
- only using data for 10 most common dog breeds
- the 'rating' will be normalized by: rating_numerator/rating_denominator

Samoyed
chow
Pembroke
golden_retriever
toy_poodle

bolded breeds indicate that they are in the top list for both most favs and most retweets the toy poodle is the only breed that isnt on the upper half (top 5) of the favs and retweets list

The correlation between Top 10 by Ratings and Top 10 by Favs was 0.89

The correlation between Top 10 by Ratings and Top 10 by Retweets was 0.86

Project 5¶

Wrangle, Gather, Assess, Clean¶

We are wrangling, analyzing, and visualizing the tweet archive of the Twitter acount, @dog_rates aka WeRateDogs.

'WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.'

Softwares Used The following packages (libraries) need to be installed. You can install these packages via conda or pip. Please revisit our Anaconda tutorial earlier in the Nanodegree program for package installation instructions.

pandas
NumPy
requests
tweepy
json

Project Motivation¶

Context¶

Your goal: wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.

Key Points¶

You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
Cleaning includes merging individual pieces of data according to the rules of tidy data.
The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs.
You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.

Project Details¶

Data Wrangling:
- Gathering data
- Assessing Data
- Cleaning Data
Storing, analyzing, and visualizing your wrangled data
Reporting on
- your wrangling efforts
- data analyses and visualizations

Gathering Data¶

3 Sources:

.CSV containing preliminary data from WeRateDogs tweets
.TSV containing images; neural network analysis of dog breeds
Twitter API with additional data

We are now going to access the Twitter API data (you need to set up a Twitter developer account and get the following:

consumer_key

consumer_secret

access_token

access_secret

wait_on_rate_limit and wait_on_rate_limit_notify are set to TRUE because:

Twitter puts limits on number of requests
we're gathering data from thousands of tweets

Tweet information should be accessed through 'get_status'

the Tweet JSON data = 'tweet_json.txt' file using 'dump()' (from json module)
this will repear for every tweet_id in the 'archive' df

The large file we just ran contains the JSON data for every tweet. We now need to create a pandas df with the info needed-

we can use 'listofdicts.' to create this df
create list and append to dictionary (JSON data )
.loads() will read from JSON module to interpret our text file

We are focusing on

ID
retweet_count
favorite_count

Assessing Data¶

.CSV containing preliminary data from WeRateDogs tweets --> stored as archive
.TSV containing images; neural network analysis of dog breeds --> stored as img
Twitter API with additional data --> stored as json

We are now going to assess, clean, drop, and make these tidy

1. Assess 'archive'¶

2. Assess 'img'¶

3. Assess 'json'¶

Issues and Conclusions¶

DQ - data quality issue DT - data tidiness issue

archive¶

there are retweets and we are only looking at non-retweet data (DT)
there are 4 columns (doggo, floofer, pupper, and puppo) that need to collapse into one column: category (DT)
timestamp is not stored as a datetime format (DQ)
there are rows where none values =/= null values(DQ)
source column (archive) is not consistent (DQ)
there is no photo or rating for : 835246439529840640 (DQ)
there is an incorrect rating for: 666287406224695296 , 810984652412424192 (DQ)

Img¶

some columns are not very descriptive (i.e. 'p1_conf') (DQ)

json¶

to match column titles in other 2 DFs, 'id' should be 'tweet_id' (DQ)

Overall¶

Looking at all and between the 3 DFs:

there are records that do not match between the DFs (DQ)
tweet data unorganized and in multiple DFs, need to combine (DT)

Cleaning Data¶

Steps:

Copy and Combine
Drop retweets
Descriptive Columns for 'img'
Fix Lowercases in names
Timestamp Fix
Incorrect Rating Fix
HTML column cleanup
Dog Stages

1. Copy and Combine¶

we are going to make a 'c' copy/version of the 3 DFs for 'clean'

2. Dropping retweets¶

3. Fixing 'img' DF; renaming to more descriptive columns¶

4. Fixing lowercase issue in overall DF (tweet)¶

5. Timestamp¶

6. Incorrect Rating Fix¶

from fractions in text

7. HTML column cleanup¶

8. Dog Stages¶

The 4 columns 'doggo', 'floofer', pupper', and 'puppo' are and can be merged into ONE column: dog_stage

Steps for this clean-up:

create dictionary: every tweet_id should have a corresponding 'dog stage' and the lists should be equal
dog_class DF derived from dictionary to be merged into 'tweet' (master DF)
drop 4 original dog category columns

Re-arranging column (optional)¶

for easier reading/digesting

Exporting Data¶

the cleaned tweet DF needs to be exported to a csv file now This is to be stored in 'twitter_archive_master.csv'

Analyze and Visualize¶

There are 3 questions we want to answer:¶

1. What are the most popular dog breeds (in terms of favorites)?¶

2. What are the most popular dog breeds (in terms of retweets)?¶

Is there a correlation between most favorites and most retweets?

3. Which breeds are the most highly rated, on average?¶