Project 5

Act Report/Findings

We are wrangling, analyzing, and visualizing the tweet archive of the Twitter acount, @dog_rates aka WeRateDogs.

'WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.'

Analyze and Visualize

There are 3 questions we want to answer:

Is there a correlation between most favorites and most retweets?

3. Which breeds are the most highly rated, on average?

Does Favs or Retweets have a relationship with most highly rated breeds? is one better than the other? which one is stronger if so?

  1. french bulldog
  2. golden retriever
  3. pembroke
  4. samoyed
  5. labrador retriever
  1. French_bulldog
  2. Pembroke
  3. Samoyed
  4. golden_retriever
  5. chow

Was there a correlation between likes and retweets?

There was a 0.98 correlation, showing a strong relationship between favs and retweets. This means that the more favs, the more retweets

image.png

3. Which breeds are the most highly rated on average?

Visualizing dog breeds with the highest rating on average.

Notes to consider:

- only using data from predictions with >50% confidence
- only using data for 10 most common dog breeds
- the 'rating' will be normalized by: rating_numerator/rating_denominator
  1. Samoyed
  2. chow
  3. Pembroke
  4. golden_retriever
  5. toy_poodle

bolded breeds indicate that they are in the top list for both most favs and most retweets the toy poodle is the only breed that isnt on the upper half (top 5) of the favs and retweets list

The correlation between Top 10 by Ratings and Top 10 by Favs was 0.89

image.png

The correlation between Top 10 by Ratings and Top 10 by Retweets was 0.86

image.png

wrangle_report

Project 5

Wrangle, Gather, Assess, Clean

We are wrangling, analyzing, and visualizing the tweet archive of the Twitter acount, @dog_rates aka WeRateDogs.

'WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.'

Softwares Used The following packages (libraries) need to be installed. You can install these packages via conda or pip. Please revisit our Anaconda tutorial earlier in the Nanodegree program for package installation instructions.

  • pandas
  • NumPy
  • requests
  • tweepy
  • json

Project Motivation

Context

Your goal: wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.

Key Points

  • You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
  • Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
  • Cleaning includes merging individual pieces of data according to the rules of tidy data.
  • The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs.
  • You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.

Project Details

  • Data Wrangling:
    • Gathering data
    • Assessing Data
    • Cleaning Data
  • Storing, analyzing, and visualizing your wrangled data
  • Reporting on
    • your wrangling efforts
    • data analyses and visualizations

Gathering Data

3 Sources:

  • .CSV containing preliminary data from WeRateDogs tweets
  • .TSV containing images; neural network analysis of dog breeds
  • Twitter API with additional data

We are now going to access the Twitter API data (you need to set up a Twitter developer account and get the following:

  • consumer_key
  • consumer_secret
  • access_token
  • access_secret

wait_on_rate_limit and wait_on_rate_limit_notify are set to TRUE because:

  • Twitter puts limits on number of requests
  • we're gathering data from thousands of tweets

Tweet information should be accessed through 'get_status'

  • the Tweet JSON data = 'tweet_json.txt' file using 'dump()' (from json module)
  • this will repear for every tweet_id in the 'archive' df

The large file we just ran contains the JSON data for every tweet. We now need to create a pandas df with the info needed-

  • we can use 'listofdicts.' to create this df
  • create list and append to dictionary (JSON data )
  • .loads() will read from JSON module to interpret our text file

We are focusing on

  • ID
  • retweet_count
  • favorite_count

Assessing Data

  • .CSV containing preliminary data from WeRateDogs tweets --> stored as archive
  • .TSV containing images; neural network analysis of dog breeds --> stored as img
  • Twitter API with additional data --> stored as json

We are now going to assess, clean, drop, and make these tidy

1. Assess 'archive'

2. Assess 'img'

3. Assess 'json'

Issues and Conclusions

DQ - data quality issue DT - data tidiness issue

archive

  • there are retweets and we are only looking at non-retweet data (DT)
  • there are 4 columns (doggo, floofer, pupper, and puppo) that need to collapse into one column: category (DT)
  • timestamp is not stored as a datetime format (DQ)
  • there are rows where none values =/= null values(DQ)
  • source column (archive) is not consistent (DQ)
  • there is no photo or rating for : 835246439529840640 (DQ)
  • there is an incorrect rating for: 666287406224695296 , 810984652412424192 (DQ)

Img

  • some columns are not very descriptive (i.e. 'p1_conf') (DQ)

json

  • to match column titles in other 2 DFs, 'id' should be 'tweet_id' (DQ)

Overall

Looking at all and between the 3 DFs:

  • there are records that do not match between the DFs (DQ)
  • tweet data unorganized and in multiple DFs, need to combine (DT)

Cleaning Data

Steps:

  1. Copy and Combine
  2. Drop retweets
  3. Descriptive Columns for 'img'
  4. Fix Lowercases in names
  5. Timestamp Fix
  6. Incorrect Rating Fix
  7. HTML column cleanup
  8. Dog Stages

1. Copy and Combine

we are going to make a 'c' copy/version of the 3 DFs for 'clean'

2. Dropping retweets

3. Fixing 'img' DF; renaming to more descriptive columns

4. Fixing lowercase issue in overall DF (tweet)

5. Timestamp

6. Incorrect Rating Fix

from fractions in text

7. HTML column cleanup

8. Dog Stages

The 4 columns 'doggo', 'floofer', pupper', and 'puppo' are and can be merged into ONE column: dog_stage

Steps for this clean-up:

  • create dictionary: every tweet_id should have a corresponding 'dog stage' and the lists should be equal
  • dog_class DF derived from dictionary to be merged into 'tweet' (master DF)
  • drop 4 original dog category columns

Re-arranging column (optional)

for easier reading/digesting

Exporting Data

the cleaned tweet DF needs to be exported to a csv file now This is to be stored in 'twitter_archive_master.csv'

Analyze and Visualize

There are 3 questions we want to answer:

Is there a correlation between most favorites and most retweets?

3. Which breeds are the most highly rated, on average?