Tip: In this data analysis, we will be looking at information about 10K movies from the Movie Database (TMDb). We are looking at which genres were most popular from year to year and exploring the relationship between the popularity of a film and it's vote average score
Dataset analyzed: TMDb Data
Questions to explore: Which genres were most popular throughout the years? Is there a correlation between popularity and vote average score of a film?
# Use this cell to set up import statements for all of the packages that you
# plan to use.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Remember to include a 'magic word' so that your visualizations are plotted
# inline with the notebook. See this page for more:
# http://ipython.readthedocs.io/en/stable/interactive/magics.html
df= pd.read_csv('tmdb_movies.csv', sep=',')
df.head()
df.info()
#drop columns not needed
df.drop(['imdb_id', 'id', 'budget', 'revenue', 'homepage', 'tagline', 'keywords', 'overview', 'production_companies', 'release_date'], axis=1, inplace=True)
#check that this is correct
df.head()
df.info()
#drop all 'missing values' rows
df.dropna(inplace=True)
df.info()
df.describe()
#all unique values in genres
df.genres.unique()
#create new DF from the series with original_title as index; splitting up genres sep by pipes
new_df = pd.DataFrame(df.genres.str.split('|').tolist(), index=df.original_title).stack()
# We now want to get rid of the secondary index
# To do this, we will make original_title as a column (it can't be an index since the values will be duplicate)
new_df = new_df.reset_index([0, 'original_title'])
new_df.columns = ['original_title', 'mgenres']
new_df.head(5)
#combine the new_df with the original df
genres_df= pd.merge(df, new_df, on='original_title')
genres_df.head(5)
#drop the old genres column and check
genres_df.drop(['genres'], axis=1, inplace=True)
genres_df.head(5)
genres_df.plot(x='release_year',y='popularity', kind='scatter' );
plt.title('Popularity by Release Year')
plt.xlabel('Release Year')
plt.ylabel('Popularity')
genres_df.plot(x='release_year',y='vote_average', kind='scatter');
plt.title('Vote Average by Release Year')
plt.xlabel('Release Year')
plt.ylabel('Vote Average')
#check datatype
type(genres_df['popularity'][0])
# convert popularity from float to int
genres_df['popularity'] = genres_df['popularity'].astype(int)
#check datatype
type(genres_df['popularity'][0])
# convert popularity from float to int
genres_df['vote_average'] = genres_df['vote_average'].astype(int)
#check datatype
type(genres_df['vote_average'][0])
#find the mean popularity score of each genre type with groupby
genres_df.groupby('mgenres').mean().popularity
#find the 25%, 50%, 75%, and max popularity values with Pandas describe
genres_df.describe().popularity
#top films grouped by genres and popularity means, sorting by top 5
topfilms_df = genres_df.groupby('mgenres')['popularity'].mean().sort_values().tail(5)
topfilms_df.plot(kind= 'bar', color='#3caea3')
plt.title('Top Genres over Time')
plt.xlabel('Genres')
plt.ylabel('Popularity')
#top rated films grouped by genres and popularity means
rated_df = genres_df.groupby('vote_average')['popularity'].mean().sort_values()
rated_df
genres_df.plot(x='vote_average',y='popularity', kind='scatter');
plt.title('Vote Average Correlation with Popularity')
plt.xlabel('Vote Average')
plt.ylabel('Popularity')
np.corrcoef(genres_df.vote_average, genres_df.popularity)
Tip: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.
create new DF from the series with original_title as index; splitting up genres sep by pipes-
new_df = pd.DataFrame(df.genres.str.split('|').tolist(), index=df.original_title).stack()
We now want to get rid of the secondary index To do this, we will make original_title as a column (it can't be an index since the values will be duplicate)
new_df = new_df.reset_index([0, 'original_title'])
new_df.columns = ['original_title', 'mgenres']
new_df.head(5)
combine the new_df with the original df
genres_df= pd.merge(df, new_df, on='original_title')
genres_df.head(5)
Drop the old genres column and check
genres_df.drop(['genres'], axis=1, inplace=True)
genres_df.head(5)
find the mean popularity score of each genre type with groupby
genres_df.groupby('mgenres').mean().popularity
top films grouped by genres and popularity means, sorting by top 5
topfilms_df = genres_df.groupby('mgenres')['popularity'].mean().sort_values().tail(5)
plot/visualize top 5 genres by popularity score
topfilms_df.plot(kind= 'bar', color='#3caea3')
plt.title('Top Genres over Time')
plt.xlabel('Genres')
plt.ylabel('Popularity')
top rated films grouped by vote average and popularity means
rated_df = genres_df.groupby('vote_average')['popularity'].mean().sort_values()
rated_df
Plot the relationship between Vote Average and Popularity
genres_df.plot(x='vote_average',y='popularity', kind='scatter');
genres_df.plot(x='vote_average',y='popularity', kind='scatter');
plt.title('Vote Average Correlation with Popularity')
plt.xlabel('Vote Average')
plt.ylabel('Popularity')
Find the correlation
np.corrcoef(genres_df.vote_average, genres_df.popularity)
There were some limitations/challenges to these conclusions which may make these findings not conclusive.
1) Missing data
All rows with missing data in Cast, Director, and Genres were dropped
2) Vote counts were, for the most part, on the lower count side- which likely skewed results
Older titles had much less votes since IMDb was not as widely used (or existed)
Titles in more recent years had a lot more data
The top 5 genres over the years are
1) Adventure
2) Science Fiction
3) Fantasy
4) Action
5) Animation
There was a weak positive correlation between a film's popularity and the average score it gets