Project: TMDb Data Analysis¶

Table of Contents¶

</ul>

Introduction¶

Tip: In this data analysis, we will be looking at information about 10K movies from the Movie Database (TMDb). We are looking at which genres were most popular from year to year and exploring the relationship between the popularity of a film and it's vote average score

Dataset analyzed: TMDb Data

Questions to explore: Which genres were most popular throughout the years? Is there a correlation between popularity and vote average score of a film?

# Use this cell to set up import statements for all of the packages that you
#   plan to use.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Remember to include a 'magic word' so that your visualizations are plotted
#   inline with the notebook. See this page for more:
#   http://ipython.readthedocs.io/en/stable/interactive/magics.html

Data Wrangling¶

Tip: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

General Properties¶

df= pd.read_csv('tmdb_movies.csv', sep=',')
df.head()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
id                      10866 non-null int64
imdb_id                 10856 non-null object
popularity              10866 non-null float64
budget                  10866 non-null int64
revenue                 10866 non-null int64
original_title          10866 non-null object
cast                    10790 non-null object
homepage                2936 non-null object
director                10822 non-null object
tagline                 8042 non-null object
keywords                9373 non-null object
overview                10862 non-null object
runtime                 10866 non-null int64
genres                  10843 non-null object
production_companies    9836 non-null object
release_date            10866 non-null object
vote_count              10866 non-null int64
vote_average            10866 non-null float64
release_year            10866 non-null int64
budget_adj              10866 non-null float64
revenue_adj             10866 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB

#drop columns not needed
df.drop(['imdb_id', 'id', 'budget', 'revenue', 'homepage', 'tagline', 'keywords', 'overview', 'production_companies', 'release_date'], axis=1, inplace=True)

#check that this is correct
df.head()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 11 columns):
popularity        10866 non-null float64
original_title    10866 non-null object
cast              10790 non-null object
director          10822 non-null object
runtime           10866 non-null int64
genres            10843 non-null object
vote_count        10866 non-null int64
vote_average      10866 non-null float64
release_year      10866 non-null int64
budget_adj        10866 non-null float64
revenue_adj       10866 non-null float64
dtypes: float64(4), int64(3), object(4)
memory usage: 933.9+ KB

#drop all 'missing values' rows
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10732 entries, 0 to 10865
Data columns (total 11 columns):
popularity        10732 non-null float64
original_title    10732 non-null object
cast              10732 non-null object
director          10732 non-null object
runtime           10732 non-null int64
genres            10732 non-null object
vote_count        10732 non-null int64
vote_average      10732 non-null float64
release_year      10732 non-null int64
budget_adj        10732 non-null float64
revenue_adj       10732 non-null float64
dtypes: float64(4), int64(3), object(4)
memory usage: 1006.1+ KB

df.describe()

#all unique values in genres
df.genres.unique()

array(['Action|Adventure|Science Fiction|Thriller',
       'Adventure|Science Fiction|Thriller',
       'Action|Adventure|Science Fiction|Fantasy', ...,
       'Adventure|Drama|Action|Family|Foreign',
       'Comedy|Family|Mystery|Romance',
       'Mystery|Science Fiction|Thriller|Drama'], dtype=object)

#create new DF from the series with original_title as index; splitting up genres sep by pipes
new_df = pd.DataFrame(df.genres.str.split('|').tolist(), index=df.original_title).stack()

# We now want to get rid of the secondary index
# To do this, we will make original_title as a column (it can't be an index since the values will be duplicate)
new_df = new_df.reset_index([0, 'original_title'])
new_df.columns = ['original_title', 'mgenres']
new_df.head(5)

#combine the new_df with the original df

genres_df= pd.merge(df, new_df, on='original_title')
genres_df.head(5)

#drop the old genres column and check

genres_df.drop(['genres'], axis=1, inplace=True)

genres_df.head(5)

genres_df.plot(x='release_year',y='popularity', kind='scatter' );
plt.title('Popularity by Release Year')
plt.xlabel('Release Year')
plt.ylabel('Popularity')

Text(0, 0.5, 'Popularity')

genres_df.plot(x='release_year',y='vote_average', kind='scatter');
plt.title('Vote Average by Release Year')
plt.xlabel('Release Year')
plt.ylabel('Vote Average')

Text(0, 0.5, 'Vote Average')

#check datatype
type(genres_df['popularity'][0])

numpy.float64

# convert popularity from float to int
genres_df['popularity'] = genres_df['popularity'].astype(int)

#check datatype
type(genres_df['popularity'][0])

numpy.int32

# convert popularity from float to int
genres_df['vote_average'] = genres_df['vote_average'].astype(int)

#check datatype
type(genres_df['vote_average'][0])

numpy.int32

#find the mean popularity score of each genre type with groupby
genres_df.groupby('mgenres').mean().popularity

mgenres
Action             0.522476
Adventure          0.724888
Animation          0.450147
Comedy             0.222933
Crime              0.323856
Documentary        0.012422
Drama              0.214612
Family             0.401550
Fantasy            0.602866
Foreign            0.005263
History            0.187320
Horror             0.145810
Music              0.170616
Mystery            0.307868
Romance            0.217295
Science Fiction    0.614319
TV Movie           0.039326
Thriller           0.338231
War                0.336918
Western            0.236686
Name: popularity, dtype: float64

#find the 25%, 50%, 75%, and max popularity values with Pandas describe
genres_df.describe().popularity

count    28403.000000
mean         0.329402
std          1.068200
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max         32.000000
Name: popularity, dtype: float64

#top films grouped by genres and popularity means, sorting by top 5
topfilms_df = genres_df.groupby('mgenres')['popularity'].mean().sort_values().tail(5)

topfilms_df.plot(kind= 'bar', color='#3caea3')
plt.title('Top Genres over Time')
plt.xlabel('Genres')
plt.ylabel('Popularity')

Text(0, 0.5, 'Popularity')

#top rated films grouped by genres and popularity means
rated_df = genres_df.groupby('vote_average')['popularity'].mean().sort_values()
rated_df

vote_average
1    0.000000
2    0.000000
9    0.000000
3    0.049751
4    0.064151
5    0.177095
6    0.350739
7    0.893360
8    1.881988
Name: popularity, dtype: float64

genres_df.plot(x='vote_average',y='popularity', kind='scatter');
plt.title('Vote Average Correlation with Popularity')
plt.xlabel('Vote Average')
plt.ylabel('Popularity')

Text(0, 0.5, 'Popularity')

np.corrcoef(genres_df.vote_average, genres_df.popularity)

array([[1.        , 0.21209355],
       [0.21209355, 1.        ]])

Exploratory Data Analysis¶

Tip: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

Research Question 1- What genres were most popular over the years?¶

1. Separate the genres from the genres column into a new column, mgenres¶

create new DF from the series with original_title as index; splitting up genres sep by pipes-

new_df = pd.DataFrame(df.genres.str.split('|').tolist(), index=df.original_title).stack()

We now want to get rid of the secondary index To do this, we will make original_title as a column (it can't be an index since the values will be duplicate)

new_df = new_df.reset_index([0, 'original_title'])
new_df.columns = ['original_title', 'mgenres']
new_df.head(5)

combine the new_df with the original df

genres_df= pd.merge(df, new_df, on='original_title')
genres_df.head(5)

Drop the old genres column and check

genres_df.drop(['genres'], axis=1, inplace=True)

genres_df.head(5)

find the mean popularity score of each genre type with groupby

genres_df.groupby('mgenres').mean().popularity

top films grouped by genres and popularity means, sorting by top 5

topfilms_df = genres_df.groupby('mgenres')['popularity'].mean().sort_values().tail(5)

plot/visualize top 5 genres by popularity score

topfilms_df.plot(kind= 'bar', color='#3caea3')
plt.title('Top Genres over Time')
plt.xlabel('Genres')
plt.ylabel('Popularity')

Text(0, 0.5, 'Popularity')

Research Question 2- Does the Popularity of a movie correlate with the Vote Score Average?¶

top rated films grouped by vote average and popularity means

rated_df = genres_df.groupby('vote_average')['popularity'].mean().sort_values()
rated_df

Plot the relationship between Vote Average and Popularity

genres_df.plot(x='vote_average',y='popularity', kind='scatter');

genres_df.plot(x='vote_average',y='popularity', kind='scatter');
plt.title('Vote Average Correlation with Popularity')
plt.xlabel('Vote Average')
plt.ylabel('Popularity')

Text(0, 0.5, 'Popularity')

Find the correlation

np.corrcoef(genres_df.vote_average, genres_df.popularity)

array([[1.        , 0.21209355],
       [0.21209355, 1.        ]])

Conclusions¶

There were some limitations/challenges to these conclusions which may make these findings not conclusive.

1) Missing data
    All rows with missing data in Cast, Director, and Genres were dropped
2) Vote counts were, for the most part, on the lower count side- which likely skewed results
    Older titles had much less votes since IMDb was not as widely used (or existed)
    Titles in more recent years had a lot more data

The top 5 genres over the years are

1) Adventure
2) Science Fiction
3) Fantasy
4) Action
5) Animation

There was a weak positive correlation between a film's popularity and the average score it gets

	id	imdb_id	popularity	budget	revenue	original_title	cast	homepage	director	tagline	...	overview	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj
0	135397	tt0369610	32.985763	150000000	1513528810	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	http://www.jurassicworld.com/	Colin Trevorrow	The park is open.	...	Twenty-two years after the events of Jurassic ...	124	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	6/9/15	5562	6.5	2015	1.379999e+08	1.392446e+09
1	76341	tt1392190	28.419936	150000000	378436354	Mad Max: Fury Road	Tom Hardy\|Charlize Theron\|Hugh Keays-Byrne\|Nic...	http://www.madmaxmovie.com/	George Miller	What a Lovely Day.	...	An apocalyptic story set in the furthest reach...	120	Action\|Adventure\|Science Fiction\|Thriller	Village Roadshow Pictures\|Kennedy Miller Produ...	5/13/15	6185	7.1	2015	1.379999e+08	3.481613e+08
2	262500	tt2908446	13.112507	110000000	295238201	Insurgent	Shailene Woodley\|Theo James\|Kate Winslet\|Ansel...	http://www.thedivergentseries.movie/#insurgent	Robert Schwentke	One Choice Can Destroy You	...	Beatrice Prior must confront her inner demons ...	119	Adventure\|Science Fiction\|Thriller	Summit Entertainment\|Mandeville Films\|Red Wago...	3/18/15	2480	6.3	2015	1.012000e+08	2.716190e+08
3	140607	tt2488496	11.173104	200000000	2068178225	Star Wars: The Force Awakens	Harrison Ford\|Mark Hamill\|Carrie Fisher\|Adam D...	http://www.starwars.com/films/star-wars-episod...	J.J. Abrams	Every generation has a story.	...	Thirty years after defeating the Galactic Empi...	136	Action\|Adventure\|Science Fiction\|Fantasy	Lucasfilm\|Truenorth Productions\|Bad Robot	12/15/15	5292	7.5	2015	1.839999e+08	1.902723e+09
4	168259	tt2820852	9.335014	190000000	1506249360	Furious 7	Vin Diesel\|Paul Walker\|Jason Statham\|Michelle ...	http://www.furious7.com/	James Wan	Vengeance Hits Home	...	Deckard Shaw seeks revenge against Dominic Tor...	137	Action\|Crime\|Thriller	Universal Pictures\|Original Film\|Media Rights ...	4/1/15	2947	7.3	2015	1.747999e+08	1.385749e+09

	popularity	original_title	cast	director	runtime	genres	vote_count	vote_average	release_year	budget_adj	revenue_adj
0	32.985763	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	Colin Trevorrow	124	Action\|Adventure\|Science Fiction\|Thriller	5562	6.5	2015	1.379999e+08	1.392446e+09
1	28.419936	Mad Max: Fury Road	Tom Hardy\|Charlize Theron\|Hugh Keays-Byrne\|Nic...	George Miller	120	Action\|Adventure\|Science Fiction\|Thriller	6185	7.1	2015	1.379999e+08	3.481613e+08
2	13.112507	Insurgent	Shailene Woodley\|Theo James\|Kate Winslet\|Ansel...	Robert Schwentke	119	Adventure\|Science Fiction\|Thriller	2480	6.3	2015	1.012000e+08	2.716190e+08
3	11.173104	Star Wars: The Force Awakens	Harrison Ford\|Mark Hamill\|Carrie Fisher\|Adam D...	J.J. Abrams	136	Action\|Adventure\|Science Fiction\|Fantasy	5292	7.5	2015	1.839999e+08	1.902723e+09
4	9.335014	Furious 7	Vin Diesel\|Paul Walker\|Jason Statham\|Michelle ...	James Wan	137	Action\|Crime\|Thriller	2947	7.3	2015	1.747999e+08	1.385749e+09

	popularity	runtime	vote_count	vote_average	release_year	budget_adj	revenue_adj
count	10732.000000	10732.000000	10732.000000	10732.000000	10732.000000	1.073200e+04	1.073200e+04
mean	0.652609	102.467853	219.802739	5.964620	2001.260436	1.776644e+07	5.200147e+07
std	1.004757	30.492619	578.789325	0.930286	12.819831	3.446490e+07	1.454192e+08
min	0.000188	0.000000	10.000000	1.500000	1960.000000	0.000000e+00	0.000000e+00
25%	0.210766	90.000000	17.000000	5.400000	1995.000000	0.000000e+00	0.000000e+00
50%	0.387136	99.000000	39.000000	6.000000	2006.000000	0.000000e+00	0.000000e+00
75%	0.720621	112.000000	148.000000	6.600000	2011.000000	2.111556e+07	3.470526e+07
max	32.985763	900.000000	9767.000000	9.200000	2015.000000	4.250000e+08	2.827124e+09

	popularity	original_title	cast	director	runtime	genres	vote_count	vote_average	release_year	budget_adj	revenue_adj	mgenres
0	32.985763	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	Colin Trevorrow	124	Action\|Adventure\|Science Fiction\|Thriller	5562	6.5	2015	1.379999e+08	1.392446e+09	Action
1	32.985763	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	Colin Trevorrow	124	Action\|Adventure\|Science Fiction\|Thriller	5562	6.5	2015	1.379999e+08	1.392446e+09	Adventure
2	32.985763	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	Colin Trevorrow	124	Action\|Adventure\|Science Fiction\|Thriller	5562	6.5	2015	1.379999e+08	1.392446e+09	Science Fiction
3	32.985763	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	Colin Trevorrow	124	Action\|Adventure\|Science Fiction\|Thriller	5562	6.5	2015	1.379999e+08	1.392446e+09	Thriller
4	28.419936	Mad Max: Fury Road	Tom Hardy\|Charlize Theron\|Hugh Keays-Byrne\|Nic...	George Miller	120	Action\|Adventure\|Science Fiction\|Thriller	6185	7.1	2015	1.379999e+08	3.481613e+08	Action

	popularity	original_title	cast	director	runtime	vote_count	vote_average	release_year	budget_adj	revenue_adj	mgenres
0	32.985763	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	Colin Trevorrow	124	5562	6.5	2015	1.379999e+08	1.392446e+09	Action
1	32.985763	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	Colin Trevorrow	124	5562	6.5	2015	1.379999e+08	1.392446e+09	Adventure
2	32.985763	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	Colin Trevorrow	124	5562	6.5	2015	1.379999e+08	1.392446e+09	Science Fiction
3	32.985763	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	Colin Trevorrow	124	5562	6.5	2015	1.379999e+08	1.392446e+09	Thriller
4	28.419936	Mad Max: Fury Road	Tom Hardy\|Charlize Theron\|Hugh Keays-Byrne\|Nic...	George Miller	120	6185	7.1	2015	1.379999e+08	3.481613e+08	Action