Project: TMDb Data Analysis

Table of Contents

  • Introduction
  • Data Wrangling
  • Exploratory Data Analysis
  • Conclusions
  • </ul>

    Introduction

    Tip: In this data analysis, we will be looking at information about 10K movies from the Movie Database (TMDb). We are looking at which genres were most popular from year to year and exploring the relationship between the popularity of a film and it's vote average score

    Dataset analyzed: TMDb Data

    Questions to explore: Which genres were most popular throughout the years? Is there a correlation between popularity and vote average score of a film?

    In [1]:
    # Use this cell to set up import statements for all of the packages that you
    #   plan to use.
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    %matplotlib inline
    
    # Remember to include a 'magic word' so that your visualizations are plotted
    #   inline with the notebook. See this page for more:
    #   http://ipython.readthedocs.io/en/stable/interactive/magics.html
    

    Data Wrangling

    Tip: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

    General Properties

    In [2]:
    df= pd.read_csv('tmdb_movies.csv', sep=',')
    df.head()
    
    Out[2]:
    id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
    0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
    1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. ... An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08
    2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke One Choice Can Destroy You ... Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 3/18/15 2480 6.3 2015 1.012000e+08 2.716190e+08
    3 140607 tt2488496 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... http://www.starwars.com/films/star-wars-episod... J.J. Abrams Every generation has a story. ... Thirty years after defeating the Galactic Empi... 136 Action|Adventure|Science Fiction|Fantasy Lucasfilm|Truenorth Productions|Bad Robot 12/15/15 5292 7.5 2015 1.839999e+08 1.902723e+09
    4 168259 tt2820852 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... http://www.furious7.com/ James Wan Vengeance Hits Home ... Deckard Shaw seeks revenge against Dominic Tor... 137 Action|Crime|Thriller Universal Pictures|Original Film|Media Rights ... 4/1/15 2947 7.3 2015 1.747999e+08 1.385749e+09

    5 rows × 21 columns

    In [3]:
    df.info()
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 10866 entries, 0 to 10865
    Data columns (total 21 columns):
    id                      10866 non-null int64
    imdb_id                 10856 non-null object
    popularity              10866 non-null float64
    budget                  10866 non-null int64
    revenue                 10866 non-null int64
    original_title          10866 non-null object
    cast                    10790 non-null object
    homepage                2936 non-null object
    director                10822 non-null object
    tagline                 8042 non-null object
    keywords                9373 non-null object
    overview                10862 non-null object
    runtime                 10866 non-null int64
    genres                  10843 non-null object
    production_companies    9836 non-null object
    release_date            10866 non-null object
    vote_count              10866 non-null int64
    vote_average            10866 non-null float64
    release_year            10866 non-null int64
    budget_adj              10866 non-null float64
    revenue_adj             10866 non-null float64
    dtypes: float64(4), int64(6), object(11)
    memory usage: 1.7+ MB
    
    In [4]:
    #drop columns not needed
    df.drop(['imdb_id', 'id', 'budget', 'revenue', 'homepage', 'tagline', 'keywords', 'overview', 'production_companies', 'release_date'], axis=1, inplace=True)
    
    In [5]:
    #check that this is correct
    df.head()
    
    Out[5]:
    popularity original_title cast director runtime genres vote_count vote_average release_year budget_adj revenue_adj
    0 32.985763 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 Action|Adventure|Science Fiction|Thriller 5562 6.5 2015 1.379999e+08 1.392446e+09
    1 28.419936 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... George Miller 120 Action|Adventure|Science Fiction|Thriller 6185 7.1 2015 1.379999e+08 3.481613e+08
    2 13.112507 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... Robert Schwentke 119 Adventure|Science Fiction|Thriller 2480 6.3 2015 1.012000e+08 2.716190e+08
    3 11.173104 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... J.J. Abrams 136 Action|Adventure|Science Fiction|Fantasy 5292 7.5 2015 1.839999e+08 1.902723e+09
    4 9.335014 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... James Wan 137 Action|Crime|Thriller 2947 7.3 2015 1.747999e+08 1.385749e+09
    In [6]:
    df.info()
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 10866 entries, 0 to 10865
    Data columns (total 11 columns):
    popularity        10866 non-null float64
    original_title    10866 non-null object
    cast              10790 non-null object
    director          10822 non-null object
    runtime           10866 non-null int64
    genres            10843 non-null object
    vote_count        10866 non-null int64
    vote_average      10866 non-null float64
    release_year      10866 non-null int64
    budget_adj        10866 non-null float64
    revenue_adj       10866 non-null float64
    dtypes: float64(4), int64(3), object(4)
    memory usage: 933.9+ KB
    
    In [7]:
    #drop all 'missing values' rows
    df.dropna(inplace=True)
    df.info()
    
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 10732 entries, 0 to 10865
    Data columns (total 11 columns):
    popularity        10732 non-null float64
    original_title    10732 non-null object
    cast              10732 non-null object
    director          10732 non-null object
    runtime           10732 non-null int64
    genres            10732 non-null object
    vote_count        10732 non-null int64
    vote_average      10732 non-null float64
    release_year      10732 non-null int64
    budget_adj        10732 non-null float64
    revenue_adj       10732 non-null float64
    dtypes: float64(4), int64(3), object(4)
    memory usage: 1006.1+ KB
    
    In [8]:
    df.describe()
    
    Out[8]:
    popularity runtime vote_count vote_average release_year budget_adj revenue_adj
    count 10732.000000 10732.000000 10732.000000 10732.000000 10732.000000 1.073200e+04 1.073200e+04
    mean 0.652609 102.467853 219.802739 5.964620 2001.260436 1.776644e+07 5.200147e+07
    std 1.004757 30.492619 578.789325 0.930286 12.819831 3.446490e+07 1.454192e+08
    min 0.000188 0.000000 10.000000 1.500000 1960.000000 0.000000e+00 0.000000e+00
    25% 0.210766 90.000000 17.000000 5.400000 1995.000000 0.000000e+00 0.000000e+00
    50% 0.387136 99.000000 39.000000 6.000000 2006.000000 0.000000e+00 0.000000e+00
    75% 0.720621 112.000000 148.000000 6.600000 2011.000000 2.111556e+07 3.470526e+07
    max 32.985763 900.000000 9767.000000 9.200000 2015.000000 4.250000e+08 2.827124e+09
    In [9]:
    #all unique values in genres
    df.genres.unique()
    
    Out[9]:
    array(['Action|Adventure|Science Fiction|Thriller',
           'Adventure|Science Fiction|Thriller',
           'Action|Adventure|Science Fiction|Fantasy', ...,
           'Adventure|Drama|Action|Family|Foreign',
           'Comedy|Family|Mystery|Romance',
           'Mystery|Science Fiction|Thriller|Drama'], dtype=object)
    In [10]:
    #create new DF from the series with original_title as index; splitting up genres sep by pipes
    new_df = pd.DataFrame(df.genres.str.split('|').tolist(), index=df.original_title).stack()
    
    In [11]:
    # We now want to get rid of the secondary index
    # To do this, we will make original_title as a column (it can't be an index since the values will be duplicate)
    new_df = new_df.reset_index([0, 'original_title'])
    new_df.columns = ['original_title', 'mgenres']
    new_df.head(5)
    
    Out[11]:
    original_title mgenres
    0 Jurassic World Action
    1 Jurassic World Adventure
    2 Jurassic World Science Fiction
    3 Jurassic World Thriller
    4 Mad Max: Fury Road Action
    In [12]:
    #combine the new_df with the original df
    
    genres_df= pd.merge(df, new_df, on='original_title')
    genres_df.head(5)
    
    Out[12]:
    popularity original_title cast director runtime genres vote_count vote_average release_year budget_adj revenue_adj mgenres
    0 32.985763 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 Action|Adventure|Science Fiction|Thriller 5562 6.5 2015 1.379999e+08 1.392446e+09 Action
    1 32.985763 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 Action|Adventure|Science Fiction|Thriller 5562 6.5 2015 1.379999e+08 1.392446e+09 Adventure
    2 32.985763 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 Action|Adventure|Science Fiction|Thriller 5562 6.5 2015 1.379999e+08 1.392446e+09 Science Fiction
    3 32.985763 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 Action|Adventure|Science Fiction|Thriller 5562 6.5 2015 1.379999e+08 1.392446e+09 Thriller
    4 28.419936 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... George Miller 120 Action|Adventure|Science Fiction|Thriller 6185 7.1 2015 1.379999e+08 3.481613e+08 Action
    In [13]:
    #drop the old genres column and check
    
    genres_df.drop(['genres'], axis=1, inplace=True)
    
    genres_df.head(5)
    
    Out[13]:
    popularity original_title cast director runtime vote_count vote_average release_year budget_adj revenue_adj mgenres
    0 32.985763 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 5562 6.5 2015 1.379999e+08 1.392446e+09 Action
    1 32.985763 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 5562 6.5 2015 1.379999e+08 1.392446e+09 Adventure
    2 32.985763 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 5562 6.5 2015 1.379999e+08 1.392446e+09 Science Fiction
    3 32.985763 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 5562 6.5 2015 1.379999e+08 1.392446e+09 Thriller
    4 28.419936 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... George Miller 120 6185 7.1 2015 1.379999e+08 3.481613e+08 Action
    In [14]:
    genres_df.plot(x='release_year',y='popularity', kind='scatter' );
    plt.title('Popularity by Release Year')
    plt.xlabel('Release Year')
    plt.ylabel('Popularity')
    
    Out[14]:
    Text(0, 0.5, 'Popularity')
    In [15]:
    genres_df.plot(x='release_year',y='vote_average', kind='scatter');
    plt.title('Vote Average by Release Year')
    plt.xlabel('Release Year')
    plt.ylabel('Vote Average')
    
    Out[15]:
    Text(0, 0.5, 'Vote Average')
    In [16]:
    #check datatype
    type(genres_df['popularity'][0])
    
    Out[16]:
    numpy.float64
    In [17]:
    # convert popularity from float to int
    genres_df['popularity'] = genres_df['popularity'].astype(int)
    
    #check datatype
    type(genres_df['popularity'][0])
    
    Out[17]:
    numpy.int32
    In [18]:
    # convert popularity from float to int
    genres_df['vote_average'] = genres_df['vote_average'].astype(int)
    
    #check datatype
    type(genres_df['vote_average'][0])
    
    Out[18]:
    numpy.int32
    In [19]:
    #find the mean popularity score of each genre type with groupby
    genres_df.groupby('mgenres').mean().popularity
    
    Out[19]:
    mgenres
    Action             0.522476
    Adventure          0.724888
    Animation          0.450147
    Comedy             0.222933
    Crime              0.323856
    Documentary        0.012422
    Drama              0.214612
    Family             0.401550
    Fantasy            0.602866
    Foreign            0.005263
    History            0.187320
    Horror             0.145810
    Music              0.170616
    Mystery            0.307868
    Romance            0.217295
    Science Fiction    0.614319
    TV Movie           0.039326
    Thriller           0.338231
    War                0.336918
    Western            0.236686
    Name: popularity, dtype: float64
    In [20]:
    #find the 25%, 50%, 75%, and max popularity values with Pandas describe
    genres_df.describe().popularity
    
    Out[20]:
    count    28403.000000
    mean         0.329402
    std          1.068200
    min          0.000000
    25%          0.000000
    50%          0.000000
    75%          0.000000
    max         32.000000
    Name: popularity, dtype: float64
    In [21]:
    #top films grouped by genres and popularity means, sorting by top 5
    topfilms_df = genres_df.groupby('mgenres')['popularity'].mean().sort_values().tail(5)
    
    In [22]:
    topfilms_df.plot(kind= 'bar', color='#3caea3')
    plt.title('Top Genres over Time')
    plt.xlabel('Genres')
    plt.ylabel('Popularity')
    
    Out[22]:
    Text(0, 0.5, 'Popularity')
    In [23]:
    #top rated films grouped by genres and popularity means
    rated_df = genres_df.groupby('vote_average')['popularity'].mean().sort_values()
    rated_df
    
    Out[23]:
    vote_average
    1    0.000000
    2    0.000000
    9    0.000000
    3    0.049751
    4    0.064151
    5    0.177095
    6    0.350739
    7    0.893360
    8    1.881988
    Name: popularity, dtype: float64
    In [24]:
    genres_df.plot(x='vote_average',y='popularity', kind='scatter');
    plt.title('Vote Average Correlation with Popularity')
    plt.xlabel('Vote Average')
    plt.ylabel('Popularity')
    
    Out[24]:
    Text(0, 0.5, 'Popularity')
    In [25]:
    np.corrcoef(genres_df.vote_average, genres_df.popularity)
    
    Out[25]:
    array([[1.        , 0.21209355],
           [0.21209355, 1.        ]])

    Exploratory Data Analysis

    Tip: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

    1. Separate the genres from the genres column into a new column, mgenres

    create new DF from the series with original_title as index; splitting up genres sep by pipes-

    new_df = pd.DataFrame(df.genres.str.split('|').tolist(), index=df.original_title).stack()  

    We now want to get rid of the secondary index To do this, we will make original_title as a column (it can't be an index since the values will be duplicate)

    new_df = new_df.reset_index([0, 'original_title'])
    new_df.columns = ['original_title', 'mgenres']
    new_df.head(5)

    combine the new_df with the original df

    genres_df= pd.merge(df, new_df, on='original_title')
    genres_df.head(5)

    Drop the old genres column and check

    genres_df.drop(['genres'], axis=1, inplace=True)
    
    genres_df.head(5)

    find the mean popularity score of each genre type with groupby

    genres_df.groupby('mgenres').mean().popularity

    top films grouped by genres and popularity means, sorting by top 5

    topfilms_df = genres_df.groupby('mgenres')['popularity'].mean().sort_values().tail(5)

    plot/visualize top 5 genres by popularity score

    In [26]:
    topfilms_df.plot(kind= 'bar', color='#3caea3')
    plt.title('Top Genres over Time')
    plt.xlabel('Genres')
    plt.ylabel('Popularity')
    
    Out[26]:
    Text(0, 0.5, 'Popularity')

    Research Question 2- Does the Popularity of a movie correlate with the Vote Score Average?

    top rated films grouped by vote average and popularity means

    rated_df = genres_df.groupby('vote_average')['popularity'].mean().sort_values()
    rated_df

    Plot the relationship between Vote Average and Popularity

    genres_df.plot(x='vote_average',y='popularity', kind='scatter');
    In [27]:
    genres_df.plot(x='vote_average',y='popularity', kind='scatter');
    plt.title('Vote Average Correlation with Popularity')
    plt.xlabel('Vote Average')
    plt.ylabel('Popularity')
    
    Out[27]:
    Text(0, 0.5, 'Popularity')

    Find the correlation

    In [28]:
    np.corrcoef(genres_df.vote_average, genres_df.popularity)
    
    Out[28]:
    array([[1.        , 0.21209355],
           [0.21209355, 1.        ]])

    Conclusions

    There were some limitations/challenges to these conclusions which may make these findings not conclusive.

    1) Missing data
        All rows with missing data in Cast, Director, and Genres were dropped
    2) Vote counts were, for the most part, on the lower count side- which likely skewed results
        Older titles had much less votes since IMDb was not as widely used (or existed)
        Titles in more recent years had a lot more data

    The top 5 genres over the years are

    1) Adventure
    2) Science Fiction
    3) Fantasy
    4) Action
    5) Animation

    There was a weak positive correlation between a film's popularity and the average score it gets