Communicate Data Findings

Ford GoBike Data

Dataset

Ford GoBike Data from 2018/01 through 2019/07. Dataset was downloaded from: https://www.fordgobike.com/system-data. This data encompasses bike ride start and end date, station information and location, member type of riders, gender, and age.

Structure of data:

  • after cleanup, we have 2.97MM entries
    • members <=60 years old
    • members labeled as M or F

Focus of Analysis:

  • Time of ride duration (duration_min)
  • User Type (user_type)
  • Gender (member_gender)
  • Age (member_age)

Age Distribution

our main age groups are between 20-60 we will want to cut off all users over 60 and focus on breaking our ages into bins

In [3]:
# age distrib with boxplot
plt.figure(figsize=(10,4))
sb.boxplot(x='member_age', data=tbikedata, palette='Blues', orient='h')
plt.title("Age Distribution", fontsize=16, y=1)
plt.xlabel("member_age", fontsize=12, labelpad=15)
plt.ylabel("counts", fontsize=12, labelpad=15);

Univariate Exploration

We will look at the bike ride trends in terms of

  • age groups (member_age_bins)
  • genders (member_gender)
  • weekday (st_weekday, et_weekday)
  • and hours of the day (st_hour, et_hour)

Percentage of Rides by Age Groups

Most users were in the buckets of 20-30 and 30-40

In [4]:
#separate DF to look into and analyze age groups/bins-
age_df = tbikedata.groupby('member_age_bins').agg({'bike_id': 'count'})

#percentage
age_df['perc'] = (age_df['bike_id'] / age_df['bike_id'].sum())*100

#plot
age_df['perc'].plot(kind='bar', figsize=(8,5))
plt.title('Percentage of Rides by Age Groups', fontsize=20, y=1)
plt.xlabel('Member Age Group', labelpad=15)
plt.ylabel('Percentage % ', labelpad=15)
plt.xticks(rotation=360);

Percentage of Rides by Genders

There are more M (male) users than F (females)

In [5]:
#separate DF to look into and analyze between genders-
gender_df = tbikedata.groupby('member_gender').agg({'bike_id':'count'})

#percentage
gender_df['perc'] = (gender_df['bike_id'] / gender_df['bike_id'].sum())*100

# plot
gender_df['perc'].plot(kind='barh', figsize=(8,5))
plt.title('Percentage of Bike Rides by Gender', fontsize=16, y=1)
plt.xlabel('Percentage %', labelpad=15)
plt.ylabel('Gender', labelpad=15)
plt.xticks(rotation=360)
plt.xlim(0,100);

Percentage of Rides by Hours of the Day

The peak time for bike rides were between 7AM-9AM and 4PM- 7PM (16-19)

In [6]:
#separate DF to look into and analyze hours-
hour_df = tbikedata.groupby('st_hour').agg({'bike_id':'count'}).reset_index()

#percentage
hour_df['perc'] = (hour_df['bike_id'] / hour_df['bike_id'].sum())*100

# plot
plt.figure(figsize=(8,5))
sb.pointplot(data=hour_df, x='st_hour', y='perc', scale=0.6)
plt.title('Percentage of Bike Rides By Hour of the Day', fontsize=16, y=1)
plt.xlabel('Hour of the Day', labelpad=15)
plt.ylabel('Percentage % ', labelpad=15)
plt.xticks(rotation=360);

Percentage of Rides by Weekday

Rides were fairly even throughout the weekday and dropped off in the weekends

In [9]:
#separate DF to look into and analyze between genders-
weekday_df = tbikedata.groupby('st_weekday').agg({'bike_id':'count'})

#percentage:
weekday_df['perc'] = (weekday_df['bike_id'] / weekday_df['bike_id'].sum())*100

#format of plot
base_color =sb.color_palette()[4]
day_order= ['Monday', 'Tuesday', 'Wednesday','Thursday','Friday','Saturday', 'Sunday']

# plot
weekday_df.reindex(day_order)['perc'].plot(kind='bar', color=base_color, figsize=(10,5))
plt.title('Percentage of Bike Rides on the Weekdays', fontsize=16, y=1)
plt.xlabel('Weekday', labelpad=15)
plt.ylabel('Percentage %', labelpad=15)
plt.xticks(rotation=360);

Univariate Findings

  • Time and Date -(7AM - 9AM) and (4PM - 7PM) are peak hours
    • Rides dip on the weekends
  • Member Groups
    • The 20-30 and 30-40 age bins had the most riders
    • there was a lot more M riders than F in the gender bucket

Bivariate Exploration

We will look at the bike ride trends in terms of

  • user_type and member_age_bins
  • user_type and duration_min

plot of bike rides for subscribers

20-30 and 30-40 were the largest age bins for suscribers

In [10]:
# SUBSCRIBER DF for calculating bike-ride counts by age group
subscriber_df = tbikedata[tbikedata['user_type'] == 'Subscriber'].groupby(['YYMM', 'member_age_bins']).agg({'bike_id':'count'}).reset_index()

# Create a data frame for calculating bike-ride counts of customers per age group over year-month.
customer_df = tbikedata[tbikedata['user_type'] == 'Customer'].groupby(['YYMM', 'member_age_bins']).agg({'bike_id':'count'}).reset_index()

# plot- trend of bike rides for subscribers
plt.figure(figsize=(12,5))
ax = sb.pointplot(data=subscriber_df, x='YYMM', y='bike_id', scale=0.4, hue='member_age_bins')
plt.title("Monthly Trend of Rides by Subscribers' Age Group", fontsize=16, y=1)
plt.xlabel('YY-MM', labelpad=15)
plt.ylabel('Bike Rides', labelpad=15)
plt.xticks(rotation=360)
legend = ax.legend()
legend.set_title('Member Age Group');

plot of bike rides for customers

20-30 and 30-40 were the largest age bins for customers as well

In [11]:
# plot- trend of bike rides for customers
plt.figure(figsize=(12,5))
ax = sb.pointplot(data=customer_df, x='YYMM', y='bike_id', scale=0.4, hue='member_age_bins')
plt.title("Monthly Trend of Rides by Customers' Age Group", fontsize=16, y=1)
plt.xlabel('year-month', labelpad=15)
plt.ylabel('Bike Rides', labelpad=15)
plt.xticks(rotation=360)
legend = ax.legend()
legend.set_title('Member Age Group');

user types and their trip duration by minutes

However, customers took longer rides than subscribers

In [12]:
tbikedata.groupby('user_type')['duration_min'].mean().plot(kind='barh', figsize=(8,5))
plt.title('Avg Trip Duration by User Type', fontsize=16, y=1)
plt.xlabel('Avg Trip Duration (mins)', labelpad=15)
plt.ylabel('User Type', labelpad=15)
plt.xticks(rotation=360);

customers vs subscribers

The width of the violinplots indicates numbers of rides, the wider it is, the more rides it is associated with, moving up the durations by minutes on the y-axis

Subscribers took shorter rides overall, but took more rides when compared to customers

In [13]:
# DF for user_type and duration = user_type_duration_df
user_type_duration_df = tbikedata.loc[:,['user_type', 'duration_min']]

# plot -
user_type_duration_df_60 = user_type_duration_df[user_type_duration_df['duration_min'] <= 60]
sb.violinplot(data=user_type_duration_df_60, x='user_type', y='duration_min');

# Add title and format it
plt.title('Distribution of Trip Durations by User Type'.title(),
               fontsize = 14, weight = "bold")
# Add x label and format it
plt.xlabel('User Types'.title(),
               fontsize = 12, weight = "bold")
# Add y label and format it
plt.ylabel('Duration in Minutes'.title(),
               fontsize = 12, weight = "bold");

Trend of Rides

Subscribers showed growth and more peaks/dips while being much higher than Customers Customers remained mostly steady throughout the months/year

In [15]:
# plot - 
plt.figure(figsize=(12,4))
palette = {'Subscriber': 'green', 'Customer': 'blue'}
ax = sb.pointplot(data= n_user_type_YYMM, x='YYMM', y=0, hue='user_type', palette=palette, scale=0.3)
plt.title('Monthly Trend of Rides by User Type', fontsize=16, y=1)
plt.xlabel('YYMM', labelpad=15)
plt.ylabel('Rides', labelpad=15)
legend = ax.legend()
legend.set_title('User Type');

Bivariate Findings

  • our userbase looks like this:
    • 89% subscribers
    • 11% customers
  • Subscribers ride more than Customers, but Customers generally rode longer rides
  • Across all age groups, bike rides showed increase over the months
    • Age groups 20-30 and 30-40 saw strong peaks
    • Age group 10-20 was the most steady over time, showing very little change compared to other groups
  • Subscribers user type saw strong growth over time with notable peaks and dips, where Customer user type remained mostly steady

Multivariate Exploration

  1. We need to make new DFs
    • different age groups of subscribers
  2. create 2 new columns
    • percentage column
    • rank column
  3. create pivot for our visual which will display-
    • st_hour
    • weekday
    • rank
  4. create our visuals: 4 heatmaps, each for the age groups we have (created in 1.)
In [21]:
# visualize using heatmaps; should produce 4 maps

plt.figure(figsize=(14.70,10.27))
plt.subplot(221)
plt.suptitle('age group, weekdays, hrs/day and bike rides', fontsize=16, y=1)
sb.heatmap(sub_1_pivot, fmt='d', annot=True, cmap='YlGnBu', annot_kws={'size': 4})
plt.title('20-30 yr old subscribers')
plt.xlabel('Hour of the Day', labelpad=5)
plt.ylabel('Day of the Week', labelpad=10)
plt.yticks(rotation=360)

plt.subplot(222)
sb.heatmap(sub_2_pivot, fmt='d', annot=True, cmap='YlGnBu', annot_kws={'size': 4})
plt.title('30-40 yr old subscribers')
plt.xlabel('Hour of the Day', labelpad=5)
plt.ylabel('Day of the Week', labelpad=10)
plt.yticks(rotation=360)

plt.subplot(223)
sb.heatmap(sub_3_pivot, fmt='d', annot=True, cmap='YlGnBu', annot_kws={'size': 4})
plt.title('40-50 yr old subscribers')
plt.xlabel('Hour of the Day', labelpad=5)
plt.ylabel('Day of the Week', labelpad=10)
plt.yticks(rotation=360)

plt.subplot(224)
sb.heatmap(sub_4_pivot, fmt='d', annot=True, cmap='YlGnBu', annot_kws={'size': 4})
plt.title('50-60 yr old subscribers')
plt.xlabel('Hour of the Day', labelpad=5)
plt.ylabel('Day of the Week', labelpad=10)
plt.yticks(rotation=360);

Multivariate Findings

  • All of our age groups saw similar ride patterns
    • rides are usually between 7AM - 9AM or 4PM - 7PM (noted as 16 -19) Monday through Fridays
    • both the 40-50 yr old group and the 50-60 yr old group saw peak rides in the afternoon a bit earlier
      • 3PM - 6PM (noted as 15 - 18)
    • the weekends are completely darker blue, indicating very low/little rides compared to the rest of the week
  • we can safely assume that most of our users are riding our bikes for work commute
  • for the 40-50 year old group and the 50-60 year old group, users saw higher ride activity throughout the day (10 - 16) than their younger counterparts
In [ ]: