The following will contain my case study on the fictional bike-share company, Cyclystic. This case study is in relation to the Google Data Analytics Professional Certificate. I received the data from this program through this link. However, the data has been made available Motivate International Inc. under this license. The CSV files for this project are very large, especially when put together. That being said, I will be using R and RStudio to clean, analyze, and visualize this data as these tools are better suited for large datasets.
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments.One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.
Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.
library(ggplot2)
library(tidyverse)
library(here)
library(janitor)
library(skimr)
library(readr)
library(plyr)
library(lubridate)
library(maptools)
library(mapproj)
library(mapdata)
library(ggmap)
all_trip_data <- list.files(path = "./CSV", pattern = "*.csv", full.names = TRUE) %>%
lapply(read_csv, na.omit) %>%
bind_rows
In the following sections, I will detail my process of cleaning the data.
all_trips_df <- data.frame(all_trip_data)
head(all_trips_df)
## X1 X2 X3 X4
## 1 ride_id rideable_type started_at ended_at
## 2 ACB6B40CF5B9044C electric_bike 2020-10-31 19:39:43 2020-10-31 19:57:12
## 3 DF450C72FD109C01 electric_bike 2020-10-31 23:50:08 2020-11-01 00:04:16
## 4 B6396B54A15AC0DF electric_bike 2020-10-31 23:00:01 2020-10-31 23:08:22
## 5 44A4AEE261B9E854 electric_bike 2020-10-31 22:16:43 2020-10-31 22:19:35
## 6 10B7DD76A6A2EB95 electric_bike 2020-10-31 19:38:19 2020-10-31 19:54:32
## X5 X6 X7
## 1 start_station_name start_station_id end_station_name
## 2 Lakeview Ave & Fullerton Pkwy 313 Rush St & Hubbard St
## 3 Southport Ave & Waveland Ave 227 Kedzie Ave & Milwaukee Ave
## 4 Stony Island Ave & 67th St 102 University Ave & 57th St
## 5 Clark St & Grace St 165 Broadway & Sheridan Rd
## 6 Southport Ave & Wrightwood Ave 190 Stave St & Armitage Ave
## X8 X9 X10 X11
## 1 end_station_id start_lat start_lng end_lat
## 2 125 41.92610083333334 -87.63897733333333 41.89034516666667
## 3 260 41.94817233333333 -87.663911 41.929527666666665
## 4 423 41.77346216666667 -87.58537166666666 41.79145466666667
## 5 256 41.95085483333333 -87.65924433333333 41.952809
## 6 185 41.928856833333334 -87.663962 41.917776833333335
## X12 X13
## 1 end_lng member_casual
## 2 -87.626068 casual
## 3 -87.70781783333334 casual
## 4 -87.60004966666666 casual
## 5 -87.65010316666667 casual
## 6 -87.69143366666667 casual
colnames(all_trips_df)
## [1] "X1" "X2" "X3" "X4" "X5" "X6" "X7" "X8" "X9" "X10" "X11" "X12"
## [13] "X13"
Looks like we’re going to have to change the column names before we proceed.
names(all_trips_df) [1:13] <- c("ride_id", "rideable_type", "started_at", "ended_at", "start_station_name",
"start_station_id", "end_station_name", "end_station_id", "start_lat", "start_lng",
"end_lat", "end_lng", "member_casual")
We will need to remove the first row of the df as it contains what was supposed to be the column names
all_trips_df <- all_trips_df[-c(1),]
Let’s check out what our df looks like now…
head(all_trips_df)
## ride_id rideable_type started_at ended_at
## 2 ACB6B40CF5B9044C electric_bike 2020-10-31 19:39:43 2020-10-31 19:57:12
## 3 DF450C72FD109C01 electric_bike 2020-10-31 23:50:08 2020-11-01 00:04:16
## 4 B6396B54A15AC0DF electric_bike 2020-10-31 23:00:01 2020-10-31 23:08:22
## 5 44A4AEE261B9E854 electric_bike 2020-10-31 22:16:43 2020-10-31 22:19:35
## 6 10B7DD76A6A2EB95 electric_bike 2020-10-31 19:38:19 2020-10-31 19:54:32
## 7 DA6C3759660133DA electric_bike 2020-10-29 17:38:04 2020-10-29 17:45:43
## start_station_name start_station_id end_station_name
## 2 Lakeview Ave & Fullerton Pkwy 313 Rush St & Hubbard St
## 3 Southport Ave & Waveland Ave 227 Kedzie Ave & Milwaukee Ave
## 4 Stony Island Ave & 67th St 102 University Ave & 57th St
## 5 Clark St & Grace St 165 Broadway & Sheridan Rd
## 6 Southport Ave & Wrightwood Ave 190 Stave St & Armitage Ave
## 7 Larrabee St & Division St 359 Wells St & Huron St
## end_station_id start_lat start_lng end_lat
## 2 125 41.92610083333334 -87.63897733333333 41.89034516666667
## 3 260 41.94817233333333 -87.663911 41.929527666666665
## 4 423 41.77346216666667 -87.58537166666666 41.79145466666667
## 5 256 41.95085483333333 -87.65924433333333 41.952809
## 6 185 41.928856833333334 -87.663962 41.917776833333335
## 7 53 41.90353233333333 -87.64334533333333 41.894397166666664
## end_lng member_casual
## 2 -87.626068 casual
## 3 -87.70781783333334 casual
## 4 -87.60004966666666 casual
## 5 -87.65010316666667 casual
## 6 -87.69143366666667 casual
## 7 -87.63431216666666 casual
Before proceeding, I dropped all rows that contained an NA value.
all_trips_df <- all_trips_df %>%
drop_na() # drops rows with NA
Let’s take a look at what we have now.
head(all_trips_df)
## ride_id rideable_type started_at ended_at
## 1 ACB6B40CF5B9044C electric_bike 2020-10-31 19:39:43 2020-10-31 19:57:12
## 2 DF450C72FD109C01 electric_bike 2020-10-31 23:50:08 2020-11-01 00:04:16
## 3 B6396B54A15AC0DF electric_bike 2020-10-31 23:00:01 2020-10-31 23:08:22
## 4 44A4AEE261B9E854 electric_bike 2020-10-31 22:16:43 2020-10-31 22:19:35
## 5 10B7DD76A6A2EB95 electric_bike 2020-10-31 19:38:19 2020-10-31 19:54:32
## 6 DA6C3759660133DA electric_bike 2020-10-29 17:38:04 2020-10-29 17:45:43
## start_station_name start_station_id end_station_name
## 1 Lakeview Ave & Fullerton Pkwy 313 Rush St & Hubbard St
## 2 Southport Ave & Waveland Ave 227 Kedzie Ave & Milwaukee Ave
## 3 Stony Island Ave & 67th St 102 University Ave & 57th St
## 4 Clark St & Grace St 165 Broadway & Sheridan Rd
## 5 Southport Ave & Wrightwood Ave 190 Stave St & Armitage Ave
## 6 Larrabee St & Division St 359 Wells St & Huron St
## end_station_id start_lat start_lng end_lat
## 1 125 41.92610083333334 -87.63897733333333 41.89034516666667
## 2 260 41.94817233333333 -87.663911 41.929527666666665
## 3 423 41.77346216666667 -87.58537166666666 41.79145466666667
## 4 256 41.95085483333333 -87.65924433333333 41.952809
## 5 185 41.928856833333334 -87.663962 41.917776833333335
## 6 53 41.90353233333333 -87.64334533333333 41.894397166666664
## end_lng member_casual
## 1 -87.626068 casual
## 2 -87.70781783333334 casual
## 3 -87.60004966666666 casual
## 4 -87.65010316666667 casual
## 5 -87.69143366666667 casual
## 6 -87.63431216666666 casual
Upon further inspection, I realized that I needed to change the data types of the columns for later analysis.
all_trips_df <- all_trips_df %>%
mutate(started_at = ymd_hms(started_at), # formats date as ymd hms
ended_at = ymd_hms(ended_at),
start_station_id = as.numeric(start_station_id),
end_station_id = as.numeric(end_station_id),
start_lat = as.numeric(start_lat),
start_lng = as.numeric(start_lng),
end_lat = as.numeric(end_lat),
end_lng = as.numeric(end_lng),
ride_length_min = round(as.numeric(ended_at-started_at), digits = 2), # returns the diff in time in sec
day_of_week = weekdays(started_at)) # retrieves day of the week that the ride started on
## Warning: 12 failed to parse.
## Warning: 12 failed to parse.
## Warning in eval(cols[[col]], .data, parent.frame()): NAs introduced by coercion
## Warning in eval(cols[[col]], .data, parent.frame()): NAs introduced by coercion
## Warning in eval(cols[[col]], .data, parent.frame()): NAs introduced by coercion
## Warning in eval(cols[[col]], .data, parent.frame()): NAs introduced by coercion
## Warning in eval(cols[[col]], .data, parent.frame()): NAs introduced by coercion
## Warning in eval(cols[[col]], .data, parent.frame()): NAs introduced by coercion
# change ride_length_min to min instead of sec
all_trips_df <- all_trips_df %>%
mutate(ride_length_min = round(as.numeric((ended_at-started_at)/60), digits = 2)) # returns diff in minutes
head(all_trips_df)
## ride_id rideable_type started_at ended_at
## 1 ACB6B40CF5B9044C electric_bike 2020-10-31 19:39:43 2020-10-31 19:57:12
## 2 DF450C72FD109C01 electric_bike 2020-10-31 23:50:08 2020-11-01 00:04:16
## 3 B6396B54A15AC0DF electric_bike 2020-10-31 23:00:01 2020-10-31 23:08:22
## 4 44A4AEE261B9E854 electric_bike 2020-10-31 22:16:43 2020-10-31 22:19:35
## 5 10B7DD76A6A2EB95 electric_bike 2020-10-31 19:38:19 2020-10-31 19:54:32
## 6 DA6C3759660133DA electric_bike 2020-10-29 17:38:04 2020-10-29 17:45:43
## start_station_name start_station_id end_station_name
## 1 Lakeview Ave & Fullerton Pkwy 313 Rush St & Hubbard St
## 2 Southport Ave & Waveland Ave 227 Kedzie Ave & Milwaukee Ave
## 3 Stony Island Ave & 67th St 102 University Ave & 57th St
## 4 Clark St & Grace St 165 Broadway & Sheridan Rd
## 5 Southport Ave & Wrightwood Ave 190 Stave St & Armitage Ave
## 6 Larrabee St & Division St 359 Wells St & Huron St
## end_station_id start_lat start_lng end_lat end_lng member_casual
## 1 125 41.92610 -87.63898 41.89035 -87.62607 casual
## 2 260 41.94817 -87.66391 41.92953 -87.70782 casual
## 3 423 41.77346 -87.58537 41.79145 -87.60005 casual
## 4 256 41.95085 -87.65924 41.95281 -87.65010 casual
## 5 185 41.92886 -87.66396 41.91778 -87.69143 casual
## 6 53 41.90353 -87.64335 41.89440 -87.63431 casual
## ride_length_min day_of_week
## 1 17.48 Saturday
## 2 14.13 Saturday
## 3 8.35 Saturday
## 4 2.87 Saturday
## 5 16.22 Saturday
## 6 7.65 Thursday
all_trips_df <- all_trips_df %>%
filter(start_station_name != "WATSON TESTING - DIVVY") # seemed to be a test location for the company, decided to filter out
all_trips_df <- all_trips_df %>%
filter(ride_length_min > 0) # filter out rides less than 0 minutes
casual_trips_df <- all_trips_df %>% # data frame of rides for casual riders
filter(member_casual == "casual",
ride_length_min < 1500, # filter out rides longer than 24hrs
year(started_at) > "2020") %>% # filter out 2020 data
mutate(day_of_week = factor(day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")), # create a hierarchy of weekdays
month = month(started_at)) # create column representing the month the ride was in
annual_member_trips_df <- all_trips_df %>% # data frame of rides for annual members
filter(member_casual == "member",
ride_length_min < 1500, # filter out rides longer than 24hrs
year(started_at) > "2020") %>% # filter out 2020 data
mutate(day_of_week = factor(day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")), # create a hierarchy of weekdays
month = month(started_at)) # create column representing the month the ride was in
summary(casual_trips_df)
## ride_id rideable_type started_at
## Length:1932167 Length:1932167 Min. :2021-01-01 00:12:27
## Class :character Class :character 1st Qu.:2021-06-05 18:41:49
## Mode :character Mode :character Median :2021-07-20 18:11:13
## Mean :2021-07-15 22:34:36
## 3rd Qu.:2021-08-31 19:40:18
## Max. :2021-10-31 23:57:33
##
## ended_at start_station_name start_station_id
## Min. :2021-01-01 00:12:30 Length:1932167 Min. : 20
## 1st Qu.:2021-06-05 19:15:45 Class :character 1st Qu.: 13033
## Median :2021-07-20 18:33:07 Mode :character Median : 13156
## Mean :2021-07-15 23:03:22 Mean : 12547
## 3rd Qu.:2021-08-31 19:57:59 3rd Qu.: 13303
## Max. :2021-11-01 10:58:41 Max. :202480
## NA's :1032809
## end_station_name end_station_id start_lat start_lng
## Length:1932167 Min. : 20 Min. :41.65 Min. :-87.83
## Class :character 1st Qu.: 13033 1st Qu.:41.88 1st Qu.:-87.65
## Mode :character Median : 13154 Median :41.90 Median :-87.64
## Mean : 12575 Mean :41.90 Mean :-87.64
## 3rd Qu.: 13303 3rd Qu.:41.93 3rd Qu.:-87.62
## Max. :202480 Max. :42.06 Max. :-87.53
## NA's :1029974
## end_lat end_lng member_casual ride_length_min
## Min. :41.65 Min. :-87.83 Length:1932167 Min. : 0.02
## 1st Qu.:41.88 1st Qu.:-87.65 Class :character 1st Qu.: 9.70
## Median :41.90 Median :-87.64 Mode :character Median : 16.98
## Mean :41.90 Mean :-87.64 Mean : 28.76
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.: 31.18
## Max. :42.17 Max. :-87.52 Max. :1499.85
##
## day_of_week month
## Monday :214482 Min. : 1.000
## Tuesday :200828 1st Qu.: 6.000
## Wednesday:202655 Median : 7.000
## Thursday :209040 Mean : 6.981
## Friday :273331 3rd Qu.: 8.000
## Saturday :446123 Max. :10.000
## Sunday :385708
min(casual_trips_df$ride_length_min)
## [1] 0.02
mean(casual_trips_df$ride_length_min)
## [1] 28.75529
max(casual_trips_df$ride_length_min)
## [1] 1499.85
summary(annual_member_trips_df)
## ride_id rideable_type started_at
## Length:2222600 Length:2222600 Min. :2021-01-01 00:02:24
## Class :character Class :character 1st Qu.:2021-05-21 08:37:00
## Mode :character Mode :character Median :2021-07-17 14:20:31
## Mean :2021-07-09 06:38:52
## 3rd Qu.:2021-09-06 20:17:45
## Max. :2021-10-31 23:58:15
##
## ended_at start_station_name start_station_id
## Min. :2021-01-01 00:08:39 Length:2222600 Min. : 20
## 1st Qu.:2021-05-21 08:49:06 Class :character 1st Qu.: 13050
## Median :2021-07-17 14:39:18 Mode :character Median : 13179
## Mean :2021-07-09 06:52:23 Mean : 12402
## 3rd Qu.:2021-09-06 20:32:34 3rd Qu.: 13323
## Max. :2021-11-01 07:42:23 Max. :202480
## NA's :1276813
## end_station_name end_station_id start_lat start_lng
## Length:2222600 Min. : 20 Min. :41.65 Min. :-87.83
## Class :character 1st Qu.: 13053 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Median : 13179 Median :41.90 Median :-87.64
## Mean : 12437 Mean :41.90 Mean :-87.65
## 3rd Qu.: 13323 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :202480 Max. :42.06 Max. :-87.53
## NA's :1280120
## end_lat end_lng member_casual ride_length_min
## Min. :41.65 Min. :-87.83 Length:2222600 Min. : 0.02
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character 1st Qu.: 5.88
## Median :41.90 Median :-87.64 Mode :character Median : 10.02
## Mean :41.90 Mean :-87.65 Mean : 13.52
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.: 17.00
## Max. :42.07 Max. :-87.53 Max. :1495.63
##
## day_of_week month
## Monday :295823 Min. : 1.00
## Tuesday :332869 1st Qu.: 5.00
## Wednesday:342877 Median : 7.00
## Thursday :324804 Mean : 6.75
## Friday :322785 3rd Qu.: 9.00
## Saturday :322414 Max. :10.00
## Sunday :281028
min(annual_member_trips_df$ride_length_min)
## [1] 0.02
mean(annual_member_trips_df$ride_length_min)
## [1] 13.52448
max(annual_member_trips_df$ride_length_min)
## [1] 1495.63
After a quick look at the summary statistics, we can see that casual riders spend more time on average per each ride than annual members. Casual riders spend and average of 28.7552883 minutes per ride, whereas annual members spend an average of 13.5244758 minutes.
casual_weekly_plot <- ggplot(data = casual_trips_df, mapping = aes(x=day_of_week)) +
geom_bar(aes(fill = day_of_week)) + # color for each weekday
labs(
title = "Number of Weekly Casual Rides",
subtitle = "January 2021 - October 2021",
x = "Day of the Week",
y = "Number of Rides",
fill = "Day of the Week"
) +
theme(axis.text.x = element_text(angle = 90)) + # turns xaxis labels 90 degrees counterclockwise
facet_wrap(~month) # create a weekly plot for each month
casual_weekly_plot
member_weekly_plot <- ggplot(data = annual_member_trips_df, mapping = aes(x=day_of_week)) +
geom_bar(aes(fill = day_of_week)) + # color for each weekday
labs(
title = "Number of Weekly Member Rides",
subtitle = "January 2021 - October 2021",
x = "Day of the Week",
y = "Number of Rides",
fill = "Day of the Week"
) +
theme(axis.text.x = element_text(angle = 90)) + # turns xaxis labels 90 degrees counterclockwise
facet_wrap(~month) # create a weekly plot for each month
member_weekly_plot
###
## create map of chicago from https://cfss.uchicago.edu/notes/raster-maps-with-ggmap/
#
chi_bb <- c( # store bounding box coordinates
left = -87.936287,
bottom = 41.679835,
right = -87.447052,
top = 42.000835
)
chicago_stamen <- get_stamenmap( # retrieve map from Stamen Maps
bbox = chi_bb,
zoom = 11
)
chicago_map <- chicago_stamen
ggmap(chicago_map) # use ggmap to view
casual_ride_density <- ggmap(chicago_map) + # map to show yearly ride density
stat_density2d(data = casual_trips_df, aes(x=start_lng, y=start_lat), color="Deep Pink", fill=stat(nrow(casual_trips_df)), alpha=I(0.5), geom = "polygon") +
labs(
title = "Ride Density for Casual Riders",
subtitle = "January 2021 - October 2021",
x = "Longitude",
y = "Latitude",
)
casual_ride_density
## Warning: Removed 26981 rows containing non-finite values (stat_density2d).
member_ride_density <- ggmap(chicago_map) + # map to show ride yearly ride density
stat_density2d(data = annual_member_trips_df, aes(x=start_lng, y=start_lat), color="Deep Pink", fill=stat(nrow(annual_member_trips_df)), alpha=I(0.5), geom = "polygon") +
labs(
title = "Ride Density for Members",
subtitle = "January 2021 - October 2021",
x = "Longitude",
y = "Latitude",
)
member_ride_density
## Warning: Removed 33910 rows containing non-finite values (stat_density2d).
To sum up everthing that has been shown so far, we loaded the necessary packages and data, cleaned and filtered the data, manipulated the data frame to fit our needs, analyzed the data, and built some beautiful plots. Now it’s time to circle back to our business problem. We had two questions to conquerer.
We found that for both casual and annual riders, the winter months (November, December, January, February, and most of March) are exceedingly lower in terms of the number of rides. And this makes sense, the winter months are typically cold, rainy, windy, and/or snowy in Chicago. Those conditions are not favorable for biking, even for the most dedicated users. However, the Spring, Summer, and early Fall months are the peak of Cyclystic’s users’ activity. The weather is fair, if not beautiful, allowing for an optimal environment for biking.
We also found that there are differences in how casual and annual members use this service. Casual riders have high activity on the weekend days in comparison the weekdays. Annual members have a different spread of activity. These users utilize the bikes more equally throughout the week as a whole with a slight emphasis on weekdays rather than weekends. This could indicate that these users are using this service to commute to and from work or to run to get lunch during a busy day.
Another interesting find was that most rides for both casual and annual riders where located in downtown Chicago. In addition, a lot of rides start up and down the water front area. We could infer that this is due to the city’s contract with Cyclystic in terms of where bikes are allowed to be located to avoid having them littering the streets. We could also infer that it would be an optimal area for users to get around and see what the city has to offer in a more efficient way that walking or Ubering everywhere.
Moving forward, there are some things I think Cyclystic should do if they want to convert casual riders into annual members.