The goal of this project is to build a machine learning model to predict the Formula One World Constructors’ Championship Standings for the upcoming 2023 season.
Formula One, or shorthand F1, is open wheel, single-seater car racing at the highest level. There are a total of ten constructors, or racing teams, that are tasked with the goal of constructing a F1 car prior to the annual racing season. During each race season, ten constructors compete with their newly launched F1 cars for a position from first to tenth in the World Constructors’ Championship Standings. Each constructor employs two drivers. These drivers also compete for a place from first, i.e. World Champion, to twentieth, in the World Drivers’ Championship Standings. The F1 race season is overlooked and regulated by the Fédération Internationale de l’Automobile (FIA).
Racing occurs over a weekend such that Fridays are for free practice, Saturdays are for qualifying, and Sundays are race days. During a race, finishing positions first to tenth earn points from 25, 18, 15, 12, 10, 8, 6, 4, 2, and 1 respectively. The number of races, or grand prix, may vary every year. The upcoming 2023 season has 23 races on the calendar, as opposed to the 2022 F1 calendar consisting of 22 races. Some races are new, and some drivers are new. This may pose some difficulty later on, so best we mention it now.
What is Free Practice?
On Fridays, constructor teams put out their two drivers on the circuit to collect data about the car, get a feel of how the car handles the track, and in general obtain as much practice and knowledge to prepare and perform during qualifying and race day.
How Qualifying Works:
On Saturdays, drivers will have allotted time to record a fast lap time for Q1, in which the five slowest drivers are cut off (positions 16 to 20). Then Q2 begins and then the remaining fifteen drivers have allotted time to record another fast lap time, during which the five slowest drivers of the remaining fifteen are cut off (positions 11 to 15). During Q3, the remaining ten drivers will record a final fast lap time to earn a position from 1 to 10. The last position a driver earned before cut off or the end of Q1 determines the position they will start from during Sunday race day.
Formula One, an inherently European-rooted sport, has gained immense popularity recently in the United States, partially due to a Netflix show “Drive to Survive” covering the motor sport. While the show itself is highly dramatized and an inaccurate representation of actual race events at times, I developed an interest like many others in the sport and began watching archived races, qualifying sessions, and more. It is known amongst fans and followers of the sport that racing teams with a higher budget and access to rich resources tend to fare better in Constructors’ Championship Standings. For many years, wealthy and heavily-sponsored constructors like Mercedes, Ferrari, and Red Bull have occupied the higher ranks of the World Championship. However, in recent years the FIA has instated a budget cap and other regulations in hopes of leveling the playing field. With the upcoming 2023 season kicking off, I am inclined to predict the results for the 2023 season, not only for my own fun, but also to see if the FIA’s regulations are contributing to the beginning of a new era of competition in F1 racing.
We are going to begin this project by just taking a general look at the data and consider what we need to do to construct a data set we can work with, i.e. missing data, useful and/or not useful variables, necessary key/ID values, manipulating and/or converting variables, etc.
To obtain the data, I will be using Ergast Developer API which includes documentation of F1 stats since the beginning of World Championships in 1950. However, I will be downloading the csv files from the API website instead of pulling data with the actual API, for convenience purposes. I may also get further information for certain variables from the official F1 website archive.
The different csv files are all data sets individually, some of which act as lookup tables for others. We are tasked to combine the data sets efficiently into a single one before any analyzing or model building.
Code in file read_data.rda
Let’s begin by loading our relevant packages, set.seed
, reading in our csv files that we downloaded from the API website, and taking a quick look at some of our variables.
library(tidyverse)
library(tidymodels)
library(kknn)
library(ggplot2)
library(corrplot)
library(ggthemes)
library(kableExtra)
library(parsnip)
library(recipes)
library(magrittr)
library(workflows)
library(glmnet)
library(themis)
library(ranger)
library(vip)
library(naniar)
library(visdat)
library(dplyr)
library(ISLR)
set.seed(2536)
load("C://Users//waliang//Documents//UCSB//third year//pstat 131//read_data.rda")
Here is a general summary of what each data set involves:
circuit
- uniquely identified circuits with their locationsraces
- uniquely identified races with their round number and locations since 1950seasons
- list of years of race seasonsresults
- collection of results for each driver at each racedrivers
- uniquely identified drivers since 1950driver_standings
- uniquely identified record of position in Drivers’ Championship Standings at the result of a certain raceconstructors
- uniquely identified constructor teams since 1950constructor_results
- uniquely identified record of points earned for some constructor at a certain raceconstructor_standings
- uniquely identified record of position in Constructors’ Championship Standings at the result of a certain racestatus
- lookup table data set of status key value pairslap_times
- all lap times for each driver at each racepit_stops
- pit stop times for each driver at each racequalifying
- qualifying results for each racesprint_results
- sprint qualifying results for each driver at certain races that included sprint qualifyingHere are some things to consider:
Since a constructor’s standing is determined by how many points they’ve accumulated throughout the race season relative to other teams, it would be ideal to include a points
variable as our response. Also, the amount of points a constructor obtains is dependent on how many point their drivers are able to obtain at races. So, the response variable can come from and/or be calculated from the results
, constructor_results
, constructor_standings
, or driver_standings
data sets.
There may be missing data because there will be three rookie drivers entering the 2023 F1 season: Oscar Piastri, Nyck De Vries, and Logan Sargeant. So, it would be hard to predict the next season with no F1 data regarding these three drivers. I will probably take data from previous Formula Two seasons for these drivers, because F2 races concurrently with the F1 season on the same circuits. We can consider number of points earned for the drivers
as a measurement of how “good” these drivers are. However, since the sum of number of points two drivers earn is the number of points those drivers’ constructor earns, we may have some structural collinearity between driver points and constructor points. Though some drivers often change constructors as well, so it is worth being cautious here.
Some constructor teams have changed their names over the years (ex. Force India changed to Racing Point Force India then changed to Racing Point then changed to Aston Martin), so data will be scattered around the data set. Despite the different names, we are grouping these “different, but same” teams because important factors like car part suppliers, important partnerships, etc. are usually the same regardless of the official name change. So, we will have to align the different team names
. all under one name, preferably the most recent and relevant one.
levels(factor(constructor_results$status))
## [1] "\\N" "D"
As we can see the status
variable of the constructor_results
data set reflects two levels, \\N
missing and D
disqualified.
# merge constructor names w/ results data set on Id variable
merge_test <- full_join(constructors[,c(1,3)], constructor_results, by="constructorId")
merge_test <- full_join(races[,c(1:3)], merge_test, by="raceId")
merge_test %>%
filter(status == "D")
## raceId year round constructorId name constructorResultsId points status
## 1 36 2007 1 1 McLaren 186 14 D
## 2 37 2007 2 1 McLaren 196 18 D
## 3 38 2007 3 1 McLaren 208 12 D
## 4 39 2007 4 1 McLaren 219 14 D
## 5 40 2007 5 1 McLaren 229 18 D
## 6 41 2007 6 1 McLaren 240 12 D
## 7 42 2007 7 1 McLaren 251 18 D
## 8 43 2007 8 1 McLaren 263 8 D
## 9 44 2007 9 1 McLaren 274 14 D
## 10 45 2007 10 1 McLaren 284 10 D
## 11 46 2007 11 1 McLaren 295 15 D
## 12 47 2007 12 1 McLaren 307 10 D
## 13 48 2007 13 1 McLaren 317 18 D
## 14 49 2007 14 1 McLaren 329 11 D
## 15 50 2007 15 1 McLaren 339 10 D
## 16 51 2007 16 1 McLaren 351 8 D
## 17 52 2007 17 1 McLaren 362 8 D
Here we discover this disqualification is because McLaren was disqualified in the 2007 Championship because of the 2007 Espionage Controversy. This will come into play later.
Code in file modify_data.rda
The first thing we do in modify_data.rda
is remove the url
column from the following data sets: constructors
, drivers
, circuit
, races
, and season
.
Like I mentioned before, we do have three new rookies drivers for the 2023 F1 season. If we take a look at our data, we will notice that our drivers actually are inputted in the drivers
data set, with their own driverId
and other values as well. So, they are also in merge_driv
, albeit mostly NA missing values.
drivers %>%
filter(driverRef == "piastri" | driverRef == "sargeant" | forename=="Nyck")
## driverId driverRef number code forename surname dob nationality
## 1 856 de_vries 45 DEV Nyck de Vries 1995-02-06 Dutch
## 2 857 piastri 81 PIA Oscar Piastri 2001-04-06 Australian
## 3 858 sargeant 2 SAR Logan Sargeant 2000-12-31 American
## url
## 1 http://en.wikipedia.org/wiki/Nyck_de_Vries
## 2 http://en.wikipedia.org/wiki/Oscar_Piastri
## 3 http://en.wikipedia.org/wiki/Logan_Sargeant
# Nyck de Vries completed one race as a reserve driver for Williams
merge_driv %>%
filter(driverId %in% c(856:858))
## raceId driverId constructorId driv_positionText driv_positionOrder driv_pnts
## 1 1089 856 3 9 9 2
## 2 1091 856 NA <NA> NA NA
## 3 1092 856 NA <NA> NA NA
## 4 1093 856 NA <NA> NA NA
## 5 1094 856 NA <NA> NA NA
## 6 1095 856 NA <NA> NA NA
## 7 1096 856 NA <NA> NA NA
## 8 1098 857 NA <NA> NA NA
## 9 1098 858 NA <NA> NA NA
## 10 1098 856 NA <NA> NA NA
## laps fastestLap fastestLapTime_rank fastestLapTime fastestLapSpeed
## 1 53 41 13 86.624 240.750
## 2 NA <NA> <NA> NA <NA>
## 3 NA <NA> <NA> NA <NA>
## 4 NA <NA> <NA> NA <NA>
## 5 NA <NA> <NA> NA <NA>
## 6 NA <NA> <NA> NA <NA>
## 7 NA <NA> <NA> NA <NA>
## 8 NA <NA> <NA> NA <NA>
## 9 NA <NA> <NA> NA <NA>
## 10 NA <NA> <NA> NA <NA>
## driv_statusId q_time driv_start_pos avg_lap_time avg_lap_pos avg_pit
## 1 1 82.471 13 91.21949 10 24.628
## 2 NA NA NA NA NA NA
## 3 NA NA NA NA NA NA
## 4 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA
## 6 NA NA NA NA NA NA
## 7 NA NA NA NA NA NA
## 8 NA NA NA NA NA NA
## 9 NA NA NA NA NA NA
## 10 NA NA NA NA NA NA
## sum_driv_pnts driv_standing driv_standingText driv_wins season round
## 1 2 20 20 0 2022 16
## 2 2 20 20 0 2022 17
## 3 2 21 21 0 2022 18
## 4 2 21 21 0 2022 19
## 5 2 21 21 0 2022 20
## 6 2 21 21 0 2022 21
## 7 2 21 21 0 2022 22
## 8 0 12 12 0 2023 1
## 9 0 15 15 0 2023 1
## 10 0 19 19 0 2023 1
## circuitId race_name circ_country circuitRef
## 1 14 Italian Grand Prix Italy monza
## 2 15 Singapore Grand Prix Singapore marina_bay
## 3 22 Japanese Grand Prix Japan suzuka
## 4 69 United States Grand Prix USA americas
## 5 32 Mexico City Grand Prix Mexico rodriguez
## 6 18 Brazilian Grand Prix Brazil interlagos
## 7 24 Abu Dhabi Grand Prix UAE yas_marina
## 8 3 Bahrain Grand Prix Bahrain bahrain
## 9 3 Bahrain Grand Prix Bahrain bahrain
## 10 3 Bahrain Grand Prix Bahrain bahrain
## circ_name driverRef forename surname dob nationality
## 1 Autodromo Nazionale di Monza de_vries Nyck de Vries 1995 Dutch
## 2 Marina Bay Street Circuit de_vries Nyck de Vries 1995 Dutch
## 3 Suzuka Circuit de_vries Nyck de Vries 1995 Dutch
## 4 Circuit of the Americas de_vries Nyck de Vries 1995 Dutch
## 5 Autódromo Hermanos RodrÃguez de_vries Nyck de Vries 1995 Dutch
## 6 Autódromo José Carlos Pace de_vries Nyck de Vries 1995 Dutch
## 7 Yas Marina Circuit de_vries Nyck de Vries 1995 Dutch
## 8 Bahrain International Circuit piastri Oscar Piastri 2001 Australian
## 9 Bahrain International Circuit sargeant Logan Sargeant 2000 American
## 10 Bahrain International Circuit de_vries Nyck de Vries 1995 Dutch
## status
## 1 Finished
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>
## 7 <NA>
## 8 <NA>
## 9 <NA>
## 10 <NA>
I did consider inputting F2 and F3 data into our data set to fill out these three drivers. However, I am now against the idea because while there may be similarities between F2, F3, and F1, the race weekends are quite different and I’m afraid adding certain values, but being forced to omit others that don’t exist in F2 and F3, would create even more missing data.
There isn’t any resource that I could pull data neatly and have it combine nicely with the data set we have now, so I think I will continue without adding extra data, for the sake of time.
Now let’s take a look at our two merged data sets: merge_con
and merge_driv
.
dim(merge_con)[1]
## [1] 13177
dim(merge_driv)[1]
## [1] 34496
Considering the 20,000+ difference in observations, which is expected as there are more unique drivers than constructors, I am against merging merge_con
and merge_driv
. At this point of the project, my plan of action is changing a little. Instead of trying to consolidate all my data into a single data set and modelling of of that, I am seriously considering modelling off of two data sets:
Use merge_con
to predict response variable con_pos
, using predictors specific to the effect of each individual constructor
Use merge_driv
to predict driv_pnts
, using predictors specific to the effect of each individual driver. We can predict driv_pnts
of our 2023 drivers and sum the drivers corresponding to their 2023 constructor teams and rank the constructors accordingly.
I think this plan is definitely doable, and is not entirely too far off from our original plan. We could just use merge_con
and proceed with our modelling, but I feel like there is so much potential with the data in merge_driv
that it would be a shame not to use it. However, with the lack of data around our three rookie drivers, perhaps the merge_con
method would fare better for our prediction goal.
Here is our two finalized data sets, as well as a codebook for both:
head(merge_con)
## raceId constructorId con_pnts con_sum_pnts con_pos con_posText con_wins
## 1 18 1 14 14 1 1 1
## 2 18 2 8 8 3 3 0
## 3 18 3 9 9 2 2 0
## 4 18 4 5 5 4 4 0
## 5 18 5 2 2 5 5 0
## 6 18 6 1 1 6 6 0
## constructorRef con_name con_nation season round circuitId
## 1 mclaren McLaren British 2008 1 1
## 2 bmw_sauber BMW Sauber German 2008 1 1
## 3 williams Williams British 2008 1 1
## 4 renault Renault French 2008 1 1
## 5 toro_rosso Toro Rosso Italian 2008 1 1
## 6 ferrari Ferrari Italian 2008 1 1
## race_name circ_country circuitRef circ_name
## 1 Australian Grand Prix Australia albert_park Albert Park Grand Prix Circuit
## 2 Australian Grand Prix Australia albert_park Albert Park Grand Prix Circuit
## 3 Australian Grand Prix Australia albert_park Albert Park Grand Prix Circuit
## 4 Australian Grand Prix Australia albert_park Albert Park Grand Prix Circuit
## 5 Australian Grand Prix Australia albert_park Albert Park Grand Prix Circuit
## 6 Australian Grand Prix Australia albert_park Albert Park Grand Prix Circuit
merge_con Codebook
raceId
- unique race identifier; numeric
constructorId
- unique constructor identifier; numeric
con_pnts
- number of points a constructor earns at a certain race; numeric
con_sum_pnts
- total number of points a constructor earned so far in the race season including that certain race; numeric
con_pos
- a constructor’s position in the Constructors’ Championship Standings at the point of that certain race during the race season; numeric
con_posText
- con_pos
; character
con_wins
- number of wins a constructor earned so far in the race season including that certain race; numeric
constructorRef
- constructor reference name; character
con_name
- constructor name; character
con_nation
- nationality of constructor team; character
season
- year of race season; numeric
round
- race order (1 = first race, 2 = second, etc.) in the race season; numeric
circuitId
- unique circuit identifier; numeric
race_name
- name of grand prix; character
circ_country
- country circuit is located in; character
circuitRef
- circuit reference name; character
circ_name
- name of circuit; character
head(merge_driv)
## raceId driverId constructorId driv_positionText driv_positionOrder driv_pnts
## 1 18 1 1 1 1 10
## 2 18 2 2 2 2 8
## 3 18 3 3 3 3 6
## 4 18 4 4 4 4 5
## 5 18 5 1 5 5 4
## 6 18 6 3 6 6 3
## laps fastestLap fastestLapTime_rank fastestLapTime fastestLapSpeed
## 1 58 39 2 87.452 218.300
## 2 58 41 3 87.739 217.586
## 3 58 41 5 88.090 216.719
## 4 58 58 7 88.603 215.464
## 5 58 43 1 87.418 218.385
## 6 57 50 14 89.639 212.974
## driv_statusId q_time driv_start_pos avg_lap_time avg_lap_pos avg_pit
## 1 1 85.187 1 98.11407 1 NA
## 2 1 85.518 5 98.20852 4 NA
## 3 1 86.059 7 98.25481 4 NA
## 4 1 86.188 12 98.41029 8 NA
## 5 1 85.452 3 98.42466 3 NA
## 6 11 86.413 14 100.08219 11 NA
## sum_driv_pnts driv_standing driv_standingText driv_wins season round
## 1 10 1 1 1 2008 1
## 2 8 2 2 0 2008 1
## 3 6 3 3 0 2008 1
## 4 5 4 4 0 2008 1
## 5 4 5 5 0 2008 1
## 6 3 6 6 0 2008 1
## circuitId race_name circ_country circuitRef
## 1 1 Australian Grand Prix Australia albert_park
## 2 1 Australian Grand Prix Australia albert_park
## 3 1 Australian Grand Prix Australia albert_park
## 4 1 Australian Grand Prix Australia albert_park
## 5 1 Australian Grand Prix Australia albert_park
## 6 1 Australian Grand Prix Australia albert_park
## circ_name driverRef forename surname dob
## 1 Albert Park Grand Prix Circuit hamilton Lewis Hamilton 1985
## 2 Albert Park Grand Prix Circuit heidfeld Nick Heidfeld 1977
## 3 Albert Park Grand Prix Circuit rosberg Nico Rosberg 1985
## 4 Albert Park Grand Prix Circuit alonso Fernando Alonso 1981
## 5 Albert Park Grand Prix Circuit kovalainen Heikki Kovalainen 1981
## 6 Albert Park Grand Prix Circuit nakajima Kazuki Nakajima 1985
## nationality status
## 1 British Finished
## 2 German Finished
## 3 German Finished
## 4 Spanish Finished
## 5 Finnish Finished
## 6 Japanese +1 Lap
merge_driv Codebook
raceId
- unique race identifier; numeric
driverId
- unique driver identifier; numeric
constructorId
- unique constructor identifier; numeric
driv_positionText
- a driver’s finishing race position at that certain race; factor; includes the following:
D
- disqualified
E
- excluded
F
- failed to qualify
N
- not classified
R
- retired
W
- withdrawn
driv_positionOrder
- a driver’s place in the finishing order of that certain race; numeric
driv_pnts
- number of points a driver earns at a certain race; numeric
laps
- number of laps a driver completed in the race; numeric
fastestLap
- the lap number during which a driver drove their fastest lap during the race; character
fastestLapTime_rank
- the rank of a driver’s fastest lap time among that of the other drivers during the race; character
fastestLapTime
- a driver’s fastest lap time during the race in seconds; numeric
fastestLapSpeed
- fastest speed reached during a driver’s fastest lap; character
driv_statusId
- unique status identifier for the condition in which a driver finished their race; numeric
q_time
- fastest qualifying fast lap time (out of Q1, Q2, and Q3) that a driver completed during that race weekend’s qualifying session; numeric
driv_start_pos
- a driver’s starting grid position for that certain race; numeric
avg_lap_time
- average of all of a driver’s lap times during a race; numeric
avg_lap_pos
- rounded average of all positions a driver was at during a race; numeric
avg_pit
- average pit stop duration of a driver during a race; numeric
sum_driv_pnts
- total number of points a driver earned so far in the race season including that certain race; numeric
driv_standing
- a driver’s position in the Drivers’ Championship Standings at the point of that certain race during the race season; numeric
driv_standingText
- driv_standing
; character; including the following:
D
- disqualified
driv_wins
- number of wins a driver earned so far in the race season including that certain race; numeric
status
- the condition in which a driver finished their race; factor
Code in file eda.rda
Let’s take a look at the missing data in merge_con
.
load("C://Users//waliang//Documents//UCSB//third year//pstat 131//eda.rda")
vis_miss(merge_con)
Let’s arrange each constructor’s result in order throughout each race season and fill in con_pnts
and con_sum_pnts
using each other (since con_sum_pnts
is an accumulation of con_pnts
throughout the season).
merge_con1 <- merge_con %>%
# arrange each constructor's result in order throughout season
arrange(constructorId, season, raceId, round, con_sum_pnts) %>%
# replace con_sum_pnts missing values with the actual value using the other con_pnts column
mutate(con_sum_pnts = case_when((is.na(con_sum_pnts) & season==lag(season) & is.na(con_pnts)!=TRUE) ~ lag(con_sum_pnts)+con_pnts,
(is.na(con_sum_pnts) & season!=lag(season) & is.na(con_pnts)!=TRUE) ~ con_pnts,
is.na(con_sum_pnts)!=TRUE ~ con_sum_pnts))
# vice versa
merge_con2 <- merge_con1 %>%
arrange(constructorId, season, raceId, round, con_sum_pnts) %>%
mutate(con_pnts = case_when((is.na(con_pnts) & season==lag(season) & is.na(con_sum_pnts)!=TRUE) ~ con_sum_pnts-lag(con_sum_pnts),
(is.na(con_pnts) & season!=lag(season) & is.na(con_sum_pnts)!=TRUE) ~ con_sum_pnts,
is.na(con_pnts)!=TRUE ~ con_pnts))
# and again the first direction for good measure
merge_con3 <- merge_con2 %>%
arrange(constructorId, season, raceId, round, con_sum_pnts) %>%
mutate(con_sum_pnts = case_when((is.na(con_sum_pnts) & season==lag(season) & is.na(con_pnts)!=TRUE) ~ lag(con_sum_pnts)+con_pnts,
(is.na(con_sum_pnts) & season!=lag(season) & is.na(con_pnts)!=TRUE) ~ con_pnts,
is.na(con_sum_pnts)!=TRUE ~ con_sum_pnts))
vis_miss(merge_con3)
merge_con3$constructorRef <- factor(merge_con3$constructorRef)
After fixing con_pnts
and con_sum_pnts
, we have a lot less missing data. The missing values in con_pos
, con_posText
, and con_wins
is somewhat minimal. We can deal with these missing values during recipe creation using a linear regression of other variables that measure a constructors progress throughout the race season, like con_pnts
and con_sum_pnts
.
Other than this, we don’t have a lot of missing data and the data set is pretty usable.
Now let’s work on merge_driv
similarly.
vis_miss(merge_driv, warn_large_data = FALSE)
That’s a lot of missing data. There are some drivers with no constructor team! Let’s look at those first.
dim(merge_driv %>%
filter(is.na(constructorId)))[1]
## [1] 8646
dim(merge_driv)[1]
## [1] 34496
There are 8646 out of 34496 observation with missing values from constructorId
to avg_pit
and status
. My guess is that when I was joining data sets that had information available only after a certain year, like quali
, pit
, and even results
, then that conflicted with driver_standings
that had a lot of data starting from the beginning of F1, leaving empty spots in variables only available after a certain date.
I don’t want to just remove these observations, since they still have data for driv_standing
and driv_wins
. Some of the missing variables that I would prefer to have, like driv_positionOrder
, driv_pnts
, and driv_start_pos
, are measures of the success of each driver at each race and throughout the season, similar to driv_standing
and driv_wins
. However, I think I will be forced to remove the rows with missing constructorId
because I will have trouble imputing a categorical variable (constructorId
is a factor).
vis_miss(merge_driv1)
Deal with missing driv_standing
and driv_wins
so we can easily impute the other variables in the recipe later on.
dim(merge_driv1 %>%
filter(is.na(driv_standing)))[1]
## [1] 469
merge_driv2 <- merge_driv1 %>%
group_by(driverId, round) %>%
# replace missing driv_standing with the median of the standings the driver gets in that round across seasons
summarise(driv_standing_na = as.integer(median(driv_standing,na.rm=TRUE)))
merge_driv3 <- full_join(merge_driv1, merge_driv2, by=c("driverId","round"))
merge_driv3 <- merge_driv3 %>%
mutate(driv_standing =
case_when(is.na(driv_standing) ~ driv_standing_na,
!is.na(driv_standing) ~ driv_standing))
dim(merge_driv3 %>%
filter(is.na(driv_standing)))[1]
## [1] 91
By replacing missing driv_standing
with the median of such in that round number across all seasons, we reduce the missing standing values from 469 to 91. This is a pretty good effort to save some of our data. Then we can omit the 91 missing values.
# omit the rest of missing driv_standing
merge_driv4 <- merge_driv3 %>%
filter(!is.na(driv_standing))
vis_miss(merge_driv4, warn_large_data = FALSE)
We can work with this. It’ll get better when imputing during recipe creation.
Let’s look at some of the relationships between our variables.
Let’s take a look at our possible response variables con_pos
and driv_pnts
.
grid.arrange(plot_con, plot_driv, ncol=2)
range((merge_con3 %>%
filter(!is.na(con_pos))
)$con_pos)
## [1] 1 22
range((merge_con3 %>%
filter(con_pos==tail(unique(sort((merge_con3 %>%
filter(!is.na(con_pos))
)$con_pos)))) %>%
select(con_pos,con_name, season, race_name, circ_name))$season)
## [1] 1963 1991
range((merge_con3 %>%
filter(con_pos==tail(unique(sort((merge_con3 %>%
filter(!is.na(con_pos))
)$con_pos)))) %>%
select(con_pos,con_name, season, race_name, circ_name))$season)
## [1] 1963 1991
range((merge_driv4 %>%
filter(!is.na(driv_pnts))
)$driv_pnts)
## [1] 0 50
The con_pos
data is from 1 to 22. The constructors with very high positions are pre-2000 (from 1963 to 1991), because now there are only ten constructors each season. The range of driv_pnts
for the data and possibility is from 0 to 50 (first place at 2014 Abu Dhabi).
* At the 2014 Abu Dhabi Grand Prix, the FIA rewarded double points. So, instead of the usual 25 to 1 point range, first place to tenth place earned 50, 36, 30, 24, 20, 16, 12, 8, 4, and 2 respectively. Understandably, double points was never done again.
Let’s look at some possible relationships between our variables.
# remove na, identification variables, only numeric
corrplot(cor(merge_con3 %>%
select_if(is.numeric)%>%
na.omit()), method="shade", diag=FALSE,addCoef.col=1,number.cex = 1, type="lower")
In the merge_con3
data set, we have a positive correlation between con_wins
and con_pnts
of which makes sense, the more points a constructor earns, the more likely it is the constructor’s drivers won races. There is also a negative correlation between con_wins
and con_pos
of, which is also obvious because the more positions a constructor is away from the top of the standings, the less likely that constructor is to win races. round
and con_sum_pnts
are positively correlated at, so the more races a constructor completes and the more of the race season that passes, the more points a constructor accumulates. On the other hand, round
and con_pnts
are not as positively correlated at because actually making quick progress and improvement during the race season to earn more and more points consistently as the season goes on is a difficult thing to do as an entire constructor team, even if it is what every team desires. The positive correlation between season
and con_pnts
/con_sum_pnts
is obvious because over time the FIA increased not only the number of points rewarded, but also the number of race finishing positions they reward points to. We also have a negative correlation of between con_pos
and con_pnts
/con_sum_pnts
. This is due to similar reasons as with con_wins
and con_pos
.
# remove na, q_time has too much na, and identification variables, only numeric
corrplot(cor(merge_driv4 %>%
select_if(is.numeric) %>%
na.omit()), method="shade",addCoef.col = 1,number.cex = 0.5, type="lower",diag=FALSE)
Most of the correlations between the merge_driv
variables follow similar interpretations as those of the merge_con3
data set. What’s nice to look at is the variables I modified myself. For example, avg_lap_pos
is very positively correlated at with driv_start_pos
which tells me that the average position a driver spends during a race is likely to be the position they started at. This is intuitively accurate as some races, like street races, can make overtaking difficult, so a driver’s start position can have a huge impact on how their race plays out. This also applies to driv_standing
. The very positive correlation of between driv_standing
and avg_lap_pos
is because the more races a driver spends as a back-marker (i.e. higher race position), the more drivers there are between them and the top of the standings.
ggplot(merge_con3, aes(x=con_pnts, y=con_pos)) +
geom_jitter(width=.5, size=1)+
geom_smooth(col="red", stat="smooth", level=FALSE)+
labs(title="Constructor Position in Standings vs. Constructor Points", y="Constructor position in standings throughout season", x="Constructor points earned at each race")
Understandably so, the lower the points a constructor earns at a race, the more constructors are between your position and the top of the standings. The relationship here also does not seem linear. Once a constructor begins earning points, it is a bit easier to climb positions in the standings, as earning more points for yourself means your competition is earning less points. The trend here looks inverse exponential, but maybe I’m getting ahead of myself.
Let’s take a closer look at the relationship between con_sum_pnts
and round
, something I found interesting as it has to do with the progress and improvement of a constructor team throughout a race season.
ggplot(merge_con3, aes(x=round, y=con_sum_pnts)) +
geom_jitter(width=.5, size=1)+
geom_smooth(method="lm", col="red")+
labs(title="Accumulated Constructor Points vs. Round", y="Constructor points accumulated throughout season", x="Round (number of race in order of season schedule)")
It is interesting to see the majority of points to be somewhat plateauing in con_sum_pnts
as round
increases. You can also see the behavior of the handful of championship winning constructor teams that manage to have a steady increase in their accumulated points.
ggplot(merge_driv4, aes(x=avg_lap_pos, y=driv_pnts)) +
geom_jitter(width=.5, size=1)+
geom_smooth(method="lm", col="green")+
labs(title="Driver Points vs. Average Lap Position", x="Average lap position", y="Driver's points earned at each race")
We can see that the back-markers, i.e. the drivers that happened to spend most of their race at I would say positions fifteen and greater, have very little chance with earning points. Most of the drivers with average lap position of fifteen and greater are lined up at zero points earned. Reasons drivers spend most of their time out of the points can vary from lacking race pace, having issues with the car, to poor tyre management, which are all valid reasons that keep them from making moves and overtaking to get into the points.
What’s also interesting is that that thick pile of zero point earners extends all the way to the higher positions. It is not a shock for drivers who are consistently in top positions, even race leaders, to suddenly encounter a car failure and DNF from the race. It is totally possible and is part of the charm of F1.
ggplot(merge_driv4, aes(x=driv_start_pos, y=avg_lap_pos)) +
geom_jitter(width=.5, size=1)+
geom_smooth(method="lm", col="blue")+
labs(title="Average Lap Position vs. Driver Start Position", x="Driver's starting position", y="Average lap position")
Like the correlation plot suggested, we have a clear positive relationship between average lap position and driver start position. Since, qualifying sessions determine driver start position, this also supports the idea that qualifying and sprint qualifying have incredible influence on the course of a race.
Code in file model_building.rda
Now we can start fitting models to see if we can predict con_pos
and driv_pnts
.
We will be splitting our data sets into a training set and testing set. A majority of the data should be in the training set, which is what we build our model off of. Having separate training and testing sets allows us to see a truer test of how well our model performs. We need to test our model on data it has not been trained on to obtain a realistic idea of where our model is at. This also avoids over-fitting. If we were to test our model on data its been trained on, the results will be very accurate, but falsely so because it is not a true prediction at that point. Our testing data should be akin to a blind taste test for our model, but we build the model’s connections and understandings of how to determine what it will be tasting.
We will split our data so that 70% of it goes to training and 30% of it is testing. We also stratify on the outcome variables con_pos
and driv_pnts
to make sure our data is split with an equal distribution of the response variables in our training and testing data.
# split into training and testing, strata on response variables
con_split <- initial_split(merge_con3, prop = 0.7, strata = con_pos)
con_train <- training(con_split)
con_test <- testing(con_split)
driv_split <- initial_split(merge_driv4, prop = 0.7, strata =driv_pnts)
driv_train <- training(driv_split)
driv_test <- testing(driv_split)
# check split proportions
nrow(con_train)/nrow(merge_con3)
nrow(con_test)/nrow(merge_con3)
nrow(driv_train)/nrow(merge_driv4)
nrow(driv_test)/nrow(merge_driv4)
We have to decide on a recipe, or formula, made up of our response variable and predictors, that can be used for many types of models later. From our merge_con3
data set, we are using con_pos
as a response variable. I want to include measures of points, home races (nationality), race order, year, and the specific circuit/track have influence on con_pos
. So, for predictors, I have decided on: constructorRef
, con_pnts
, con_wins
, con_nation
, season
, round
, circuitRef
, and circ_country
.
For merge_driv
, our response variable is driv_pnts
. For predictors, I want to include: driverRef
, constructorRef
, driv_positionText
, driv_positionOrder
, laps
, fastestLapTime_rank
, fastestLapSpeed
, driv_start_pos
, avg_lap_time
, avg_lap_pos
, driv_standing
, driv_wins
, season
, round
, circuitRef
, circ_country
, dob
, nationality
, and status
.
I made sure to include only one of pnts
or sum_pnts
variables because they are calculated dependently. Same thing with fastestLapTime_rank
and fastestLap
since one variable is a ranking of the other. Since driv_start_pos
is inherently a result of qualifying sessions, I feel that q_time
is less useful. Especially since q_time
is a decimal amount of seconds and is more missing than not, I think excluding q_time
is preferred.
We will impute the missing values in continuous and dummy coded categorical variables.
# omit the small amount of missing values left
merge_con3 <- merge_con3 %>%
na.omit()
# CONSTRUCTOR RECIPE
con_rec <-
recipe(con_pos ~ constructorRef+con_pnts+con_wins+con_nation+season+
round+circuitRef+circ_country, data = con_train) %>%
# omit the small amount of missing values left
step_naomit(all_predictors())%>%
# there are a hundreds of different constructors and circuits, so we need to collapse the less common ones into an other category
step_other(constructorRef, threshold=.1) %>%
step_other(con_nation, threshold=.1) %>%
step_other(circuitRef, threshold=.1) %>%
step_other(circ_country, threshold=.1) %>%
step_naomit()%>%
# dummy code categorical variables
step_dummy(all_nominal_predictors()) %>%
# remove variables (likely the dummy coded ones) that only contain a single value
step_zv(all_predictors())%>%
# step_novel(all_nominal_predictors())%>%
# step_unknown(all_nominal_predictors())%>%
# normalizing
step_center(all_predictors()) %>%
step_scale(all_predictors())
# prep and bake
prep(con_rec) %>%
bake(new_data = con_train)
We will also use k-fold cross validation. K-fold cross-validation is a process of setting aside a portion of the training data to split into “k folds” or k number of sections, which within each k fold takes a turn in acting as the validation set while the rest of the k folds are training data. This is done in a manner so that each k times we train and assess, each training set and validation set is mutually exclusive. So, the process should be fitting, then assessing k times, before one final assessment on the actual testing split portion of data. Let’s do 10 folds.
con_folds <- vfold_cv(con_train, v = 10)
con_folds2 <- vfold_cv(con_train, v = 5)
# for random forest for the sake of time
At this point I decided to only work on the constructor data set for the sake of time.
Our process for building our models follows the same steps. The process is as follows:
# LINEAR MODEL
lm_model<- linear_reg(engine="lm")
# POLYNOMIAL REGRESSION
# tune degree on the numerical variables (degree on these variables will amplify their effect)
# ex. a change in points and wins will affect constructor position more than normal
con_polyrec <- con_rec %>%
step_poly(con_pnts,con_wins,season,round,degree=tune())
poly_model <- linear_reg(mode="regression",
engine="lm")
# K NEAREST NEIGHBORS
# tune neighbors
knn_model <- nearest_neighbor(neighbors = tune(),
mode="regression",
engine="kknn")
# ELASTIC NET LINEAR REGRESSION
# tune penalty and mixture
en_model <- linear_reg(mixture=tune(),
penalty=tune(),
mode="regression",
engine="glmnet")
# ELASTIC NET W/ LASSO
# tune penalty, set mixture to 1 for lasso penalty
en_lasso <- linear_reg(penalty=tune(),
mixture=1,
mode="regression",
engine="glmnet")
# ELASTIC NET W/ RIDGE
# tune penalty, set mixture to 0 for ridge penalty
en_ridge <- linear_reg(penalty=tune(),
mixture=0,
mode="regression",
engine="glmnet")
# RANDOM FOREST
# tune number of predictors (mtry), trees, and minimum number of values in each node (min_n)
rf_model <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
# LINEAR MODEL
con_lmwflow <- workflow() %>%
add_model(lm_model) %>%
add_recipe(con_rec)
# POLYNOMIAL REGRESSION
con_polywflow <- workflow() %>%
add_model(poly_model) %>%
add_recipe(con_polyrec)
# K NEAREST NEIGHBORS
con_knnwflow <- workflow() %>%
add_model(knn_model) %>%
add_recipe(con_rec)
# ELASTIC NET LINEAR REGRESSION
con_enwflow <- workflow() %>%
add_model(en_model) %>%
add_recipe(con_rec)
# ELASTIC NET W/ LASSO
con_lassowflow <- workflow() %>%
add_model(en_lasso) %>%
add_recipe(con_rec)
# ELASTIC NET W/ RIDGE
con_ridgewflow <- workflow() %>%
add_model(en_ridge) %>%
add_recipe(con_rec)
# RANDOM FOREST
con_rfwflow <- workflow() %>%
add_model(rf_model) %>%
add_recipe(con_rec)
# POLYNOMIAL REGRESSION
# range degree from 1 to 5
poly_grid <- grid_regular(degree(range=c(1,5)),
levels=5)
# K NEAREST NEIGHBORS
# range neighbors from 1 to 15
knn_grid <- grid_regular(neighbors(range=c(1,15)),
levels=5)
# ELASTIC NET LINEAR REGRESSION
en_grid <- grid_regular(penalty(range=c(0.01,3), trans=identity_trans()),
mixture(range=c(0,1)),
levels=10)
# ELASTIC NET W/ LASSO and
# ELASTIC NET W/ RIDGE
lasso_ridge_grid <- grid_regular(penalty(range=c(0.01,3),
trans=identity_trans()), levels=10)
# RANDOM FOREST
# predictors (mtry) range depend on the recipe
con_rfgrid <- grid_regular(mtry(range=c(1,9)),
trees(range=c(200,400)),
min_n(range=c(10,20)),
levels=5)
# POLYNOMIAL REGRESSION
con_polytune <- tune_grid(
con_polywflow,
resamples = con_folds,
grid = poly_grid)
# K NEAREST NEIGHBORS
con_knntune <- tune_grid(
con_knnwflow,
resamples = con_folds,
grid = knn_grid)
# ELASTIC NET
con_entune <- tune_grid(
con_enwflow,
resamples = con_folds,
grid = en_grid)
# RIDGE REGRESSION
con_ridgetune <- tune_grid(
con_ridgewflow,
resamples = con_folds,
grid = lasso_ridge_grid)
# LASSO REGRESSION
con_lassotune <- tune_grid(
con_lassowflow,
resamples = con_folds,
grid = lasso_ridge_grid)
# RANDOM FOREST
con_rftune <- tune_grid(
con_rfwflow,
resamples = con_folds2,
grid = con_rfgrid)
load("C://Users//waliang//Documents//UCSB//third year//pstat 131//model_building_final.rda")
load("C://Users//waliang//Documents//UCSB//third year//pstat 131//model_results_final.rda")
lm_rmse <- collect_metrics(con_lmfit) %>%
slice(1)
ridge_rmse <- collect_metrics(con_ridgetune) %>%
arrange(mean) %>%
filter(.metric=="rmse") %>%
slice(1)
# LASSO REGRESSION
lasso_rmse <- collect_metrics(con_lassotune) %>%
arrange(mean) %>%
filter(.metric=="rmse") %>%
slice(1)
# POLYNOMIAL REGRESSION
poly_rmse <- collect_metrics(con_polytune) %>%
arrange(mean) %>%
filter(.metric=="rmse") %>%
slice(1)
# K NEAREST NEIGHBORS
knn_rmse <- collect_metrics(con_knntune) %>%
arrange(mean) %>%
filter(.metric=="rmse") %>%
slice(1)
# ELASTIC NET
elastic_rmse <- collect_metrics(con_entune) %>%
arrange(mean)%>%
filter(.metric=="rmse") %>%
slice(1)
lm_rmse
## # A tibble: 1 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 3.45 10 0.0155 Preprocessor1_Model1
ridge_rmse
## # A tibble: 1 x 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.01 rmse standard 3.45 10 0.0157 Preprocessor1_Model01
lasso_rmse
## # A tibble: 1 x 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.01 rmse standard 3.45 10 0.0157 Preprocessor1_Model01
poly_rmse
## # A tibble: 1 x 7
## degree .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 5 rmse standard 2.82 10 0.0264 Preprocessor5_Model1
knn_rmse
## # A tibble: 1 x 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 15 rmse standard 2.65 10 0.0212 Preprocessor1_Model5
elastic_rmse
## # A tibble: 1 x 8
## penalty mixture .metric .estimator mean n std_err .config
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.01 1 rmse standard 3.45 10 0.0157 Preprocessor1_Model091
rf_rmse
## # A tibble: 1 x 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 4 350 20 rmse standard 2.37 5 0.0162 Preprocessor1_Model1~
Code in file model_results.rda
Let’s compare the models from before that minimize RMSE to find best performing one out of them all.
The model that minimizes RMSE the most is the Random Forest model with mtry = 5
, trees = 400
, and min_n = 20
, which achieves a mean RMSE of 2.370339 across 5 folds.
autoplot(con_entune, metric="rmse")
The model with lasso = 0
penalty, the red line, and low mixture minimize RMSE the most. As lasso penalty increases, our model performs increasingly worse. This makes sense because a high penalty will reduce the coefficients to our predictors, effectively downplaying the effect our predictors have on our response.
Also, when mixture penalty is present from 0.01 to 0.1, the discrepancies between models of different lasso penalties are small in performance level as they are when mixture penalty is very high. This makes sense because our ridge regression model (mixture = 0
) performed with a higher RMSE of 3.451871 than our lasso regression (mixture = 1
) model (RMSE of 3.449157).
autoplot(con_polytune, metric="rmse")
As the degree of our continuous variables increase from 1 to 5, the RMSE decreases steadily. This aligned with my expectations because our actual data, like con_pnts
and con_wins
, are based on singular points and wins where small amounts can make a difference in determining a constructor’s place in the standings. Often times, a constructor’s position in the standings will only be guaranteed a the last race, because competition between competitors will last the entire season. So, by adding degrees, we are amplifying the differences between our data points. Thus, the effect of each predictor will become more and more influential. It seems a degree of 5 would perform the best, but I can also assume that an even higher degree would lower the RMSE even more.
autoplot(con_knntune, metric="rmse")
Similarly, the knn model performs increasingly better when the neighbors increase. Our knn model with 15 neighbors managed to minimize the RMSE the most, except for the random forest model, which is not entriely surprising. Also, the non-linear models, like knn and random forest, seemed to perform the best, so our data may not be non-linear.
autoplot(con_rftune, metric="rmse")
Focusing on how the tuning parameters work on minimizing RMSE, it looks like minimal node size and number of trees have almost no difference in affecting RMSE. However, number of randomly selected predictors mtry
looks to minimize RMSE the most when it is larger (about 2 to 8), though there is a slight dip in RMSE across all models at mtry = 5
. Notice how we ranged mtry
from 1 to 8. We wouldn’t want the model to reach mtry = 9
because if all trees have the same first split, then the trees are not independent.
# parameters of best model: mtry=5, trees=400, min_n=20
rf_rmse
## # A tibble: 1 x 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 4 350 20 rmse standard 2.37 5 0.0162 Preprocessor1_Model1~
# fit best model to training
best_rf_train <- select_best(con_rftune, metric="rmse")
# finalize workflow
final_model_wflow <- finalize_workflow(con_rfwflow, best_rf_train)
final_model_fit <- fit(final_model_wflow, data=con_train)
Now let’s apply the model on testing data.
con_tibble <- predict(final_model_fit, new_data = con_test %>% select(-con_pos))
con_tibble <- bind_cols(con_tibble, con_test %>% select(con_pos))
con_metric <- metric_set(rmse)
con_tibble_metric <- con_metric(con_tibble, truth=con_pos, estimate=.pred)
con_tibble_metric
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 2.36
The model performed even better achieving a RMSE of 2.361256 on the testing data. Since RMSE is a measure of predicted outcome values related to actual outcome values, and our response variable con_pos
is supposed to range from 1 to twenty-something, a RMSE of 2.361256 is not too bad. Our predicted standing positions might be a few places off in the standings, which is not ideal, but it’s manageable.
The best performing model is the random forest model, while the more linear, simple models, like the linear regression model and elastic net models, performed poorly in comparison. I was not entirely surprised by how our models performed because predicting constructors standings did not seem like a linear trend intuitively. There are a lot of factors that come into play, a lot of which isn’t even included or properly represented in our data here, like more detailed qualifying data, receiving grid place penalties from the FIA, having on-going car issues, etc. With all this in consideration, I did not think a basic linear model could cover what we needed.
As for next steps, I will probably spend a bit more time to carefully go over everything and do the random forest model with a proper amount of folds and wider ranges for the tuning parameters. I would also spend more time on manipulating the data and perhaps even include data for our rookie drivers.
Overall, this project was successful and a great challenge in compiling data and applying something I am genuinely interested in and passionate about to academic skills I’ve only recently learned in this class.
Data comes from the Ergast Developer API.
Information about Formula 1 and the FIA comes from my own personal knowledge, the Official F1 Website, and Wikipedia.