Predicting 2023 Formula One World Constructors’ Championship Standings

Introduction

The goal of this project is to build a machine learning model to predict the Formula One World Constructors’ Championship Standings for the upcoming 2023 season.

What is Formula One? What are constructors?

Formula One, or shorthand F1, is open wheel, single-seater car racing at the highest level. There are a total of ten constructors, or racing teams, that are tasked with the goal of constructing a F1 car prior to the annual racing season. During each race season, ten constructors compete with their newly launched F1 cars for a position from first to tenth in the World Constructors’ Championship Standings. Each constructor employs two drivers. These drivers also compete for a place from first, i.e. World Champion, to twentieth, in the World Drivers’ Championship Standings. The F1 race season is overlooked and regulated by the Fédération Internationale de l’Automobile (FIA).

Racing occurs over a weekend such that Fridays are for free practice, Saturdays are for qualifying, and Sundays are race days. During a race, finishing positions first to tenth earn points from 25, 18, 15, 12, 10, 8, 6, 4, 2, and 1 respectively. The number of races, or grand prix, may vary every year. The upcoming 2023 season has 23 races on the calendar, as opposed to the 2022 F1 calendar consisting of 22 races. Some races are new, and some drivers are new. This may pose some difficulty later on, so best we mention it now.

What is Free Practice?
On Fridays, constructor teams put out their two drivers on the circuit to collect data about the car, get a feel of how the car handles the track, and in general obtain as much practice and knowledge to prepare and perform during qualifying and race day.

How Qualifying Works:
On Saturdays, drivers will have allotted time to record a fast lap time for Q1, in which the five slowest drivers are cut off (positions 16 to 20). Then Q2 begins and then the remaining fifteen drivers have allotted time to record another fast lap time, during which the five slowest drivers of the remaining fifteen are cut off (positions 11 to 15). During Q3, the remaining ten drivers will record a final fast lap time to earn a position from 1 to 10. The last position a driver earned before cut off or the end of Q1 determines the position they will start from during Sunday race day.

Why F1?

Formula One, an inherently European-rooted sport, has gained immense popularity recently in the United States, partially due to a Netflix show “Drive to Survive” covering the motor sport. While the show itself is highly dramatized and an inaccurate representation of actual race events at times, I developed an interest like many others in the sport and began watching archived races, qualifying sessions, and more. It is known amongst fans and followers of the sport that racing teams with a higher budget and access to rich resources tend to fare better in Constructors’ Championship Standings. For many years, wealthy and heavily-sponsored constructors like Mercedes, Ferrari, and Red Bull have occupied the higher ranks of the World Championship. However, in recent years the FIA has instated a budget cap and other regulations in hopes of leveling the playing field. With the upcoming 2023 season kicking off, I am inclined to predict the results for the 2023 season, not only for my own fun, but also to see if the FIA’s regulations are contributing to the beginning of a new era of competition in F1 racing.

Exploring and Modifying Data

We are going to begin this project by just taking a general look at the data and consider what we need to do to construct a data set we can work with, i.e. missing data, useful and/or not useful variables, necessary key/ID values, manipulating and/or converting variables, etc.

Data Source

To obtain the data, I will be using Ergast Developer API which includes documentation of F1 stats since the beginning of World Championships in 1950. However, I will be downloading the csv files from the API website instead of pulling data with the actual API, for convenience purposes. I may also get further information for certain variables from the official F1 website archive.

The different csv files are all data sets individually, some of which act as lookup tables for others. We are tasked to combine the data sets efficiently into a single one before any analyzing or model building.

Loading Packages and Exploring the Data

Code in file read_data.rda

Let’s begin by loading our relevant packages, set.seed, reading in our csv files that we downloaded from the API website, and taking a quick look at some of our variables.

library(tidyverse)
library(tidymodels)
library(kknn)
library(ggplot2)
library(corrplot)
library(ggthemes)
library(kableExtra)
library(parsnip)
library(recipes)
library(magrittr)
library(workflows)
library(glmnet)
library(themis)
library(ranger)
library(vip)
library(naniar)
library(visdat)
library(dplyr)
library(ISLR)
set.seed(2536)

load("C://Users//waliang//Documents//UCSB//third year//pstat 131//read_data.rda")

Here is a general summary of what each data set involves:

circuit - uniquely identified circuits with their locations
races - uniquely identified races with their round number and locations since 1950
seasons - list of years of race seasons
results - collection of results for each driver at each race
drivers - uniquely identified drivers since 1950
driver_standings - uniquely identified record of position in Drivers’ Championship Standings at the result of a certain race
constructors - uniquely identified constructor teams since 1950
constructor_results - uniquely identified record of points earned for some constructor at a certain race
constructor_standings - uniquely identified record of position in Constructors’ Championship Standings at the result of a certain race
status - lookup table data set of status key value pairs
lap_times - all lap times for each driver at each race
pit_stops - pit stop times for each driver at each race
qualifying - qualifying results for each race
sprint_results - sprint qualifying results for each driver at certain races that included sprint qualifying

Here are some things to consider:

Since a constructor’s standing is determined by how many points they’ve accumulated throughout the race season relative to other teams, it would be ideal to include a points variable as our response. Also, the amount of points a constructor obtains is dependent on how many point their drivers are able to obtain at races. So, the response variable can come from and/or be calculated from the results, constructor_results, constructor_standings, or driver_standings data sets.
There may be missing data because there will be three rookie drivers entering the 2023 F1 season: Oscar Piastri, Nyck De Vries, and Logan Sargeant. So, it would be hard to predict the next season with no F1 data regarding these three drivers. I will probably take data from previous Formula Two seasons for these drivers, because F2 races concurrently with the F1 season on the same circuits. We can consider number of points earned for the drivers as a measurement of how “good” these drivers are. However, since the sum of number of points two drivers earn is the number of points those drivers’ constructor earns, we may have some structural collinearity between driver points and constructor points. Though some drivers often change constructors as well, so it is worth being cautious here.
Some constructor teams have changed their names over the years (ex. Force India changed to Racing Point Force India then changed to Racing Point then changed to Aston Martin), so data will be scattered around the data set. Despite the different names, we are grouping these “different, but same” teams because important factors like car part suppliers, important partnerships, etc. are usually the same regardless of the official name change. So, we will have to align the different team names. all under one name, preferably the most recent and relevant one.

levels(factor(constructor_results$status))

## [1] "\\N" "D"

As we can see the status variable of the constructor_results data set reflects two levels, \\N missing and D disqualified.

# merge constructor names w/ results data set on Id variable
merge_test <- full_join(constructors[,c(1,3)], constructor_results, by="constructorId")
merge_test <- full_join(races[,c(1:3)], merge_test, by="raceId")

merge_test %>%
  filter(status == "D")

##    raceId year round constructorId    name constructorResultsId points status
## 1      36 2007     1             1 McLaren                  186     14      D
## 2      37 2007     2             1 McLaren                  196     18      D
## 3      38 2007     3             1 McLaren                  208     12      D
## 4      39 2007     4             1 McLaren                  219     14      D
## 5      40 2007     5             1 McLaren                  229     18      D
## 6      41 2007     6             1 McLaren                  240     12      D
## 7      42 2007     7             1 McLaren                  251     18      D
## 8      43 2007     8             1 McLaren                  263      8      D
## 9      44 2007     9             1 McLaren                  274     14      D
## 10     45 2007    10             1 McLaren                  284     10      D
## 11     46 2007    11             1 McLaren                  295     15      D
## 12     47 2007    12             1 McLaren                  307     10      D
## 13     48 2007    13             1 McLaren                  317     18      D
## 14     49 2007    14             1 McLaren                  329     11      D
## 15     50 2007    15             1 McLaren                  339     10      D
## 16     51 2007    16             1 McLaren                  351      8      D
## 17     52 2007    17             1 McLaren                  362      8      D

Here we discover this disqualification is because McLaren was disqualified in the 2007 Championship because of the 2007 Espionage Controversy. This will come into play later.

Modifying Data Sets

Code in file modify_data.rda

The first thing we do in modify_data.rda is remove the url column from the following data sets: constructors, drivers, circuit, races, and season.

Constructors Related Data Sets

Let’s begin by focusing on the constructors information (points at certain races, position in standings at the result of certain races, etc.). Since the constructors data set is largely a lookup table, we will focus on merging constructor_results and constructor_standings first. Let’s do the following modifications:

rename the points variables to clarify whether they are points earned at each race or points accumulated from each race so far in the season.
remove the 2007 races of McLaren that were labeled as disqualified due to the 2007 Espionage Controversy, as it can be an outlier denoting performance higher than expected for McLaren during those years.
After doing so, our constructor_results and constructor_standings are now con_res and con_stand respectively.

load("C://Users//waliang//Documents//UCSB//third year//pstat 131//modify_data.rda")

head(con_res)
##   constructorResultsId raceId constructorId con_pnts
## 1                    1     18             1       14
## 2                    2     18             2        8
## 3                    3     18             3        9
## 4                    4     18             4        5
## 5                    5     18             5        2
## 6                    6     18             6        1
head(con_stand)
##   constructorStandingsId raceId constructorId con_sum_pnts con_pos con_posText
## 1                      1     18             1           14       1           1
## 2                      2     18             2            8       3           3
## 3                      3     18             3            9       2           2
## 4                      4     18             4            5       4           4
## 5                      5     18             5            2       5           5
## 6                      6     18             6            1       6           6
##   con_wins
## 1        1
## 2        0
## 3        0
## 4        0
## 5        0
## 6        0

We will full_join() merge con_res and con_stand by raceId and constructorId. It should not remove anything, only add/join together rows from both data sets. I am also going to remove the constructorResultsId and constructorStandingsId columns because there is no need for these identification variables after merging all information together.
We finish the constructor’s section up by adding in the constructor team name con_name, reference constructorRef, and nationality con_nation, using the constructorsdata set as a lookup table.

We also add race information, like circuit name and country, race season year, race name, and more, using the races and circuit data sets similarly.

# finished constructors related data set
head(merge_con[order(merge_con$raceId),])
##      raceId constructorId con_pnts con_sum_pnts con_pos con_posText con_wins
## 3806      1            23       18           18       1           1        1
## 3807      1             1        0           NA      NA        <NA>       NA
## 3808      1             7       11           11       2           2        0
## 3809      1             4        4            4       3           3        0
## 3810      1             3        3            3       4           4        0
## 3811      1             5        3            3       5           5        0
##      constructorRef   con_name con_nation season round circuitId
## 3806          brawn      Brawn    British   2009     1         1
## 3807        mclaren    McLaren    British   2009     1         1
## 3808         toyota     Toyota   Japanese   2009     1         1
## 3809        renault    Renault     French   2009     1         1
## 3810       williams   Williams    British   2009     1         1
## 3811     toro_rosso Toro Rosso    Italian   2009     1         1
##                  race_name circ_country  circuitRef
## 3806 Australian Grand Prix    Australia albert_park
## 3807 Australian Grand Prix    Australia albert_park
## 3808 Australian Grand Prix    Australia albert_park
## 3809 Australian Grand Prix    Australia albert_park
## 3810 Australian Grand Prix    Australia albert_park
## 3811 Australian Grand Prix    Australia albert_park
##                           circ_name
## 3806 Albert Park Grand Prix Circuit
## 3807 Albert Park Grand Prix Circuit
## 3808 Albert Park Grand Prix Circuit
## 3809 Albert Park Grand Prix Circuit
## 3810 Albert Park Grand Prix Circuit
## 3811 Albert Park Grand Prix Circuit

Drivers Related Data Sets

Let’s look at the drivers, driver_standings, results, lap_times, pit_stops, qualifiying, and sprint_results data sets.

Let’s begin with a break down of the qualifying data set:

dim(qualifying)[1]
## [1] 9575
dim(results)[1]
## [1] 25840

The number of observations in qualifying is much less than that of results because the qualifying data is only provided starting in 2003. If I were to join the two data sets together, I would have a lot of missing data for the qualifying variables.

# 139 missing q1 times (and q2 and q3 given that these drivers were cut off from making it into q2 and q3)
quali_na
## [1] 139

# other than single digit minute timestamps and missing times, we have a double digit q1 timestamp 16:42.640 and more blank missing values
rest_of_q1
##   qualifyId raceId driverId constructorId number position        q1  q2  q3
## 1      2569    255       87             1      7       24 16:42.640 \\N \\N
## 2      7424    983       20             6      5       20                  
## 3      7639    993      843             5     28       20                  
## 4      7656    994      830             9     33       20                  
## 5      7677    995      154           210      8       20                  
## 6      8036   1013        8            51      7       19                  
## 7      8037   1013      842             9     10       20                  
## 8      8177   1020       20             6      5       20                  
## 9      8257   1024      817             4      3       20
    # we just have missing q2 values
rest_of_q2

qualifyId	raceId	driverId	constructorId	number	position	q1
3882	114	23	3	4	1	1:15.259
3883	114	8	1	6	2	1:15.295
3884	114	31	3	3	3	1:15.415
3885	114	15	4	7	4	1:15.500
3886	114	30	6	1	5	1:15.644
3887	114	14	1	5	6	1:15.700
3888	114	22	6	2	7	1:15.820
3889	114	4	4	8	8	1:15.884
3890	114	17	19	14	9	1:16.237
3891	114	43	7	21	10	1:16.744
3892	114	35	16	16	11	1:16.755
3893	114	21	17	11	12	1:16.967
3894	114	42	19	15	13	1:17.103
3895	114	2	15	9	14	1:17.176
3896	114	49	15	10	15	1:17.402
3897	114	52	17	12	16	1:17.452
3898	114	44	7	20	17	1:17.464
3899	114	50	18	19	18	1:18.706
3900	114	51	18	18	19	1:20.063
7395	981	825	210	20	16	1:40.489
7396	981	835	4	30	17	1:40.646
7397	981	828	15	9	18	1:41.732
7398	981	836	15	94	19	1:41.875
7424	983	20	6	5	20
7435	983	154	210	8	16	1:33.308
7436	983	825	210	20	17	1:33.434
7437	983	836	15	94	18	1:33.483
7438	983	828	15	9	19	1:33.970
7513	987	843	5	28	15	1:10.625
7514	987	836	15	94	16	1:10.678
7515	987	842	5	10	17	1:10.686
7516	987	840	3	18	18	1:10.776
7517	987	828	15	9	19	1:10.875
7555	989	843	5	28	16	1:24.532
7556	989	828	15	9	17	1:24.556
7557	989	844	15	16	18	1:24.636
7558	989	845	3	35	19	1:24.922
7559	989	842	5	10	20	1:25.295
7615	992	838	1	2	16	1:44.489
7616	992	842	5	10	17	1:44.496
7617	992	828	15	9	18	1:45.541
7635	993	807	4	27	16	1:18.923
7636	993	828	15	9	17	1:19.493
7637	993	845	3	35	18	1:19.695
7638	993	840	3	18	19	1:20.225
7639	993	843	5	28	20
7652	994	843	5	28	16	1:13.179
7653	994	828	15	9	17	1:13.265
7654	994	840	3	18	18	1:13.323
7655	994	825	210	20	19	1:13.393
7656	994	830	9	33	20
7673	995	842	5	10	16	1:13.047
7674	995	840	3	18	17	1:13.590
7675	995	845	3	35	18	1:13.643
7676	995	828	15	9	19	1:14.593
7677	995	154	210	8	20
7735	998	832	4	55	16	1:28.456
7736	998	838	1	2	17	1:29.096
7737	998	845	3	35	18	1:29.252
7772	1000	840	3	18	15	1:18.560
7773	1000	838	1	2	16	1:18.782
7774	1000	844	15	16	17	1:18.817
7775	1000	839	10	31	18	1:19.142
7776	1000	815	10	11	19	1:19.200
7777	1000	845	3	35	20	1:19.301
7835	1003	825	210	20	16	1:39.644
7836	1003	843	5	28	17	1:39.809
7837	1003	838	1	2	18	1:39.864
7838	1003	845	3	35	19	1:41.263
7839	1003	840	3	18	20	1:41.334
7874	1005	817	9	3	15	1:29.806
7875	1005	807	4	27	16	1:30.361
7876	1005	845	3	35	17	1:30.372
7877	1005	4	1	14	18	1:30.573
7878	1005	838	1	2	19	1:31.041
7879	1005	828	15	9	20	1:31.213
7913	1007	842	5	10	15	1:16.828
7914	1007	154	210	8	16	1:16.911
7915	1007	838	1	2	17	1:16.966
7916	1007	825	210	20	18	1:17.599
7917	1007	840	3	18	19	1:17.689
7918	1007	845	3	35	20	1:17.886
8015	1012	840	211	18	16	1:34.292
8016	1012	847	3	63	17	1:35.253
8017	1012	9	3	88	18	1:35.281
8031	1013	840	211	18	14	1:42.630
8032	1013	154	210	8	15	1:43.407
8033	1013	807	4	27	16	1:43.427
8034	1013	847	3	63	17	1:45.062
8035	1013	9	3	88	18	1:45.455
8036	1013	8	51	7	19
8037	1013	842	9	10	20
8173	1020	846	1	4	16	1:13.333
8174	1020	848	5	23	17	1:13.461
8175	1020	847	3	63	18	1:14.721
8176	1020	9	3	88	19	1:14.839
8177	1020	20	6	5	20
8212	1022	841	51	99	15	1:45.637
8213	1022	842	5	10	16	1:46.435
8214	1022	832	1	55	17	1:46.507
8215	1022	826	5	26	18	1:46.518
8216	1022	847	3	63	19	1:47.548
8233	1023	154	210	8	16	1:20.784
8234	1023	815	211	11	17	1:21.291
8235	1023	847	3	63	18	1:21.800
8236	1023	9	3	88	19	1:22.356
8252	1024	826	5	26	15	1:39.957
8253	1024	840	211	18	16	1:39.979
8254	1024	154	210	8	17	1:40.277
8255	1024	847	3	63	18	1:40.867
8256	1024	9	3	88	19	1:41.186
8257	1024	817	4	3	20
8273	1025	8	51	7	16	1:34.840
8274	1025	847	3	63	17	1:35.356
8275	1025	9	3	88	18	1:36.474
8276	1025	848	9	23	19	1:39.197
8277	1025	826	5	26	20	1:39.544
8293	1026	817	4	3	16	1:29.822
8294	1026	815	211	11	17	1:30.344
8295	1026	847	3	63	18	1:30.364
8353	1029	826	5	26	16	1:09.320
8354	1029	840	211	18	17	1:09.536
8355	1029	847	3	63	18	1:10.126
8356	1029	9	3	88	19	1:10.614
8630	1042	8	51	7	16	1:18.201
8631	1042	841	51	99	17	1:18.323
8632	1042	154	210	8	18	1:18.364
8633	1042	825	210	20	19	1:18.508
8634	1042	849	3	6	20	1:18.777
8710	1046	825	210	20	16	0:54.705
8711	1046	849	3	6	17	0:54.796
8712	1046	851	3	89	18	0:54.892
8713	1046	8	51	7	19	0:54.963
8714	1046	850	210	51	20	0:55.426
9248	1076	848	3	23	16	1:20.135
9249	1076	825	210	20	17	1:20.254
9250	1076	20	117	5	18	1:21.149
9251	1076	849	3	6	19	1:21.372
9268	1077	852	213	22	16	1:20.474
9269	1077	842	213	10	17	1:20.732
9270	1077	849	3	6	18	1:21.971
9271	1077	839	214	31	19	1:22.338
9288	1078	825	210	20	16	1:30.975
9289	1078	855	51	24	17	1:31.020
9290	1078	848	3	23	18	1:31.266
9291	1078	849	3	6	19	1:31.325
9328	1080	848	3	23	16	1:13.611
9329	1080	842	213	10	17	1:13.660
9330	1080	840	117	18	18	1:13.678
9331	1080	849	3	6	19	1:14.403
9332	1080	855	51	24	20	1:15.606

    # just missing q3 values
rest_of_q3

qualifyId	raceId	driverId	constructorId	number	position	q1	q2
3882	114	23	3	4	1	1:15.259
3883	114	8	1	6	2	1:15.295
3884	114	31	3	3	3	1:15.415
3885	114	15	4	7	4	1:15.500
3886	114	30	6	1	5	1:15.644
3887	114	14	1	5	6	1:15.700
3888	114	22	6	2	7	1:15.820
3889	114	4	4	8	8	1:15.884
3890	114	17	19	14	9	1:16.237
3891	114	43	7	21	10	1:16.744
3892	114	35	16	16	11	1:16.755
3893	114	21	17	11	12	1:16.967
3894	114	42	19	15	13	1:17.103
3895	114	2	15	9	14	1:17.176
3896	114	49	15	10	15	1:17.402
3897	114	52	17	12	16	1:17.452
3898	114	44	7	20	17	1:17.464
3899	114	50	18	19	18	1:18.706
3900	114	51	18	18	19	1:20.063
7390	981	815	10	11	11	1:38.511	1:37.582
7391	981	807	4	27	12	1:39.242	1:38.059
7392	981	4	1	14	13	1:39.134	1:38.202
7393	981	826	5	26	14	1:39.183	1:38.245
7394	981	832	5	55	15	1:39.788	1:38.526
7395	981	825	210	20	16	1:40.489
7396	981	835	4	30	17	1:40.646
7397	981	828	15	9	18	1:41.732
7398	981	836	15	94	19	1:41.875
7424	983	20	6	5	20
7430	983	13	3	19	11	1:32.267	1:32.034
7431	983	835	4	30	12	1:32.576	1:32.100
7432	983	840	3	18	13	1:33.000	1:32.307
7433	983	832	5	55	14	1:32.650	1:32.402
7434	983	842	5	10	15	1:32.547	1:32.558
7435	983	154	210	8	16	1:33.308
7436	983	825	210	20	17	1:33.434
7437	983	836	15	94	18	1:33.483
7438	983	828	15	9	19	1:33.970
7509	987	839	10	31	11	1:10.168	1:09.830
7510	987	154	210	8	12	1:10.148	1:09.879
7511	987	838	1	2	13	1:10.286	1:10.116
7512	987	825	210	20	14	1:10.521	1:10.154
7513	987	843	5	28	15	1:10.625
7514	987	836	15	94	16	1:10.678
7515	987	842	5	10	17	1:10.686
7516	987	840	3	18	18	1:10.776
7517	987	828	15	9	19	1:10.875
7549	989	822	131	77	10	1:23.686	1:22.089
7550	989	4	1	14	11	1:23.597	1:23.692
7551	989	838	1	2	12	1:24.073	1:23.853
7552	989	815	10	11	13	1:24.344	1:24.005
7553	989	840	3	18	14	1:24.464	1:24.230
7554	989	839	10	31	15	1:24.503	1:24.786
7555	989	843	5	28	16	1:24.532
7556	989	828	15	9	17	1:24.556
7557	989	844	15	16	18	1:24.636
7558	989	845	3	35	19	1:24.922
7559	989	842	5	10	20	1:25.295
7610	992	840	3	18	11	1:44.359	1:43.585
7611	992	845	3	35	12	1:44.261	1:43.886
7612	992	4	1	14	13	1:44.010	1:44.019
7613	992	844	15	16	14	1:43.752	1:44.074
7614	992	825	210	20	15	1:43.674	1:44.759
7615	992	838	1	2	16	1:44.489
7616	992	842	5	10	17	1:44.496
7617	992	828	15	9	18	1:45.541
7630	993	838	1	2	11	1:18.885	1:18.323
7631	993	842	5	10	12	1:18.550	1:18.463
7632	993	839	10	31	13	1:18.813	1:18.696
7633	993	844	15	16	14	1:18.661	1:18.910
7634	993	815	10	11	15	1:18.740	1:19.098
7635	993	807	4	27	16	1:18.923
7636	993	828	15	9	17	1:19.493
7637	993	845	3	35	18	1:19.695
7638	993	840	3	18	19	1:20.225
7639	993	843	5	28	20
7648	994	838	1	2	12	1:12.463	1:12.440
7649	994	845	3	35	13	1:12.706	1:12.521
7650	994	844	15	16	14	1:12.829	1:12.714
7651	994	154	210	8	15	1:12.930	1:12.728
7652	994	843	5	28	16	1:13.179
7653	994	828	15	9	17	1:13.265
7654	994	840	3	18	18	1:13.323
7655	994	825	210	20	19	1:13.393
7656	994	830	9	33	20
7659	994	807	4	27	11	1:13.065	1:12.411
7670	995	843	5	28	12	1:12.587	1:12.635
7671	995	4	1	14	14	1:12.979	1:12.856
7672	995	838	1	2	15	1:12.998	1:12.865
7673	995	842	5	10	16	1:13.047
7674	995	840	3	18	17	1:13.590
7675	995	845	3	35	18	1:13.643
7676	995	828	15	9	19	1:14.593
7677	995	154	210	8	20
7730	998	807	4	27	11	1:28.017	1:27.901
7731	998	815	10	11	12	1:28.210	1:27.928
7732	998	4	1	14	13	1:28.187	1:28.139
7733	998	842	5	10	14	1:28.399	1:28.343
7734	998	828	15	9	15	1:28.249	1:28.391
7735	998	832	4	55	16	1:28.456
7736	998	838	1	2	17	1:29.096
7737	998	845	3	35	18	1:29.252
7768	1000	4	1	14	11	1:18.208	1:35.214
7769	1000	817	9	3	12	1:18.540	1:36.442
7770	1000	807	4	27	13	1:17.905	1:36.506
7771	1000	828	15	9	14	1:18.641	1:37.075
7772	1000	840	3	18	15	1:18.560
7773	1000	838	1	2	16	1:18.782
7774	1000	844	15	16	17	1:18.817
7775	1000	839	10	31	18	1:19.142
7776	1000	815	10	11	19	1:19.200
7777	1000	845	3	35	20	1:19.301
7830	1003	4	1	14	11	1:39.022	1:38.641
7831	1003	832	4	55	12	1:39.103	1:38.716
7832	1003	844	15	16	13	1:39.206	1:38.747
7833	1003	828	15	9	14	1:39.366	1:39.453
7834	1003	842	5	10	15	1:39.614	1:39.691
7835	1003	825	210	20	16	1:39.644
7836	1003	843	5	28	17	1:39.809
7837	1003	838	1	2	18	1:39.864
7838	1003	845	3	35	19	1:41.263
7839	1003	840	3	18	20	1:41.334
7870	1005	844	15	16	11	1:29.706	1:29.864
7871	1005	825	210	20	12	1:30.219	1:30.226
7872	1005	832	4	55	13	1:30.236	1:30.490
7873	1005	840	3	18	14	1:30.317	1:30.714
7874	1005	817	9	3	15	1:29.806
7875	1005	807	4	27	16	1:30.361
7876	1005	845	3	35	17	1:30.372
7877	1005	4	1	14	18	1:30.573
7878	1005	838	1	2	19	1:31.041
7879	1005	828	15	9	20	1:31.213
7910	1007	839	10	31	11	1:16.252	1:16.844
7911	1007	815	10	11	13	1:16.242	1:17.167
7912	1007	843	5	28	14	1:16.682	1:17.184
7913	1007	842	5	10	15	1:16.828
7914	1007	154	210	8	16	1:16.911
7915	1007	838	1	2	17	1:16.966
7916	1007	825	210	20	18	1:17.599
7917	1007	840	3	18	19	1:17.689
7918	1007	845	3	35	20	1:17.886
7919	1007	4	1	14	12	1:16.857	1:16.871
8008	1012	825	210	20	9	1:34.036	1:33.150
8009	1012	154	210	8	10	1:33.752	1:33.156
8010	1012	826	5	26	11	1:33.783	1:33.236
8011	1012	815	211	11	12	1:34.026	1:33.299
8012	1012	8	51	7	13	1:34.125	1:33.419
8013	1012	832	1	55	14	1:33.686	1:33.523
8014	1012	846	1	4	15	1:34.148	1:33.967
8015	1012	840	211	18	16	1:34.292
8016	1012	847	3	63	17	1:35.253
8017	1012	9	3	88	18	1:35.281
8026	1013	844	6	16	9	1:41.426	1:41.995
8027	1013	832	1	55	10	1:41.936	1:42.398
8028	1013	817	4	3	11	1:42.486	1:42.477
8029	1013	848	5	23	12	1:42.154	1:42.494
8030	1013	825	210	20	13	1:42.382	1:42.699
8031	1013	840	211	18	14	1:42.630
8032	1013	154	210	8	15	1:43.407
8033	1013	807	4	27	16	1:43.427
8034	1013	847	3	63	17	1:45.062
8035	1013	9	3	88	18	1:45.455
8036	1013	8	51	7	19
8037	1013	842	9	10	20
8167	1020	844	6	16	10	1:12.229	1:12.344
8168	1020	841	51	99	11	1:13.170	1:12.786
8169	1020	825	210	20	12	1:13.103	1:12.789
8170	1020	817	4	3	13	1:13.131	1:12.799
8171	1020	826	5	26	14	1:13.278	1:13.135
8172	1020	840	211	18	15	1:13.256	1:13.450
8173	1020	846	1	4	16	1:13.333
8174	1020	848	5	23	17	1:13.461
8175	1020	847	3	63	18	1:14.721
8176	1020	9	3	88	19	1:14.839
8177	1020	20	6	5	20
8208	1022	154	210	8	11	1:45.694	1:44.797
8209	1022	846	1	4	12	1:46.154	1:44.847
8210	1022	840	211	18	13	1:46.000	1:45.047
8211	1022	848	9	23	14	1:45.528	1:45.799
8212	1022	841	51	99	15	1:45.637
8213	1022	842	5	10	16	1:46.435
8214	1022	832	1	55	17	1:46.507
8215	1022	826	5	26	18	1:46.518
8216	1022	847	3	63	19	1:47.548
8225	1023	848	9	23	8	1:20.382	1:20.021
8226	1023	840	211	18	9	1:20.643	1:20.498
8227	1023	8	51	7	10	1:20.634	1:20.515
8228	1023	841	51	99	11	1:20.657	1:20.517
8229	1023	825	210	20	12	1:20.616	1:20.615
8230	1023	826	5	26	13	1:20.723	1:20.630
8231	1023	846	1	4	14	1:20.646	1:21.068
8232	1023	842	5	10	15	1:20.508	1:21.125
8233	1023	154	210	8	16	1:20.784
8234	1023	815	211	11	17	1:21.291
8235	1023	847	3	63	18	1:21.800
8236	1023	9	3	88	19	1:22.356
8247	1024	815	211	11	10	1:39.909	1:38.520
8248	1024	841	51	99	11	1:39.272	1:38.697
8249	1024	842	5	10	12	1:39.085	1:38.699
8250	1024	8	51	7	13	1:39.454	1:38.858
8251	1024	825	210	20	14	1:39.942	1:39.650
8252	1024	826	5	26	15	1:39.957
8253	1024	840	211	18	16	1:39.979
8254	1024	154	210	8	17	1:40.277
8255	1024	847	3	63	18	1:40.867
8256	1024	9	3	88	19	1:41.186
8257	1024	817	4	3	20
8268	1025	842	5	10	11	1:34.456	1:33.950
8269	1025	815	211	11	12	1:34.336	1:33.958
8270	1025	841	51	99	13	1:34.755	1:34.037
8271	1025	825	210	20	14	1:33.889	1:34.082
8272	1025	840	211	18	15	1:34.287	1:34.233
8273	1025	8	51	7	16	1:34.840
8274	1025	847	3	63	17	1:35.356
8275	1025	9	3	88	18	1:36.474
8276	1025	848	9	23	19	1:39.197
8277	1025	826	5	26	20	1:39.544
8288	1026	841	51	99	11	1:29.604	1:29.254
8289	1026	840	211	18	12	1:29.594	1:29.345
8290	1026	8	51	7	13	1:29.636	1:29.358
8291	1026	826	5	26	14	1:29.723	1:29.563
8292	1026	807	4	27	15	1:29.619	1:30.112
8293	1026	817	4	3	16	1:29.822
8294	1026	815	211	11	17	1:30.344
8295	1026	847	3	63	18	1:30.364
8348	1029	846	1	4	11	1:08.891	1:08.868
8349	1029	817	4	3	12	1:09.086	1:08.903
8350	1029	841	51	99	13	1:09.175	1:08.919
8351	1029	807	4	27	14	1:09.050	1:08.921
8352	1029	815	211	11	15	1:09.288	1:09.035
8353	1029	826	5	26	16	1:09.320
8354	1029	840	211	18	17	1:09.536
8355	1029	847	3	63	18	1:10.126
8356	1029	9	3	88	19	1:10.614
8626	1042	840	211	18	12	1:17.667	1:17.626
8625	1042	839	4	31	11	1:17.775	1:17.614
8624	1042	817	4	3	10	1:17.621	1:17.481
8627	1042	826	213	26	13	1:17.841	1:17.728
8628	1042	847	3	63	14	1:17.931	1:17.788
8629	1042	20	6	5	15	1:17.446	1:17.919
8630	1042	8	51	7	16	1:18.201
8631	1042	841	51	99	17	1:18.323
8632	1042	154	210	8	18	1:18.364
8633	1042	825	210	20	19	1:18.508
8634	1042	849	3	6	20	1:18.777
8705	1046	839	4	31	11	0:54.309	0:53.995
8706	1046	848	9	23	12	0:54.620	0:54.026
8707	1046	20	6	5	13	0:54.301	0:54.175
8708	1046	841	51	99	14	0:54.523	0:54.377
8709	1046	846	1	4	15	0:54.194	0:54.693
8710	1046	825	210	20	16	0:54.705
8711	1046	849	3	6	17	0:54.796
8712	1046	851	3	89	18	0:54.892
8713	1046	8	51	7	19	0:54.963
8714	1046	850	210	51	20	0:55.426
9242	1076	4	214	14	10	1:19.192	1:18.815
9243	1076	842	213	10	11	1:19.580	1:19.226
9244	1076	822	51	77	12	1:19.251	1:19.410
9245	1076	852	213	22	13	1:19.742	1:19.424
9246	1076	855	51	24	14	1:19.910	1:20.155
9247	1076	854	210	47	15	1:20.104	1:20.465
9248	1076	848	3	23	16	1:20.135
9249	1076	825	210	20	17	1:20.254
9250	1076	20	117	5	18	1:21.149
9251	1076	849	3	6	19	1:21.372
9262	1077	832	6	55	10	1:19.305	1:18.990
9263	1077	847	131	63	11	1:20.383	1:20.757
9264	1077	854	210	47	12	1:20.422	1:20.916
9265	1077	1	131	44	13	1:20.470	1:21.138
9266	1077	855	51	24	14	1:19.730	1:21.434
9267	1077	840	117	18	15	1:20.342	1:28.119
9268	1077	852	213	22	16	1:20.474
9269	1077	842	213	10	17	1:20.732
9270	1077	849	3	6	18	1:21.971
9271	1077	839	214	31	19	1:22.338
9283	1078	4	214	14	11	1:30.407	1:30.160
9284	1078	847	131	63	12	1:30.490	1:30.173
9285	1078	20	117	5	13	1:30.677	1:30.214
9286	1078	817	1	3	14	1:30.583	1:30.310
9287	1078	854	210	47	15	1:30.645	1:30.423
9288	1078	825	210	20	16	1:30.975
9289	1078	855	51	24	17	1:31.020
9290	1078	848	3	23	18	1:31.266
9291	1078	849	3	6	19	1:31.325
9323	1080	852	213	22	11	1:13.110	1:12.797
9324	1080	822	51	77	12	1:13.541	1:12.909
9325	1080	825	210	20	13	1:13.069	1:12.921
9326	1080	817	1	3	14	1:13.338	1:12.964
9327	1080	854	210	47	15	1:13.469	1:13.081
9328	1080	848	3	23	16	1:13.611
9329	1080	842	213	10	17	1:13.660
9330	1080	840	117	18	18	1:13.678
9331	1080	849	3	6	19	1:14.403
9332	1080	855	51	24	20	1:15.606
9602	1095	844	6	16	10	1:14.486	1:10.950
9603	1095	848	3	23	11	1:14.324	1:11.631
9604	1095	842	213	10	12	1:14.371	1:11.675
9605	1095	20	117	5	13	1:13.597	1:11.678
9606	1095	817	1	3	14	1:14.931	1:12.140

There are some missing q1 times. This could be due to a car issue that interrupted drivers from finishing their laps. Other than the missing values, we can work with the times. We had to check q2 and q3 similarly because I want to work with the three variables in one go. We will want to:

convert the timestamps into numeric variables
only keep the fastest time q_time out of q1, q2, and q3 to streamline the data and avoid the missing values within the qualifying data set

By obtaining the fastest time out of three, it gives us a better idea of the maximum potential of the car and/or driver at a certain weekend and circuit. After some renaming of variables for distinction, here is our quali data set with our q_time variable.

head(quali)   # Looks nice!
## # A tibble: 6 x 5
## # Groups:   raceId, driverId, constructorId [6]
##   raceId driverId constructorId driv_start_pos q_time
##    <int>    <int>         <int>          <int>  <dbl>
## 1      1        1             1             15   86.5
## 2      1        2             2             11   85.5
## 3      1        3             3              5   85.1
## 4      1        4             4             12   85.6
## 5      1        5             1             14   85.7
## 6      1        6             3             13   85.6

Before I merge quali to results, I want to talk about sprint_results.

Sprint qualifying is another form of qualifying, except the traditional qualifying with Q1, Q2, and Q3 is done on Friday along with free practice to determine the starting grid positions for a sprint race. A sprint race is essentially a shorter, 100 kilometer race of which the finishing positions for each driver will determine the starting grid positions for the actual race.

Since sprint_results can affect a race result as much as normal qualifying weekends without a sprint race, I would include sprint_results as well.

races %>%
  filter(raceId %in% as.numeric(levels(factor(sprint_results$raceId)))) %>%
  select(raceId, year, round, name, date)
##   raceId year round                      name       date
## 1   1061 2021    10        British Grand Prix 2021-07-18
## 2   1065 2021    14        Italian Grand Prix 2021-09-12
## 3   1071 2021    19     SÃ£o Paulo Grand Prix 2021-11-14
## 4   1077 2022     4 Emilia Romagna Grand Prix 2022-04-24
## 5   1084 2022    11       Austrian Grand Prix 2022-07-10
## 6   1095 2022    21      Brazilian Grand Prix 2022-11-13

Here, we can see that sprint qualifying was a relatively new procedure enacted for the first time in the 2021 season and is only recorded at six races. So, I would like to combine the start_pos from quali with the positionOrder from sprint_results to complete the drivers’ race start position variable. There are also points awarded for the sprint races, but I would prefer to use the overall points from the driver_standings data set as a better representation of the total points each driver accumulates with the progress of each race weekend.

# new driv_start_pos after merging quali and sprint_results
head(merge_quali)

## # A tibble: 6 x 5
## # Groups:   raceId, driverId, constructorId [6]
##   raceId driverId constructorId q_time driv_start_pos
##    <int>    <int>         <int>  <dbl>          <int>
## 1      1        1             1   86.5             15
## 2      1        2             2   85.5             11
## 3      1        3             3   85.1              5
## 4      1        4             4   85.6             12
## 5      1        5             1   85.7             14
## 6      1        6             3   85.6             13

The results data set holds a lot of race information especially about individual drivers. The most useful variables (other than the identification variables) seem to be positionText, positionOrder, points, laps, grid (i.e. starting position), fastestLapTime, rank (i.e. rank of fastestLapTime), and statusId. This is the majority of variables in results, so I will merge the data sets and combine grid and driv_start_pos from merge_quali as they are the same variable. I will address other missing values later.

count(res, res$time=="\\N")
##   res$time == "\\\\N"     n
## 1               FALSE  7088
## 2                TRUE 18752

The time variable has more missing than not, which can be due to not finishing a race because of car failure, etc. So, I think statusId and positionText would be a better representation of the position at which drivers finished their race and in what conditions. Let’s remove time and milliseconds (time measured in milliseconds).

Now let’s merge our modified merge_quali data set and a modified results data set.

# remove time, driver number, position, convert lap time, rename variables to be driver specific, convert positionText into factor for results data set
# this is now our merged data set
head(merge_res)
##   resultId raceId driverId constructorId driv_positionText driv_positionOrder
## 1        1     18        1             1                 1                  1
## 2        2     18        2             2                 2                  2
## 3        3     18        3             3                 3                  3
## 4        4     18        4             4                 4                  4
## 5        5     18        5             1                 5                  5
## 6        6     18        6             3                 6                  6
##   driv_pnts laps fastestLap rank fastestLapTime fastestLapSpeed driv_statusId
## 1        10   58         39    2         87.452         218.300             1
## 2         8   58         41    3         87.739         217.586             1
## 3         6   58         41    5         88.090         216.719             1
## 4         5   58         58    7         88.603         215.464             1
## 5         4   58         43    1         87.418         218.385             1
## 6         3   57         50   14         89.639         212.974            11
##   q_time driv_start_pos avg_lap_time avg_lap_pos avg_pit
## 1 85.187              1     98.11407           1      NA
## 2 85.518              5     98.20852           4      NA
## 3 86.059              7     98.25481           4      NA
## 4 86.188             12     98.41029           8      NA
## 5 85.452              3     98.42466           3      NA
## 6 86.413             14    100.08219          11      NA

Now let’s look at the lap_times and pit_stops data sets. * Know that pit_stops is only from 2012 onward. lap_times is from 1996 onward.

# example of McLaren at raceId==1
lap_times %>%
  filter(raceId==1 & driverId==1) %>%
  head()
##   raceId driverId lap position     time milliseconds
## 1      1        1   1       13 1:49.088       109088
## 2      1        1   2       12 1:33.740        93740
## 3      1        1   3       11 1:31.600        91600
## 4      1        1   4       10 1:31.067        91067
## 5      1        1   5       10 1:32.129        92129
## 6      1        1   6        9 1:30.469        90469

# example of McLaren pit stops
pit_stops %>%
  filter(driverId==1) %>%
  head()
##   raceId driverId stop lap     time duration milliseconds
## 1    841        1    1  16 17:28:24   23.227        23227
## 2    841        1    2  36 17:59:29   23.199        23199
## 3    842        1    1  12 16:24:40   22.552        22552
## 4    842        1    2  24 16:45:48   22.611        22611
## 5    842        1    3  37 17:08:28   25.382        25382
## 6    842        1    4  52 17:34:48   22.466        22466

The lap_times data set has the lap time of every single lap driven in every race by each driver. pit_stops has a record of every pit stop made at each race. Keep in mind, the pit stop duration recorded in the pit_stops data set is the entire amount of time a car spends in the pit lane, not just the amount of time it takes to change tyres. In general, a car usually spends twenty-something seconds in the pit lane. Thus, the pit stop duration can have quite an impact on the outcome of a driver’s race, especially if a team struggles to pit quickly, which happens quite often.

To modify these timestamps, I want to create a variable for average lap time, average position (rounded), and average pit stop duration for each race given.

* At each race, drivers are required to pit (typically changing four tyres) at least once. Many times, drivers have to pit more than once. We will be taking the average pit stop duration among all the pits a driver makes at a certain race.

# lap and pit w/ averages
head(lap)
## # A tibble: 6 x 4
## # Groups:   raceId [1]
##   raceId driverId avg_lap_time avg_lap_pos
##    <int>    <int>        <dbl>       <dbl>
## 1      1        1         97.6          10
## 2      1        2         97.6          15
## 3      1        3         97.6           7
## 4      1        4         97.6          10
## 5      1        6         91.8           7
## 6      1        7         97.6          14
head(pit)
## # A tibble: 6 x 3
## # Groups:   raceId [1]
##   raceId driverId avg_pit
##    <int>    <int>   <dbl>
## 1    841        1    23.2
## 2    841        2    24.0
## 3    841        3    23.7
## 4    841        4    24.1
## 5    841        5    24.9
## 6    841       10    23.8

Let’s merge lap and pit to merge_res.

head(merge_res)

##   resultId raceId driverId constructorId driv_positionText driv_positionOrder
## 1        1     18        1             1                 1                  1
## 2        2     18        2             2                 2                  2
## 3        3     18        3             3                 3                  3
## 4        4     18        4             4                 4                  4
## 5        5     18        5             1                 5                  5
## 6        6     18        6             3                 6                  6
##   driv_pnts laps fastestLap rank fastestLapTime fastestLapSpeed driv_statusId
## 1        10   58         39    2         87.452         218.300             1
## 2         8   58         41    3         87.739         217.586             1
## 3         6   58         41    5         88.090         216.719             1
## 4         5   58         58    7         88.603         215.464             1
## 5         4   58         43    1         87.418         218.385             1
## 6         3   57         50   14         89.639         212.974            11
##   q_time driv_start_pos avg_lap_time avg_lap_pos avg_pit
## 1 85.187              1     98.11407           1      NA
## 2 85.518              5     98.20852           4      NA
## 3 86.059              7     98.25481           4      NA
## 4 86.188             12     98.41029           8      NA
## 5 85.452              3     98.42466           3      NA
## 6 86.413             14    100.08219          11      NA

vis_miss(merge_res)

For now, I won’t mess with the missing data. It’s due to certain data (ex. qualifying times q_time, lap times avg_lap_time, pit stop durations avg_pit) only being available after a certain date.

Let’s look at one last data set, driver_standings.

# not too different in dimensions is a good thing
dim(driver_standings)[1]
## [1] 33902
dim(merge_res)[1]
## [1] 25850

Let’s begin with renaming:

position and positionText to driv_standing because it conflicts with race result position from before, whereas now this is about the drivers’ positions in the Drivers’ Championship Standings when accumulating the progress from each race.
points to sum_driv_pnts as these are points not earned at each race, but points earned overall from that race as well as past races in the season.

count(races, races$time=="\\N")
##   races$time == "\\\\N"   n
## 1                 FALSE 371
## 2                  TRUE 731

* I would have liked distinguishing between day and night races using time in races as that could have a huge impact on tyre degradation and race result, but there is only about 30% non-missing data which can’t help us much.

head(driver_stand)
##   driverStandingsId raceId driverId sum_driv_pnts driv_standing
## 1                 1     18        1            10             1
## 2                 2     18        2             8             2
## 3                 3     18        3             6             3
## 4                 4     18        4             5             4
## 5                 5     18        5             4             5
## 6                 6     18        6             3             6
##   driv_standingText driv_wins
## 1                 1         1
## 2                 2         0
## 3                 3         0
## 4                 4         0
## 5                 5         0
## 6                 6         0

# nothing else to change about standings, let's merge our modified drivers related data set
# as well as remove unnecessary variables, rename for more distinction, add information from status and drivers data sets (basically using the following as lookup tables)
head(race)
##   raceId season round circuitId             race_name circ_country  circuitRef
## 1      1   2009     1         1 Australian Grand Prix    Australia albert_park
## 2      2   2009     2         2  Malaysian Grand Prix     Malaysia      sepang
## 3      3   2009     3        17    Chinese Grand Prix        China    shanghai
## 4      4   2009     4         3    Bahrain Grand Prix      Bahrain     bahrain
## 5      5   2009     5         4    Spanish Grand Prix        Spain   catalunya
## 6      6   2009     6         6     Monaco Grand Prix       Monaco      monaco
##                        circ_name
## 1 Albert Park Grand Prix Circuit
## 2   Sepang International Circuit
## 3 Shanghai International Circuit
## 4  Bahrain International Circuit
## 5 Circuit de Barcelona-Catalunya
## 6              Circuit de Monaco
head(circ)
##   circuitId circ_country  circuitRef                      circ_name
## 1         1    Australia albert_park Albert Park Grand Prix Circuit
## 2         2     Malaysia      sepang   Sepang International Circuit
## 3         3      Bahrain     bahrain  Bahrain International Circuit
## 4         4        Spain   catalunya Circuit de Barcelona-Catalunya
## 5         5       Turkey    istanbul                  Istanbul Park
## 6         6       Monaco      monaco              Circuit de Monaco
head(stat)
##   driv_statusId       status
## 1             1     Finished
## 2             2 Disqualified
## 3             3     Accident
## 4             4    Collision
## 5             5       Engine
## 6             6      Gearbox

# new merged driver related data set
head(merge_driv)
##   raceId driverId constructorId driv_positionText driv_positionOrder driv_pnts
## 1     18        1             1                 1                  1        10
## 2     18        2             2                 2                  2         8
## 3     18        3             3                 3                  3         6
## 4     18        4             4                 4                  4         5
## 5     18        5             1                 5                  5         4
## 6     18        6             3                 6                  6         3
##   laps fastestLap fastestLapTime_rank fastestLapTime fastestLapSpeed
## 1   58         39                   2         87.452         218.300
## 2   58         41                   3         87.739         217.586
## 3   58         41                   5         88.090         216.719
## 4   58         58                   7         88.603         215.464
## 5   58         43                   1         87.418         218.385
## 6   57         50                  14         89.639         212.974
##   driv_statusId q_time driv_start_pos avg_lap_time avg_lap_pos avg_pit
## 1             1 85.187              1     98.11407           1      NA
## 2             1 85.518              5     98.20852           4      NA
## 3             1 86.059              7     98.25481           4      NA
## 4             1 86.188             12     98.41029           8      NA
## 5             1 85.452              3     98.42466           3      NA
## 6            11 86.413             14    100.08219          11      NA
##   sum_driv_pnts driv_standing driv_standingText driv_wins season round
## 1            10             1                 1         1   2008     1
## 2             8             2                 2         0   2008     1
## 3             6             3                 3         0   2008     1
## 4             5             4                 4         0   2008     1
## 5             4             5                 5         0   2008     1
## 6             3             6                 6         0   2008     1
##   circuitId             race_name circ_country  circuitRef
## 1         1 Australian Grand Prix    Australia albert_park
## 2         1 Australian Grand Prix    Australia albert_park
## 3         1 Australian Grand Prix    Australia albert_park
## 4         1 Australian Grand Prix    Australia albert_park
## 5         1 Australian Grand Prix    Australia albert_park
## 6         1 Australian Grand Prix    Australia albert_park
##                        circ_name  driverRef forename    surname  dob
## 1 Albert Park Grand Prix Circuit   hamilton    Lewis   Hamilton 1985
## 2 Albert Park Grand Prix Circuit   heidfeld     Nick   Heidfeld 1977
## 3 Albert Park Grand Prix Circuit    rosberg     Nico    Rosberg 1985
## 4 Albert Park Grand Prix Circuit     alonso Fernando     Alonso 1981
## 5 Albert Park Grand Prix Circuit kovalainen   Heikki Kovalainen 1981
## 6 Albert Park Grand Prix Circuit   nakajima   Kazuki   Nakajima 1985
##   nationality   status
## 1     British Finished
## 2      German Finished
## 3      German Finished
## 4     Spanish Finished
## 5     Finnish Finished
## 6    Japanese   +1 Lap

What about Piastri, Sargeant, and De Vries?

Like I mentioned before, we do have three new rookies drivers for the 2023 F1 season. If we take a look at our data, we will notice that our drivers actually are inputted in the drivers data set, with their own driverId and other values as well. So, they are also in merge_driv, albeit mostly NA missing values.

drivers %>%
  filter(driverRef == "piastri" | driverRef == "sargeant" | forename=="Nyck")

##   driverId driverRef number code forename  surname        dob nationality
## 1      856  de_vries     45  DEV     Nyck de Vries 1995-02-06       Dutch
## 2      857   piastri     81  PIA    Oscar  Piastri 2001-04-06  Australian
## 3      858  sargeant      2  SAR    Logan Sargeant 2000-12-31    American
##                                           url
## 1  http://en.wikipedia.org/wiki/Nyck_de_Vries
## 2  http://en.wikipedia.org/wiki/Oscar_Piastri
## 3 http://en.wikipedia.org/wiki/Logan_Sargeant

# Nyck de Vries completed one race as a reserve driver for Williams
merge_driv %>%
  filter(driverId %in% c(856:858))

##    raceId driverId constructorId driv_positionText driv_positionOrder driv_pnts
## 1    1089      856             3                 9                  9         2
## 2    1091      856            NA              <NA>                 NA        NA
## 3    1092      856            NA              <NA>                 NA        NA
## 4    1093      856            NA              <NA>                 NA        NA
## 5    1094      856            NA              <NA>                 NA        NA
## 6    1095      856            NA              <NA>                 NA        NA
## 7    1096      856            NA              <NA>                 NA        NA
## 8    1098      857            NA              <NA>                 NA        NA
## 9    1098      858            NA              <NA>                 NA        NA
## 10   1098      856            NA              <NA>                 NA        NA
##    laps fastestLap fastestLapTime_rank fastestLapTime fastestLapSpeed
## 1    53         41                  13         86.624         240.750
## 2    NA       <NA>                <NA>             NA            <NA>
## 3    NA       <NA>                <NA>             NA            <NA>
## 4    NA       <NA>                <NA>             NA            <NA>
## 5    NA       <NA>                <NA>             NA            <NA>
## 6    NA       <NA>                <NA>             NA            <NA>
## 7    NA       <NA>                <NA>             NA            <NA>
## 8    NA       <NA>                <NA>             NA            <NA>
## 9    NA       <NA>                <NA>             NA            <NA>
## 10   NA       <NA>                <NA>             NA            <NA>
##    driv_statusId q_time driv_start_pos avg_lap_time avg_lap_pos avg_pit
## 1              1 82.471             13     91.21949          10  24.628
## 2             NA     NA             NA           NA          NA      NA
## 3             NA     NA             NA           NA          NA      NA
## 4             NA     NA             NA           NA          NA      NA
## 5             NA     NA             NA           NA          NA      NA
## 6             NA     NA             NA           NA          NA      NA
## 7             NA     NA             NA           NA          NA      NA
## 8             NA     NA             NA           NA          NA      NA
## 9             NA     NA             NA           NA          NA      NA
## 10            NA     NA             NA           NA          NA      NA
##    sum_driv_pnts driv_standing driv_standingText driv_wins season round
## 1              2            20                20         0   2022    16
## 2              2            20                20         0   2022    17
## 3              2            21                21         0   2022    18
## 4              2            21                21         0   2022    19
## 5              2            21                21         0   2022    20
## 6              2            21                21         0   2022    21
## 7              2            21                21         0   2022    22
## 8              0            12                12         0   2023     1
## 9              0            15                15         0   2023     1
## 10             0            19                19         0   2023     1
##    circuitId                race_name circ_country circuitRef
## 1         14       Italian Grand Prix        Italy      monza
## 2         15     Singapore Grand Prix    Singapore marina_bay
## 3         22      Japanese Grand Prix        Japan     suzuka
## 4         69 United States Grand Prix          USA   americas
## 5         32   Mexico City Grand Prix       Mexico  rodriguez
## 6         18     Brazilian Grand Prix       Brazil interlagos
## 7         24     Abu Dhabi Grand Prix          UAE yas_marina
## 8          3       Bahrain Grand Prix      Bahrain    bahrain
## 9          3       Bahrain Grand Prix      Bahrain    bahrain
## 10         3       Bahrain Grand Prix      Bahrain    bahrain
##                         circ_name driverRef forename  surname  dob nationality
## 1    Autodromo Nazionale di Monza  de_vries     Nyck de Vries 1995       Dutch
## 2       Marina Bay Street Circuit  de_vries     Nyck de Vries 1995       Dutch
## 3                  Suzuka Circuit  de_vries     Nyck de Vries 1995       Dutch
## 4         Circuit of the Americas  de_vries     Nyck de Vries 1995       Dutch
## 5  AutÃ³dromo Hermanos RodrÃguez  de_vries     Nyck de Vries 1995       Dutch
## 6    AutÃ³dromo JosÃ© Carlos Pace  de_vries     Nyck de Vries 1995       Dutch
## 7              Yas Marina Circuit  de_vries     Nyck de Vries 1995       Dutch
## 8   Bahrain International Circuit   piastri    Oscar  Piastri 2001  Australian
## 9   Bahrain International Circuit  sargeant    Logan Sargeant 2000    American
## 10  Bahrain International Circuit  de_vries     Nyck de Vries 1995       Dutch
##      status
## 1  Finished
## 2      <NA>
## 3      <NA>
## 4      <NA>
## 5      <NA>
## 6      <NA>
## 7      <NA>
## 8      <NA>
## 9      <NA>
## 10     <NA>

I did consider inputting F2 and F3 data into our data set to fill out these three drivers. However, I am now against the idea because while there may be similarities between F2, F3, and F1, the race weekends are quite different and I’m afraid adding certain values, but being forced to omit others that don’t exist in F2 and F3, would create even more missing data.

There isn’t any resource that I could pull data neatly and have it combine nicely with the data set we have now, so I think I will continue without adding extra data, for the sake of time.

Change of Plans?

Now let’s take a look at our two merged data sets: merge_con and merge_driv.

dim(merge_con)[1]
## [1] 13177
dim(merge_driv)[1]
## [1] 34496

Considering the 20,000+ difference in observations, which is expected as there are more unique drivers than constructors, I am against merging merge_con and merge_driv. At this point of the project, my plan of action is changing a little. Instead of trying to consolidate all my data into a single data set and modelling of of that, I am seriously considering modelling off of two data sets:

Use merge_con to predict response variable con_pos, using predictors specific to the effect of each individual constructor
Use merge_driv to predict driv_pnts, using predictors specific to the effect of each individual driver. We can predict driv_pnts of our 2023 drivers and sum the drivers corresponding to their 2023 constructor teams and rank the constructors accordingly.

I think this plan is definitely doable, and is not entirely too far off from our original plan. We could just use merge_con and proceed with our modelling, but I feel like there is so much potential with the data in merge_driv that it would be a shame not to use it. However, with the lack of data around our three rookie drivers, perhaps the merge_con method would fare better for our prediction goal.

Finalized Data Set(s) and Codebook(s)

Here is our two finalized data sets, as well as a codebook for both:

head(merge_con)

##   raceId constructorId con_pnts con_sum_pnts con_pos con_posText con_wins
## 1     18             1       14           14       1           1        1
## 2     18             2        8            8       3           3        0
## 3     18             3        9            9       2           2        0
## 4     18             4        5            5       4           4        0
## 5     18             5        2            2       5           5        0
## 6     18             6        1            1       6           6        0
##   constructorRef   con_name con_nation season round circuitId
## 1        mclaren    McLaren    British   2008     1         1
## 2     bmw_sauber BMW Sauber     German   2008     1         1
## 3       williams   Williams    British   2008     1         1
## 4        renault    Renault     French   2008     1         1
## 5     toro_rosso Toro Rosso    Italian   2008     1         1
## 6        ferrari    Ferrari    Italian   2008     1         1
##               race_name circ_country  circuitRef                      circ_name
## 1 Australian Grand Prix    Australia albert_park Albert Park Grand Prix Circuit
## 2 Australian Grand Prix    Australia albert_park Albert Park Grand Prix Circuit
## 3 Australian Grand Prix    Australia albert_park Albert Park Grand Prix Circuit
## 4 Australian Grand Prix    Australia albert_park Albert Park Grand Prix Circuit
## 5 Australian Grand Prix    Australia albert_park Albert Park Grand Prix Circuit
## 6 Australian Grand Prix    Australia albert_park Albert Park Grand Prix Circuit

merge_con Codebook

raceId - unique race identifier; numeric
constructorId - unique constructor identifier; numeric
con_pnts - number of points a constructor earns at a certain race; numeric
con_sum_pnts - total number of points a constructor earned so far in the race season including that certain race; numeric
con_pos - a constructor’s position in the Constructors’ Championship Standings at the point of that certain race during the race season; numeric
con_posText - con_pos; character
con_wins - number of wins a constructor earned so far in the race season including that certain race; numeric
constructorRef - constructor reference name; character
con_name - constructor name; character
con_nation - nationality of constructor team; character
season - year of race season; numeric
round - race order (1 = first race, 2 = second, etc.) in the race season; numeric
circuitId - unique circuit identifier; numeric
race_name - name of grand prix; character
circ_country - country circuit is located in; character
circuitRef - circuit reference name; character
circ_name - name of circuit; character

head(merge_driv)

##   raceId driverId constructorId driv_positionText driv_positionOrder driv_pnts
## 1     18        1             1                 1                  1        10
## 2     18        2             2                 2                  2         8
## 3     18        3             3                 3                  3         6
## 4     18        4             4                 4                  4         5
## 5     18        5             1                 5                  5         4
## 6     18        6             3                 6                  6         3
##   laps fastestLap fastestLapTime_rank fastestLapTime fastestLapSpeed
## 1   58         39                   2         87.452         218.300
## 2   58         41                   3         87.739         217.586
## 3   58         41                   5         88.090         216.719
## 4   58         58                   7         88.603         215.464
## 5   58         43                   1         87.418         218.385
## 6   57         50                  14         89.639         212.974
##   driv_statusId q_time driv_start_pos avg_lap_time avg_lap_pos avg_pit
## 1             1 85.187              1     98.11407           1      NA
## 2             1 85.518              5     98.20852           4      NA
## 3             1 86.059              7     98.25481           4      NA
## 4             1 86.188             12     98.41029           8      NA
## 5             1 85.452              3     98.42466           3      NA
## 6            11 86.413             14    100.08219          11      NA
##   sum_driv_pnts driv_standing driv_standingText driv_wins season round
## 1            10             1                 1         1   2008     1
## 2             8             2                 2         0   2008     1
## 3             6             3                 3         0   2008     1
## 4             5             4                 4         0   2008     1
## 5             4             5                 5         0   2008     1
## 6             3             6                 6         0   2008     1
##   circuitId             race_name circ_country  circuitRef
## 1         1 Australian Grand Prix    Australia albert_park
## 2         1 Australian Grand Prix    Australia albert_park
## 3         1 Australian Grand Prix    Australia albert_park
## 4         1 Australian Grand Prix    Australia albert_park
## 5         1 Australian Grand Prix    Australia albert_park
## 6         1 Australian Grand Prix    Australia albert_park
##                        circ_name  driverRef forename    surname  dob
## 1 Albert Park Grand Prix Circuit   hamilton    Lewis   Hamilton 1985
## 2 Albert Park Grand Prix Circuit   heidfeld     Nick   Heidfeld 1977
## 3 Albert Park Grand Prix Circuit    rosberg     Nico    Rosberg 1985
## 4 Albert Park Grand Prix Circuit     alonso Fernando     Alonso 1981
## 5 Albert Park Grand Prix Circuit kovalainen   Heikki Kovalainen 1981
## 6 Albert Park Grand Prix Circuit   nakajima   Kazuki   Nakajima 1985
##   nationality   status
## 1     British Finished
## 2      German Finished
## 3      German Finished
## 4     Spanish Finished
## 5     Finnish Finished
## 6    Japanese   +1 Lap

merge_driv Codebook

raceId - unique race identifier; numeric
driverId - unique driver identifier; numeric
constructorId - unique constructor identifier; numeric
driv_positionText - a driver’s finishing race position at that certain race; factor; includes the following:

D - disqualified
E - excluded
F - failed to qualify
N - not classified
R - retired
W - withdrawn

driv_positionOrder - a driver’s place in the finishing order of that certain race; numeric
driv_pnts - number of points a driver earns at a certain race; numeric
laps - number of laps a driver completed in the race; numeric
fastestLap - the lap number during which a driver drove their fastest lap during the race; character
fastestLapTime_rank - the rank of a driver’s fastest lap time among that of the other drivers during the race; character
fastestLapTime - a driver’s fastest lap time during the race in seconds; numeric
fastestLapSpeed - fastest speed reached during a driver’s fastest lap; character
driv_statusId - unique status identifier for the condition in which a driver finished their race; numeric
q_time - fastest qualifying fast lap time (out of Q1, Q2, and Q3) that a driver completed during that race weekend’s qualifying session; numeric
driv_start_pos - a driver’s starting grid position for that certain race; numeric
avg_lap_time - average of all of a driver’s lap times during a race; numeric
avg_lap_pos - rounded average of all positions a driver was at during a race; numeric
avg_pit - average pit stop duration of a driver during a race; numeric
sum_driv_pnts - total number of points a driver earned so far in the race season including that certain race; numeric
driv_standing - a driver’s position in the Drivers’ Championship Standings at the point of that certain race during the race season; numeric
driv_standingText - driv_standing; character; including the following:

D - disqualified
driv_wins - number of wins a driver earned so far in the race season including that certain race; numeric
status - the condition in which a driver finished their race; factor

Exploratory Data Analysis

Code in file eda.rda

Missing Data

Let’s take a look at the missing data in merge_con.

load("C://Users//waliang//Documents//UCSB//third year//pstat 131//eda.rda")

vis_miss(merge_con)

Let’s arrange each constructor’s result in order throughout each race season and fill in con_pnts and con_sum_pnts using each other (since con_sum_pnts is an accumulation of con_pnts throughout the season).

merge_con1 <- merge_con %>%
  # arrange each constructor's result in order throughout season
  arrange(constructorId, season, raceId, round, con_sum_pnts) %>%
  # replace con_sum_pnts missing values with the actual value using the other con_pnts column
  mutate(con_sum_pnts = case_when((is.na(con_sum_pnts) & season==lag(season) & is.na(con_pnts)!=TRUE) ~ lag(con_sum_pnts)+con_pnts,
                                  (is.na(con_sum_pnts) & season!=lag(season) & is.na(con_pnts)!=TRUE) ~ con_pnts,
                                  is.na(con_sum_pnts)!=TRUE ~ con_sum_pnts))


# vice versa
merge_con2 <- merge_con1 %>%
  arrange(constructorId, season, raceId, round, con_sum_pnts) %>%
  mutate(con_pnts = case_when((is.na(con_pnts) & season==lag(season) & is.na(con_sum_pnts)!=TRUE) ~ con_sum_pnts-lag(con_sum_pnts),
                              (is.na(con_pnts) & season!=lag(season) & is.na(con_sum_pnts)!=TRUE) ~ con_sum_pnts,
                              is.na(con_pnts)!=TRUE ~ con_pnts))

# and again the first direction for good measure
merge_con3 <- merge_con2 %>%
   arrange(constructorId, season, raceId, round, con_sum_pnts) %>%
   mutate(con_sum_pnts = case_when((is.na(con_sum_pnts) & season==lag(season) & is.na(con_pnts)!=TRUE) ~ lag(con_sum_pnts)+con_pnts,
                                   (is.na(con_sum_pnts) & season!=lag(season) & is.na(con_pnts)!=TRUE) ~ con_pnts,
                                   is.na(con_sum_pnts)!=TRUE ~ con_sum_pnts))

vis_miss(merge_con3)

merge_con3$constructorRef <- factor(merge_con3$constructorRef)

After fixing con_pnts and con_sum_pnts, we have a lot less missing data. The missing values in con_pos, con_posText, and con_wins is somewhat minimal. We can deal with these missing values during recipe creation using a linear regression of other variables that measure a constructors progress throughout the race season, like con_pnts and con_sum_pnts.

Other than this, we don’t have a lot of missing data and the data set is pretty usable.

Now let’s work on merge_driv similarly.

vis_miss(merge_driv, warn_large_data = FALSE)

That’s a lot of missing data. There are some drivers with no constructor team! Let’s look at those first.

dim(merge_driv %>%
  filter(is.na(constructorId)))[1]
## [1] 8646
dim(merge_driv)[1]
## [1] 34496

There are 8646 out of 34496 observation with missing values from constructorId to avg_pit and status. My guess is that when I was joining data sets that had information available only after a certain year, like quali, pit, and even results, then that conflicted with driver_standings that had a lot of data starting from the beginning of F1, leaving empty spots in variables only available after a certain date.

I don’t want to just remove these observations, since they still have data for driv_standing and driv_wins. Some of the missing variables that I would prefer to have, like driv_positionOrder, driv_pnts, and driv_start_pos, are measures of the success of each driver at each race and throughout the season, similar to driv_standing and driv_wins. However, I think I will be forced to remove the rows with missing constructorId because I will have trouble imputing a categorical variable (constructorId is a factor).

vis_miss(merge_driv1)

Deal with missing driv_standing and driv_wins so we can easily impute the other variables in the recipe later on.

dim(merge_driv1 %>%
      filter(is.na(driv_standing)))[1]
## [1] 469

merge_driv2 <- merge_driv1 %>%
  group_by(driverId, round) %>%
  # replace missing driv_standing with the median of the standings the driver gets in that round across seasons
  summarise(driv_standing_na = as.integer(median(driv_standing,na.rm=TRUE)))

merge_driv3 <- full_join(merge_driv1, merge_driv2, by=c("driverId","round"))

merge_driv3 <- merge_driv3 %>%
  mutate(driv_standing = 
           case_when(is.na(driv_standing) ~ driv_standing_na,
                     !is.na(driv_standing) ~ driv_standing))

dim(merge_driv3 %>%
  filter(is.na(driv_standing)))[1]
## [1] 91

By replacing missing driv_standing with the median of such in that round number across all seasons, we reduce the missing standing values from 469 to 91. This is a pretty good effort to save some of our data. Then we can omit the 91 missing values.

# omit the rest of missing driv_standing
merge_driv4 <- merge_driv3 %>%
  filter(!is.na(driv_standing))

vis_miss(merge_driv4, warn_large_data = FALSE)

We can work with this. It’ll get better when imputing during recipe creation.

Visual EDA

Let’s look at some of the relationships between our variables.

Constructor and Driver Points

Let’s take a look at our possible response variables con_pos and driv_pnts.

grid.arrange(plot_con, plot_driv, ncol=2)

range((merge_con3 %>%
         filter(!is.na(con_pos))
         )$con_pos)
## [1]  1 22

range((merge_con3 %>%
  filter(con_pos==tail(unique(sort((merge_con3 %>%
         filter(!is.na(con_pos))
         )$con_pos)))) %>%
  select(con_pos,con_name, season, race_name, circ_name))$season)
## [1] 1963 1991

range((merge_con3 %>%
  filter(con_pos==tail(unique(sort((merge_con3 %>%
         filter(!is.na(con_pos))
         )$con_pos)))) %>%
  select(con_pos,con_name, season, race_name, circ_name))$season)
## [1] 1963 1991

range((merge_driv4 %>%
         filter(!is.na(driv_pnts))
         )$driv_pnts)
## [1]  0 50

The con_pos data is from 1 to 22. The constructors with very high positions are pre-2000 (from 1963 to 1991), because now there are only ten constructors each season. The range of driv_pnts for the data and possibility is from 0 to 50 (first place at 2014 Abu Dhabi).

* At the 2014 Abu Dhabi Grand Prix, the FIA rewarded double points. So, instead of the usual 25 to 1 point range, first place to tenth place earned 50, 36, 30, 24, 20, 16, 12, 8, 4, and 2 respectively. Understandably, double points was never done again.

Correlation Plots

Let’s look at some possible relationships between our variables.

# remove na, identification variables, only numeric
corrplot(cor(merge_con3 %>%
               select_if(is.numeric)%>%
               na.omit()), method="shade", diag=FALSE,addCoef.col=1,number.cex = 1, type="lower")

In the merge_con3 data set, we have a positive correlation between con_wins and con_pnts of which makes sense, the more points a constructor earns, the more likely it is the constructor’s drivers won races. There is also a negative correlation between con_wins and con_pos of, which is also obvious because the more positions a constructor is away from the top of the standings, the less likely that constructor is to win races. round and con_sum_pnts are positively correlated at, so the more races a constructor completes and the more of the race season that passes, the more points a constructor accumulates. On the other hand, round and con_pnts are not as positively correlated at because actually making quick progress and improvement during the race season to earn more and more points consistently as the season goes on is a difficult thing to do as an entire constructor team, even if it is what every team desires. The positive correlation between season and con_pnts/con_sum_pnts is obvious because over time the FIA increased not only the number of points rewarded, but also the number of race finishing positions they reward points to. We also have a negative correlation of between con_pos and con_pnts/con_sum_pnts. This is due to similar reasons as with con_wins and con_pos.

# remove na, q_time has too much na, and identification variables, only numeric
corrplot(cor(merge_driv4 %>%
               select_if(is.numeric) %>%
               na.omit()), method="shade",addCoef.col = 1,number.cex = 0.5, type="lower",diag=FALSE)

Most of the correlations between the merge_driv variables follow similar interpretations as those of the merge_con3 data set. What’s nice to look at is the variables I modified myself. For example, avg_lap_pos is very positively correlated at with driv_start_pos which tells me that the average position a driver spends during a race is likely to be the position they started at. This is intuitively accurate as some races, like street races, can make overtaking difficult, so a driver’s start position can have a huge impact on how their race plays out. This also applies to driv_standing. The very positive correlation of between driv_standing and avg_lap_pos is because the more races a driver spends as a back-marker (i.e. higher race position), the more drivers there are between them and the top of the standings.

Constructor Position in Standings vs. Constructor Points

ggplot(merge_con3, aes(x=con_pnts, y=con_pos)) +
  geom_jitter(width=.5, size=1)+
  geom_smooth(col="red", stat="smooth", level=FALSE)+
  labs(title="Constructor Position in Standings vs. Constructor Points", y="Constructor position in standings throughout season", x="Constructor points earned at each race")

Understandably so, the lower the points a constructor earns at a race, the more constructors are between your position and the top of the standings. The relationship here also does not seem linear. Once a constructor begins earning points, it is a bit easier to climb positions in the standings, as earning more points for yourself means your competition is earning less points. The trend here looks inverse exponential, but maybe I’m getting ahead of myself.

Accumulated Constructor Points vs. Round

Let’s take a closer look at the relationship between con_sum_pnts and round, something I found interesting as it has to do with the progress and improvement of a constructor team throughout a race season.

ggplot(merge_con3, aes(x=round, y=con_sum_pnts)) +
  geom_jitter(width=.5, size=1)+
  geom_smooth(method="lm", col="red")+
  labs(title="Accumulated Constructor Points vs. Round", y="Constructor points accumulated throughout season", x="Round (number of race in order of season schedule)")

It is interesting to see the majority of points to be somewhat plateauing in con_sum_pnts as round increases. You can also see the behavior of the handful of championship winning constructor teams that manage to have a steady increase in their accumulated points.

Driver Points vs. Average Lap Position

ggplot(merge_driv4, aes(x=avg_lap_pos, y=driv_pnts)) +
  geom_jitter(width=.5, size=1)+
  geom_smooth(method="lm", col="green")+
  labs(title="Driver Points vs. Average Lap Position", x="Average lap position", y="Driver's points earned at each race")

We can see that the back-markers, i.e. the drivers that happened to spend most of their race at I would say positions fifteen and greater, have very little chance with earning points. Most of the drivers with average lap position of fifteen and greater are lined up at zero points earned. Reasons drivers spend most of their time out of the points can vary from lacking race pace, having issues with the car, to poor tyre management, which are all valid reasons that keep them from making moves and overtaking to get into the points.

What’s also interesting is that that thick pile of zero point earners extends all the way to the higher positions. It is not a shock for drivers who are consistently in top positions, even race leaders, to suddenly encounter a car failure and DNF from the race. It is totally possible and is part of the charm of F1.

Average Lap Position vs. Driver Start Position

ggplot(merge_driv4, aes(x=driv_start_pos, y=avg_lap_pos)) +
  geom_jitter(width=.5, size=1)+
  geom_smooth(method="lm", col="blue")+
  labs(title="Average Lap Position vs. Driver Start Position", x="Driver's starting position", y="Average lap position")

Like the correlation plot suggested, we have a clear positive relationship between average lap position and driver start position. Since, qualifying sessions determine driver start position, this also supports the idea that qualifying and sprint qualifying have incredible influence on the course of a race.

Model Set-Up

Code in file model_building.rda

Now we can start fitting models to see if we can predict con_pos and driv_pnts.

Data Split

We will be splitting our data sets into a training set and testing set. A majority of the data should be in the training set, which is what we build our model off of. Having separate training and testing sets allows us to see a truer test of how well our model performs. We need to test our model on data it has not been trained on to obtain a realistic idea of where our model is at. This also avoids over-fitting. If we were to test our model on data its been trained on, the results will be very accurate, but falsely so because it is not a true prediction at that point. Our testing data should be akin to a blind taste test for our model, but we build the model’s connections and understandings of how to determine what it will be tasting.

We will split our data so that 70% of it goes to training and 30% of it is testing. We also stratify on the outcome variables con_pos and driv_pnts to make sure our data is split with an equal distribution of the response variables in our training and testing data.

# split into training and testing, strata on response variables
con_split <- initial_split(merge_con3, prop = 0.7, strata = con_pos)
con_train <- training(con_split)
con_test <- testing(con_split)

driv_split <- initial_split(merge_driv4, prop = 0.7, strata =driv_pnts)
driv_train <- training(driv_split)
driv_test <- testing(driv_split)

# check split proportions
nrow(con_train)/nrow(merge_con3)
nrow(con_test)/nrow(merge_con3)
nrow(driv_train)/nrow(merge_driv4)
nrow(driv_test)/nrow(merge_driv4)

Recipe

We have to decide on a recipe, or formula, made up of our response variable and predictors, that can be used for many types of models later. From our merge_con3 data set, we are using con_pos as a response variable. I want to include measures of points, home races (nationality), race order, year, and the specific circuit/track have influence on con_pos. So, for predictors, I have decided on: constructorRef, con_pnts, con_wins, con_nation, season, round, circuitRef, and circ_country.

For merge_driv, our response variable is driv_pnts. For predictors, I want to include: driverRef, constructorRef, driv_positionText, driv_positionOrder, laps, fastestLapTime_rank, fastestLapSpeed, driv_start_pos, avg_lap_time, avg_lap_pos, driv_standing, driv_wins, season, round, circuitRef, circ_country, dob, nationality, and status.

I made sure to include only one of pnts or sum_pnts variables because they are calculated dependently. Same thing with fastestLapTime_rank and fastestLap since one variable is a ranking of the other. Since driv_start_pos is inherently a result of qualifying sessions, I feel that q_time is less useful. Especially since q_time is a decimal amount of seconds and is more missing than not, I think excluding q_time is preferred.

We will impute the missing values in continuous and dummy coded categorical variables.

# omit the small amount of missing values left
merge_con3 <- merge_con3 %>%
  na.omit()


# CONSTRUCTOR RECIPE
con_rec <-
  recipe(con_pos ~ constructorRef+con_pnts+con_wins+con_nation+season+
           round+circuitRef+circ_country, data = con_train) %>%
  
  # omit the small amount of missing values left
  step_naomit(all_predictors())%>%
  
  # there are a hundreds of different constructors and circuits, so we need to collapse the less common ones into an other category
  step_other(constructorRef, threshold=.1) %>%
  step_other(con_nation, threshold=.1) %>%
  step_other(circuitRef, threshold=.1) %>%
  step_other(circ_country, threshold=.1) %>%
  step_naomit()%>%
  

  # dummy code categorical variables
  step_dummy(all_nominal_predictors()) %>%
  
  # remove variables (likely the dummy coded ones) that only contain a single value
  step_zv(all_predictors())%>%
  # step_novel(all_nominal_predictors())%>%
  # step_unknown(all_nominal_predictors())%>%
  
  # normalizing
  step_center(all_predictors()) %>%
  step_scale(all_predictors())
  

# prep and bake
prep(con_rec) %>%
  bake(new_data = con_train)

K-fold Cross Validation

We will also use k-fold cross validation. K-fold cross-validation is a process of setting aside a portion of the training data to split into “k folds” or k number of sections, which within each k fold takes a turn in acting as the validation set while the rest of the k folds are training data. This is done in a manner so that each k times we train and assess, each training set and validation set is mutually exclusive. So, the process should be fitting, then assessing k times, before one final assessment on the actual testing split portion of data. Let’s do 10 folds.

con_folds <- vfold_cv(con_train, v = 10)

con_folds2 <- vfold_cv(con_train, v = 5)
# for random forest for the sake of time

Model Building

At this point I decided to only work on the constructor data set for the sake of time.

Our process for building our models follows the same steps. The process is as follows:

set up model with tuning parameters, the engine, and mode (either regression or classification)

# LINEAR MODEL
lm_model<- linear_reg(engine="lm")


# POLYNOMIAL REGRESSION
# tune degree on the numerical variables (degree on these variables will amplify their effect)
      # ex. a change in points and wins will affect constructor position more than normal
con_polyrec <- con_rec %>%
  step_poly(con_pnts,con_wins,season,round,degree=tune())

poly_model <- linear_reg(mode="regression",
                         engine="lm")


# K NEAREST NEIGHBORS 
# tune neighbors
knn_model <- nearest_neighbor(neighbors = tune(),
                              mode="regression",
                              engine="kknn")


# ELASTIC NET LINEAR REGRESSION
# tune penalty and mixture
en_model <- linear_reg(mixture=tune(),
                       penalty=tune(),
                       mode="regression",
                       engine="glmnet")


# ELASTIC NET W/ LASSO
# tune penalty, set mixture to 1 for lasso penalty
en_lasso <- linear_reg(penalty=tune(),
                       mixture=1,
                       mode="regression",
                       engine="glmnet")


# ELASTIC NET W/ RIDGE
# tune penalty, set mixture to 0 for ridge penalty
en_ridge <- linear_reg(penalty=tune(),
                       mixture=0,
                       mode="regression",
                       engine="glmnet")


# RANDOM FOREST
# tune number of predictors (mtry), trees, and minimum number of values in each node (min_n)
rf_model <- rand_forest(mtry = tune(), 
                           trees = tune(), 
                           min_n = tune()) %>%
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("regression")

set up work flow with model and recipe

# LINEAR MODEL
con_lmwflow <- workflow() %>%
  add_model(lm_model) %>%
  add_recipe(con_rec)



# POLYNOMIAL REGRESSION
con_polywflow <- workflow() %>%
  add_model(poly_model) %>%
  add_recipe(con_polyrec)


# K NEAREST NEIGHBORS 
con_knnwflow <- workflow() %>%
  add_model(knn_model) %>%
  add_recipe(con_rec)


# ELASTIC NET LINEAR REGRESSION
con_enwflow <- workflow() %>%
  add_model(en_model) %>%
  add_recipe(con_rec)



# ELASTIC NET W/ LASSO
con_lassowflow <- workflow() %>%
  add_model(en_lasso) %>%
  add_recipe(con_rec)


# ELASTIC NET W/ RIDGE
con_ridgewflow <- workflow() %>%
  add_model(en_ridge) %>%
  add_recipe(con_rec)



# RANDOM FOREST
con_rfwflow <- workflow() %>%
  add_model(rf_model) %>%
  add_recipe(con_rec)

set up tuning grid with ranges and levels for the tuning of parameters

# POLYNOMIAL REGRESSION
# range degree from 1 to 5
poly_grid <- grid_regular(degree(range=c(1,5)),
                        levels=5)

# K NEAREST NEIGHBORS 
# range neighbors from 1 to 15
knn_grid <- grid_regular(neighbors(range=c(1,15)),
                        levels=5)


# ELASTIC NET LINEAR REGRESSION
en_grid <- grid_regular(penalty(range=c(0.01,3), trans=identity_trans()),
                        mixture(range=c(0,1)),
                        levels=10)


# ELASTIC NET W/ LASSO and
# ELASTIC NET W/ RIDGE
lasso_ridge_grid <- grid_regular(penalty(range=c(0.01,3),
                                         trans=identity_trans()), levels=10)


# RANDOM FOREST
# predictors (mtry) range depend on the recipe
con_rfgrid <- grid_regular(mtry(range=c(1,9)),
                        trees(range=c(200,400)),
                        min_n(range=c(10,20)),
                        levels=5)

tune model using workflow with k-fold cross validation, and tuning grid

# POLYNOMIAL REGRESSION
con_polytune <- tune_grid(
  con_polywflow,
  resamples = con_folds,
  grid = poly_grid)



# K NEAREST NEIGHBORS
con_knntune <- tune_grid(
    con_knnwflow,
    resamples = con_folds,
    grid = knn_grid)



# ELASTIC NET
con_entune <- tune_grid(
  con_enwflow,
  resamples = con_folds,
  grid = en_grid)


# RIDGE REGRESSION
con_ridgetune <- tune_grid(
  con_ridgewflow,
  resamples = con_folds,
  grid = lasso_ridge_grid)



# LASSO REGRESSION
con_lassotune <- tune_grid(
  con_lassowflow,
  resamples = con_folds,
  grid = lasso_ridge_grid)


# RANDOM FOREST
con_rftune <- tune_grid(
  con_rfwflow,
  resamples = con_folds2,
  grid = con_rfgrid)

Collect metrics of tuned models. We want to minimize root mean squared error (RMSE), so find the lowest RMSE for each tuned model.

load("C://Users//waliang//Documents//UCSB//third year//pstat 131//model_building_final.rda")

load("C://Users//waliang//Documents//UCSB//third year//pstat 131//model_results_final.rda")

lm_rmse <- collect_metrics(con_lmfit) %>% 
  slice(1)

ridge_rmse <- collect_metrics(con_ridgetune) %>% 
  arrange(mean) %>%
  filter(.metric=="rmse") %>%
  slice(1)

# LASSO REGRESSION
lasso_rmse <- collect_metrics(con_lassotune) %>% 
  arrange(mean) %>% 
  filter(.metric=="rmse") %>%
  slice(1)

# POLYNOMIAL REGRESSION
poly_rmse <- collect_metrics(con_polytune) %>% 
  arrange(mean) %>% 
  filter(.metric=="rmse") %>%
  slice(1)

# K NEAREST NEIGHBORS
knn_rmse <- collect_metrics(con_knntune) %>% 
  arrange(mean) %>% 
  filter(.metric=="rmse") %>%
  slice(1)

# ELASTIC NET
elastic_rmse <- collect_metrics(con_entune) %>% 
  arrange(mean)%>% 
  filter(.metric=="rmse") %>%
  slice(1)

lm_rmse
## # A tibble: 1 x 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard    3.45    10  0.0155 Preprocessor1_Model1
ridge_rmse
## # A tibble: 1 x 7
##   penalty .metric .estimator  mean     n std_err .config              
##     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1    0.01 rmse    standard    3.45    10  0.0157 Preprocessor1_Model01
lasso_rmse
## # A tibble: 1 x 7
##   penalty .metric .estimator  mean     n std_err .config              
##     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1    0.01 rmse    standard    3.45    10  0.0157 Preprocessor1_Model01
poly_rmse
## # A tibble: 1 x 7
##   degree .metric .estimator  mean     n std_err .config             
##    <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1      5 rmse    standard    2.82    10  0.0264 Preprocessor5_Model1
knn_rmse
## # A tibble: 1 x 7
##   neighbors .metric .estimator  mean     n std_err .config             
##       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1        15 rmse    standard    2.65    10  0.0212 Preprocessor1_Model5
elastic_rmse
## # A tibble: 1 x 8
##   penalty mixture .metric .estimator  mean     n std_err .config               
##     <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                 
## 1    0.01       1 rmse    standard    3.45    10  0.0157 Preprocessor1_Model091
rf_rmse
## # A tibble: 1 x 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     4   350    20 rmse    standard    2.37     5  0.0162 Preprocessor1_Model1~

Model Results

Code in file model_results.rda

Let’s compare the models from before that minimize RMSE to find best performing one out of them all.

The model that minimizes RMSE the most is the Random Forest model with mtry = 5, trees = 400, and min_n = 20, which achieves a mean RMSE of 2.370339 across 5 folds.

Elastic Net

autoplot(con_entune, metric="rmse")

The model with lasso = 0 penalty, the red line, and low mixture minimize RMSE the most. As lasso penalty increases, our model performs increasingly worse. This makes sense because a high penalty will reduce the coefficients to our predictors, effectively downplaying the effect our predictors have on our response.

Also, when mixture penalty is present from 0.01 to 0.1, the discrepancies between models of different lasso penalties are small in performance level as they are when mixture penalty is very high. This makes sense because our ridge regression model (mixture = 0) performed with a higher RMSE of 3.451871 than our lasso regression (mixture = 1) model (RMSE of 3.449157).

Polynomial Regression

autoplot(con_polytune, metric="rmse")

As the degree of our continuous variables increase from 1 to 5, the RMSE decreases steadily. This aligned with my expectations because our actual data, like con_pnts and con_wins, are based on singular points and wins where small amounts can make a difference in determining a constructor’s place in the standings. Often times, a constructor’s position in the standings will only be guaranteed a the last race, because competition between competitors will last the entire season. So, by adding degrees, we are amplifying the differences between our data points. Thus, the effect of each predictor will become more and more influential. It seems a degree of 5 would perform the best, but I can also assume that an even higher degree would lower the RMSE even more.

K-Nearest Neighbors

autoplot(con_knntune, metric="rmse")

Similarly, the knn model performs increasingly better when the neighbors increase. Our knn model with 15 neighbors managed to minimize the RMSE the most, except for the random forest model, which is not entriely surprising. Also, the non-linear models, like knn and random forest, seemed to perform the best, so our data may not be non-linear.

Random Forest

autoplot(con_rftune, metric="rmse")

Focusing on how the tuning parameters work on minimizing RMSE, it looks like minimal node size and number of trees have almost no difference in affecting RMSE. However, number of randomly selected predictors mtry looks to minimize RMSE the most when it is larger (about 2 to 8), though there is a slight dip in RMSE across all models at mtry = 5. Notice how we ranged mtry from 1 to 8. We wouldn’t want the model to reach mtry = 9 because if all trees have the same first split, then the trees are not independent.

Results of Best Model

# parameters of best model: mtry=5, trees=400, min_n=20
rf_rmse

## # A tibble: 1 x 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     4   350    20 rmse    standard    2.37     5  0.0162 Preprocessor1_Model1~

Fitting to Training

# fit best model to training
best_rf_train <- select_best(con_rftune, metric="rmse")

# finalize workflow
final_model_wflow <- finalize_workflow(con_rfwflow, best_rf_train)
final_model_fit <- fit(final_model_wflow, data=con_train)

Testing

Now let’s apply the model on testing data.

con_tibble <- predict(final_model_fit, new_data = con_test %>% select(-con_pos))
con_tibble <- bind_cols(con_tibble, con_test %>% select(con_pos))

con_metric <- metric_set(rmse)
con_tibble_metric <- con_metric(con_tibble, truth=con_pos, estimate=.pred)
con_tibble_metric

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        2.36

The model performed even better achieving a RMSE of 2.361256 on the testing data. Since RMSE is a measure of predicted outcome values related to actual outcome values, and our response variable con_pos is supposed to range from 1 to twenty-something, a RMSE of 2.361256 is not too bad. Our predicted standing positions might be a few places off in the standings, which is not ideal, but it’s manageable.

Conclusion

The best performing model is the random forest model, while the more linear, simple models, like the linear regression model and elastic net models, performed poorly in comparison. I was not entirely surprised by how our models performed because predicting constructors standings did not seem like a linear trend intuitively. There are a lot of factors that come into play, a lot of which isn’t even included or properly represented in our data here, like more detailed qualifying data, receiving grid place penalties from the FIA, having on-going car issues, etc. With all this in consideration, I did not think a basic linear model could cover what we needed.

As for next steps, I will probably spend a bit more time to carefully go over everything and do the random forest model with a proper amount of folds and wider ranges for the tuning parameters. I would also spend more time on manipulating the data and perhaps even include data for our rookie drivers.

Overall, this project was successful and a great challenge in compiling data and applying something I am genuinely interested in and passionate about to academic skills I’ve only recently learned in this class.

Sources

Data comes from the Ergast Developer API.

Information about Formula 1 and the FIA comes from my own personal knowledge, the Official F1 Website, and Wikipedia.