Introduction

The goal of this project is to build a machine learning model to predict the Formula One World Constructors’ Championship Standings for the upcoming 2023 season.

What is Formula One? What are constructors?

Formula One, or shorthand F1, is open wheel, single-seater car racing at the highest level. There are a total of ten constructors, or racing teams, that are tasked with the goal of constructing a F1 car prior to the annual racing season. During each race season, ten constructors compete with their newly launched F1 cars for a position from first to tenth in the World Constructors’ Championship Standings. Each constructor employs two drivers. These drivers also compete for a place from first, i.e. World Champion, to twentieth, in the World Drivers’ Championship Standings. The F1 race season is overlooked and regulated by the Fédération Internationale de l’Automobile (FIA).

Racing occurs over a weekend such that Fridays are for free practice, Saturdays are for qualifying, and Sundays are race days. During a race, finishing positions first to tenth earn points from 25, 18, 15, 12, 10, 8, 6, 4, 2, and 1 respectively. The number of races, or grand prix, may vary every year. The upcoming 2023 season has 23 races on the calendar, as opposed to the 2022 F1 calendar consisting of 22 races. Some races are new, and some drivers are new. This may pose some difficulty later on, so best we mention it now.

What is Free Practice?
On Fridays, constructor teams put out their two drivers on the circuit to collect data about the car, get a feel of how the car handles the track, and in general obtain as much practice and knowledge to prepare and perform during qualifying and race day.

How Qualifying Works:
On Saturdays, drivers will have allotted time to record a fast lap time for Q1, in which the five slowest drivers are cut off (positions 16 to 20). Then Q2 begins and then the remaining fifteen drivers have allotted time to record another fast lap time, during which the five slowest drivers of the remaining fifteen are cut off (positions 11 to 15). During Q3, the remaining ten drivers will record a final fast lap time to earn a position from 1 to 10. The last position a driver earned before cut off or the end of Q1 determines the position they will start from during Sunday race day.



Why F1?

Formula One, an inherently European-rooted sport, has gained immense popularity recently in the United States, partially due to a Netflix show “Drive to Survive” covering the motor sport. While the show itself is highly dramatized and an inaccurate representation of actual race events at times, I developed an interest like many others in the sport and began watching archived races, qualifying sessions, and more. It is known amongst fans and followers of the sport that racing teams with a higher budget and access to rich resources tend to fare better in Constructors’ Championship Standings. For many years, wealthy and heavily-sponsored constructors like Mercedes, Ferrari, and Red Bull have occupied the higher ranks of the World Championship. However, in recent years the FIA has instated a budget cap and other regulations in hopes of leveling the playing field. With the upcoming 2023 season kicking off, I am inclined to predict the results for the 2023 season, not only for my own fun, but also to see if the FIA’s regulations are contributing to the beginning of a new era of competition in F1 racing.


Exploring and Modifying Data

We are going to begin this project by just taking a general look at the data and consider what we need to do to construct a data set we can work with, i.e. missing data, useful and/or not useful variables, necessary key/ID values, manipulating and/or converting variables, etc.

Data Source

To obtain the data, I will be using Ergast Developer API which includes documentation of F1 stats since the beginning of World Championships in 1950. However, I will be downloading the csv files from the API website instead of pulling data with the actual API, for convenience purposes. I may also get further information for certain variables from the official F1 website archive.

The different csv files are all data sets individually, some of which act as lookup tables for others. We are tasked to combine the data sets efficiently into a single one before any analyzing or model building.

Loading Packages and Exploring the Data

Code in file read_data.rda

Let’s begin by loading our relevant packages, set.seed, reading in our csv files that we downloaded from the API website, and taking a quick look at some of our variables.

library(tidyverse)
library(tidymodels)
library(kknn)
library(ggplot2)
library(corrplot)
library(ggthemes)
library(kableExtra)
library(parsnip)
library(recipes)
library(magrittr)
library(workflows)
library(glmnet)
library(themis)
library(ranger)
library(vip)
library(naniar)
library(visdat)
library(dplyr)
library(ISLR)
set.seed(2536)

load("C://Users//waliang//Documents//UCSB//third year//pstat 131//read_data.rda")


Here is a general summary of what each data set involves:

  • circuit - uniquely identified circuits with their locations
  • races - uniquely identified races with their round number and locations since 1950
  • seasons - list of years of race seasons
  • results - collection of results for each driver at each race
  • drivers - uniquely identified drivers since 1950
  • driver_standings - uniquely identified record of position in Drivers’ Championship Standings at the result of a certain race
  • constructors - uniquely identified constructor teams since 1950
  • constructor_results - uniquely identified record of points earned for some constructor at a certain race
  • constructor_standings - uniquely identified record of position in Constructors’ Championship Standings at the result of a certain race
  • status - lookup table data set of status key value pairs
  • lap_times - all lap times for each driver at each race
  • pit_stops - pit stop times for each driver at each race
  • qualifying - qualifying results for each race
  • sprint_results - sprint qualifying results for each driver at certain races that included sprint qualifying

Here are some things to consider:

  1. Since a constructor’s standing is determined by how many points they’ve accumulated throughout the race season relative to other teams, it would be ideal to include a points variable as our response. Also, the amount of points a constructor obtains is dependent on how many point their drivers are able to obtain at races. So, the response variable can come from and/or be calculated from the results, constructor_results, constructor_standings, or driver_standings data sets.

  2. There may be missing data because there will be three rookie drivers entering the 2023 F1 season: Oscar Piastri, Nyck De Vries, and Logan Sargeant. So, it would be hard to predict the next season with no F1 data regarding these three drivers. I will probably take data from previous Formula Two seasons for these drivers, because F2 races concurrently with the F1 season on the same circuits. We can consider number of points earned for the drivers as a measurement of how “good” these drivers are. However, since the sum of number of points two drivers earn is the number of points those drivers’ constructor earns, we may have some structural collinearity between driver points and constructor points. Though some drivers often change constructors as well, so it is worth being cautious here.

  3. Some constructor teams have changed their names over the years (ex. Force India changed to Racing Point Force India then changed to Racing Point then changed to Aston Martin), so data will be scattered around the data set. Despite the different names, we are grouping these “different, but same” teams because important factors like car part suppliers, important partnerships, etc. are usually the same regardless of the official name change. So, we will have to align the different team names. all under one name, preferably the most recent and relevant one.



levels(factor(constructor_results$status))
## [1] "\\N" "D"

As we can see the status variable of the constructor_results data set reflects two levels, \\N missing and D disqualified.

# merge constructor names w/ results data set on Id variable
merge_test <- full_join(constructors[,c(1,3)], constructor_results, by="constructorId")
merge_test <- full_join(races[,c(1:3)], merge_test, by="raceId")

merge_test %>%
  filter(status == "D")
##    raceId year round constructorId    name constructorResultsId points status
## 1      36 2007     1             1 McLaren                  186     14      D
## 2      37 2007     2             1 McLaren                  196     18      D
## 3      38 2007     3             1 McLaren                  208     12      D
## 4      39 2007     4             1 McLaren                  219     14      D
## 5      40 2007     5             1 McLaren                  229     18      D
## 6      41 2007     6             1 McLaren                  240     12      D
## 7      42 2007     7             1 McLaren                  251     18      D
## 8      43 2007     8             1 McLaren                  263      8      D
## 9      44 2007     9             1 McLaren                  274     14      D
## 10     45 2007    10             1 McLaren                  284     10      D
## 11     46 2007    11             1 McLaren                  295     15      D
## 12     47 2007    12             1 McLaren                  307     10      D
## 13     48 2007    13             1 McLaren                  317     18      D
## 14     49 2007    14             1 McLaren                  329     11      D
## 15     50 2007    15             1 McLaren                  339     10      D
## 16     51 2007    16             1 McLaren                  351      8      D
## 17     52 2007    17             1 McLaren                  362      8      D

Here we discover this disqualification is because McLaren was disqualified in the 2007 Championship because of the 2007 Espionage Controversy. This will come into play later.



Modifying Data Sets

Code in file modify_data.rda

The first thing we do in modify_data.rda is remove the url column from the following data sets: constructors, drivers, circuit, races, and season.

What about Piastri, Sargeant, and De Vries?

Like I mentioned before, we do have three new rookies drivers for the 2023 F1 season. If we take a look at our data, we will notice that our drivers actually are inputted in the drivers data set, with their own driverId and other values as well. So, they are also in merge_driv, albeit mostly NA missing values.

drivers %>%
  filter(driverRef == "piastri" | driverRef == "sargeant" | forename=="Nyck")
##   driverId driverRef number code forename  surname        dob nationality
## 1      856  de_vries     45  DEV     Nyck de Vries 1995-02-06       Dutch
## 2      857   piastri     81  PIA    Oscar  Piastri 2001-04-06  Australian
## 3      858  sargeant      2  SAR    Logan Sargeant 2000-12-31    American
##                                           url
## 1  http://en.wikipedia.org/wiki/Nyck_de_Vries
## 2  http://en.wikipedia.org/wiki/Oscar_Piastri
## 3 http://en.wikipedia.org/wiki/Logan_Sargeant
# Nyck de Vries completed one race as a reserve driver for Williams
merge_driv %>%
  filter(driverId %in% c(856:858))
##    raceId driverId constructorId driv_positionText driv_positionOrder driv_pnts
## 1    1089      856             3                 9                  9         2
## 2    1091      856            NA              <NA>                 NA        NA
## 3    1092      856            NA              <NA>                 NA        NA
## 4    1093      856            NA              <NA>                 NA        NA
## 5    1094      856            NA              <NA>                 NA        NA
## 6    1095      856            NA              <NA>                 NA        NA
## 7    1096      856            NA              <NA>                 NA        NA
## 8    1098      857            NA              <NA>                 NA        NA
## 9    1098      858            NA              <NA>                 NA        NA
## 10   1098      856            NA              <NA>                 NA        NA
##    laps fastestLap fastestLapTime_rank fastestLapTime fastestLapSpeed
## 1    53         41                  13         86.624         240.750
## 2    NA       <NA>                <NA>             NA            <NA>
## 3    NA       <NA>                <NA>             NA            <NA>
## 4    NA       <NA>                <NA>             NA            <NA>
## 5    NA       <NA>                <NA>             NA            <NA>
## 6    NA       <NA>                <NA>             NA            <NA>
## 7    NA       <NA>                <NA>             NA            <NA>
## 8    NA       <NA>                <NA>             NA            <NA>
## 9    NA       <NA>                <NA>             NA            <NA>
## 10   NA       <NA>                <NA>             NA            <NA>
##    driv_statusId q_time driv_start_pos avg_lap_time avg_lap_pos avg_pit
## 1              1 82.471             13     91.21949          10  24.628
## 2             NA     NA             NA           NA          NA      NA
## 3             NA     NA             NA           NA          NA      NA
## 4             NA     NA             NA           NA          NA      NA
## 5             NA     NA             NA           NA          NA      NA
## 6             NA     NA             NA           NA          NA      NA
## 7             NA     NA             NA           NA          NA      NA
## 8             NA     NA             NA           NA          NA      NA
## 9             NA     NA             NA           NA          NA      NA
## 10            NA     NA             NA           NA          NA      NA
##    sum_driv_pnts driv_standing driv_standingText driv_wins season round
## 1              2            20                20         0   2022    16
## 2              2            20                20         0   2022    17
## 3              2            21                21         0   2022    18
## 4              2            21                21         0   2022    19
## 5              2            21                21         0   2022    20
## 6              2            21                21         0   2022    21
## 7              2            21                21         0   2022    22
## 8              0            12                12         0   2023     1
## 9              0            15                15         0   2023     1
## 10             0            19                19         0   2023     1
##    circuitId                race_name circ_country circuitRef
## 1         14       Italian Grand Prix        Italy      monza
## 2         15     Singapore Grand Prix    Singapore marina_bay
## 3         22      Japanese Grand Prix        Japan     suzuka
## 4         69 United States Grand Prix          USA   americas
## 5         32   Mexico City Grand Prix       Mexico  rodriguez
## 6         18     Brazilian Grand Prix       Brazil interlagos
## 7         24     Abu Dhabi Grand Prix          UAE yas_marina
## 8          3       Bahrain Grand Prix      Bahrain    bahrain
## 9          3       Bahrain Grand Prix      Bahrain    bahrain
## 10         3       Bahrain Grand Prix      Bahrain    bahrain
##                         circ_name driverRef forename  surname  dob nationality
## 1    Autodromo Nazionale di Monza  de_vries     Nyck de Vries 1995       Dutch
## 2       Marina Bay Street Circuit  de_vries     Nyck de Vries 1995       Dutch
## 3                  Suzuka Circuit  de_vries     Nyck de Vries 1995       Dutch
## 4         Circuit of the Americas  de_vries     Nyck de Vries 1995       Dutch
## 5  Autódromo Hermanos Rodríguez  de_vries     Nyck de Vries 1995       Dutch
## 6    Autódromo José Carlos Pace  de_vries     Nyck de Vries 1995       Dutch
## 7              Yas Marina Circuit  de_vries     Nyck de Vries 1995       Dutch
## 8   Bahrain International Circuit   piastri    Oscar  Piastri 2001  Australian
## 9   Bahrain International Circuit  sargeant    Logan Sargeant 2000    American
## 10  Bahrain International Circuit  de_vries     Nyck de Vries 1995       Dutch
##      status
## 1  Finished
## 2      <NA>
## 3      <NA>
## 4      <NA>
## 5      <NA>
## 6      <NA>
## 7      <NA>
## 8      <NA>
## 9      <NA>
## 10     <NA>


I did consider inputting F2 and F3 data into our data set to fill out these three drivers. However, I am now against the idea because while there may be similarities between F2, F3, and F1, the race weekends are quite different and I’m afraid adding certain values, but being forced to omit others that don’t exist in F2 and F3, would create even more missing data.

There isn’t any resource that I could pull data neatly and have it combine nicely with the data set we have now, so I think I will continue without adding extra data, for the sake of time.

Change of Plans?

Now let’s take a look at our two merged data sets: merge_con and merge_driv.

dim(merge_con)[1]
## [1] 13177
dim(merge_driv)[1]
## [1] 34496

Considering the 20,000+ difference in observations, which is expected as there are more unique drivers than constructors, I am against merging merge_con and merge_driv. At this point of the project, my plan of action is changing a little. Instead of trying to consolidate all my data into a single data set and modelling of of that, I am seriously considering modelling off of two data sets:

  1. Use merge_con to predict response variable con_pos, using predictors specific to the effect of each individual constructor

  2. Use merge_driv to predict driv_pnts, using predictors specific to the effect of each individual driver. We can predict driv_pnts of our 2023 drivers and sum the drivers corresponding to their 2023 constructor teams and rank the constructors accordingly.

I think this plan is definitely doable, and is not entirely too far off from our original plan. We could just use merge_con and proceed with our modelling, but I feel like there is so much potential with the data in merge_driv that it would be a shame not to use it. However, with the lack of data around our three rookie drivers, perhaps the merge_con method would fare better for our prediction goal.

Finalized Data Set(s) and Codebook(s)

Here is our two finalized data sets, as well as a codebook for both:

head(merge_con)
##   raceId constructorId con_pnts con_sum_pnts con_pos con_posText con_wins
## 1     18             1       14           14       1           1        1
## 2     18             2        8            8       3           3        0
## 3     18             3        9            9       2           2        0
## 4     18             4        5            5       4           4        0
## 5     18             5        2            2       5           5        0
## 6     18             6        1            1       6           6        0
##   constructorRef   con_name con_nation season round circuitId
## 1        mclaren    McLaren    British   2008     1         1
## 2     bmw_sauber BMW Sauber     German   2008     1         1
## 3       williams   Williams    British   2008     1         1
## 4        renault    Renault     French   2008     1         1
## 5     toro_rosso Toro Rosso    Italian   2008     1         1
## 6        ferrari    Ferrari    Italian   2008     1         1
##               race_name circ_country  circuitRef                      circ_name
## 1 Australian Grand Prix    Australia albert_park Albert Park Grand Prix Circuit
## 2 Australian Grand Prix    Australia albert_park Albert Park Grand Prix Circuit
## 3 Australian Grand Prix    Australia albert_park Albert Park Grand Prix Circuit
## 4 Australian Grand Prix    Australia albert_park Albert Park Grand Prix Circuit
## 5 Australian Grand Prix    Australia albert_park Albert Park Grand Prix Circuit
## 6 Australian Grand Prix    Australia albert_park Albert Park Grand Prix Circuit

merge_con Codebook

raceId - unique race identifier; numeric
constructorId - unique constructor identifier; numeric
con_pnts - number of points a constructor earns at a certain race; numeric
con_sum_pnts - total number of points a constructor earned so far in the race season including that certain race; numeric
con_pos - a constructor’s position in the Constructors’ Championship Standings at the point of that certain race during the race season; numeric
con_posText - con_pos; character
con_wins - number of wins a constructor earned so far in the race season including that certain race; numeric
constructorRef - constructor reference name; character
con_name - constructor name; character
con_nation - nationality of constructor team; character
season - year of race season; numeric
round - race order (1 = first race, 2 = second, etc.) in the race season; numeric
circuitId - unique circuit identifier; numeric
race_name - name of grand prix; character
circ_country - country circuit is located in; character
circuitRef - circuit reference name; character
circ_name - name of circuit; character

head(merge_driv)
##   raceId driverId constructorId driv_positionText driv_positionOrder driv_pnts
## 1     18        1             1                 1                  1        10
## 2     18        2             2                 2                  2         8
## 3     18        3             3                 3                  3         6
## 4     18        4             4                 4                  4         5
## 5     18        5             1                 5                  5         4
## 6     18        6             3                 6                  6         3
##   laps fastestLap fastestLapTime_rank fastestLapTime fastestLapSpeed
## 1   58         39                   2         87.452         218.300
## 2   58         41                   3         87.739         217.586
## 3   58         41                   5         88.090         216.719
## 4   58         58                   7         88.603         215.464
## 5   58         43                   1         87.418         218.385
## 6   57         50                  14         89.639         212.974
##   driv_statusId q_time driv_start_pos avg_lap_time avg_lap_pos avg_pit
## 1             1 85.187              1     98.11407           1      NA
## 2             1 85.518              5     98.20852           4      NA
## 3             1 86.059              7     98.25481           4      NA
## 4             1 86.188             12     98.41029           8      NA
## 5             1 85.452              3     98.42466           3      NA
## 6            11 86.413             14    100.08219          11      NA
##   sum_driv_pnts driv_standing driv_standingText driv_wins season round
## 1            10             1                 1         1   2008     1
## 2             8             2                 2         0   2008     1
## 3             6             3                 3         0   2008     1
## 4             5             4                 4         0   2008     1
## 5             4             5                 5         0   2008     1
## 6             3             6                 6         0   2008     1
##   circuitId             race_name circ_country  circuitRef
## 1         1 Australian Grand Prix    Australia albert_park
## 2         1 Australian Grand Prix    Australia albert_park
## 3         1 Australian Grand Prix    Australia albert_park
## 4         1 Australian Grand Prix    Australia albert_park
## 5         1 Australian Grand Prix    Australia albert_park
## 6         1 Australian Grand Prix    Australia albert_park
##                        circ_name  driverRef forename    surname  dob
## 1 Albert Park Grand Prix Circuit   hamilton    Lewis   Hamilton 1985
## 2 Albert Park Grand Prix Circuit   heidfeld     Nick   Heidfeld 1977
## 3 Albert Park Grand Prix Circuit    rosberg     Nico    Rosberg 1985
## 4 Albert Park Grand Prix Circuit     alonso Fernando     Alonso 1981
## 5 Albert Park Grand Prix Circuit kovalainen   Heikki Kovalainen 1981
## 6 Albert Park Grand Prix Circuit   nakajima   Kazuki   Nakajima 1985
##   nationality   status
## 1     British Finished
## 2      German Finished
## 3      German Finished
## 4     Spanish Finished
## 5     Finnish Finished
## 6    Japanese   +1 Lap

merge_driv Codebook

raceId - unique race identifier; numeric
driverId - unique driver identifier; numeric
constructorId - unique constructor identifier; numeric
driv_positionText - a driver’s finishing race position at that certain race; factor; includes the following:

D - disqualified
E - excluded
F - failed to qualify
N - not classified
R - retired
W - withdrawn

driv_positionOrder - a driver’s place in the finishing order of that certain race; numeric
driv_pnts - number of points a driver earns at a certain race; numeric
laps - number of laps a driver completed in the race; numeric
fastestLap - the lap number during which a driver drove their fastest lap during the race; character
fastestLapTime_rank - the rank of a driver’s fastest lap time among that of the other drivers during the race; character
fastestLapTime - a driver’s fastest lap time during the race in seconds; numeric
fastestLapSpeed - fastest speed reached during a driver’s fastest lap; character
driv_statusId - unique status identifier for the condition in which a driver finished their race; numeric
q_time - fastest qualifying fast lap time (out of Q1, Q2, and Q3) that a driver completed during that race weekend’s qualifying session; numeric
driv_start_pos - a driver’s starting grid position for that certain race; numeric
avg_lap_time - average of all of a driver’s lap times during a race; numeric
avg_lap_pos - rounded average of all positions a driver was at during a race; numeric
avg_pit - average pit stop duration of a driver during a race; numeric
sum_driv_pnts - total number of points a driver earned so far in the race season including that certain race; numeric
driv_standing - a driver’s position in the Drivers’ Championship Standings at the point of that certain race during the race season; numeric
driv_standingText - driv_standing; character; including the following:

D - disqualified
driv_wins - number of wins a driver earned so far in the race season including that certain race; numeric
status - the condition in which a driver finished their race; factor



Exploratory Data Analysis

Code in file eda.rda

Missing Data

Let’s take a look at the missing data in merge_con.

load("C://Users//waliang//Documents//UCSB//third year//pstat 131//eda.rda")

vis_miss(merge_con)


Let’s arrange each constructor’s result in order throughout each race season and fill in con_pnts and con_sum_pnts using each other (since con_sum_pnts is an accumulation of con_pnts throughout the season).

merge_con1 <- merge_con %>%
  # arrange each constructor's result in order throughout season
  arrange(constructorId, season, raceId, round, con_sum_pnts) %>%
  # replace con_sum_pnts missing values with the actual value using the other con_pnts column
  mutate(con_sum_pnts = case_when((is.na(con_sum_pnts) & season==lag(season) & is.na(con_pnts)!=TRUE) ~ lag(con_sum_pnts)+con_pnts,
                                  (is.na(con_sum_pnts) & season!=lag(season) & is.na(con_pnts)!=TRUE) ~ con_pnts,
                                  is.na(con_sum_pnts)!=TRUE ~ con_sum_pnts))


# vice versa
merge_con2 <- merge_con1 %>%
  arrange(constructorId, season, raceId, round, con_sum_pnts) %>%
  mutate(con_pnts = case_when((is.na(con_pnts) & season==lag(season) & is.na(con_sum_pnts)!=TRUE) ~ con_sum_pnts-lag(con_sum_pnts),
                              (is.na(con_pnts) & season!=lag(season) & is.na(con_sum_pnts)!=TRUE) ~ con_sum_pnts,
                              is.na(con_pnts)!=TRUE ~ con_pnts))

# and again the first direction for good measure
merge_con3 <- merge_con2 %>%
   arrange(constructorId, season, raceId, round, con_sum_pnts) %>%
   mutate(con_sum_pnts = case_when((is.na(con_sum_pnts) & season==lag(season) & is.na(con_pnts)!=TRUE) ~ lag(con_sum_pnts)+con_pnts,
                                   (is.na(con_sum_pnts) & season!=lag(season) & is.na(con_pnts)!=TRUE) ~ con_pnts,
                                   is.na(con_sum_pnts)!=TRUE ~ con_sum_pnts))
vis_miss(merge_con3)

merge_con3$constructorRef <- factor(merge_con3$constructorRef)


After fixing con_pnts and con_sum_pnts, we have a lot less missing data. The missing values in con_pos, con_posText, and con_wins is somewhat minimal. We can deal with these missing values during recipe creation using a linear regression of other variables that measure a constructors progress throughout the race season, like con_pnts and con_sum_pnts.

Other than this, we don’t have a lot of missing data and the data set is pretty usable.

Now let’s work on merge_driv similarly.

vis_miss(merge_driv, warn_large_data = FALSE)


That’s a lot of missing data. There are some drivers with no constructor team! Let’s look at those first.

dim(merge_driv %>%
  filter(is.na(constructorId)))[1]
## [1] 8646
dim(merge_driv)[1]
## [1] 34496

There are 8646 out of 34496 observation with missing values from constructorId to avg_pit and status. My guess is that when I was joining data sets that had information available only after a certain year, like quali, pit, and even results, then that conflicted with driver_standings that had a lot of data starting from the beginning of F1, leaving empty spots in variables only available after a certain date.

I don’t want to just remove these observations, since they still have data for driv_standing and driv_wins. Some of the missing variables that I would prefer to have, like driv_positionOrder, driv_pnts, and driv_start_pos, are measures of the success of each driver at each race and throughout the season, similar to driv_standing and driv_wins. However, I think I will be forced to remove the rows with missing constructorId because I will have trouble imputing a categorical variable (constructorId is a factor).

vis_miss(merge_driv1)


Deal with missing driv_standing and driv_wins so we can easily impute the other variables in the recipe later on.

dim(merge_driv1 %>%
      filter(is.na(driv_standing)))[1]
## [1] 469
merge_driv2 <- merge_driv1 %>%
  group_by(driverId, round) %>%
  # replace missing driv_standing with the median of the standings the driver gets in that round across seasons
  summarise(driv_standing_na = as.integer(median(driv_standing,na.rm=TRUE)))

merge_driv3 <- full_join(merge_driv1, merge_driv2, by=c("driverId","round"))

merge_driv3 <- merge_driv3 %>%
  mutate(driv_standing = 
           case_when(is.na(driv_standing) ~ driv_standing_na,
                     !is.na(driv_standing) ~ driv_standing))
dim(merge_driv3 %>%
  filter(is.na(driv_standing)))[1]
## [1] 91

By replacing missing driv_standing with the median of such in that round number across all seasons, we reduce the missing standing values from 469 to 91. This is a pretty good effort to save some of our data. Then we can omit the 91 missing values.

# omit the rest of missing driv_standing
merge_driv4 <- merge_driv3 %>%
  filter(!is.na(driv_standing))
vis_miss(merge_driv4, warn_large_data = FALSE)


We can work with this. It’ll get better when imputing during recipe creation.



Visual EDA

Let’s look at some of the relationships between our variables.

Constructor and Driver Points

Let’s take a look at our possible response variables con_pos and driv_pnts.

grid.arrange(plot_con, plot_driv, ncol=2)


range((merge_con3 %>%
         filter(!is.na(con_pos))
         )$con_pos)
## [1]  1 22

range((merge_con3 %>%
  filter(con_pos==tail(unique(sort((merge_con3 %>%
         filter(!is.na(con_pos))
         )$con_pos)))) %>%
  select(con_pos,con_name, season, race_name, circ_name))$season)
## [1] 1963 1991

range((merge_con3 %>%
  filter(con_pos==tail(unique(sort((merge_con3 %>%
         filter(!is.na(con_pos))
         )$con_pos)))) %>%
  select(con_pos,con_name, season, race_name, circ_name))$season)
## [1] 1963 1991

range((merge_driv4 %>%
         filter(!is.na(driv_pnts))
         )$driv_pnts)
## [1]  0 50

The con_pos data is from 1 to 22. The constructors with very high positions are pre-2000 (from 1963 to 1991), because now there are only ten constructors each season. The range of driv_pnts for the data and possibility is from 0 to 50 (first place at 2014 Abu Dhabi).

* At the 2014 Abu Dhabi Grand Prix, the FIA rewarded double points. So, instead of the usual 25 to 1 point range, first place to tenth place earned 50, 36, 30, 24, 20, 16, 12, 8, 4, and 2 respectively. Understandably, double points was never done again.



Correlation Plots

Let’s look at some possible relationships between our variables.

# remove na, identification variables, only numeric
corrplot(cor(merge_con3 %>%
               select_if(is.numeric)%>%
               na.omit()), method="shade", diag=FALSE,addCoef.col=1,number.cex = 1, type="lower")


In the merge_con3 data set, we have a positive correlation between con_wins and con_pnts of which makes sense, the more points a constructor earns, the more likely it is the constructor’s drivers won races. There is also a negative correlation between con_wins and con_pos of, which is also obvious because the more positions a constructor is away from the top of the standings, the less likely that constructor is to win races. round and con_sum_pnts are positively correlated at, so the more races a constructor completes and the more of the race season that passes, the more points a constructor accumulates. On the other hand, round and con_pnts are not as positively correlated at because actually making quick progress and improvement during the race season to earn more and more points consistently as the season goes on is a difficult thing to do as an entire constructor team, even if it is what every team desires. The positive correlation between season and con_pnts/con_sum_pnts is obvious because over time the FIA increased not only the number of points rewarded, but also the number of race finishing positions they reward points to. We also have a negative correlation of between con_pos and con_pnts/con_sum_pnts. This is due to similar reasons as with con_wins and con_pos.

# remove na, q_time has too much na, and identification variables, only numeric
corrplot(cor(merge_driv4 %>%
               select_if(is.numeric) %>%
               na.omit()), method="shade",addCoef.col = 1,number.cex = 0.5, type="lower",diag=FALSE)


Most of the correlations between the merge_driv variables follow similar interpretations as those of the merge_con3 data set. What’s nice to look at is the variables I modified myself. For example, avg_lap_pos is very positively correlated at with driv_start_pos which tells me that the average position a driver spends during a race is likely to be the position they started at. This is intuitively accurate as some races, like street races, can make overtaking difficult, so a driver’s start position can have a huge impact on how their race plays out. This also applies to driv_standing. The very positive correlation of between driv_standing and avg_lap_pos is because the more races a driver spends as a back-marker (i.e. higher race position), the more drivers there are between them and the top of the standings.

Constructor Position in Standings vs. Constructor Points

ggplot(merge_con3, aes(x=con_pnts, y=con_pos)) +
  geom_jitter(width=.5, size=1)+
  geom_smooth(col="red", stat="smooth", level=FALSE)+
  labs(title="Constructor Position in Standings vs. Constructor Points", y="Constructor position in standings throughout season", x="Constructor points earned at each race")


Understandably so, the lower the points a constructor earns at a race, the more constructors are between your position and the top of the standings. The relationship here also does not seem linear. Once a constructor begins earning points, it is a bit easier to climb positions in the standings, as earning more points for yourself means your competition is earning less points. The trend here looks inverse exponential, but maybe I’m getting ahead of myself.

Accumulated Constructor Points vs. Round

Let’s take a closer look at the relationship between con_sum_pnts and round, something I found interesting as it has to do with the progress and improvement of a constructor team throughout a race season.

ggplot(merge_con3, aes(x=round, y=con_sum_pnts)) +
  geom_jitter(width=.5, size=1)+
  geom_smooth(method="lm", col="red")+
  labs(title="Accumulated Constructor Points vs. Round", y="Constructor points accumulated throughout season", x="Round (number of race in order of season schedule)")


It is interesting to see the majority of points to be somewhat plateauing in con_sum_pnts as round increases. You can also see the behavior of the handful of championship winning constructor teams that manage to have a steady increase in their accumulated points.

Driver Points vs. Average Lap Position

ggplot(merge_driv4, aes(x=avg_lap_pos, y=driv_pnts)) +
  geom_jitter(width=.5, size=1)+
  geom_smooth(method="lm", col="green")+
  labs(title="Driver Points vs. Average Lap Position", x="Average lap position", y="Driver's points earned at each race")


We can see that the back-markers, i.e. the drivers that happened to spend most of their race at I would say positions fifteen and greater, have very little chance with earning points. Most of the drivers with average lap position of fifteen and greater are lined up at zero points earned. Reasons drivers spend most of their time out of the points can vary from lacking race pace, having issues with the car, to poor tyre management, which are all valid reasons that keep them from making moves and overtaking to get into the points.

What’s also interesting is that that thick pile of zero point earners extends all the way to the higher positions. It is not a shock for drivers who are consistently in top positions, even race leaders, to suddenly encounter a car failure and DNF from the race. It is totally possible and is part of the charm of F1.

Average Lap Position vs. Driver Start Position

ggplot(merge_driv4, aes(x=driv_start_pos, y=avg_lap_pos)) +
  geom_jitter(width=.5, size=1)+
  geom_smooth(method="lm", col="blue")+
  labs(title="Average Lap Position vs. Driver Start Position", x="Driver's starting position", y="Average lap position")


Like the correlation plot suggested, we have a clear positive relationship between average lap position and driver start position. Since, qualifying sessions determine driver start position, this also supports the idea that qualifying and sprint qualifying have incredible influence on the course of a race.

Model Set-Up

Code in file model_building.rda

Now we can start fitting models to see if we can predict con_pos and driv_pnts.

Data Split

We will be splitting our data sets into a training set and testing set. A majority of the data should be in the training set, which is what we build our model off of. Having separate training and testing sets allows us to see a truer test of how well our model performs. We need to test our model on data it has not been trained on to obtain a realistic idea of where our model is at. This also avoids over-fitting. If we were to test our model on data its been trained on, the results will be very accurate, but falsely so because it is not a true prediction at that point. Our testing data should be akin to a blind taste test for our model, but we build the model’s connections and understandings of how to determine what it will be tasting.

We will split our data so that 70% of it goes to training and 30% of it is testing. We also stratify on the outcome variables con_pos and driv_pnts to make sure our data is split with an equal distribution of the response variables in our training and testing data.

# split into training and testing, strata on response variables
con_split <- initial_split(merge_con3, prop = 0.7, strata = con_pos)
con_train <- training(con_split)
con_test <- testing(con_split)

driv_split <- initial_split(merge_driv4, prop = 0.7, strata =driv_pnts)
driv_train <- training(driv_split)
driv_test <- testing(driv_split)

# check split proportions
nrow(con_train)/nrow(merge_con3)
nrow(con_test)/nrow(merge_con3)
nrow(driv_train)/nrow(merge_driv4)
nrow(driv_test)/nrow(merge_driv4)



Recipe

We have to decide on a recipe, or formula, made up of our response variable and predictors, that can be used for many types of models later. From our merge_con3 data set, we are using con_pos as a response variable. I want to include measures of points, home races (nationality), race order, year, and the specific circuit/track have influence on con_pos. So, for predictors, I have decided on: constructorRef, con_pnts, con_wins, con_nation, season, round, circuitRef, and circ_country.

For merge_driv, our response variable is driv_pnts. For predictors, I want to include: driverRef, constructorRef, driv_positionText, driv_positionOrder, laps, fastestLapTime_rank, fastestLapSpeed, driv_start_pos, avg_lap_time, avg_lap_pos, driv_standing, driv_wins, season, round, circuitRef, circ_country, dob, nationality, and status.

I made sure to include only one of pnts or sum_pnts variables because they are calculated dependently. Same thing with fastestLapTime_rank and fastestLap since one variable is a ranking of the other. Since driv_start_pos is inherently a result of qualifying sessions, I feel that q_time is less useful. Especially since q_time is a decimal amount of seconds and is more missing than not, I think excluding q_time is preferred.

We will impute the missing values in continuous and dummy coded categorical variables.

# omit the small amount of missing values left
merge_con3 <- merge_con3 %>%
  na.omit()


# CONSTRUCTOR RECIPE
con_rec <-
  recipe(con_pos ~ constructorRef+con_pnts+con_wins+con_nation+season+
           round+circuitRef+circ_country, data = con_train) %>%
  
  # omit the small amount of missing values left
  step_naomit(all_predictors())%>%
  
  # there are a hundreds of different constructors and circuits, so we need to collapse the less common ones into an other category
  step_other(constructorRef, threshold=.1) %>%
  step_other(con_nation, threshold=.1) %>%
  step_other(circuitRef, threshold=.1) %>%
  step_other(circ_country, threshold=.1) %>%
  step_naomit()%>%
  

  # dummy code categorical variables
  step_dummy(all_nominal_predictors()) %>%
  
  # remove variables (likely the dummy coded ones) that only contain a single value
  step_zv(all_predictors())%>%
  # step_novel(all_nominal_predictors())%>%
  # step_unknown(all_nominal_predictors())%>%
  
  # normalizing
  step_center(all_predictors()) %>%
  step_scale(all_predictors())
  

# prep and bake
prep(con_rec) %>%
  bake(new_data = con_train)



K-fold Cross Validation

We will also use k-fold cross validation. K-fold cross-validation is a process of setting aside a portion of the training data to split into “k folds” or k number of sections, which within each k fold takes a turn in acting as the validation set while the rest of the k folds are training data. This is done in a manner so that each k times we train and assess, each training set and validation set is mutually exclusive. So, the process should be fitting, then assessing k times, before one final assessment on the actual testing split portion of data. Let’s do 10 folds.

con_folds <- vfold_cv(con_train, v = 10)

con_folds2 <- vfold_cv(con_train, v = 5)
# for random forest for the sake of time



Model Building

At this point I decided to only work on the constructor data set for the sake of time.

Our process for building our models follows the same steps. The process is as follows:

  1. set up model with tuning parameters, the engine, and mode (either regression or classification)
# LINEAR MODEL
lm_model<- linear_reg(engine="lm")


# POLYNOMIAL REGRESSION
# tune degree on the numerical variables (degree on these variables will amplify their effect)
      # ex. a change in points and wins will affect constructor position more than normal
con_polyrec <- con_rec %>%
  step_poly(con_pnts,con_wins,season,round,degree=tune())

poly_model <- linear_reg(mode="regression",
                         engine="lm")


# K NEAREST NEIGHBORS 
# tune neighbors
knn_model <- nearest_neighbor(neighbors = tune(),
                              mode="regression",
                              engine="kknn")


# ELASTIC NET LINEAR REGRESSION
# tune penalty and mixture
en_model <- linear_reg(mixture=tune(),
                       penalty=tune(),
                       mode="regression",
                       engine="glmnet")


# ELASTIC NET W/ LASSO
# tune penalty, set mixture to 1 for lasso penalty
en_lasso <- linear_reg(penalty=tune(),
                       mixture=1,
                       mode="regression",
                       engine="glmnet")


# ELASTIC NET W/ RIDGE
# tune penalty, set mixture to 0 for ridge penalty
en_ridge <- linear_reg(penalty=tune(),
                       mixture=0,
                       mode="regression",
                       engine="glmnet")


# RANDOM FOREST
# tune number of predictors (mtry), trees, and minimum number of values in each node (min_n)
rf_model <- rand_forest(mtry = tune(), 
                           trees = tune(), 
                           min_n = tune()) %>%
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("regression")


  1. set up work flow with model and recipe
# LINEAR MODEL
con_lmwflow <- workflow() %>%
  add_model(lm_model) %>%
  add_recipe(con_rec)



# POLYNOMIAL REGRESSION
con_polywflow <- workflow() %>%
  add_model(poly_model) %>%
  add_recipe(con_polyrec)


# K NEAREST NEIGHBORS 
con_knnwflow <- workflow() %>%
  add_model(knn_model) %>%
  add_recipe(con_rec)


# ELASTIC NET LINEAR REGRESSION
con_enwflow <- workflow() %>%
  add_model(en_model) %>%
  add_recipe(con_rec)



# ELASTIC NET W/ LASSO
con_lassowflow <- workflow() %>%
  add_model(en_lasso) %>%
  add_recipe(con_rec)


# ELASTIC NET W/ RIDGE
con_ridgewflow <- workflow() %>%
  add_model(en_ridge) %>%
  add_recipe(con_rec)



# RANDOM FOREST
con_rfwflow <- workflow() %>%
  add_model(rf_model) %>%
  add_recipe(con_rec)


  1. set up tuning grid with ranges and levels for the tuning of parameters
# POLYNOMIAL REGRESSION
# range degree from 1 to 5
poly_grid <- grid_regular(degree(range=c(1,5)),
                        levels=5)

# K NEAREST NEIGHBORS 
# range neighbors from 1 to 15
knn_grid <- grid_regular(neighbors(range=c(1,15)),
                        levels=5)


# ELASTIC NET LINEAR REGRESSION
en_grid <- grid_regular(penalty(range=c(0.01,3), trans=identity_trans()),
                        mixture(range=c(0,1)),
                        levels=10)


# ELASTIC NET W/ LASSO and
# ELASTIC NET W/ RIDGE
lasso_ridge_grid <- grid_regular(penalty(range=c(0.01,3),
                                         trans=identity_trans()), levels=10)


# RANDOM FOREST
# predictors (mtry) range depend on the recipe
con_rfgrid <- grid_regular(mtry(range=c(1,9)),
                        trees(range=c(200,400)),
                        min_n(range=c(10,20)),
                        levels=5)


  1. tune model using workflow with k-fold cross validation, and tuning grid
# POLYNOMIAL REGRESSION
con_polytune <- tune_grid(
  con_polywflow,
  resamples = con_folds,
  grid = poly_grid)



# K NEAREST NEIGHBORS
con_knntune <- tune_grid(
    con_knnwflow,
    resamples = con_folds,
    grid = knn_grid)



# ELASTIC NET
con_entune <- tune_grid(
  con_enwflow,
  resamples = con_folds,
  grid = en_grid)


# RIDGE REGRESSION
con_ridgetune <- tune_grid(
  con_ridgewflow,
  resamples = con_folds,
  grid = lasso_ridge_grid)



# LASSO REGRESSION
con_lassotune <- tune_grid(
  con_lassowflow,
  resamples = con_folds,
  grid = lasso_ridge_grid)


# RANDOM FOREST
con_rftune <- tune_grid(
  con_rfwflow,
  resamples = con_folds2,
  grid = con_rfgrid)


  1. Collect metrics of tuned models. We want to minimize root mean squared error (RMSE), so find the lowest RMSE for each tuned model.
load("C://Users//waliang//Documents//UCSB//third year//pstat 131//model_building_final.rda")

load("C://Users//waliang//Documents//UCSB//third year//pstat 131//model_results_final.rda")
lm_rmse <- collect_metrics(con_lmfit) %>% 
  slice(1)

ridge_rmse <- collect_metrics(con_ridgetune) %>% 
  arrange(mean) %>%
  filter(.metric=="rmse") %>%
  slice(1)

# LASSO REGRESSION
lasso_rmse <- collect_metrics(con_lassotune) %>% 
  arrange(mean) %>% 
  filter(.metric=="rmse") %>%
  slice(1)

# POLYNOMIAL REGRESSION
poly_rmse <- collect_metrics(con_polytune) %>% 
  arrange(mean) %>% 
  filter(.metric=="rmse") %>%
  slice(1)

# K NEAREST NEIGHBORS
knn_rmse <- collect_metrics(con_knntune) %>% 
  arrange(mean) %>% 
  filter(.metric=="rmse") %>%
  slice(1)

# ELASTIC NET
elastic_rmse <- collect_metrics(con_entune) %>% 
  arrange(mean)%>% 
  filter(.metric=="rmse") %>%
  slice(1)
lm_rmse
## # A tibble: 1 x 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard    3.45    10  0.0155 Preprocessor1_Model1
ridge_rmse
## # A tibble: 1 x 7
##   penalty .metric .estimator  mean     n std_err .config              
##     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1    0.01 rmse    standard    3.45    10  0.0157 Preprocessor1_Model01
lasso_rmse
## # A tibble: 1 x 7
##   penalty .metric .estimator  mean     n std_err .config              
##     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1    0.01 rmse    standard    3.45    10  0.0157 Preprocessor1_Model01
poly_rmse
## # A tibble: 1 x 7
##   degree .metric .estimator  mean     n std_err .config             
##    <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1      5 rmse    standard    2.82    10  0.0264 Preprocessor5_Model1
knn_rmse
## # A tibble: 1 x 7
##   neighbors .metric .estimator  mean     n std_err .config             
##       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1        15 rmse    standard    2.65    10  0.0212 Preprocessor1_Model5
elastic_rmse
## # A tibble: 1 x 8
##   penalty mixture .metric .estimator  mean     n std_err .config               
##     <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                 
## 1    0.01       1 rmse    standard    3.45    10  0.0157 Preprocessor1_Model091
rf_rmse
## # A tibble: 1 x 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     4   350    20 rmse    standard    2.37     5  0.0162 Preprocessor1_Model1~



Model Results

Code in file model_results.rda

Let’s compare the models from before that minimize RMSE to find best performing one out of them all.

The model that minimizes RMSE the most is the Random Forest model with mtry = 5, trees = 400, and min_n = 20, which achieves a mean RMSE of 2.370339 across 5 folds.

Elastic Net

autoplot(con_entune, metric="rmse")


The model with lasso = 0 penalty, the red line, and low mixture minimize RMSE the most. As lasso penalty increases, our model performs increasingly worse. This makes sense because a high penalty will reduce the coefficients to our predictors, effectively downplaying the effect our predictors have on our response.

Also, when mixture penalty is present from 0.01 to 0.1, the discrepancies between models of different lasso penalties are small in performance level as they are when mixture penalty is very high. This makes sense because our ridge regression model (mixture = 0) performed with a higher RMSE of 3.451871 than our lasso regression (mixture = 1) model (RMSE of 3.449157).

Polynomial Regression

autoplot(con_polytune, metric="rmse")


As the degree of our continuous variables increase from 1 to 5, the RMSE decreases steadily. This aligned with my expectations because our actual data, like con_pnts and con_wins, are based on singular points and wins where small amounts can make a difference in determining a constructor’s place in the standings. Often times, a constructor’s position in the standings will only be guaranteed a the last race, because competition between competitors will last the entire season. So, by adding degrees, we are amplifying the differences between our data points. Thus, the effect of each predictor will become more and more influential. It seems a degree of 5 would perform the best, but I can also assume that an even higher degree would lower the RMSE even more.

K-Nearest Neighbors

autoplot(con_knntune, metric="rmse")


Similarly, the knn model performs increasingly better when the neighbors increase. Our knn model with 15 neighbors managed to minimize the RMSE the most, except for the random forest model, which is not entriely surprising. Also, the non-linear models, like knn and random forest, seemed to perform the best, so our data may not be non-linear.

Random Forest

autoplot(con_rftune, metric="rmse")


Focusing on how the tuning parameters work on minimizing RMSE, it looks like minimal node size and number of trees have almost no difference in affecting RMSE. However, number of randomly selected predictors mtry looks to minimize RMSE the most when it is larger (about 2 to 8), though there is a slight dip in RMSE across all models at mtry = 5. Notice how we ranged mtry from 1 to 8. We wouldn’t want the model to reach mtry = 9 because if all trees have the same first split, then the trees are not independent.


Results of Best Model

# parameters of best model: mtry=5, trees=400, min_n=20
rf_rmse
## # A tibble: 1 x 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     4   350    20 rmse    standard    2.37     5  0.0162 Preprocessor1_Model1~


Fitting to Training

# fit best model to training
best_rf_train <- select_best(con_rftune, metric="rmse")

# finalize workflow
final_model_wflow <- finalize_workflow(con_rfwflow, best_rf_train)
final_model_fit <- fit(final_model_wflow, data=con_train)



Testing

Now let’s apply the model on testing data.

con_tibble <- predict(final_model_fit, new_data = con_test %>% select(-con_pos))
con_tibble <- bind_cols(con_tibble, con_test %>% select(con_pos))

con_metric <- metric_set(rmse)
con_tibble_metric <- con_metric(con_tibble, truth=con_pos, estimate=.pred)
con_tibble_metric
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        2.36


The model performed even better achieving a RMSE of 2.361256 on the testing data. Since RMSE is a measure of predicted outcome values related to actual outcome values, and our response variable con_pos is supposed to range from 1 to twenty-something, a RMSE of 2.361256 is not too bad. Our predicted standing positions might be a few places off in the standings, which is not ideal, but it’s manageable.



Conclusion

The best performing model is the random forest model, while the more linear, simple models, like the linear regression model and elastic net models, performed poorly in comparison. I was not entirely surprised by how our models performed because predicting constructors standings did not seem like a linear trend intuitively. There are a lot of factors that come into play, a lot of which isn’t even included or properly represented in our data here, like more detailed qualifying data, receiving grid place penalties from the FIA, having on-going car issues, etc. With all this in consideration, I did not think a basic linear model could cover what we needed.

As for next steps, I will probably spend a bit more time to carefully go over everything and do the random forest model with a proper amount of folds and wider ranges for the tuning parameters. I would also spend more time on manipulating the data and perhaps even include data for our rookie drivers.

Overall, this project was successful and a great challenge in compiling data and applying something I am genuinely interested in and passionate about to academic skills I’ve only recently learned in this class.



Sources

Data comes from the Ergast Developer API.

Information about Formula 1 and the FIA comes from my own personal knowledge, the Official F1 Website, and Wikipedia.