Experts versus models: How do we rank drivers?

Who was a greater Formula 1 driver: Nigel Mansell or Elio de Angelis? Expert panels have uniformly selected Mansell. Raw statistics of success — wins, championships, and the like — also clearly favor Mansell. Yet, every mathematical model of driver rankings developed to date resoundingly answers de Angelis. Examining the data table below helps to explain why.

mansell_deangelis_table3

This is a motivating example, but certainly not an isolated case of head-to-head results and career achievements telling conflicting stories in Formula 1. Such cases highlight the importance of the metrics used to make driver comparisons, and the different tendencies of experts and models.

When I began this blog, I had just embarked on a major analysis project that ultimately generated model-based rankings of the best Formula 1 drivers across history. The essential difficulty in rating drivers (compared to rating athletes in many other sports) is accounting for the different cars drivers used — with the exception of teammates (and even that comparison can be complicated), no direct driver comparisons are possible. Team performance differences greatly exceed driver performance differences in Formula 1, meaning traditional metrics for success (e.g., wins, points) make for unreliable driver performance metrics. Quantitative models are thus needed to infer any sort of objective rankings.

At the time I started my work, only one such model existed for Formula 1 (Eichenberger and Stadelmann, 2009). I felt that there was still room for significant improvement, and I think my model (Phillips, 2014) generated many new insights, and a driver hierarchy with greater face validity. This model has been the basis for several of my posts on this blog, including yearly rankings.

Recently, another model was published (Bell et al., 2016), unrelated to this blog, also with the goal of separating driver and team performances in Formula 1. This means that there are now three peer-reviewed models. Each of these models is a fully objective method for ranking drivers and teams. Provided with a set of data (i.e., historical race results), parameter values that represent team and driver performances are found by achieving the model’s best fit to the data.

Interestingly, each model has slightly different underlying assumptions. These are the choices that dictate what the model takes as inputs, how these are mathematically related, the complexity of the model, and what the model gives as outputs. The existence of multiple models is extremely valuable for understanding how different assumptions affect relative rankings, and how robust rankings are to the structure of the underlying model.

The properties of each model are summarized in the table below, with my model (Phillips (2014) — the one used on f1metrics) in the middle column.

F1_models_table

One of the key differences between the 2009 model and the latter two is the use of finishing position instead of points as the performance metric. Using finishing position in a linear model introduces a few potential problems. First, the model becomes very sensitive to any bad results. For example, a driver who finishes 1st several times and 20th once (e.g., due to a DNF), may have a worse performance than a teammate who consistently finishes 3rd. The same DNF penalty does not afflict drivers who habitually run in lower positions, meaning DNFs also penalize drivers differently depending on the competitiveness of their team under this system. Second, the model will experience floor and ceiling performance effects — driver/car combinations that are right near the top or the bottom of the field have less scope for differences in finishing positions. A nonlinear performance metric can address this by accounting for nonlinear changes in the performance metric (points) as a function of driver/team performance. An obvious downside to traditional points systems is their failure to discriminate between non-scoring positions (e.g., 12th vs. 15th), but this can be addressed using an extended points system, as I described previously, and that is how the other two models work.

A second key difference is the treatment of DNFs. The model used on this blog (Phillips, 2014) is unique in excluding non-driver DNFs (such as mechanical failures and technical disqualifications) from the analysis. The main downside to doing this is that it is extremely time-consuming to prepare the data in this form — as I can attest, having spent several weeks doing that myself! The upside is that it partially adjusts for bad luck. Bell et al. argue in their paper that, given enough races, the effects of luck will tend to balance out. I agree with this in principle, but in practice drivers often have short careers (tens of races), or extremely unlucky years. This undermines the ability to do per-season rankings (as I’ve presented using my model previously) and can lead to some peculiar driver rankings. For example, Lauda is ranked very low by the Bell et al. model, which is probably partly due to his very poor reliability alongside Prost in 1985 (10 mechanical failures in 14 starts). Meanwhile, Christian Fittipaldi is ranked the 11th best driver of all time by Bell et al., despite mixed results alongside weak teammates — this is because he had many fewer mechanical DNFs than his teammates during his short career. As a counterargument to excluding mechanical DNFs, car preservation was an important skill in the earlier days of F1, so mechanical failures cannot be considered purely bad luck, as they usually are today.

The Bell et al. model is unique in assuming some relationship between team performances across different years. The other two models instead allow team performance to be fit independently in each year. There is some sense to both approaches. Top teams tend to remain top teams between seasons. However, there are also many historical examples of drastic changes in team performance from one season to the next. Both assumptions are therefore worth consideration.

The models of Eichenberger & Stadelmann and Bell et al. also consider effects of secondary factors, such as weather and the track. I preferred to start with a simpler model structure, given the risk of model overspecification, especially since many years have had 0-2 wet races. Nevertheless, Bell et al. used this information to show that drivers are of relatively greater importance on average in the wet and on street circuits. Both of these findings square very well with intuition.

The three models make predictions that are similar in many areas, but significantly different in others. In addition to ranking drivers, the models can also be used for ranking the best teams in history. The two most recent models confirm that team performance is a larger contributor than driver performance to overall performance in Formula 1, with estimates of 61% team from Phillips and 86% team from Bell et al. This figure was probably also computed, but not reported in the paper, by Eichenberger & Stadelmann.

The top 50 driver rankings by the three models are listed below. In some cases, two of the models like a particular driver, while one does not.

model_ranks_comparison

One way in which we can use the models is to create a consensus ranking list by combining the rankings of the three individual models. In essence, we are then treating the three models as experts with differing opinions. Generating consensus rankings is a problem that has been studied in many areas outside of sports, including electing political candidates and ordering priorities most efficiently for business.

The simplest approach to consensus rankings is to create a list based on the average rankings of each candidate. This is the Borda method. For example, if one driver has ranks of 3, 2, and 6, their average ranking is 3.67. If another driver has ranks of 2, 1, and 4, their average ranking is 2.33, which would put them ahead of the other driver. However, many people do not like the Borda method. One problem is that it can be easily affected by extreme outlier votes. For example, a driver who is ranked 5th by 5 models but 100th by 1 model will be ranked behind a driver who is ranked 20th by all 6 models. If we are treating all models as equally valid (which we are for now), it seems more sensible to consider the consensus of 5 models over the wildly different ranking of 1. A related problem is that the Borda method fails the Condorcet criterion, which states that if an absolute majority of ranking lists place a driver in number 1 position, that driver should be ranked number 1 in the consensus list.

An alternative algorithm for finding consensus rankings is the Kemeny-Young method. The goal in this case is to find the consensus ranking list that minimizes disagreement with each other list using the Kendall tau distance as its distance metric. What this means in simple terms is to check the relative ordering of each pair of drivers in the consensus ranking list, compared to their relative ordering in each individual ranking list. For example, if Driver A is ranked ahead of Driver B in the consensus ranking list, we count how many of the models disagree on that ordering (i.e., by ranking Driver B ahead of Driver A in their lists). The total number of disagreements is counted across all possible pairwise driver comparisons.

This method is guaranteed to satisfy the Condorcet criterion, and has other favorable features. The main downside is that it can be computationally intensive to actually find the optimal consensus ranking list. In this case, it was not difficult, because the lists involved are not too long and there are only three of them.

Using this distance metric, we can first check the similarity of the three models’ lists based on how many pairwise disagreements there are between them.

model_ranks_similarity

From above, we can see that the three models’ lists are all similarly close to one another, with the two more recent models being slightly closer to one another than the E&S (2009) list.

The consensus ranking list for the three models (i.e., the list that is closest to all three) is shown below, with the driver ranks for each of the three models shown alongside.

consensus_ranks3

*Schumacher’s rating in the Bell et al. model is derived by ignoring his post-2006 results, and treating N Rosberg’s teammate for 2010-2012 as a different driver.

While the three models do not universally agree, there are some notable cases where they are in good agreement with one another, but strong disagreement with expert opinions. These are interesting cases to consider, because they indicate drivers who may be overrated or underrated by subjective criteria. One of the most recent driver ranking lists compiled by a panel experts is the Autosport Top 40, which polled 217 Formula 1 drivers. Below are the cases of most notable expert/model mismatch.

Jack Brabham

jbrabham

Jack Brabham is quite rightly recognized as one of the giants of the sport. Nobody has repeated his feat of winning a championship in a car of their own construction. Only four drivers have historically won more drivers’ championships than Brabham, and he raced successfully in F1 for a period of 16 years, winning his last race at age 43. However, when his successes are weighed against those of his teammates, all three models think that on driving performance alone, there were several better in his generation. He was notably outperformed by Gurney, Rindt, and Ickx as teammates.

Mika Hakkinen

mhakkinen

Hakkinen is often cited as Schumacher’s greatest rival. Martin Brundle, a pundit I hugely respect, has previously written about how closely he rated the pair on the basis of his experience as teammates to both. I think it’s important to note, however, that Brundle faced Schumacher in the latter’s first full season of Formula 1,  whereas Hakkinen was into his third full season when he faced Brundle. He thus faced Schumacher well before his peak.

The models attribute the closeness of the Hakkinen-Schumacher rivalry to a car advantage for McLaren in most years. The models also view Coulthard as no stronger than Barrichello, which does Hakkinen no favors given he was much closer matched with Coulthard than Schumacher was with Barrichello. In addition, Coulthard was beaten by a greater margin by Raikkonen than he was by Hakkinen during their respective stints as teammates.

Nigel Mansell

nmansell

Mansell was hugely entertaining to watch. His all-or-nothing approach to racing often generated on-track drama. In the eyes of experts, he is consistently ranked among the all-time top 20, and sometimes even the top 10. It is difficult, however, to square these rankings with the objective facts. Mansell was convincingly beaten by teammate de Angelis, and then closely matched with Keke Rosberg, both of whom are typically ranked much lower by experts than Mansell. Models instead place Mansell closer to 50th in the all-time rankings, and are less than impressed by his tally of only one drivers’ championship given he spent at least three years in the undoubtedly best car on the grid (1986, 1987, 1992).

Gilles Villeneuve

gvilleneuve

Much like Mansell, Gilles Villeneuve had all the qualities that make a driver popular with fans and pundits: phenomenal car control, raw speed, wet-weather skills, and willingness to take (often insane) risks. Yet the results weren’t always there. The unmissable qualities mask issues in other areas, such as consistency and the occasions when the risks simply didn’t pay off. The models — which do not care whether race results were achieved by overt brilliance or boring consistency — view things very differently from the experts. By their assessments, Villeneuve’s results rate him a good driver but not a great one. In their view, if Villeneuve was the complete package, he should have more easily beaten his teammates, including chief rival Pironi, who was actually outscored by every teammate he ever faced in Formula 1.

Nick Heidfeld

nheidfeld

Nick Heidfeld holds the rather unfortunate distinctions of most podiums without a win (13) and second most starts without a win (183). Based on model rankings, he should be considered a natural contender for the best driver to never win a race, alongside Chris Amon. Undoubtedly, both of these drivers are better than many race winners in Formula 1 history, and both came agonizingly close to winning. Yet, for whatever reasons, Heidfeld hasn’t yet gained the same legacy as Amon in the eyes of experts.

In Amon’s case, there were parallel exploits in the Tasman Series, non-championship grands prix, and endurance racing by which experts could rate him. In Formula 1, the cars Amon raced were often quick enough to challenge for victories, but unreliable, which kept him in the public eye. Heidfeld, by contrast, raced mostly for middling Formula 1 teams, after narrowly missing two potential moves to McLaren. His results were consistently much better than should have been expected from the machinery, but there were few obvious highs and few obvious lows. He just kept on, metronomically, scoring points. A consistent, low-risk driving style can be devastatingly effective in a front-running team (just ask Alain Prost), but it has never been a fan winner for drivers stuck further down the grid. Heidfeld is the archetypal example.

Elio de Angelis

edeangelis

Elio de Angelis is a remarkably forgotten and underrated driver, especially given he died near his peak and had a reputation as one of the most charismatic drivers in the paddock. During their time as teammates at Lotus, de Angelis handily beat Mansell (see the table at the beginning of the article). He then gave Senna a genuine fight in 1985, losing 3-13 in qualifying, 3-5 in races, and 33-38 in points — comparable to Mansell’s results against Prost at Ferrari. If we consider that most expert ranking lists place Senna 1st and Mansell around 15th, it would seem inexcusable to leave de Angelis out of the top 10. In actual fact, he rarely appears inside the top 40!

The models arrive at a very different conclusion from the experts. Namely, they universally propose that de Angelis belongs among the greats of the 1980s, and that he should very likely be ranked ahead of Mansell.

Ayrton Senna & Alain Prost

senna_prost

A final fascinating case to consider here is Senna vs. Prost. Every expert ranking list published by a major English-language motorsport journal since 1997 has ranked Senna the 1st or 2nd greatest driver of all time. In addition, every one of these lists has ranked Senna ahead of Prost, usually by a small margin. During their time as teammates at McLaren, Senna beat Prost 14-9 in races where neither driver had a mechanical failure. Prost was ahead 186-154 on overall points, although this tally is affected by Senna’s much worse luck in 1989.

Interestingly, none of the three mathematical models rank Senna or Prost in 1st place overall. Moreover, all three models rank Prost ahead of Senna, albeit by a small margin in the two more recent models. One could speculate on many reasons for Senna potentially being overrated by experts, including his spectacular driving style compared to Prost’s unassuming but very effective approach, Senna’s cultivated air of mysticism, and of course his untimely death near the peak of his career.

While models have the obvious advantage of being quantitative and objective, they are not without flaws. As a seasoned observer of the sport, there are times when my subjective assessments don’t accord with my own model’s rankings. In some of these cases, I find my subjective impressions to be without solid foundation, but in others I can give good reasons why I think the model is incomplete in particular cases. The problem with subjective views, of course, is that there are equally seasoned observers who don’t share my opinions or subjective reasoning — who then do we choose to believe? Subjective impressions from the keen eyes of experts can nevertheless guide us to areas where the models are currently missing important factors and could in future be improved.

One idea for improving these models is to include other performance metrics, such as timing data, in addition to race results. In my paper, I showed that the predictions of my model do correlate well with timing data. However, there are many challenges associated with deducing performances from lap-times, including safety cars, traffic, weather, tyre and fuel strategies, and drivers choosing to back off (e.g., when they have a comfortable lead or for technical reasons). Moreover, these factors have varied significantly between eras, making it difficult to find a satisfactory model specification. To my knowledge, there has been only one attempt to quantitatively rank driver and team effects using timing data, which is the impressively detailed GrandPrixRatings system. However, this is not a fully objective model, as the top drivers in each year are chosen by the author a priori, and then other driver and team performances are calculated relative to them.

Notably, the three models considered here do not account for factors such as team orders or interactions between drivers and cars (e.g., some drivers preferring particular cars). This is because these factors cannot ever be reliably quantified back to 1950, and typically cannot even be reliably quantified today. In the case of team orders, it is often difficult to assess the extent to which they were used or the knock-on effects they had (both in terms of race strategy and in terms of motivation for the #2 drivers — why fight if it’s futile?). In cases such as Ayrton Senna, Michael Schumacher, and Fernando Alonso, team orders probably had small effects on results because they were already comfortably beating their teammates in the vast majority of races. In the case of driver-car interactions, it is actually impossible to distinguish this from changes in form. Did a driver just have a bad year, or did they not like the car? These two scenarios look identical and are not amenable to any sort of easy quantitative separation.

A chief limitations of all three models is one mentioned in previous blog posts. Namely, the models do not account for systematic changes in performance across a driver’s career with age and/or experience. I presented some preliminary work on quantifying age effects in my 2015 review and I’m currently working to build a new model that includes these factors in a satisfactory fashion. This factor seems to account for most of the driver rankings that are spurious in my model. Specifically,

  • Nico Rosberg (who benefits from comparison to an older Michael Schumacher)
  • John Watson (who benefits from comparison to a recently-returned Niki Lauda)
  • Heinz-Harald Frentzen (who benefits from comparison to an older Damon Hill).

The Bell et al. model made a special exception in one of these cases, treating Michael Schumacher of 2010-2012 as a completely different driver (something I also tested in my model in previous posts), presumably to avoid the issue of Nico Rosberg being one of the very top ranked drivers, but this is an ad-hoc solution. Both I and Eichenberger & Stadelmann have explored models that include age or experience effects, but we are yet to find a specification that really makes sense and improves the model.

By its very nature, model development is never finished. It’s an iterative process of improvement and no model is ever perfect. I’ll keep working on mine, and I’m sure others will take up the challenge too. That’s it from me until the annual season review!

References

Eichenberger, R. and Stadelmann, D., 2009. Who is the best Formula 1 driver? An economic approach to evaluating talent. Economic Analysis and Policy, 39(3), pp.389-406.

http://www.sciencedirect.com/science/article/pii/S0313592609500355

Phillips, A.J.K., 2014. Uncovering Formula One driver performances from 1950 to 2013 by adjusting for team and competition effects. Journal of Quantitative Analysis in Sports, 10(2), pp.261-278.

https://www.degruyter.com/dg/viewarticle/j$002fjqas.2014.10.issue-2$002fjqas-2013-0031$002fjqas-2013-0031.xml

[Email me for a pdf copy]

Bell, A., Smith, J., Sabel, C.E. and Jones, K., 2016. Formula for success: Multilevel modelling of Formula One Driver and Constructor performance, 1950–2014. Journal of Quantitative Analysis in Sports, 12(2), pp.99-112.

https://www.degruyter.com/view/j/jqas.ahead-of-print/jqas-2015-0050/jqas-2015-0050.xml

F1 Metrics can be found here:

https://f1metrics.wordpress.com/