Cycling is fundamentally a team sport, and like all team sports it has roles/positions which riders fill in each race. Unlike most team sports however, those roles/positions are not explicitly stated prior to the race by teams. Confusing things further, cycling teams compete at different strength races regularly. A rider who is a helper at a World Tour level race could easily be the protected leader in a lower level 1.1 race. The challenge to successfully define which position/role each rider fulfills on their team can be collapsed into answering two questions: 1) which parcours fits a rider (sprint finishes, hills, mountains) and 2) are they typically the leader or a helper (do they finish as the top rider in their team often or rarely?).
Cluster analysis is regularly used in other team sports to define roles – even in sports with more defined positions. This paper from the Sloan Sports Analytics Conference from 2012 discusses clustering based on roles in the context of the NBA. This talk from Opta Pro Forum in 2015 discusses clustering based on player types in the context of football. There have been many more advanced and refined attempts at clustering in both (and other) sports since. Clustering is most easily done either with the K-means method or with hierarchical clustering. Both operate by feeding certain features for each row of data into the algorithm. For K-means, you have to pre-define the number of clusters you’re looking for (this can be optimized so it’s not necessarily arbitrary), but for hierarchical a tree is built which steadily divides the data into smaller and smaller clusters.
Clustering in Pro Cycling
K-Means is the method I’ll use here. The key to using K-means (and any clustering method) is defining the best features for your data so that there are obvious ways for the algorithm to divide the data. For this, I’ve defined season long average values for 2017-2020 for four statistics:
- % of points earned in bunch sprint finishes (of all points earned) – where points are earned decay from 1st place earning the most to a cut-off between 15th and 50th place depending on the strength of the peloton earning the least
- Overall points per race-day – with the same definition of points
- % of race-days finishing as #1 rider on your team (must also finish in top 20 in the race)
- Difficulty of the parcours weighted by points earned – where tougher mountain stages are high difficulty and flat stages are low difficulty
These four features define 1) whether a rider earns points in sprint finishes, 2) whether they are finishing high in races, 3) whether they are leading the team, and 4) whether they fit best on flatter, hillier, or mountainous races. We can generate other features like how often a rider is in the breakaway, their performance in time trials, whether they’re successful in tough conditions, or how strong the races they participate in are, but this gives a good start and have strong data availability going back 3+ years.
Performing the Clustering
K-Means can be optimized using several methods (elbow, silhouette, etc) to find the correct number of clusters. Sometimes the number will be obvious and sometimes a small range is appropriate. For this data, between 4 and 7 clusters was the best fit. After fitting the model, six produced the most explainable clusters.
The six clusters produced can be broadly defined as three leader clusters and three helper clusters with the three levels corresponding to mountainous or flatter parcours.
- Sprinters – the easiest cluster to define; these riders are most successful in bunch sprints in flatter races and are often the leader
- Climbers – these riders get few points in bunch sprints; rather they earn points in mountainous finishes and are often the leader of the team
- Puncheurs – these riders are best on hillier parcours and can win from the bunch or in smaller groups
- Climbing helper – these riders earn fewer points and are leaders less often, but are more often successful in mountainous/hilly stages
- Sprint train – these riders earn points often in bunch sprints finishes, but are rarely leaders
- Domestiques – this is the catch-all group for riders who aren’t successful in mountain/hilly stages, nor do they earn bunch sprint points often; these can be road captains or super-strong men like Tim Declercq whose work is done before the pointy end of the race.
|Cluster||% of Riders||Example (2019)|
|Climbing helper||19%||Marc Soler|
|Sprint train||20%||Max Richeze|
So about 28% of riders fit into one of the three leader clusters, another 39% in the two specialized helper clusters, and 34% in the more generic domestique cluster. Said more clearly, in an eight man grand tour team you’ll normally have two protected riders, three specialized helpers, and three less specialized domestiques.
This visual lays out how this looks at the team level with colors denoting clusters, % of races as leader on x-axis, and parcours fit on y-axis. Below is Bora Hansgrohe – one of the most successful teams in the World Tour in 2019.
They had three primary sprinters in 2019 who are clustered on the lower right and two climbers in the upper right. They have a number of puncheurs of whom Max Schachmann is the prime example. The clustering isn’t perfect here; Formolo is more of a climbing helper and Postlberger is more involved in the sprint train, but because of the mixed roles they get classified here. Muhlberger is certainly a climbing helper though. In the bottom left are numerous support riders of which Schwarzmann, Archbold, Burghardt, and Selig are seen as sprint train and most of the rest are domestiques. You can argue Bodnar and Oss are more likely sprint train than not (and Oss is clustered with sprint train for 2017, 2018, and 2020).
In general though, these plots give a strong overview of which roles riders are fulfilling in a team for a given season.
A generic plot of where all riders fell in 2019 is below.
This clustering has numerous applications like:
- does having more sprint train domestiques predict more success for sprinters / same for climbers and their helpers?
- how does power output differ across clusters on different stage types?
- which types of riders are most successful on different parcours?
- which teams are most and least balanced (high or low percentage of riders as leader clusters vs helper clusters)?