Inherent in any analysis of cycling is a division of races into categorical types – sprint stages, mountain stages, time trials, etc. In this post I’ll discuss several processes to categorize one day races, multi-day stage races, and grand tour stages into appropriate groups for analysis.
This classification problem requires a large amount of data to be gathered along with significant feature engineering. My first attempt to categorize stages came solely from race and stage results; stages are already assigned KOM points and sprint points, the actual timing results are strongly indicative of the type of stage (eg, sprint stages will have very little or no separation between the times of the top finishers while mountain stages will have significant separation throughout the peloton), and the speed and length of stages inform as well.
I first using Principal Component Analysis to find the crucial features which separated stages. The first PC was always the KOM difficulty of climbing and the second PC was always strongly influenced by the distance of the stage (separating out short mountain stages from long one day classic races). Other features like the standard deviation in time for the top 40 finishers and the speed of the stage were also strong indicators of stage type.
I applied these PCs to cluster the stages using K Means and experimented with between 3 and 5 clusters (ignoring time trials) with the hope that the data would sort itself into sprint/intermediate/flat with uphill finish/medium mountain/high mountain or some similar break-down.
However, inaccurate classification abounds with this process; the three profiles below all award very similar KOM points. Stage 5 2016 is 216km, raced at 39 kph, with a standard deviation of 93 seconds between top 40 finishers. Stage 10 2016 is 197km, raced at 45 kph, with a standard deviation of 236 seconds between top 40 finishers. Stage 5 in 2017 is 161km, raced at 43 kph, with a standard deviation of 44 seconds. None of this information is that informative; each stage was classified the same as either ‘intermediate’ or ‘medium mountain’ depending on the number of clusters used.
Stage 5 in 2016
Stage 10 in 2016
Stage 5 in 2017
But, one is clearly a stage with a summit finish which separated GC contenders (Stage 5 in 2017), while one is clearly a stage the GC contenders won’t contest (Stage 10 in 2016). Just using results data to accurately parse the difference between mountain and intermediate or intermediate and sprint was not possible.
Instead I was forced to gather actual race route data which shows elevation and distance throughout each stage. Linking that data up with data showing categorized climbs gave me a much richer data-set for judging the difficulty of climbing (one for a future post) and for separating stage types.
Some of the helpful features for dividing stages were:
- gradient in the final 1 KM
- concentration of climbing difficulty (basically the difficulty of the toughest climb in a stage)
- total elevation change (the amount of uphill or downhill distance in a stage)
- number of categorized climbs
- overall climbing difficulty
I achieved better classification results (as judged by my manual classification of historical Tour de France stages) using this method (along with the same PCA/K Means methods from earlier), but still wasn’t getting the precision necessary. Frustratingly, the Stage 5 2017 finish at La Planche de Belles Filles was being classified either as similar to Stage 6 2018 ending on Mur de Bretagne or as similar to the intermediate/hilly stages from earlier.
I attempted to supplement this process by including decision trees trained on previously tagged Tour de France stage data (about 12% of overall stages were tagged). This again provided additional precision, but not the quantum leap needed.
I’ll discuss how I bridged the gap and finally solved this problem using random forests (a model using many decision trees) in a future post.