In my last post I dug into the Tour de France power data shared by @Velofacts, specifically adding to his analysis by breaking-down the relative power output of each rider compared to themselves. In other words, instead of judging the power output of Thomas de Gendt relative to other riders who have different skill-sets, judge relative to his own level. Some of the key findings were: 1) the flat sprint stages saw significantly lower relative power output than the big mountain days, 2) tough days where the peloton pushed hard like stage 7 saw comparable power output to the mountain days, and 3) riders saw peak power output when they were in the break.
This post will take that analysis further and determine what stage characteristics lead to high or low relative power outputs across pro races. To do that, I’ve collected nearly 10,000 individual stages linked to specific races for pro riders across the World and continental tours for 2019 and 2020. This data-set includes 292 unique riders with 98% of the data coming from riders with at least 10 races (the minimum to include in modelling below). The average normalized power for this sample was 278 watts (4.09 watts/kg) with the 10th to 90th percentile represented by 232 to 321 watts (3.49 to 4.67 watts/kg).
Model Creation
I separated 2019 from 2020 with 2019 acting as the training set and 2020 as the test set so we’ll be able to judge how predictive the model is without having seen the data yet. I built two models: a simple linear regression with easy to interpret effects and then a random forest based model (xgboost) which should theoretically have better performance with worse interpretability.
I linked the race level power files to my existing data-set of stage results which include variables like whether a race was a time trial, one day race, and/or grand tour, what the climbing difficulty of the stage was, whether the stage ended with an uphill finish, what class of race it was (World Tour or lower levels), but also finishing position data from riders.
To build the linear model, I included:
one_day_race, time_trial, length of stage (km), natural log of finish position, climb_difficulty of stage, and rider_DNF (did not finish race). I also included an interaction between log finish position and climb difficulty with the idea that there is probably a larger difference in power output by finish position on tougher stages.
The model was built to predict the relative power output on the stage calculated in the form of: eg, 300 watts on stage / 285 watts on average = 1.053 relative power output.
The linear model achieved in-sample R^2 of 0.25 with a standard error of 0.10. Obviously predicting power output is a high variance task. Five of the seven variables were judged significant at p < 0.01 level (rider DNF was not significant and the finish position/climb difficulty interaction was significant at p <0.05 level).
The coefficients were:
Variable | Coefficient | SE |
Intercept | 1.096 | 0.01 |
Natural log of finish position (1) | -0.014 | 0.002 |
climb_difficulty (2) | 0.007 | 0.001 |
time_trial | 0.138 | 0.009 |
one_day_race | 0.055 | 0.003 |
length in km | -0.0005 | <0.001 |
Natural log of finish position * climb_difficulty | -0.0005 | <0.001 |
Rider DNF | -0.009 | 0.01 |
(1): Actually natural log of rank + 1 to allow for interaction term as LN(1) = 0
(2): Climbing difficulty is judged on a scale starting at 0 where the toughest mountain stages are around 30. Classic races like Flanders and Strade Bianche come in at 4-5, hillier races like Liege-Bastogne-Liege and Fleche Wallonne at 8-10, grand tour mountain stages typically start at 12 and up.
Practical Impacts
A rider finishing 1st on a mountain stage (climb difficulty = 15) will be estimated to have 9.4% higher power output than the same rider finishing 150th on that mountain stage. On a flat stage, the 1st place rider will have about 6.3% higher power output than the same rider finishing 150th.
One day races are raced with 5.5% higher relative power – which matches the findings of van Erp and Sanders that one day races are ridden at a higher intensity than stage races.
Time trials obviously have much higher normalized power as they are much shorter races. In this case, 14% higher relative power. Related, stage length plays a small role with shorter stages = greater power output. As time trials are shorter, much of this impact comes from time trials, but shorter stages like Stage 20 of the 2019 Tour de France have much higher normalized power than longer stages.
Testing
Testing this model on the 2020 data shows similar out-of-sample fit – R^2 of 0.23 and a standard error of 0.10. Again, predicting relative power at the stage level is a high variance endeavor!
The highest predicted power output in the test set (>2750 races in 2020) was Thomas De Gendt’s stage 20 time trial in the 2020 Tour de France which was predicted at 121.9% of his average normalized power. The Planche de Belles Filles time trial had almost all of the elements of a high power output stage: short, a time trial, with a lot of climbing. De Gendt finished 20th so our predicted power for the higher finishing riders would have been even higher. De Gendt actually recorded 135% of his average normalized power!
The highest road race prediction was Pierre Latour at Mont Ventoux Challenge – a one day race with two significant climbs where Latour finished 4th. The prediction was 118.5% of Latour’s average normalized power, but he only produced 105%. The residuals for ten riders with power files from that raced showed it as the 3rd largest negative difference between predicted and actual power – indicating the race required much less power than predicted by the model.
The flip-side of that was Stage 7 of the Tour de France where Bora attempted to make the race extremely difficult to shed Peter Sagan’s sprint rivals. Later the stage exploded in the crosswinds. Overall, it ranked as the 6th largest positive difference between predicted and actual power. Parcours and race type play a significant role in power output in a race, but how the race is ridden is a huge factor.
Gradient Boosted Model
Gradient boosted models leverage hundreds or thousands of independent random forest models to learn which variables are most significant and derive predictions. In this case, I used the same training and testing data and the same variables with the xgboost package in R.
Optimizing for root mean square error gave me an error of 0.088 on the training data and 0.099 on the testing data. Out of sample, the R^2 was 0.24 – not much improved on the linear model. Based on that lack of significant improvement from the boosted model, it makes sense to rely on the more easily interpreted linear model.
Predictions
To end, here are the top 10 over-estimated and under-estimated stages by the model for 2020. As mentioned, the Mont Ventoux Challenge was one of the most over-estimated in power output alongside four of the flatter Tour de France stages, the pan flat Milano-Torino race, a flat stage from Tirreno-Adriatico, a Tour of Portugal time trial, and – surprisingly – Milano-Sanremo.

On the under-estimated side, there’s a handful of minor French and Spanish stage races from February along with the World Championship road race, a Binckbank Tour stage where Mathieu van der Poel won from 60km out, and the aforementioned stage 7 of the Tour de France.
