A Better Bunch Sprint Model

I introduced a very basic model for rating riders two months ago which simply took the natural logarithm of finishing rank in each race to make the stat Log Rank. At the end of that piece, I introduced a way to model Log Rank over long time periods to find whether riders a) achieve better or worse finishing ranks overall, b) achieve better or worse ranks in bunch sprint finishes, and c) achieve better or worse ranks in races with a lot of climbing. That ranking model does a good job of distinguishing riders who are expected to perform better or worse in bunch sprints, but not a great job at distinguishing truly great from merely good sprinters.

The issues with that Log Rank model are: 1) it considers all different parcours of races in building the overall impact data point, not just races ending in bunch sprints, 2) it considers all bunch sprints for a rider, even those where a heavier sprinter was jettisoned a climb and failed to participate in the sprint finish, 3) it considers bunch sprints where a rider was present in the bunch, but was actually helping a teammate (eg, Davide Ballerini often sprints for himself in smaller races, but is in the sprint train for bigger ones), and 4) it doesn’t consider the quality of the sprinters participating alongside each rider in the sprint (ie, the competition on that day may be much reduced by tougher parcours, mechanicals, crashes, or splits in the bunch).

So how to account for these issues. First, we want to just evaluate sprinters based on bunch sprint finishes. Anything which doesn’t end in a bunch sprint is ignored by this new model. Second, we want to ignore any race for a rider where they didn’t finish with the first group in the sprint AND in the top 25 positions; this indicates they were capable of sprinting. Third, we want to ignore any race where a rider wasn’t the top finisher on the team. Many riders participating in as a lead-out man can rack up 10th place finishes which can pollute our understanding of them as sprinters in races where they compete as team leader. And fourth, we consider the cumulative strength of the sprinting field which meets these first three criteria based on the simple Log Rank model outputs.

Determing strength of sprinting field

How does point #4 above work in practice? Seventeen sprinters in UAE Tour stage 1 on Sunday qualified for these criteria including the top 13 finishers. My basic Log Rank model predicts following finishing positions in a generic strong race for those seventeen riders.

RiderPredicted Rank
Jasper Philipsen3.0
Arnaud Demare3.2
Sam Bennett3.3
Pascal Ackermann4.0
Dylan Groenewegen4.9
Elia Viviani6.9
Mark Cavendish7.0
Marijn van den Berg9.3
Olav Kooij9.6
Marc Sarreau10.1
Rudy Barbier10.8
Max Kanter16.5
Emils Liepins27.7
Jonathan Milan29.7
Tom Devriendt34.5
Michael Schwarzmann35.2
Jonathan Canaveral47.3
Qualifying sprinters from UAE Tour Stage 1 (2022)

A lot of very talented sprinters were in this race – including seven with an expected finishing rank of 7.0. Compare to stage 1 of Tour of Oman where the top sprinters were Fernando Gaviria (5.0), Mark Cavendish (7.0) and no one else with a predicted rank better than 10.0.

To determine the cumulative strength of sprinting field, I just take the reciprocal of each rider’s predicted log rank (1 / predicted log rank) and add them together. A top sprinter like Bennett or Philipsen will contribute 1/3 or 0.33 points while someone with a very low prediction like Canaveral or Schwarzmann will contibute 1/40 or 0.03 points.

The top races for sprinters tend to be the Tour de France, Milano-Sanremo, Paris-Nice, and UAE Tour with cumulative strength of sprinting fields around 3.0 to 4.0 depending on the specific day. World Tour races in general average just under 2.0, with a wide range, while .Pro races average just above 1.0, again with a wide range. The lowest pro races at .1 level tend to average just below 1.0 with hardly any rating better than 1.5.

With that data calculated, it is simple to specify a model using this strength of sprinting field and rider to predict both finishing rank and whether a rider won the sprint. Both of these models find 1) the impact of individual rider on success metric and 2) a potentially non-linear impact of the cumulative strength of sprint field.

To Predict Finishing Rank:
gam(log(finish_rnk) ~ rider + s(strength_sprint_field))

To Predict Win:
gam(win ~ rider + s(strength_sprint_field))

I ran both models for this example on data since the start of 2020, only considering riders who participated in at least 16 sprints meeting the criteria laid out above. This ranged from Wout Van Aert with 19 sprints to Philipsen/Ackermann with 45.

Who is the top sprinter in early 2022?

Both models produce similar results given the data. Fabio Jakobsen is seen as the most likely sprinter to win a given race and the sprinter who will finish with the best finishing position overall. For example, in a typical World Tour level sprint the models predicts Jakobsen to win 53% of the time and finish an average of 1.9. Wout Van Aert is predicted 2nd in win probability at 44% and 4th in finishing rank at 2.7. Sam Bennett is tied with Caleb Ewan for 3rd in win probability at 38%, but slightly ahead of him for 2nd place in finishing rank at 2.2. Ewan is predicted at 2.6 in finishing rank.

Those four comprise a fairly clear top group with Jakobsen fairly clearly the #1 sprinter in the world. Behind those four are guys like Philipsen, Groenewegen, Cavendish, and Demare. As a sign of his diminished form in recent years, Peter Sagan ranks outside the top 25 in predicted win probability and 15th in predicted finishing rank.

Fabio Jakobsen

Looking at the data in this way it’s obvious why Jakobsen is the top predicted sprinter while ranking only fifth in the PCS Sprinter Ranking and 14th (!) in my own basic Log Rank model. Jakobsen had three week long stage races in his comeback from serious injury last year where he didn’t compete as a sprinter. Basically, the basic Log Rank model sees a guy who was “awful” at sprinting for a dozen sprints. But, when we restrict just to races where he was the team leader and he was in the sprint pack, the graph below shows he has been dominant.

Jakobsen is winning nearly 70% of his sprints where he is the leader and is contesting the sprint since the start of 2020. That blows everyone else away, with Van Aert and Ewan managing only a mid 40% win rate in that time. Jakobsen has raced lesser competition than guys like Van Aert, Ewan, and Bennett, but he’s dominated that competition.

One of the big stories of this and last cycling season is Mark Cavendish’s return to massive success with Quick Step, including tying the record for career Tour de France stage wins. He has twelve wins since the start of 2021 – including two this season – and easily rates as a top 10 sprinter in the world right now. Because he and Jakobsen race for the same team, only one of them is likely to make the Tour de France team where Quick Step sprinters have been steered to 14 sprint stages in the last five races. Unfortunately for Cavendish, Jakobsen isn’t simply just another top 10 sprinter – he’s the best in the world right now.

One thought on “A Better Bunch Sprint Model

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s