Predicting the WSL: When the Model Tells You Something Unexpected

🤖 BTP Machine Learning Series

The WSL model was built using exactly the same methodology as BTP’s Championship and League One models — same rolling form features, same four-model comparison, same evaluation framework. The results told a different and more interesting story. Every model came back worse than the naive baseline. This post is about what that actually means, why it happened, and what it reveals about women’s football as a statistical object.

The BTP ML system covers three divisions: Championship, League One, and WSL. For Championship and League One, the models beat the baseline — modestly but consistently. For WSL, they didn’t. That is not a failure to hide. It is a finding worth documenting carefully, because the reasons are substantive and not immediately obvious.

For background on the shared methodology — log loss, rolling windows, the four-model comparison — see the Championship explainer. This post covers the WSL-specific findings, with particular attention to what the outcome distribution reveals about the structure of the women’s game.

📊 The Data

📁 WSL Dataset

The BTP database holds WSL results from 2019/20 onwards. After filtering to finished matches and excluding the current 2025/26 season (live data — never used for training), the modelling dataset covers:

Matches

659

finished results

Seasons in DB

2019/20 void, 2025/26 live

Nulls dropped

clean dataset

xG coverage

training seasons

Why not 2019/20? The WSL 2019/20 season was declared null and void by the FA in May 2020. Unlike the men’s game — where COVID-affected seasons were flagged with a crowd_present feature and included — there is no meaningful signal in voided results. They were excluded entirely. No crowd flag is used in the WSL model.

Why not pull 2018/19 data?

API-Football has WSL events data back to 2018. It was not used, and the reason is structural rather than technical.

The WSL in 2018 was a structurally different competition. Pre-Euros 2022, the investment gap between clubs was narrower, attendances were a fraction of current levels, and the tier of play was meaningfully lower across the board. Adding data from an era that predates the post-Euros explosion in funding and quality would mean training on a competition that no longer exists in the same form. The model would be learning patterns from a different sport. Stale data from a different structural era is likely to hurt, not help.

No xG — a simple situation

xG data does not exist for WSL in the BTP database for the training seasons. The Championship explainer described a difficult trade-off: theoretically superior xG features versus the data volume advantage of six seasons of goals data. For WSL 2020–2024, there is no trade-off. API-Football only provides WSL fixture statistics from 2023/24 onwards — which falls entirely within the test and live periods. The model uses goals-based rolling features exclusively, and for this dataset, that is the only option.

📊 The Outcome Distribution — The Key Finding

📈 WSL vs Championship vs League One

Before building a single model, the outcome distribution tells you how hard a division is to predict. Here is the comparison across all three BTP divisions:

Outcome	WSL	Championship	League One
🏠 Home win	43.1%	43.1%	43.3%
🤝 Draw	17.9%	25.9%	25.3%
✈️ Away win	38.9%	30.9%	31.4%

Home win rates are almost identical across all three divisions — roughly 43% in each. The dramatic difference is draws. WSL produces nearly 8 percentage points fewer draws than the men’s divisions, with away wins correspondingly higher at 38.9% vs ~31%.

Why fewer draws? Draws cluster when teams are evenly matched. In WSL, the hierarchy between clubs is steeper — the quality gap between the top three (Manchester City, Chelsea, Arsenal) and the bottom three is larger in relative terms than in the Championship or League One. Larger quality gaps between sides produce more decisive results. When a top-six side plays a bottom-six side, the stronger team tends to win rather than draw. The result is a distribution where “away win” becomes significantly more common because the away team is often the stronger side.

⚙️ How the Model Works

Feature Engineering

The WSL feature set is identical to the League One model — rolling goals scored, goals conceded, and points earned over the last 5 and 10 games for both teams, league position at kickoff, and a season ordinal. The model never sees the match it is predicting. Rolling windows reset at each season boundary.

Rolling form windows

Goals scored, goals conceded, and points earned across the last 5 games and last 10 games per team — 12 rolling features total. The 10-game window outranks the 5-game window in feature importance, consistent with the men’s models. Form over a longer spell captures underlying quality more reliably than recent volatility.

Position bands — adjusted for a 12-team league

WSL has 12 teams, not 20 or 24. Position bands are recalibrated accordingly: positions 1–3 = title_contender, 4–9 = mid_table, 10–12 = relegation. No crowd_present flag — the 2019/20 voided season was excluded entirely rather than flagged, so there is no meaningful coverage gap to model.

Top Feature Importances

From the production Random Forest model, the top predictors by mean feature importance:

League position dominates, as in both men’s models. The gap between position and rolling form is consistent with League One — the steeper quality hierarchy in WSL makes raw table position a strong signal of expected outcome.

📈 The Honest Results

📊 Model Comparison — 2024/25 Test Season (126 matches)

All four models were evaluated on the 2024/25 WSL season — a season none of them saw during training (train set: 2020/21–2023/24, 493 matches).

📐 What is log loss?

Log loss measures how well a model’s predicted probabilities match what actually happened. Lower is better. A completely uninformed model assigning equal probability to all outcomes (33%/33%/33%) scores around 1.099. The naive baseline — which predicts the training-set outcome distribution for every match — scores 1.0521 for WSL. Any model that scores above 1.0521 is providing worse probability estimates than just repeating the historical frequencies.

Model	Log Loss ↓	Accuracy	vs Baseline	Result
Naive Baseline	1.0521	46.8%	—	Floor
Logistic Regression	1.6108	59.5%	+0.559 ↑	❌ worse
Random Forest (300)	1.5688	55.6%	+0.517 ↑	❌ worse
XGBoost	1.9094	54.0%	+0.857 ↑	❌ worse

← scroll →

All three trained models score above baseline. No model is used in a production sense — predictions are generated exploratorily.

What “worse than baseline” actually means

The baseline predicts H=42.2%, D=17.6%, A=40.2% for every single match, without looking at either team. When a trained model scores worse than this, it means its probability estimates — which do look at team form, position, and history — are less well-calibrated than ignoring all of that information. The model is not just failing to improve; it is actively making things worse by introducing noise that overwhelms the signal it can find.

Why does this happen?

Training volume

493 training matches vs 2,633 for League One, 2,700+ for Championship. The WSL model has less than 20% of the training data available to the men’s models. With this few examples, the model picks up noise rather than signal — patterns that appear to hold within the training data but don’t generalise to unseen matches.

Near-binary distribution

With draws at only 17.9%, the outcome distribution is more skewed than in the men’s game. The baseline is hard to beat precisely because it already reflects this — predicting “draw” at 17.6% is surprisingly accurate because draws really are rare. A model that tries to vary draw probabilities by fixture is introducing variability around a signal that is already captured by the fixed rate.

No xG data

The men’s models run without xG and still beat the baseline because goals-based rolling form captures underlying quality reasonably well over 2,600+ matches. With only 493 training rows, the signal-to-noise ratio of goals data is much lower — there simply aren’t enough examples for the model to reliably distinguish quality levels from short-run form fluctuations.

Structural shift mid-training

The training data spans 2020/21–2023/24. The post-Euros 2022 investment explosion fundamentally changed the WSL competitive landscape mid-dataset. Patterns learned from 2020–2022 partially describe a different competition. A model can’t be expected to generalise across a structural break of this magnitude.

🇬🇧 The Euros 2022 Structural Break

⚡ Before and After the Watershed

The England women’s team winning Euro 2022 at Wembley transformed WSL as a commercial and competitive entity. This is not sentiment — it shows up in the data in ways that directly affect model performance.

What changed after Euros 2022

WSL average attendances grew from ~2,000 pre-2022 to 10,000+ by 2024/25
Chelsea, Manchester City, and Arsenal investment levels now dwarf the rest of the division
The quality gap between top and bottom has widened — more decisive results, fewer draws
Player quality at the top end is materially different to what the 2020–2022 training data describes

This is also why 2018/19 data was not added despite being available in the API. The 2018 WSL represented an even more structurally different competition — pre-professionalisation, pre-investment, pre-Euros. Adding it would mean the model is simultaneously trying to learn from three different versions of the WSL. More data is only better if it’s describing the same underlying process.

Future approach: Once enough post-Euros seasons have accumulated (likely by 2027/28), a model trained exclusively on 2022/23 onwards should be significantly more coherent. Training on a structurally consistent period is more valuable than maximising raw match count across structural breaks.

🔴 What the Predictions Still Tell Us

📋 GW Predictions — 28–29 March 2026

Even where a model doesn’t beat the baseline, the probability estimates can still encode useful information about the relative strength of teams. A 64% home probability for Chelsea is not noise — it reflects Chelsea’s league position (3rd, 37 pts) against Aston Villa (8th, 20 pts) and their recent form. The position signal is working; the model is just not well-calibrated enough in aggregate to beat a fixed baseline on log loss.

Fixture	Home %	Draw %	Away %	Predicted
Everton W vs Liverpool W	53%	22%	26%	🏠 Home
Man United W vs Man City W	19%	31%	50%	✈️ Away
Arsenal W vs Tottenham W	51%	24%	25%	🏠 Home
West Ham W vs London City	18%	22%	60%	✈️ Away
Chelsea W vs Aston Villa W	64%	15%	21%	🏠 Home
Leicester W vs Brighton W	31%	23%	47%	✈️ Away

← scroll →

Sanity check: Man City as 50% away favourites at Man United directly reflects their 8-point table lead (46 pts vs 38 pts) and superior goal difference. West Ham at only 18% at home against London City reflects 11th place (12 pts) hosting a 7th-placed side (20 pts) — the position signal is working as expected. The model correctly identifies where the quality differences are largest. The calibration problem is in how it handles matches where sides are close.

🛣️ What Would Improve the Model

📋 Roadmap

Improvement	Detail
More seasons	Each adds ~132 rows. Need 3–4 more post-Euros seasons to make baseline beatable
xG data	API has WSL xG from 2023/24 — only 2 seasons available now; worth incorporating when 3+ seasons accumulate
Squad data	Injuries and suspensions not in BTP database; impact is large in a 12-team league
Home/away split	Buildable now from existing data — next iteration feature
Post-Euros split	Train exclusively on 2022/23 onwards when sufficient seasons accumulate — more coherent structural period

← scroll →

🔴 Live Predictions — Exploratory Only

⚡ How It Works on the Site

The WSL pipeline generates predictions using the same infrastructure as the Championship and League One models — a Python script calculates rolling form and live league positions from the BTP database and writes probabilities to the same wprm_btp_ml_predictions table, distinguished by model_version='wsl_goals_v1'. The same shortcode renders them on the page.

These predictions are published transparently as an exploratory exercise, not a validated forecasting tool. Here is Manchester United W hosting Manchester City W on 28 March — Man City 50% favourites as the table leader despite playing away:

Match Prediction

Manchester United W vs Manchester City W — 28 March 2026. City top on 46 pts, United 2nd on 38 pts. City 50% away favourites on position and form.

These are probability estimates, not certainties. A 50% probability for City does not mean they will win — it means the model thinks they are the most likely winner, with a genuine chance of a United win (19%) or draw (31%). The model is acknowledging that this is a close-at-the-top derby and United are not a soft home side.

📅 Going Forward

📌 The Honest Summary

BTP applied the same ML methodology to WSL that worked for Championship and League One. The model didn’t beat the baseline. That result is documented here in full, with the reasons explained as clearly as possible, because that’s more valuable than a post that only publishes models that worked.

The model will be retrained at the end of 2025/26 with one more season of data. When 2023/24 and 2024/25 xG becomes available in the BTP database, a parallel xG model will be attempted using the same methodology as the Championship. Most honestly: the WSL needs a few more post-Euros seasons before the underlying patterns are stable enough for a goals-based model to reliably beat a fixed baseline.

This is the companion piece to the Championship and League One explainers. All three are built on the same BTP database. The methodology will always be documented.

Three divisions, three models, three different findings. Championship: logistic regression wins by a hair. League One: random forest wins more clearly. WSL: no model beats the baseline, and the reason turns out to be a story about structural change, data volume, and the post-Euros transformation of women’s football in England. That last one is the most interesting result of the three.

Model: Random Forest (scikit-learn, 300 estimators). Training data: WSL 2020/21–2023/24 (493 matches). Test: 2024/25 (126 matches). 2019/20 season excluded (FA null and void). Predictions for 2025/26 use completed fixtures up to point of generation. Exploratory only — model does not beat naive baseline.

Tags: BTP data data analysis Euros 2022 football statistics machine learning match prediction probability model Random Forest women's football analytics Women's Super League wsl WSL 2025-26

Predicting the WSL

WSL Gameweek 19 Preview

Beyond The Prem

Welcome Back!

Retrieve your password