Article by Ben Griffis
This short article will discuss a statistical model that we call a “regression model”. This type of model will be used to answer the question in the title. Does Diego Simeone’s Atlético actually have the best defense in La Liga? It’s a much-discussed subject, and a statistical model, a regression model, can help us answer that. For this scenario, my Null Hypothesis will be that there is no difference between Atlético’s defense and the other teams. Null hypotheses are important for statistical tests, as we will use statistics to see if we have evidence to rejuct this null hypothesis as (most likely) flase.
To answer this question, I first gathered information on all La Liga matches from the past 4 completed season and the current 21/22 season (after MD 37, so all but one match for each team has been played). All data is from fbref.com.
Then, using a linear regression model, I controlled for a host of factors :
- Focal team
- No base (essentially, with categorical (non-continuous like possession) variables, one of the entries has to be the “base”, or what all others are compared to).
- Opponent
- Base = Atlético Madrid
- Focal team’s possession
- Match number (1, 2, 3, …, 38)
- The season
- Whether the match was home/away (for the focal team, not the opponent)
- Base = away
- Day of the week
- Base = Friday
- Local time of kickoff
- Base = 12:00 Noon
- The referee in charge of the match
- Base = Adrián Cordero
The model result is shown at the bottom of this post. Basically, it proves that, yes, it has been harder to generate xG against Atlético than any other La Liga team in the past 5 seasons. If we use xG as a measure of a team’s defensive strength, then that shows us Atlético have had the best defense over the last 5 seasons at least.
The way the model shows this is:
- All but two clubs (Getafe & Leganés) have significant p-values (bolded, which show there is a statistical difference in something) AND
- No other club has a negative “estimate”.
We can read each club’s estimate as: “Controlling for [all factors in the bulleted list above], teams should generate X more xG than when playing against Atlético Madrid”, With X being the estimate. A p-value below 0.05 is typically called “significant”, meaning that the data is inconsistent with the null hypothesis.
Example Case
So, let’s take Barcelona as an example. Controlling for all these factors, teams should generate 0.32 more xG than when playing Atlético Madrid.
And let’s look at two example scenarios. First, a Cádiz match against Atlético, and then a Cádiz match against Barcelona. We’ll keep everything else the same (let’s use the bases, so we’ll say Cádiz is the away team, Friday, noon kickoff, with Adrián Cordero in charge. We’ll also say it’s this season (2022) and match number 10, with Cádiz having 45% possession). The method to calculate any match is:
Team’s estimate + [Opponent Estimate] + (Possession * 0.0006748) + (Match Number * 0.0006748) + (Season End Year * -0.0280301) + [Home/Away Estimate] + [Day Estimate] + [Time Estimate] + [Ref Estimate]
If you use the base cases, you do not need to include those estimates, since we are using those as comparisons.
- Atlético-Cádiz: 57.2159207 + (45 * 0.0006748) + (10 * 0.0006748) + (2022 * -0.0280301)
- Cádiz should generate 0.576 xG
- Barcelona-Cádiz: 57.2159207 + 0.2045895 + (45 * 0.0006748) + (10 * 0.0006748) + (2022 * -0.0280301)
- Cádiz should generate 0.781 xG
From this we can see how we can interpret Barcelona’s estimate. Keeping all else constant, a team would generate 0.205 more xG against Barcelona than they would against Atlético!
A Final Word: Statistical Models’ Use in Football Data
Statistical models have a lot of possible uses in football. On way we can use them is to see what players might be putting in great (or poor) performances. By including many different variables, instead of just 2-3 like we can on a scatter plot, we can then use a residual plot to visualize players who are exceeding their predicted levels of performance, or who may be far below them.
And then another way to use models is to conduct a statistical test, like I did here, to begin to answer a specific question. I asked a question about teams, but we can also use models to begin answering questions about players or leagues as well. The key is that you include the variable you want to test (here, Atlético as the opponent team base case) and see the significance of that value (in this example, we needed to look at all other opponent teams’ estimates and p-values). Make sure to include other variables to act as controls. That way, we can start isolating the effect of what you want to test!
Linear Regression Output
Note: these values are rounded to 2 places. None of the numbers are exactly 0.00.
xG | |||
---|---|---|---|
Predictors | Estimates | CI | p-value |
Team [Atlético Madrid] | 57.49 | 11.93 – 103.04 | 0.013 |
Team [Alavés] | 57.22 | 11.66 – 102.77 | 0.014 |
Team [Athletic Club] | 57.32 | 11.77 – 102.87 | 0.014 |
Team [Barcelona] | 58.03 | 12.48 – 103.58 | 0.013 |
Team [Cádiz] | 57.22 | 11.64 – 102.79 | 0.014 |
Team [Celta Vigo] | 57.36 | 11.81 – 102.91 | 0.014 |
Team [Deportivo] | 57.43 | 11.90 – 102.95 | 0.013 |
Team [Eibar] | 57.29 | 11.74 – 102.83 | 0.014 |
Team [Elche] | 57.04 | 11.47 – 102.61 | 0.014 |
Team [Espanyol] | 57.24 | 11.69 – 102.78 | 0.014 |
Team [Getafe] | 57.18 | 11.63 – 102.73 | 0.014 |
Team [Girona] | 57.28 | 11.75 – 102.81 | 0.014 |
Team [Granada] | 57.28 | 11.71 – 102.84 | 0.014 |
Team [Huesca] | 57.23 | 11.68 – 102.78 | 0.014 |
Team [Las Palmas] | 57.11 | 11.59 – 102.63 | 0.014 |
Team [Leganés] | 57.17 | 11.63 – 102.71 | 0.014 |
Team [Levante] | 57.38 | 11.83 – 102.93 | 0.014 |
Team [Málaga] | 57.03 | 11.51 – 102.56 | 0.014 |
Team [Mallorca] | 57.26 | 11.69 – 102.82 | 0.014 |
Team [Osasuna] | 57.24 | 11.67 – 102.80 | 0.014 |
Team [Rayo Vallecano] | 57.29 | 11.73 – 102.85 | 0.014 |
Team [Real Betis] | 57.41 | 11.86 – 102.96 | 0.014 |
Team [Real Madrid] | 57.94 | 12.39 – 103.50 | 0.013 |
Team [Real Sociedad] | 57.51 | 11.96 – 103.06 | 0.013 |
Team [Sevilla] | 57.52 | 11.97 – 103.07 | 0.013 |
Team [Valencia] | 57.43 | 11.87 – 102.98 | 0.013 |
Team [Valladolid] | 57.17 | 11.62 – 102.72 | 0.014 |
Team [Villarreal] | 57.62 | 12.07 – 103.17 | 0.013 |
Opponent [Alavés] | 0.44 | 0.31 – 0.58 | <0.001 |
Opponent [Athletic Club] | 0.20 | 0.06 – 0.33 | 0.004 |
Opponent [Barcelona] | 0.20 | 0.06 – 0.34 | 0.004 |
Opponent [Betis] | 0.36 | 0.22 – 0.50 | <0.001 |
Opponent [Cádiz] | 0.50 | 0.32 – 0.68 | <0.001 |
Opponent [Celta Vigo] | 0.41 | 0.27 – 0.55 | <0.001 |
Opponent [Eibar] | 0.29 | 0.15 – 0.44 | <0.001 |
Opponent [Elche] | 0.66 | 0.48 – 0.84 | <0.001 |
Opponent [Espanyol] | 0.32 | 0.18 – 0.47 | <0.001 |
Opponent [Getafe] | 0.06 | -0.07 – 0.20 | 0.358 |
Opponent [Girona] | 0.33 | 0.15 – 0.51 | <0.001 |
Opponent [Granada] | 0.45 | 0.29 – 0.61 | <0.001 |
Opponent [Huesca] | 0.37 | 0.19 – 0.54 | <0.001 |
Opponent [La Coruña] | 0.56 | 0.33 – 0.79 | <0.001 |
Opponent [Las Palmas] | 0.79 | 0.54 – 1.05 | <0.001 |
Opponent [Leganés] | 0.10 | -0.05 – 0.26 | 0.192 |
Opponent [Levante] | 0.63 | 0.49 – 0.77 | <0.001 |
Opponent [Málaga] | 0.44 | 0.20 – 0.68 | <0.001 |
Opponent [Mallorca] | 0.53 | 0.35 – 0.71 | <0.001 |
Opponent [Osasuna] | 0.33 | 0.17 – 0.49 | <0.001 |
Opponent [Rayo Vallecano] | 0.38 | 0.20 – 0.56 | <0.001 |
Opponent [Real Madrid] | 0.22 | 0.08 – 0.36 | 0.002 |
Opponent [Real Sociedad] | 0.20 | 0.06 – 0.33 | 0.004 |
Opponent [Sevilla] | 0.20 | 0.07 – 0.34 | 0.004 |
Opponent [Valencia] | 0.33 | 0.20 – 0.47 | <0.001 |
Opponent [Valladolid] | 0.36 | 0.20 – 0.51 | <0.001 |
Opponent [Villarreal] | 0.38 | 0.24 – 0.51 | <0.001 |
Possession | 0.00 | 0.00 – 0.01 | 0.013 |
Match Number | 0.00 | -0.00 – 0.00 | 0.521 |
Season | -0.03 | -0.05 – -0.01 | 0.015 |
Venue [Home] | 0.31 | 0.27 – 0.35 | <0.001 |
Day [Mon] | -0.02 | -0.14 – 0.09 | 0.708 |
Day [Sat] | -0.03 | -0.14 – 0.08 | 0.567 |
Day [Sun] | 0.01 | -0.09 – 0.12 | 0.781 |
Day [Thu] | 0.08 | -0.07 – 0.22 | 0.308 |
Day [Tue] | 0.08 | -0.08 – 0.23 | 0.327 |
Day [Wed] | 0.08 | -0.06 – 0.22 | 0.239 |
Time 13:00 | 0.06 | -0.09 – 0.21 | 0.423 |
Time 14:00 | -0.03 | -0.16 – 0.10 | 0.642 |
Time 16:00 | -0.05 | -0.21 – 0.12 | 0.577 |
Time 16:15 | -0.09 | -0.21 – 0.04 | 0.167 |
Time 17:00 | -0.27 | -0.50 – -0.05 | 0.017 |
Time 17:30 | 0.04 | -0.25 – 0.33 | 0.800 |
Time 18:00 | -0.20 | -0.57 – 0.17 | 0.288 |
Time 18:15 | 0.09 | -0.24 – 0.41 | 0.602 |
Time 18:30 | -0.06 | -0.18 – 0.05 | 0.283 |
Time 18:45 | 0.23 | -0.70 – 1.17 | 0.626 |
Time 19:00 | -0.14 | -0.32 – 0.04 | 0.127 |
Time 19:15 | 0.10 | -0.46 – 0.65 | 0.736 |
Time 19:30 | -0.21 | -0.36 – -0.05 | 0.010 |
Time 19:45 | -0.26 | -0.70 – 0.18 | 0.242 |
Time 20:00 | -0.17 | -0.37 – 0.02 | 0.077 |
Time 20:15 | -0.03 | -0.30 – 0.24 | 0.848 |
Time 20:30 | -0.12 | -0.43 – 0.20 | 0.464 |
Time 20:45 | -0.03 | -0.17 – 0.11 | 0.665 |
Time 21:00 | -0.12 | -0.24 – 0.01 | 0.077 |
Time 21:15 | 0.11 | -0.33 – 0.54 | 0.627 |
Time 21:30 | -0.21 | -0.41 – -0.00 | 0.048 |
Time 22:00 | -0.13 | -0.30 – 0.03 | 0.114 |
Time 22:15 | 0.08 | -0.20 – 0.36 | 0.565 |
Time 22:30 | 0.05 | -0.88 – 0.97 | 0.920 |
Referee [Alberola Rojas] | -0.07 | -0.21 – 0.07 | 0.335 |
Referee [Alberto Undiano] | -0.26 | -0.45 – -0.08 | 0.005 |
Referee [Alejandro Hernández] |
-0.12 | -0.26 – 0.03 | 0.120 |
Referee [Alejandro Muñíz] |
-0.05 | -0.29 – 0.19 | 0.687 |
Referee [Alfonso Álvarez] |
-0.25 | -0.51 – 0.00 | 0.054 |
Referee [Antonio Matéu Lahoz] |
-0.21 | -0.35 – -0.06 | 0.005 |
Referee [César Soto] | -0.05 | -0.22 – 0.11 | 0.517 |
Referee [Carlos del Cerro] |
-0.17 | -0.31 – -0.02 | 0.022 |
Referee [Daniel Trujillo] | -0.09 | -0.33 – 0.15 | 0.451 |
Referee [David Fernández] |
-0.06 | -0.31 – 0.18 | 0.611 |
Referee [David Medié] | -0.04 | -0.20 – 0.11 | 0.607 |
Referee [Eduardo Prieto] | -0.24 | -0.43 – -0.06 | 0.011 |
Referee [Guillermo Cuadra] |
-0.08 | -0.23 – 0.07 | 0.314 |
Referee [Ignacio Iglesias] |
-0.12 | -0.31 – 0.06 | 0.201 |
Referee [Isidro Díaz de Mera] |
-0.01 | -0.19 – 0.18 | 0.938 |
Referee [Javier Estrada] | -0.14 | -0.29 – 0.01 | 0.064 |
Referee [Jesús Gil] | -0.06 | -0.21 – 0.08 | 0.405 |
Referee [Jorge Figueroa] | -0.15 | -0.34 – 0.03 | 0.106 |
Referee [José González] | -0.27 | -0.44 – -0.11 | 0.001 |
Referee [José Luis Munuera] |
-0.10 | -0.25 – 0.05 | 0.184 |
Referee [José Munuera] | -0.21 | -0.45 – 0.03 | 0.087 |
Referee [José Sánchez] | -0.23 | -0.37 – -0.09 | 0.002 |
Referee [Juan Martínez] | -0.10 | -0.25 – 0.04 | 0.167 |
Referee [Mario Melero] | -0.19 | -0.34 – -0.05 | 0.008 |
Referee [Miguel Ángel Ortiz Arias] |
-0.06 | -0.31 – 0.19 | 0.625 |
Referee [Pablo González] | -0.25 | -0.39 – -0.10 | 0.001 |
Referee [Ricardo de Burgos] |
-0.07 | -0.21 – 0.07 | 0.337 |
Referee [Santiago Jaime] | -0.23 | -0.37 – -0.09 | 0.002 |
Referee [Valentín Pizarro] |
-0.10 | -0.26 – 0.07 | 0.253 |
Referee [Xavi Estrada] | -0.33 | -1.27 – 0.62 | 0.501 |
Observations | 3780 | ||
R2 / R2 adjusted | 0.799 / 0.793 |