CMSC320 Final Tutorial (Fall 2020)

Analyzing the relationship between home matches and match wins in the English Premier League (Soccer)

GROUP -> We Attac We Protec

Vikam Sehgal Ashray Wadhwa Naman Dua

Introduction

With the English Premier League in full swing, discussion about the home game advantage happens to be a hot topic yet again. And with the beginning of the Covid-19 season, our group has had a little extra time to think. As a result, we are out to answer the question that is on everyone's mind - is there actually a strong relation between home games played by the teams, to the wins the same team ends up securing in a single season.

Of course, there is a reason we seek to find the answer for ourselves and more importantly, for you. While a lot of the soccer fans might be interested in the answer for helping themselves in getting a boost of confidence when it comes to betting on matches and earning better profits, this answer would also ideally help in further development of soccer analysis. For instance, what we learn might be helpful in contributing to data analysis for comparing player and team quality across soccer history.

So why wait, let's dive right in!

Getting Started

Following is the research paper we used for building the base for our approach towards finding out the home game advantage. Please be sure to check it out in order to gain a deeper understanding of any formula we used for data analysis.

Link for research paper on calculating home advantage: https://www.researchgate.net/publication/261402166_Calculating_the_Home_Advantage_in_Soccer_Leagues#:~:text=Home%2Dadvantage%20describes%20the%20benefit,%2C%202014)%20.%20...

Link where you can find all the data we used for our analysis (for seasons -> 2014-2019): https://datahub.io/sports-data/english-premier-league

Please note that we would not be taking the data for the premier league season 2019-2020 into account because of the Covid-19 pandemic and how it affected, the stadium attendance, matches played, and the sport in general.

Importing the following python libraries is important for carrying out our data analysis.

Loading Data, Transforming, and Tidying

We will now transform, load, and tidy our data so as to be able to run desired operations on it. Our dataset unfortunately does not provide us with an end of season table. With the given information we can construct our own end of season table. We use information regarding the matches - what team was the home team, what team was the away team, did the home team win or did the away team win. These are all the questions we need answered to be able to construct our table.

Steps:

We begin by creating a list of all the teams that are mentioned in the dataset. We then iterate over all the matches held in the season and add both the home team and the away team to the "teams" list. Since each team has had multiple matches the list will contain duplicates. We can simply transform the list to a set and then back to a list to be able to store it in our end of season table dataframe.

We then create a dictionary that stores the team name as the key and a match stat list as their value, with each index corresponding to either a win, loss, draw, home win, home goals scored, home goals conceded.

Lastly, we loop over this dictionary and add the stats to our data set. To add points the english premier league follows a simple principle. For every match won the winning team gets 3 points, every match lost gives the team 0 points, every match that ends as a draw gives both the teams 1 point.

Now that we have ensured that the data has been loaded and tidied up appropriately, we add certain new calculated entries to the table in the form of total win percentage and a home win percentage.

Adding Some Important Metrics

Now that we have the table for a single season, we will add some new columns to the table important in getting one step closer to our answer.

We will calculate the following:

  1. Home Win % = home_win / home matches_played (19)
  2. Total Win % = wins / matches_played

To get a clearer idea of the scale of advantage a home game gives, we add the 'home_advantage' feature. Home advantage in the english premier league is calculated as follows:

Home Advantage = (Home Goals Scored - Home Goals Conceded) / 19

source for calculation of home field advantage in soccer: https://www.pinnacle.com/en/betting-articles/Soccer/Home-Field-Advantage/FGU2ZXMPGZCTFHSE

Now that we have the season 18-19 table for analysis, we make use of more descriptive statistics in the form of graphs and charts in order to help us understand the behavior pertaining to teams and their home game performances over the years.

Plots for season 18-19

We now visualize the home advantage and the home_win% by plotting line charts.

Here the x axis represents the teams in the premier league that season, and y axis represents the home advantage and home_win%.

Observation:

We observe that both the charts appear to be similar in trend which is expected because home win percentage and home advantage go hand in hand. But we would ideally expect to see a more flatter chart to gauge if all teams have an evenly spread advantage if they play at home.

While these charts might be appropriate for getting a very high-level view of whether home game and performance correlate, we cannot really extract a lot of information out of it.

This is why we will be looking at the data from 4 other seasons from the past and will be using more sophisticated techniques to eventually step closer to the completion of our analysis.

Display Scatter plot for each season by their win advantage

Left -> Right

Best -> Worst

Observation :

As we can see from the graphs plotted above for the 14/15 season to 17/18 seasons. We get a similar graph in all the cases, which helps us maintain the hypothesis that the greater the home advantage the better the teams performed in the season. We have shown seasonal performance through the graphs itself on the X-Axis, the team names are arranged in each graph according to seasonal finishes, where the first team on the left is the team that finished first, whereas on the right the last team is the one that finished last in the season. We can easily infer from the data we can see in the graphs that home advantage is highly correlated to the seasonal performance a team has. In simple terms, we can say that a team which has a high home advantage, has a great season. There is one outlier in the following graphs which is Newcastle United in the 15/16 season who had a high home advantage but their performance that season was horrible. Other than this outlier, our hypothesis holds true for now.

We now combine the tables from all seasons we have taken into consideration, and with the appropriate variables for making violin plots and understanding the behavior and relationship between home games and match performances better.

We take increments of 1 year and utilize the performance data from 5 seasons across all teams. The x-axis represents the years and the y-axis depicts the spread of home advantage across all teams for each season seen on the x-axis.

Observation:

The violin plot more clearly shows the distribution of values for each year by displaying the density of each by the width of the violins at each of those points.

It can be observed that for all the seasons taken into consideration, we see an almost identical distribution of density for home advantages across all teams. The highest density of home advantages is usually around 0 for almost all seasons and peaking a little unusually around 1 for the season of 2015-2016.

All the violins can be seen as representing unimodal data with the seasons not really impacting the shape of the distribution.

We can learn that this bottom-heavy trend, dense around 0, in fact takes away from what we are hoping for the behavior of relationship between home games and team performance to be. This is because the violin plot depicts the fact that most teams playing in the league have the home advantages around the value of 0 indicating that the home game advantage might actually exclusively apply to only some of the teams.

In order to analyse the data further, we make more pictorial depictions related to our current information.

For a better understanding of what is happening let us look at the yearly home win advantage for each team in the English premier league. And another plot to give insights regarding the average home advantage across 2014-2018 for each team that participated in the premier league.

Analyzing the plots

Relation between Home Advantage and Attendance

Our first analysis that we make is whether home game advantage is related to audience attendance, whether having a big home crowd supporting the teams is known to be very intimidating for the away team. Not only, do these crowds chant their slogans a lot, but also can be very crucial in setting morale for the home team. To perform this procedure, we have used BeautifulSoup4 to scrape the website data from the following website for each season- Premier League 2018/2019 » Attendance » Home matches . We have used the data for the other previous seasons, 14/15 onwards. We have broken the data down to 3 columns in a dictionary with the team-name as the key and the first value being the total attendance throughout the season for a team’s home matches, and the third value being the audience average throughout the season for home matches per team.

From this observation, we can start investigating why certain teams seem to have more distinct home advantages compared to others. Intuitively, we should be able to measure what affects home advantage more easily than what affects a teams overall performance. This is because while teams cycle through players of varying quality each year, the difference in conditions between home games and away games remains relatively constant except for a few factors such as player fitness, player quality, and current form.

In Our Graphs we have plotted the X-Axis as Team Names and the Y-Axis as Crowd Attendance (In Blue) and Home Advantage (In Red)

Observation:

Looking at the difference between teams such as Manchester United and Bournemouth we can easily tell the differences home advantages creates, while the former had a good season finish being 6th on the table with a great number of home wins in comparison to Bournemouth, who finished not only 14th on the table just surviving relegation for the season. But, also with the lowest attendance crowds throughout the season, while Manchester United was a team with the highest overall attendance in the season.Being one of the oldest clubs in English Football, Manchester United as a team has gained great fan loyalty, and this we believe really affects team morale and team performances and really ups the home advantage scale. We go on to further our research by making bar graphs to find a correlation between home advantage and attendance. Our code helps us figure out whether our hypothesis is true. We were unable to take the latest dataset of the 19/20 season into account as there was no in-person attendance in any of the Premier League matches due to Covid-19. And, this anomaly would ruin the research on finding whether home advantage and crowd attendance are correlated, we have chosen to ignore this in our research.

Pandas corr() function gives the correlation between 2 columns of a data frame. Corr() function returns 1 if a proper correlation can be found. Since the value < 1 it doesn't signify

Conclusion

Having gathered all the data by scraping using beautiful soup and having loaded the csv files on which we applied data transformations and tidying, we were able to apply data analytics through exploratory analysis, using descriptive statistics, and using various python libraries such as pandas, seaborn, numpy, and matplotlib, as part of the pipeline. We were able to provide insights and hypothesize the relations between data and events.

Through the data analysis, we can conclude that even though secluded events may make it look like home advantage is a great factor in determining match outcomes, delving deeper into the data and its analysis, the so-called advantages seemingly do not factor into the seasonal outcomes.

References ->

https://seaborn.pydata.org/generated/seaborn.violinplot.html https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html https://www.worldfootball.net/attendance/eng-premier-league-2019-2020/1/ (Web page scraping) https://datahub.io/sports-data/english-premier-league https://www.worldfootball.net/attendance/eng-premier-league-2017-2018/1/ https://www.worldfootball.net/attendance/eng-premier-league-2016-2017/1/ https://www.worldfootball.net/attendance/eng-premier-league-2015-2016/1/ https://www.worldfootball.net/attendance/eng-premier-league-2014-2015/1/ https://www.crummy.com/software/BeautifulSoup/bs4/doc/