Soccer Players Recommendation System Using Machine Learning, Python, and GridDB | GridDB: Open Source Time Series Database for IoT

The soccer business is a multi-billion dollar industry combining high-performing athletes, passionate fans, and big sponsorship deals. Team owners and managers around the globe are always looking for an edge to find the best talent that can add to the winning soccer to their team. Using machine learning to help team managers find a balanced, talented team is a perfect combination of technology and data to add value to a business.

This article will cover recommendation system models to help managers find talent and upcoming players based on their performance and soccer league data. The objective is to use the recommendation system as a monitoring and prospecting tool to find innovative, high-performing soccer players. In this article, we propose a recommendation system model to recommend soccer players given their soccer matches data using Python and GridDB.

Please download the source code from here:

$ git clone https://github.com/griddbnet/Blogs.git --branch soccer

Setting up your environment

To implement the recommendation system described in this article, we begin by configuring your machine’s environment to execute the Python code properly. Below are some of the prerequisites that must be met in your environment:

GridDB: GridDB is our database that stores the data used in the recommendation system model.
Python 3.11.2: The latest version of Python 3.11.2 is used in our solution.
Jupyter Notebook: Jupyter Notebook is an integrated development environment (IDE) to run our Python code.

If you need to install any missing packages, you can do so through the command line by typing the following:

pip install package-name

In addition, if you are utilizing GridDB, you will need to acquire these extra libraries:

Finally, we are missing GridDB Python Client Library in our situation. Here’s how we use the pip command to install the missing library in the Jupyter terminal:

pip install griddb-python

We may now explore our dataset after successfully installing and configuring our environment.

Introduction to the dataset

The dataset used in this article contains 416 rows and 17 columns. The dataset comprises attributes defining soccer players in terms of their role and historical accomplishments. These attributes consider the current market value, field position, and points scored.

The following is the list of the features that are found in our dataset:

Name: Player name in a text value.
Club: Club name in a text value.
Age: Player age in numerical value.
Position: Player position in the field in a text value.
Position Category: A categorical variable representing the player’s field position.
Market Value: Player market price in numerical value.
Page Views: The number of Wikipedia page views is calculated as a daily average.
Fantasy League Value: Player Fantasy League Price in numerical value.
Fantasy League Selection: Player Fantasy League Selection in numerical value.
Fantasy League Points: Player Fantasy League Points in numerical value.
Region: A categorical variable representing the region of the player.
Nationality: A text value that represents the nationality of the player.
New Foreign: A boolean value to indicate if the players signed up with a foreign club newly.
Age Category: The age group of the player.
Club ID: A numerical value that represents a club identifier.
Big Club: A boolean value to indicate if the players signed up with a big club or not.
New Signing: A boolean value to tell if the players signed up with a club newly.

The dataset was extracted from the English Premier League Players Dataset. The table below is the first three rows of this dataset:

Importing the necessary libraries

In this article, we will be using multiple Python modules that we will import according to their usage to build our recommendation system:

Python libraries used to read and preprocess the dataset:

  import numpy as np 
  import pandas as pd

Python libraries are used to explore the dataset using graphs and plots:

  import seaborn as sns
  import matplotlib.pyplot as plt

Python libraries used to build the recommendation system model:

  from sklearn.preprocessing import StandardScaler
  from sklearn.neighbors import NearestNeighbors
  from sklearn.decomposition import PCA

Python library used to connect to a GridDB cluster:

  import griddb_python as griddb

After successfully importing the required libraries, we begin with reading our English Premier League Players Dataset.

Loading the Dataset

GridDB plays a significant role in creating our recommendation system as it is the data storage mechanism we will use to store our dataset. To successfully store the data, we will first load the GridDB container. This can be done using the griddb_python library we installed earlier. Next, we will use container.put to insert the data using a loop. Once done, we must load the data back into a data frame to continue creating our recommendation system.

The code described in this section can be written as follows:

factory = griddb.StoreFactory.get_instance()

# Provide the necessary arguments
gridstore = factory.get_store(
    notification_member = '127.0.0.1:10001',
    cluster_name = 'myCluster',
    username = 'admin',
    password = 'admin'
)

# Define the container info
conInfo = griddb.ContainerInfo(
    "football_players",
    [
        ["name", griddb.Type.STRING],
        ["club", griddb.Type.STRING],
        ["age", griddb.Type.DOUBLE],
        ["position", griddb.Type.STRING],
        ["position_cat", griddb.Type.DOUBLE],
        ["market_value", griddb.Type.DOUBLE],
        ["page_views", griddb.Type.DOUBLE],
        ["fpl_value", griddb.Type.DOUBLE],
        ["fpl_sel", griddb.Type.STRING],
        ["fpl_points", griddb.Type.DOUBLE],
        ["region", griddb.Type.DOUBLE],
        ["nationality", griddb.Type.STRING],
        ["new_foreign", griddb.Type.DOUBLE],
        ["age_cat", griddb.Type.DOUBLE],
        ["club_id", griddb.Type.DOUBLE],
        ["big_club", griddb.Type.DOUBLE],
        ["new_signing", griddb.Type.DOUBLE]
    ],
    griddb.ContainerType.COLLECTION, True
)

# Drop container if it exists
gridstore.drop_container(conInfo.name)

# Create a container
container = gridstore.put_container(conInfo)

# Load the data

# Put rows
for i in range(len(data)):
  row = data.iloc[i].tolist()
  try:
    container.put(row)
  except Exception as e:
    print(f"Error on row {i}: {row}")
    print(e)

cont = gridstore.get_container("football_players")

if cont is None:
  print("Does not exist")

print("connection successful")

# Define the exact columns you need
columns = ["*"]

select_statement = "SELECT " + ", ".join(columns) + " FROM football_players"

# Execute the query
query = container.query(select_statement)
rs = query.fetch(False)

data = rs.fetch_rows()

print(data.head())

Exploratory Data Analysis

Before we build our recommendation system, we must begin with an exploratory data analysis that will allow us to find any inconsistencies in our data and overall visualization of the dataset.
First, we begin by checking for any null values in our attributes. This is achieved with the following lines of code:

data.isnull().sum()

This cell outputs the following results, indicating that we have one missing value for the region attribute:

name            0
club            0
age             0
position        0
position_cat    0
market_value    0
page_views      0
fpl_value       0
fpl_sel         0
fpl_points      0
region          1
nationality     0
new_foreign     0
age_cat         0
club_id         0
big_club        0
new_signing     0
dtype: int64

To clean up our missing value, we can use the built-in method dropna(). This method returns the newly cleaned dataset to replace the old unclean one.