Data Collection and Preparation: The Backbone of Machine Learning

Data Collection and Preparation for ML Project

Data Collection and Preparation: The Backbone of Machine Learning

Welcome to the third installment of our end-to-end Machine Learning (ML) series. We’ve laid the groundwork with a solid understanding of our problem – predicting diabetes – and have defined it in ML terms. Now, we’re ready to dive into the crucial phase of data collection and preparation. This process, also known as data preprocessing, shapes the backbone of any ML project and significantly influences the model’s performance.

What you’ll learn?

Data Collection: Finding the Right Data

Data is the fuel for machine learning. It powers our algorithms and drives our results. In our case, we’re looking for data on diabetes records that includes a range of features about each patient, such as BMI, gender, and more.

But where do we find this data?

  1. Public Datasets: There are various public datasets available that could suit our needs. Platforms like Kaggle, UCI Machine Learning Repository, and Google’s Dataset Search are good starting points.
  2. Government Databases: Many governments maintain databases of public records, including real estate sales, that could be used for our project. Make sure to check the legal usage of this data.
  3. Paid Databases: Some databases require a fee to access, but they often provide higher-quality, more specific, or more up-to-date data.
  4. Scraping Data: If the data isn’t readily available, another option could be to scrape it from various online real estate platforms. However, this must be done respectfully, following the platform’s terms of service and legal regulations.

For this series, we’ll assume that we’ve found a suitable dataset on a public platform that includes records of patients with and without diabetes along with different features such as age, BMI, and more.

Data Preparation: Making the Data Machine-Learnable

Once we’ve collected our data, the next step is to prepare it for our ML model. This involves cleaning the data, handling missing values, dealing with outliers, and feature engineering.

Data loading:

This step depends on the dataset you’re working with, how it is stored, and other factors. In our case, we use the “Diabetes” dataset which can be loaded using a simple function scikit-learn provides:

from sklearn.datasets import load_diabetes
import pandas as pd

# Load the dataset
diabetes = load_diabetes(scaled=False)

# Create a DataFrame
data = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

# a quantitative measure of disease progression one year after baseline
data['diabetes_measure'] = diabetes.target

# Display the first few rows of the DataFrame
print('Dataset size:', len(data))
data.head()
data frame head() show

Data Cleaning:

This step involves removing duplicates, correcting errors, and dealing with inconsistencies in the data. For instance, if ‘sex’ is sometimes written as ‘male’ and sometimes as ‘M,’ we’d standardize this.

# Drop duplicates
data = data.drop_duplicates()

# Check for inconsistencies, in this case, in the 'age' column
data['age'].hist()
plt.xlabel('Count')
plt.ylabel('Age')
plt.show()


# Check for inconsistencies, in this case, in the 'sex' column
data['sex'].hist()
plt.xlabel('Count')
plt.ylabel('Sex')
plt.show()
Feature histogram
Age Histogram using Matplotlib in python
Handling Missing Values:

It’s common for datasets to have missing values. We could fill in these missing values with a standard value, the mean or median value, or use a method like regression or a more advanced technique to estimate the missing value.

# Check for missing values
print(data.isnull().sum())

# In case of missing values, fill with the median value
data = data.fillna(data.median())
Dealing with Outliers:

Outliers can significantly impact our model. Techniques to handle outliers include setting a threshold and capping anything above it, transforming the data (e.g., log transform), or using robust models that are less sensitive to outliers.

Here we apply a simple outlier removal process which drop all examples that diverge from the mean by more than 3 standard deviations.

from scipy import stats
import numpy as np

# Calculate Z-scores
z_scores = np.abs(stats.zscore(data))

# Remove rows with Z-scores above 3
data = data[(z_scores < 3).all(axis=1)]
Feature Engineering:

This step involves creating new features from existing ones that might better represent the problem to the model.

The process of feature engineering is super important. Many times, this is the step through which we can “inject” subject-matter knowledge, keep only important features, and more.

For instance, instead of ‘BMI’ and ‘SEX’, we might create a new feature, ‘sex_bmi’. Here we inject domain knowledge as we know that BMI recommendation is different by gender.

data['sex_bmi'] = data['sex'] * data['bmi']
Feature Scaling:

Many ML algorithms perform better when numerical input variables are scaled to a standard range. Methods include Normalization and Standardization.

  • Normalization – scaling to a range of 0-1.
  • Standardization – reshaping the distribution to have a mean of 0 and a standard deviation of 1.

Let’s incorporate SKLearn library for Standardizing the numerical columns (“features”):

  • Step 1 – Identify the relevant columns
  • Step 2 – Call the fit_transform method of the StandardScaler object.
from sklearn.preprocessing import StandardScaler

# Initialize a scaler
scaler = StandardScaler()

# Scale numerical columns
# In our DS, all columns are numerical
numerical_cols = diabetes.feature_names 
data[numerical_cols] = scaler.fit_transform(data[numerical_cols])

data.head()
Data frame head after Scaling
Encoding Categorical Variables:

Many ML models can only handle numerical values, so we need to convert categorical variables into numerical ones.

Common methods include One-Hot Encoding and Ordinal Encoding.

For example, suppose we have a column “city” which contains names of cities in which the patients live. We want to transform the string value into a numerical value.

How? Using get_dummies function:

data = pd.get_dummies(data, drop_first=True)

Data Exploration: Understanding Our Data

Before we conclude our data preparation, it’s a good idea to explore our data to gain insights that could be useful when developing our model. Data visualization tools can help us understand the distribution of data, identify outliers, and detect relationships between variables. For instance, we might find that patient illness is strongly correlated with BMI but less so with its height.

With our data cleaned and preprocessed, let’s perform some exploratory data analysis to understand our data better. We’ll use Python’s Matplotlib and Seaborn libraries for this.

Understanding the distribution of data:

import matplotlib.pyplot as plt
import seaborn as sns

# Plot histograms for each numerical attribute
data.hist(bins=50, figsize=(20,15))
plt.show()
Data histogram of all numerical features (columns)
Identifying correlations:

Next, it is important to gain some understanding of the correlation between our features to themselves, for nonetheless, between the features and the target.

# Calculate correlation matrix
corr_matrix = data.corr()

# Display correlations with the target variable
print(corr_matrix["diabetes_measure"].sort_values(ascending=False))
corr_matrix["diabetes_measure"].sort_values(ascending=False).plot.bar()
plt.ylabel('Correlation')
plt.show()
Features correlation plot

Looking at the results, “diabetes_measure” is with correlation 1 to itself (Obviously, good for sanity check).

In addition, BMI for example is the most correlated feature while Gender seems like a bad predicate.

Visualizing relationships:
# Visualize the relationship between S1-S6 features
sns.pairplot(data[["s1", "s2", "s3", "s4", "s5", "s6"]])
plt.show()

# Visualize the relationship between S1-S6 features by gender
sns.pairplot(data[["s1", "s2", "s3", "s4", "s5", "s6", "sex"]], hue='sex')
plt.show()

# Visualize the relationship between gender and diabetes_measure
sns.boxplot(x=data['sex'], y=data['diabetes_measure'])
plt.show()
Pair plot diagram on S1-S6 features
Pair plot diagram on S1-S6 features conditioned on Gender
Gender box plot diagram

There are many more ways to visualize and explore our data. This step is so important!

Conclusion: The Power of Prepared Data

Data collection and preparation is a critical phase in a machine learning project. The quality and relevance of the data collected, as well as how well it’s prepared, will significantly influence our model’s performance.

Data collection and preparation are at the core of a machine-learning project. A model is only as good as the data it learns from, and how we prepare this data will directly affect our results.

Having now framed our problem, defined it in ML terms, and prepared our data, we’re ready to embark on the next phase of our journey โ€“ choosing and training our machine learning model.

Stay tuned for our next post in this series, where we’ll explore how to select the suitable machine-learning algorithm for our problem and how to train our model. We encourage you to share your thoughts, ask questions, and leave comments as we navigate this exciting journey.

Happy learning!

What’s next?

For more guides press here

Want to dive deeper into Recent papers and their summaries – click here

Pandas AI – Combine LLMs with Pandas

Learn how to Create ChatGPT in your CLI

Back to the beginning