Building a Predictive Model: Regression and Evaluation

Machine learning regression model and evaluation

Building a Predictive Model: Regression and Evaluation

Welcome back to our ML series, where we’re developing a Machine Learning model end-to-end using the Diabetes dataset.
In our previous posts, we defined our problem, collected our dataset, cleaned it, performed exploratory data analysis, and prepared it for machine learning.
Now we’re stepping into the realm of model building, training, and evaluation.

What you’ll learn?

  • Model Selection: Picking the Right Tool
  • Data Splitting: Train and Test Sets
  • Model Training: Teaching Our Model to Learn
  • Making Predictions: Asking Our Model Questions
  • Model Evaluation: How Well Did We Do?
  • Conclusion: Onwards to Tuning and Optimization
Model Selection: Picking the Right Tool

Choosing the right model is both art and science.

The selection is usually based on the nature of your data and the problem you’re trying to solve. In our case, we’re dealing with a classic regression problem: predicting a continuous outcome (diabetes index) from several predictors.

There are many types of regression models we could use:

  • Linear regression
  • Decision trees
  • Random forest,
  • Gradient boosting
  • Neural networks, and more.

For simplicity and interpretability, we’ll start with Linear Regression as our initial model. Linear regression is a good starting point because it’s simple, fast, and provides a baseline from which we can compare other, more complex models.

Data Splitting: Train and Test Sets

Before we can train our model, we need to split our data into a training set and a test set. The training set is what we’ll use to teach our model about the relationship between the features and the target variable. The test set, on the other hand, is used to evaluate the model’s performance on unseen data.

In Python, we can use the train_test_split function from sklearn.model_selection to randomly split our data.

from sklearn.model_selection import train_test_split

# Define our predictors and target
X = data.drop('diabetes_measure', axis=1)
y = data['diabetes_measure']

# Split our data into training and test sets
TEST_FRACTION = 0.2
SEED = 42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_FRACTION, random_state=SEED)

A common practice is to keep 80% of the data for training and 20% for testing. But, it really depends on which dataset we are working with! More on that in future posts.

The random_state parameter ensures that the split is reproducible.

Model Training: Teaching Our Model to Learn

With our data split, we can now train our model. Training a model with SKLearn is as easy as calling the “fit” function.

from sklearn.linear_model import LinearRegression

# Initialize a Linear Regression model
lr = LinearRegression()

# Train the model
lr.fit(X_train, y_train)

During training, the model learns the relationship between the features and the target variable.

In the case of linear regression, it learns the optimal coefficients for the equation that predicts the target variable from the features.

Making Predictions: Asking Our Model Questions

Once the model is trained, we can use it to predict the diabetes index for new unseen data. This is done using the “predict” method of the model.

# Make predictions on the training and test data
y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)

These predictions can then be compared to the actual values to evaluate how well our model is performing.

Model Evaluation: How Well Did We Do?

Evaluating a model is a crucial step in the machine-learning workflow.

It tells us how well our model is performing and can also help us compare different models or approaches.

There are several ways to evaluate a regression model’s performance, but we’ll focus on two common metrics: Root Mean Squared Error (RMSE) and Coefficient of Determination (R^2).

  • Root Mean Squared Error (RMSE): This is a popular metric for regression problems. It measures the average magnitude of the errors in a set of predictions, without considering their direction. RMSE is particularly useful when large errors are undesirable. The closer to 0 this is, the better the prediction.
  • Coefficient of Determination (R^2): This metric provides an indication of the goodness of fit of a set of predictions to the actual values. In other words, it explains how much of the variability in the outcome can be explained by the predictors in our model. The value for R^2 lies between 0 and 1. A value closer to 1 indicates that a larger proportion of variance is accounted for by the model.

Let’s calculate these metrics for our model:

from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Calculate RMSE for the training and test sets
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))

# Calculate R^2 for the training and test sets
r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)

# Display results
print(f'Training RMSE: {rmse_train:.2f}')
print(f'Test RMSE: {rmse_test:.2f}')
print(f'Training R^2: {r2_train:.2f}')
print(f'Test R^2: {r2_test:.2f}')
Training RMSE: 53.26
Test RMSE: 53.23
Training R^2: 0.53
Test R^2: 0.39

The RMSE values give us an idea of how much error the model makes in its predictions with a higher weight for large errors.

The R^2 values, on the other hand, tell us how well our model fits the data. A high R^2 and a low RMSE indicate a good fit.

Conclusion: Onwards to Tuning and Optimization

We’ve now successfully built and evaluated a simple linear regression model on the Diabetes dataset. However, our journey doesn’t end here.

ML models rarely provide the best possible results on the first try, and our model is no exception.

There are several ways to potentially improve the performance of our model, such as:

  • Feature engineering: Can we create new features that capture more information?
  • Model selection: Are there other models that might perform better?
  • Hyperparameter tuning: Can we adjust the settings of our model to get better results?
  • Ensemble methods: Can we combine the predictions of multiple models to get better results?

In the next post, we’ll explore some of these options to improve our model. As always, happy coding and learning!

What’s next?

For more guides press here

Want to dive deeper into Recent papers and their summaries – click here

Pandas AI – Combine LLMs with Pandas

Learn how to Create ChatGPT in your CLI

Back to the beginning

,