Problem Understanding and Definition in Machine Learning
Welcome to the second post in our series on End-to-End Machine Learning Projects. In the first post, we provided an overview of what an end-to-end machine learning (ML) project looks like. In addition, we introduced the problem that we will be working on throughout this series: predicting house prices.
Haven’t read the first post? click here
Today, we dive deeper into understanding the problem and defining it in machine learning terms.
As we move to the second part of this series, it’s time to dig deeper into understanding and defining our problem.
A detailed exploration at this stage can be vital in shaping the course of our ML project and ensuring its success.
Let’s embark on this journey of understanding the problem we are aiming to solve by defining it in terms of machine learning.
What you’ll learn?
- Understanding the Problem
- Defining the Problem in Machine Learning Terms
- The full picture
- The Key to a successful start
Unveiling the Layers of the Problem
In our case, the problem at hand appears simple โ predicting house prices. Even in simplicity, there are several facets that we need to understand and untangle. To do this, we’ll break it down into four key aspects:
- Goal
- Use Case
- End Users
- Data Availability
Goal
The first step in understanding a problem is to crystalize what we’re aiming to achieve. In our case, we aim to develop a machine-learning model that can accurately predict house prices. While this seems straightforward, we must ask ourselves โ what features will influence this price? What kind of price range are we considering? For this project, we’ll consider various features such as the number of rooms, the age of the house, location, proximity to amenities, and more. We’ll assume our prices range across typical residential house prices.
Use Case
It’s important to identify how our ML model will be used, whether it will be a part of a larger system or a standalone tool. How will it interact with other components or users? In our case, we assume the model will be integrated into a real estate platform, providing instant price estimates to users interested in specific properties. Understanding this not only helps in creating the model but also in making decisions around the model’s interpretability, speed, and integration with the platform’s existing architecture.
End Users
Users can range from homeowners and buyers to real estate agents, investors, and app developers, therefore understanding who will use the model can influence how we shape it For our project.
Let’s say our model is primarily aimed at buyers and real estate agents. This user-centric approach can help us understand how to present the results, what levels of accuracy are acceptable, and what features might be most important to our users.
Data Availability
An ML model is only as good as the data it’s trained on. Therefore, understanding what data is available for training the model is critical. Do we have historical data on house sales? Do we have enough data to capture the nuances of the house prices? In our scenario, let’s assume we have access to a rich dataset of past house sales that includes the sale price and various features of each house.
Translating the Problem into the ML Lexicon
Once we’ve dissected the problem and understood its aspects, we must define it in terms of machine learning. This involves pinning down the type of ML problem we’re tackling and the metrics to measure our model’s success.
- Problem Type: In general we can categorize Machine learning problems into a few types: supervised (classification and regression), unsupervised (clustering), semi-supervised, and reinforcement learning. In our case, as we’re predicting a continuous numerical value (house prices), we’re dealing with a supervised regression problem.
- Success Metrics: Defining the criteria for success is pivotal in machine learning projects. Otherwise, we have no measure of improvement or even understanding if we achieved our goal. Without defining a Success metric, how would we know when our project is ready to be deployed?
For regression problems, common metrics include:
– Mean Absolute Error (MAE).
– Mean Squared Error (MSE).
– Root Mean Squared Error (RMSE).
– Rยฒ score.
For our project, we’ll choose RMSE as our primary metric. This metric works well for our case as it heavily penalizes larger errors, which is important when dealing with high-value transactions like house sales.
However, we’ll also keep an eye on MAE to ensure our model isn’t frequently off by a large margin, and Rยฒ to understand the proportion of the variance in our dependent variable that is predictable from the independent variables.
Tying it all Together: Defining the Problem Statement
After gaining a thorough understanding of our problem and defining it in machine learning terms, we can finally pen down our problem statement:
“Develop a supervised machine learning model to predict residential house prices based on a range of house features, such as the number of rooms, age of the house, location, and more, with the least possible RMSE (i.e. minimizing RMSE). The model will be integrated into a real estate platform primarily used by home buyers and real estate agents.”
This clear and succinct statement will guide our decisions as we proceed through each step of this project.
Final words: The Key to a Successful Start
Understanding and defining the problem is the cornerstone of any machine-learning project. It sets the direction for everything โ from data collection and preparation to model selection, training, and evaluation.
Having now framed our problem and its machine learning interpretation, we’re ready to embark on our next journey โ Data Collection and Preparation. In our next post, we’ll dive into finding the right data, and data collection techniques, and preparing them for our machine-learning model.
We look forward to your engagement in this learning journey, and we encourage you to leave comments, ask questions, and share insights as we move forward. Until then, happy learning!
What’s next?
End-to-End ML Project – Post 1 – Introduction
End-to-End ML Project – Post 3 – Data Collection and Preparation
Want to dive deeper into Recent papers and their summaries – click here