ScikitLLM – A powerful combination of SKLearn and LLMs

ScikitLLM - Open source library combining SKLearn with OpenAI LLM

ScikitLLM – Simple SKLearn API with Powerful LLMs Under the Hood

GitHub Repo

Scikit-LLM is a standout open-source project in the world of machine learning. It’s a Python library that cleverly combines the power of large language models, like ChatGPT, with the flexibility of Scikit-learn, a popular machine-learning library. This combination is not just innovative, it’s game-changing, making text analysis tasks easier and more efficient. As we explore this impressive GitHub repository, we’ll see how this project is helping to shape the future of machine learning and text analysis.

Dive in with me, into a comprehensive guide on how to use the amazing ScikitLLM Library without any issues and a perfect integration with your current SKLearn code.

What you’ll learn?

Introduction to SKLearn (Scikit-learn)

Scikit-learn, often referred to as SKLearn, is one of the most widely-used Open-source Python libraries in the field of machine learning.
It’s a go-to resource for many data scientists and machine learning enthusiasts due to its comprehensive suite of algorithms, simplicity, and ease of use.

Whether you’re looking to implement regression, classification, clustering, or dimensionality reduction, Scikit-learn has got you covered.

Designed with a focus on the practical application of machine learning, Scikit-learn is built on the foundations of Python’s scientific computing libraries: NumPy, SciPy, and Matplotlib.
This means that it integrates well with Python’s scientific stack and can work efficiently with NumPy arrays and SciPy sparse matrices as inputs.

A central philosophy behind Scikit-learn is its consistent and simple interface. Regardless of the machine learning algorithm you choose to use, the process of using it is usually the same:

  • import the appropriate class
  • Call the ‘fit‘ method with your data.
  • Then, use methods such as ‘predict‘ or ‘transform‘ to use the model.

This uniformity provides a smooth user experience and reduces the learning curve for using new models. Let’s take a look at a basic example, using a decision tree classifier:

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import tree

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create an instance of a DecisionTreeClassifier
clf = tree.DecisionTreeClassifier()

# Train the model
clf.fit(X_train, y_train)

# Use the model to make predictions on unseen data
predictions = clf.predict(X_test)

In the above code, we first import the necessary libraries and load the Iris dataset, which is included in Scikit-learn’s datasets module.
We split this data into a training set, to train our model, and a test set, to evaluate its performance.
We then create an instance of the DecisionTreeClassifier, fit it to our training data, and finally, use it to predict the classes of our test data.

This simple and consistent interface, combined with a wide range of machine learning algorithms, is part of why Scikit-learn is such a popular choice for machine learning in Python.

Now, once we are familiar with SKLearn, let’s dive into ScikitLLM!

ScikitLLM – Introduction

Scikit-LLM is a Python library that integrates large language models, such as ChatGPT, into the Scikit-learn framework. Besides all of that, it is open-source and free.
It provides a seamless way to perform advanced natural language processing (NLP) tasks, from zero-shot text classification to sophisticated text vectorization, all within the Scikit-learn pipeline.

ScikitLLM wraps the interaction with the OpenAI API, automatically handling tasks such as API key configuration and response processing.

The library ensures compatibility with Scikit-learn’s interface, meaning you can use familiar methods like fit and predict. This allows users to leverage the power of large language models while maintaining the familiar workflows and practices associated with Scikit-learn.

Installation

So, how do we install the library? You first need to have Python installed. For here, the path is pretty clear because it is open source and accessible through pip. Run the following pip install command to install ScikitLLM

pip install scikit-llm

That is, simple, right?

OpenAI API Key

As ScikitLLM is really a fresh-out-of-the-oven library, it currently supports only OpenAI Chat models (ChatGPT, GPT3/4, etc’). So in order to use it properly, we need an API Key that you can generate using OpenAI. If you are not familiar with how to get your own OpenAI API Key, please click here for a step-by-step guide to get the key in 60 seconds.

Now, once we have our API Key, let’s dive into the code

Getting started with the code

We start by importing ScikitLLM and setting the OpenAI API Key configuration. Why? Once we train, predict, and generally use our model with ScikitLLM, it will use those configurations to “Talk” to OpenAI LLMs

from skllm.config import SKLLMConfig
SKLLMConfig.set_openai_key("OPENAI_API_KEY")
SKLLMConfig.set_openai_org("OPENAI_API_ORG")

Train and Predict with ZeroShotGPTClassifier

First, we import our model

from skllm import ZeroShotGPTClassifier

Next, let’s pick a simple Dataset for our experiment. Here you can pick any dataset you wish to work with

from skllm.datasets import get_classification_dataset

# demo sentiment analysis dataset
# labels: positive, negative, neutral
X, y = get_classification_dataset() 

Create the model by providing the exact OpenAI LLM model to use. It can be “gpt-3.5-turbo” and it can be “gpt4” (if your account has access to it).

# Create our model object - using "gpt-3.5-turbo" LLM
clf = ZeroShotGPTClassifier(openai_model = "gpt-3.5-turbo")

Starting our training, The beautiful part is that, the API is the same as SKLearn:

# Fit/Train our model
clf.fit(X, y)

Now we move to the prediction part. Same here, a SKLearn like API:

# Predict/Inference on our Data using trained model
labels = clf.predict(X)

All together:

from skllm.config import SKLLMConfig
from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset

# Configure OpenAI Accessability 
SKLLMConfig.set_openai_key("OPENAI_API_KEY")
SKLLMConfig.set_openai_org("OPENAI_API_ORG")

# demo sentiment analysis dataset
# labels: positive, negative, neutral
X, y = get_classification_dataset() 

# Create our model object - using "gpt-3.5-turbo" LLM
clf = ZeroShotGPTClassifier(openai_model = "gpt-3.5-turbo")

# Fit/Train our model
clf.fit(X, y)

# Predict/Inference on our Data using trained model
labels = clf.predict(X)

Truly Zero Shot Unlabeled Model

So far we saw an example that starts by training a model and then predicting. The unsaid assumption we had is that we have labeled data, i.e. Y exists and is available.

The power of recent LLMs is that they are amazing at predictions without any additional training – true Zero Shot predictions.

Let’s see how we can implement this use case in our code. We have X which is our input, but Y – our labels are missing.

We just need to provide the possible labels/categories of our task:

# Import model and dataset
from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset

# Load the data
X, _ = get_classification_dataset()

# Create the model
clf = ZeroShotGPTClassifier()

# Only provide the possible Categories (i,e "labels")
clf.fit(None, ['positive', 'negative', 'neutral'])

# Predict
labels = clf.predict(X)

Text Vectorization

In addition to its use as a classifier, GPT can also be employed purely for data preprocessing. The GPTVectorizer function enables the conversion of text blocks of any length into fixed-dimensional vectors. These vectors can then be used with virtually any classification or regression model, thereby enhancing the versatility of data processing and model application.

The code:

# Import
from skllm.preprocessing import GPTVectorizer

# Create the model
model = GPTVectorizer(openai_model = "gpt-3.5-turbo")

# Fit and Transform (one after the other)
vectors = model.fit_transform(X)

That was easy, too easy.
Let’s see a more complicated example:

  • Vectorize the data
  • Train a classifier on the Vectorized data instead of the raw text
  • Incorporate Scikit-learn API and different features such as:
    • Pipeline – Create a flow/pipeline of models
    • LabelEncoder – Encode your labels into a vectorized form
    • XGBClassifier – XGBoost classifier

We will work with the SKLearn Pipeline object, to create a complex model out of smaller building blocks. In our case – GPTVEctorizer would be the first step and XGBClassifier is the second step.

# Import SKLearn stuff
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier

le = LabelEncoder()
# Encode the labels - both train nd test
y_train_encoded = le.fit_transform(y_train)
y_test_encoded = le.transform(y_test)

# Create a pipeline with two steps: GPTVectorizer + XGBClassifier
steps = [('GPT', GPTVectorizer()), ('Clf', XGBClassifier())]
clf = Pipeline(steps)
clf.fit(X_train, y_train_encoded)
yh = clf.predict(X_test)

Wow, that’s impressive how convenient it is to use GPT now!

Conclusion

So what did we learn today? We got familiar with the new Open source Python library SciitLLM. Now, you can use OpenAI LLMs for different Natural Language Processing (NLP) tasks, like any other model that exists in SKLearn (Scikit-learn).
It will speed up your work, make your experiments cleared, and under the same umbrella, that probably you already working with.
So what are you waiting for?

What’s next?

For more guides press here

Want to dive deeper into Recent papers and their summaries – click here

Pandas AI – Combine LLMs with Pandas

Learn how to Create ChatGPT in your CLI

Back to the beginning