Hidden Gem Libraries for Machine Learning

“Don’t limit yourself to the popular choices. The best solutions often come from exploring the unknown.”

“The true joy of discovery lies in uncovering hidden gems that have yet to shine.”

“Innovation comes from stepping outside your comfort zone and trying something new. These libraries are your passport to new frontiers.”

Photo by Luke Chesser on Unsplash

Hello, fellow data scientists and machine learning enthusiasts! In this article, I want to introduce you to some lesser-known but incredibly useful libraries that can help you build better machine learning models. While TensorFlow and PyTorch may be the most popular choices, there are many other libraries out there that are worth exploring.

Criteria for selecting libraries:

Before we dive into the libraries themselves, let me explain the criteria I used to select them. First and foremost, I looked for libraries that are relatively new or have recently gained popularity. I also considered their usefulness and potential for solving common machine learning problems. Finally, I wanted to include a mix of libraries for different tasks, such as hyperparameter optimization, time series forecasting, and natural language processing.

With that in mind, let’s take a closer look at some of these hidden gems.

CatBoost:

First up, we have CatBoost. This is a gradient boosting library that can handle categorical features and missing values without requiring preprocessing. It’s been gaining popularity in recent years, and for good reason.

CatBoost uses an innovative algorithm that incorporates categorical features directly into the decision trees. This can be incredibly useful for datasets with many categorical features, as it eliminates the need for one-hot encoding or other preprocessing steps.

Here’s an example of how to use CatBoost to train a model:

import catboost as cb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=42)

clf = cb.CatBoostClassifier(iterations=100, learning_rate=0.1, depth=3)
clf.fit(X_train, y_train)

print(clf.score(X_test, y_test))

In this example, we’re using CatBoost to classify breast cancer samples as either malignant or benign. We split the data into training and testing sets, then create a CatBoostClassifier object and fit it to the training data. Finally, we evaluate the model’s accuracy on the test data.

Optuna:

Next up, we have Optuna. This is a hyperparameter optimization library that uses Bayesian optimization and other techniques to find the best hyperparameters for a model. Hyperparameter tuning can be a tedious and time-consuming task, but Optuna makes it much easier.

Optuna works by defining a search space for the hyperparameters and then trying different combinations to see which ones perform best. It uses a variety of optimization algorithms to efficiently explore the search space and find the optimal hyperparameters.

Here’s an example of how to use Optuna to optimize the hyperparameters for a support vector machine (SVM) classifier:

import optuna
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=42)

def objective(trial):
C = trial.suggest_loguniform('C', 1e-10, 1e10)
gamma = trial.suggest_loguniform('gamma', 1e-10, 1e10)
clf = SVC(C=C, gamma=gamma)
clf.fit(X_train, y_train)
return clf.score(X_test, y_test)

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

print(study.best_params)

In this example, we’re using Optuna to find the best values for the C and gamma hyperparameters of an SVM classifier. We define an objective function that takes

In this example, we’re using Optuna to find the best values for the C and gamma hyperparameters of an SVM classifier. We define an objective function that takes in a set of hyperparameters and returns the accuracy of the corresponding classifier on the test data. Inside the function, we create an SVM object with the specified hyperparameters, fit it to the training data, and return the accuracy on the test data.

We then create an Optuna study object, which represents the optimization process. We tell it to maximize the objective function (i.e., find the hyperparameters that result in the highest accuracy) and to run 100 trials. Finally, we print out the best hyperparameters found by Optuna.

PyCaret:

Next up, we have PyCaret. This is a low-code library that simplifies the process of building, training, and deploying machine learning models. It’s designed to be easy to use for beginners, but it’s also powerful enough for advanced users.

PyCaret provides a wide range of pre-processing, modeling, and evaluation functions, all of which can be accessed through a simple and intuitive interface. It also includes a number of useful features, such as automatic feature engineering and model selection.

Here’s an example of how to use PyCaret to train a random forest classifier:

from pycaret.datasets import get_data
from pycaret.classification import *

data = get_data('credit')
clf = setup(data, target='default', silent=True)
best = compare_models()
tuned = tune_model(best)
final = finalize_model(tuned)
predict = predict_model(final, data)

print(predict.head())

In this example, we’re using PyCaret to train a random forest classifier on a credit card default dataset. We use the setup function to preprocess the data and set the target variable, and then call the compare_models function to compare several different classifier models. We then use the tune_model function to optimize the hyperparameters of the best-performing model, and the finalize_model function to train the final model on the entire dataset. Finally, we use the predict_model function to generate predictions on the original dataset.

LightGBM:

Next up, we have LightGBM. This is a gradient boosting library that uses histogram-based algorithms to speed up training and reduce memory usage. It’s been gaining popularity in recent years due to its impressive performance and efficiency.

LightGBM uses a number of innovative techniques, such as exclusive feature bundling and histogram-based gradient boosting, to achieve faster and more accurate training than other gradient boosting libraries.

Here’s an example of how to use LightGBM to train a model:

import lightgbm as lgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=42)

train_data = lgb.Dataset(X_train, label=y_train)
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'binary_logloss',
'num_leaves': 31,
'learning_rate': 0.05,
'feature_fraction': 0.9,
'bagging_fraction': 0.8,
'bagging_freq': 5,
'verbose': 0
}
clf = lgb.train(params, train_data, 100)

y_pred = clf.predict(X_test)
y_pred = [1 if x >= 0.5 else 0 for x in y_pred]

print((y_pred == y_test).mean())

In this example, we’re using LightGBM to classify breast cancer samples as either malignant or benign. We split the data into training and testing sets, then create a LightGBM Dataset object with the training data. We then define a set of parameters for the model, including the boosting type, objective function, and learning rate. Finally, we train the model with 100 iterations and evaluate its accuracy on the test data.

Prophet:

Last but not least, we have Prophet. This is a time series forecasting library that uses additive models to capture trends, seasonality, and other patterns. It was developed by Facebook and has been gaining popularity in recent years.

Prophet is designed to be easy to use and provides a number of useful features, such as automatic detection of holidays and custom seasonality patterns. It can also handle missing data and outliers, making it a useful tool for real-world datasets.

Here’s an example of how to use Prophet to forecast the number of airline passengers over time:

import pandas as pd
from fbprophet import Prophet
from fbprophet.plot import add_changepoints_to_plot

data = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/flights.csv')
data = data.rename(columns={'year': 'ds', 'passengers': 'y'})
m = Prophet()
m.fit(data)
future = m.make_future_dataframe(periods=36, freq='MS')
forecast = m.predict(future)
fig = m.plot(forecast)
_ = add_changepoints_to_plot(fig.gca(), m, forecast)

In this example, we’re using Prophet to forecast the number of airline passengers over time. We start by reading in the data from a CSV file and renaming the columns to match Prophet’s expected format. We then create a Prophet object and fit it to the data. We use the make_future_dataframe function to create a dataframe with 36 additional months of data, and then use the predict function to generate a forecast. Finally, we plot the forecast using the plot function and add the changepoints to the plot using the add_changepoints_to_plot function.

Conclusion:

And that’s it! We’ve covered five lesser-known but incredibly useful libraries for machine learning. CatBoost can handle categorical features and missing values without preprocessing, Optuna can optimize hyperparameters with ease, PyCaret provides a low-code interface for building and training models, LightGBM is a fast and efficient gradient boosting library, and Prophet can handle time series forecasting with ease. By exploring these libraries and incorporating them into your workflows, you can build better and more efficient machine learning models.

5 Hidden Gems for Machine Learning Enthusiasts: Discovering Lesser-Known Libraries That Can…

Table of contents