How to Generate Test Data for Machine Learning in Python using scikit-learn

A great place to start when testing a new machine learning algorithm is to generate test data. Collecting data can be a tedious task, and often the best (and easiest) solution will be to use generated data rather than collecting it youself. More often than not, you simply want to compare different machine learning algorithms and you don’t care about the origin of the data. The Python library, scikit-learn (sklearn), allows one to create test datasets fit for many different machine learning test problems. Sci-kit learn is a popular library that contains a wide-range of machine-learning algorithms and can be used for data mining and data analysis.

Table of Contents

There are two ways to generate test data in Python using sklearn. The first one is to load existing datasets as explained in the following section. The second way is to create test data youself using sklearn. This guide will go over both approaches.



All scikit-learn Test Datasets and How to Load Them From Python

Python’s scikit-learn library has a very awesome list of test datasets available for you to play around with.

The sklearn library provides a list of “toy datasets” for the purpose of testing machine learning algorithms. The data is returned from the following sklearn.datasets functions:

  • load_boston() Boston housing prices for regression
  • load_iris() The iris dataset for classification
  • load_diabetes() The diabetes dataset for regression
  • load_digits() Images of digits for classification
  • load_linnerud() The linnerud dataset for multivariate regression
  • load_wine() The wine dataset for classification
  • load_breast_cancer() The breast cancer dataset for classification

Here’s a quick example on how to load the datasets above.

# Import libraries
from sklearn.datasets import load_digits
from matplotlib import pyplot as plt

# Load the data
data = load_digits()

# Plot one of the digits ("8" in this case)
plt.gray() 
plt.matshow(digits.images[8]) 
plt.show()

Which gives us this figure
sklearn load_digits 8

Now that we have seen go to load test data, let’s look into how to generate the data ourselves.

Generate Test Data for Linear Regression Problems

Regression is a technique used to estimate the relation between variables. In linear regression, one wishes to find the best possible linear fit to correlate two or more variables. Regression belongs to the machine learning branch called supervised learning. The data is generated with the sklearn.datasets.make_regression() function.

Let’s have an example in Python of how to generate test data for a linear regression problem using sklearn.

# Import libraries
from sklearn import datasets
from matplotlib import pyplot as plt

# Get regression data from scikit-learn
x, y = datasets.make_regression(n_samples=20, n_features=1, noise=0.5)

# Vizualize the data
plt.scatter(x,y)
plt.show()

The function make_regression() takes several inputs as shown in the example above. The inputs configured above are the number of test data points generated n_samples the number of input features n_features and finally the noise level noise in the output date.

The following result is obtained by running the code in Python.
sklearn make_regression regression data python

Create Test Data for Clustering

Clustering has to do with finding different clusters or patterns in ones data. The are various machine learning algorithms that can classify data into clusters. Read more about clustering here. We create the data using the sklearn.datasets.samples_generator.make_blobs function.

make_blobs from sklearn can be used to clustering data for any number of features n_features with corresponding labels.

A code example is shown below with the sci-kit learn library and make_blobs.

from sklearn.datasets.samples_generator import make_blobs
from matplotlib import pyplot as plt
import pandas as pd

# Create test data: features (X) and labels (y)
X, y = make_blobs(n_samples=200, centers=4, n_features=2)

# Group the data by labels
Xy = pd.DataFrame(dict(x1=X[:,0], x1=X[:,1], label=y))
groups = Xy.groupby('label')

# Plot the blobs
fig, ax = plt.subplots()
colors = ["blue", "red", "green", "purple"]
for idx, classification in groups:
    classification.plot(ax=ax, kind='scatter', x='x1', y='x2', label=idx, color=colors[idx])
plt.show()

This gives us the following plot:

make blobs sklearn generate data python

Generate Test Data for Face Recognition – The Olivetti Faces Dataset

Now for my favourite dataset from sci-kit learn, the Olivetti faces. Let’s generate test data for facial recognition using python and sklearn.

The Olivetti Faces test data is quite old as all the photes were taken between 1992 and 1994. All the photes are black and white, 64×64 pixels, and the faces have been centered which makes them ideal for testing a face recognition machine learning algorithm.

The images are retrieved from sklearn in python using the function fetch_olivetti_faces(). When calling this function, python will load all the images which may take some time. When you want to plot the images, it can therefore be a good idea to only plot a small subset of the images to avoid memory problems.

Here is an python example on how to load the Olivetti faces from sklearn using the fetch_olivetti_faces function.

# Import libraries
from sklearn import datasets
import numpy as np
from matplotlib import pyplot as plt

# Get the data using fetch_olivetti_faces()
faces = datasets.fetch_olivetti_faces()

# Visualize the faces (let's take the first 15)
fig = plt.figure(figsize=(8, 6))
for i in range(15):
    ax = fig.add_subplot(3, 5, i + 1, xticks=[], yticks=[])
    ax.imshow(faces.images[i], cmap=plt.cm.bone)

And here we see the first 15 faces of the Olivetti faces dataset:

Another face images dataset from sklearn – Labeled Faces in the Wild

For a newer and colorised dataset, we suggest using the Labeled Faces in the Wild (LFW) dataset. This is a larger dataset (200 MB) but it can be loaded in a very similar way.

Labeled Faces in the Wild is a dataset of face photographs for designing and training face recognition algorithms. The photos in the dataset are of famous people such as Tony Blair, Ariel Sharon, Colin Powell and George W. Bush.

The LFW dataset can be loaded from python using this function: fetch_lfw_people(min_faces_per_person=50, resize=0.5) with a minimum amount of faces per person min_faces_per_person and a resizing factor resize.

Generate Test Data for Circle Classification for Machine Learning

Classification is an important branch of machine learning. This section and the next will help you create some great test datasets for classification problems. Our next scikit learn function is sklearn.datasets.make_circles.

This section will teach you how to use the function make_circles to make two “circle classes” for your machine learning algorithm to classify. The method takes two inputs: the amount of data you want to generate n_samples and the noise level in the data noise.

Here is the Python code:

# Import libraries
from sklearn.datasets import make_circles # This is how we get the circles!
from matplotlib import pyplot
from pandas import DataFrame

# First, let's generate the data using make_circles()
X, y = make_circles(n_samples=200, noise=0.1)

# Split the data into two vectors (one for each class)
# and organize the data in a dataframe
df = DataFrame(dict(x1=X[:,0], x2=X[:,1], label=y))
grouped = df.groupby('label')

#Finally, let's plot the results
fig, ax = pyplot.subplots()
colors = ["red", "blue"]
labels = ["x1", "x2"]
for key, group in grouped:
    group.plot(ax=ax, kind='scatter', x='x1', y='x2', label=labels[key], color=colors[key])
pyplot.show()

Executing the above code gives us the following plot:

test data make_circles machine learning python sklearn

Generate Test Data for Moon Classification

We just looked at how to create circles for classification. Now, let’s look at how to create test data moons! Sci-kit learn also let’s you make two half moon to test your classification algorithms.

This time we are going to use the function make_moons to generate two opposite “half moon classes” for our classification problem. This function also need to know amount of data you want to generate n_samples and the noise level that you want noise.

This is how the code will look in Python using sklearn:

# Import libraries
from sklearn.datasets import make_moons
from matplotlib import pyplot
from pandas import DataFrame

# Generate test data using using make_moons()
X, y = make_moons(n_samples=100, noise=0.1)

# Just like with the circles, split and organize the data 
df = DataFrame(dict(x1=X[:,0], x2=X[:,1], label=y))
grouped = df.groupby('label')

# And plot it
colors = ["red", "blue"]
labels = ["x1", "x2"]
fig, ax = pyplot.subplots()
for key, group in grouped:
    group.plot(ax=ax, kind='scatter', x='x1', y='x2', label=labels[key], color=colors[key])
pyplot.show()

After running this code, we get:

test data make_moons machine learning python sklearn

Summary

We hope this guide on how to create test data for machine learning in Python using scikit-learn was useful to some of you! If you enjoy the site and you want the guides to keep coming, feel free to leave a comment or follow us on Facebook.

Leave a Reply

Your email address will not be published.