How to Generate Test Data for Machine Learning in Python using scikit-learn
A great place to start when testing a new machine learning algorithm is to generate test data. Collecting data can be a tedious task, and often the best (and easiest) solution will be to use generated data rather than collecting it youself. More often than not, you simply want to compare different machine learning algorithms and you don’t care about the origin of the data. The Python library, scikit-learn (sklearn), allows one to create test datasets fit for many different machine learning test problems. Sci-kit learn is a popular library that contains a wide-range of machine-learning algorithms and can be used for data mining and data analysis.
Table of Contents
- All scikit-learn Test Datasets and How to Load Them From Python
- Create Linear Regression Data
- Generate Test Data for Clustering
- Data for Face Recognition
- Circle Classification Data for Machine Learning
- Test Data for Moon Classification
There are two ways to generate test data in Python using sklearn. The first one is to load existing datasets as explained in the following section. The second way is to create test data youself using sklearn. This guide will go over both approaches.
All scikit-learn Test Datasets and How to Load Them From Python
Python’s scikit-learn library has a very awesome list of test datasets available for you to play around with.
The sklearn library provides a list of “toy datasets” for the purpose of testing machine learning algorithms. The data is returned from the following
load_boston()Boston housing prices for regression
load_iris()The iris dataset for classification
load_diabetes()The diabetes dataset for regression
load_digits()Images of digits for classification
load_linnerud()The linnerud dataset for multivariate regression
load_wine()The wine dataset for classification
load_breast_cancer()The breast cancer dataset for classification
Here’s a quick example on how to load the datasets above.
# Import libraries from sklearn.datasets import load_digits from matplotlib import pyplot as plt # Load the data data = load_digits() # Plot one of the digits ("8" in this case) plt.gray() plt.matshow(digits.images) plt.show()
Which gives us this figure
Now that we have seen go to load test data, let’s look into how to generate the data ourselves.
Generate Test Data for Linear Regression Problems
Regression is a technique used to estimate the relation between variables. In linear regression, one wishes to find the best possible linear fit to correlate two or more variables. Regression belongs to the machine learning branch called supervised learning. The data is generated with the sklearn.datasets.make_regression() function.
Let’s have an example in Python of how to generate test data for a linear regression problem using sklearn.
# Import libraries from sklearn import datasets from matplotlib import pyplot as plt # Get regression data from scikit-learn x, y = datasets.make_regression(n_samples=20, n_features=1, noise=0.5) # Vizualize the data plt.scatter(x,y) plt.show()
make_regression() takes several inputs as shown in the example above. The inputs configured above are the number of test data points generated
n_samples the number of input features
n_features and finally the noise level
noise in the output date.
The following result is obtained by running the code in Python.
Create Test Data for Clustering
Clustering has to do with finding different clusters or patterns in ones data. The are various machine learning algorithms that can classify data into clusters. Read more about clustering here. We create the data using the sklearn.datasets.samples_generator.make_blobs function.
make_blobs from sklearn can be used to clustering data for any number of features
n_features with corresponding labels.
A code example is shown below with the sci-kit learn library and
from sklearn.datasets.samples_generator import make_blobs from matplotlib import pyplot as plt import pandas as pd # Create test data: features (X) and labels (y) X, y = make_blobs(n_samples=200, centers=4, n_features=2) # Group the data by labels Xy = pd.DataFrame(dict(x1=X[:,0], x1=X[:,1], label=y)) groups = Xy.groupby('label') # Plot the blobs fig, ax = plt.subplots() colors = ["blue", "red", "green", "purple"] for idx, classification in groups: classification.plot(ax=ax, kind='scatter', x='x1', y='x2', label=idx, color=colors[idx]) plt.show()
This gives us the following plot:
Generate Test Data for Face Recognition – The Olivetti Faces Dataset
Now for my favourite dataset from sci-kit learn, the Olivetti faces. Let’s generate test data for facial recognition using python and sklearn.
The Olivetti Faces test data is quite old as all the photes were taken between 1992 and 1994. All the photes are black and white, 64×64 pixels, and the faces have been centered which makes them ideal for testing a face recognition machine learning algorithm.
The images are retrieved from sklearn in python using the function
fetch_olivetti_faces(). When calling this function, python will load all the images which may take some time. When you want to plot the images, it can therefore be a good idea to only plot a small subset of the images to avoid memory problems.
Here is an python example on how to load the Olivetti faces from sklearn using the fetch_olivetti_faces function.
# Import libraries from sklearn import datasets import numpy as np from matplotlib import pyplot as plt # Get the data using fetch_olivetti_faces() faces = datasets.fetch_olivetti_faces() # Visualize the faces (let's take the first 15) fig = plt.figure(figsize=(8, 6)) for i in range(15): ax = fig.add_subplot(3, 5, i + 1, xticks=, yticks=) ax.imshow(faces.images[i], cmap=plt.cm.bone)
And here we see the first 15 faces of the Olivetti faces dataset:
Another face images dataset from sklearn – Labeled Faces in the Wild
For a newer and colorised dataset, we suggest using the Labeled Faces in the Wild (LFW) dataset. This is a larger dataset (200 MB) but it can be loaded in a very similar way.
Labeled Faces in the Wild is a dataset of face photographs for designing and training face recognition algorithms. The photos in the dataset are of famous people such as Tony Blair, Ariel Sharon, Colin Powell and George W. Bush.
The LFW dataset can be loaded from python using this function:
fetch_lfw_people(min_faces_per_person=50, resize=0.5) with a minimum amount of faces per person
min_faces_per_person and a resizing factor
Generate Test Data for Circle Classification for Machine Learning
Classification is an important branch of machine learning. This section and the next will help you create some great test datasets for classification problems. Our next scikit learn function is sklearn.datasets.make_circles.
This section will teach you how to use the function
make_circles to make two “circle classes” for your machine learning algorithm to classify. The method takes two inputs: the amount of data you want to generate
n_samples and the noise level in the data
Here is the Python code:
# Import libraries from sklearn.datasets import make_circles # This is how we get the circles! from matplotlib import pyplot from pandas import DataFrame # First, let's generate the data using make_circles() X, y = make_circles(n_samples=200, noise=0.1) # Split the data into two vectors (one for each class) # and organize the data in a dataframe df = DataFrame(dict(x1=X[:,0], x2=X[:,1], label=y)) grouped = df.groupby('label') #Finally, let's plot the results fig, ax = pyplot.subplots() colors = ["red", "blue"] labels = ["x1", "x2"] for key, group in grouped: group.plot(ax=ax, kind='scatter', x='x1', y='x2', label=labels[key], color=colors[key]) pyplot.show()
Executing the above code gives us the following plot:
Generate Test Data for Moon Classification
We just looked at how to create circles for classification. Now, let’s look at how to create test data moons! Sci-kit learn also let’s you make two half moon to test your classification algorithms.
This time we are going to use the function
make_moons to generate two opposite “half moon classes” for our classification problem. This function also need to know amount of data you want to generate
n_samples and the noise level that you want
This is how the code will look in Python using sklearn:
# Import libraries from sklearn.datasets import make_moons from matplotlib import pyplot from pandas import DataFrame # Generate test data using using make_moons() X, y = make_moons(n_samples=100, noise=0.1) # Just like with the circles, split and organize the data df = DataFrame(dict(x1=X[:,0], x2=X[:,1], label=y)) grouped = df.groupby('label') # And plot it colors = ["red", "blue"] labels = ["x1", "x2"] fig, ax = pyplot.subplots() for key, group in grouped: group.plot(ax=ax, kind='scatter', x='x1', y='x2', label=labels[key], color=colors[key]) pyplot.show()
After running this code, we get:
We hope this guide on how to create test data for machine learning in Python using scikit-learn was useful to some of you! If you enjoy the site and you want the guides to keep coming, feel free to leave a comment or follow us on Facebook.