Recognize Handwritten Digits with a Neural Network in TensorFlow

This guide teaches you how to create a neural network for recognizing handwritten digits from the mnist dataset in the popular Python library, TensorFlow. TensorFlow is an open source software library for machine learning developed by Google’s Brain Team. Read more about the TensorFlow library at the bottom of this page. Some knowledge of neural networks is required to get the full output from this guide, even though the code can easily be implemented by everyone. Feel free to check out this Introduction to Neural Networks. Throughout the guide, there will be guiding comments in the code to help you understand what is going on. However, if some things remain unclear, you are very welcome to ask questions in the comment section below this guide.

How to build a Neural Network in TensorFlow for the Mnist Dataset

In order to create a neural network in TensorFlow, one must implement a series of main functionalities. One way to do this is to design and implement the following functions:

  • initialize_parameters initializes the parameters to be trained by the neural network.
  • forward_propagation
  • calculate_cost
  • train_parameters
  • predict

First things first. We want to start out by installing TensorFlow. Follow this guide on how to install TensorFlow.

In this guide, we will be using the mnist dataset which can be obtained using the sklearn library. Check out this guide for more information on how to get test data using sklearn.

Here is a quick example of how you can load the mnist dataset using sklearn:

from sklearn.datasets import fetch_mldata
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

# Load the mnist data
data = fetch_mldata('MNIST original')

# Prepare the data for machine learning
X =
y =,1)
encoder = OneHotEncoder()
y = encoder.transform(y).toarray()

# Divide X and y into a train- and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

Now we are ready to get started with deep learning in TensorFlow!


Initialize Parameters

This first function we are going to write will be the initialize_parameters(layout). This function takes the layout of the neural network layout as input and returns some initial parameters for our model. The parameters are a weight matrix W and a bias vector b for each layer.

The neural network layer will be defined as a dictionary as follows. Feel free to try different network layouts.

layout = {"l0":len(X_train[0]), "l1":100, "l2":100, "l3":60, "l4":len(y_train[0])}

For initialize_parameters(), we want to loop over each of the layers in layout and initialize parameters W and b for each.

import tensorflow as tf	

def initialize_parameters(layout):
        layout (dict)     : Layer sizes of the neural network

        parameters (dict) : Initialized parameters
    # Create parameters dictionary to store W and b and initialize counter i
    parameters = {}
    i = 1
    # Loop over all the hidden layers and the output
    # and save initialized values for W and b in parameters
    for _ in layout:
		# Define layer conditions
        W = "W" + str(i)
        b = "b" + str(i)
        current_layer_size = layout["l" + str(i)]
        previous_layer_size = layout["l" + str(i-1)]
		# Initialize parameters 
        parameters[W] = tf.get_variable(W, [current_layer_size, previous_layer_size], 
        parameters[b] = tf.get_variable(b, [current_layer_size, 1], 
		# Move to next layer
        i += 1
		# Stop before output layer (we don't need parameters for this layer)
        if i == len(layout) - 1:

    return parameters

Notice how we use tf.get_variable to initialize both W and b. b is just initialized as a zero vector, but we must initialize W to random values to avoid risking a situation where no nodes will ever fire.

Forward Propagation

Next step is to write forward_propagation(X, parameters) that takes inputs X and parameters to return the activation of the output layer A_output. The input X will be our training data while parameters are the parameters obtained from the initialization function above.

def forward_propagation(X, parameters):
        X (float array)   : Features (X_train)
        parameters (dict) : Parameters W and b

        A_out (tf.Tensor) : The activation function of the output layer
    # Define a dictionary with the input (activation) layer
    # to add more activations as we iterate over the layers
    temp_dict = {"A0": X}
    # Determine the number of layers to iterate over
    layer_count = int(len(parameters) / 2)

    for i in range(1, layer_count+1):
        # Define layer conditions
        W = "W" + str(i)
        b = "b" + str(i)
        Z = "Z" + str(i)
        A = "A" + str(i)
        A_previous = "A" + str(i-1)
        # Perform forward propagation
        temp_dict[Z] = tf.add(tf.matmul(tf.cast(temp_dict[A_previous], tf.float32),
        # Use a relu activation function for the hidden layers
        # and a sigmoid activation function for the output layer
        if i == layer_count:
            temp_dict[A] = tf.nn.sigmoid(temp_dict[Z])
            temp_dict[A] = tf.nn.relu(temp_dict[Z])
    # Return only the activation of the output layer
    A_output = temp_dict["A" + str(layer_count)]
    return A_output

In this code, we have chosen to use a relu activation function tf.nn.relu for the hidden layers and a sigmoid activation function tf.nn.sigmoid for the output layer. You are free to play around with this and see what works best for you.

Calculate Cost

Now let’s define a cost function. This function is used by the optimizer in TensorFlow to determine how much the parameters should corrected in each iteration. The inputs for the calculate_cost() function uses A_output, y, and parameters to return the cost variable.

def calculate_cost(A_output, y, parameters):
        A_out (tf.Tensor) : Activation function of output layer
        y (float array)   : Labels
        parameters (dict) : Parameters W and b

        cost (tf.Tensor)  : The activation function of the output layer

    # We will comepare the calculated labels to the actual ones
    calculated_labels = tf.transpose(A_output)
    actual_labels = tf.transpose(y)

    # Determine the cost as
    cost = tf.reduce_mean(tf.squared_difference(calculated_labels, actual_labels))
    return cost

Here we use the squared difference in tf.squared_diffrence, but it is also possible to calculate the cost in other ways. Another popular choice is tf.nn.softmax_cross_entropy_with_logits.

Train Parameters (Backward Propagation)

Now it is time to train our model. This function train_parameters(X_train, y_train, X_test, y_test, layout, learning_rate=0.1, epochs=5, batch_size=1000) will take rather long list of inputs that should be rather self-explanatory. More info about the inputs are provided in the docstring below. This function returns the trained parameters.

We have already written all the functions we need in a neural network except for backward propagation. This is, however, done very easily in TensorFlow as you do not have to write the code for the back propagation yourself. You simply have to define an optimizer and run it. The code below shows this.

Lets put together everything we have done so far:

def train_parameters(X_train, y_train, X_test, y_test, layout, 
                     learning_rate=0.005, epochs=20, batch_size=1000):
        X_train (float array) : Train features
        y_train (float array) : Train labels
        X_test (float array)  : Test features
        y_test (float array)  : Test labels
        layout (dict)         : Layer sizes of the neural network
        learning_rate (float) : The learning rate for training
        epochs (int)      	  : Epochs through the data
		batch_size (int)      : Size of each data batch 

        parameters (dict)     : The trained model parameters

    # Reset to default graph to avoid overwriting tf.variables

    # Create placeholders of for tensors X and y
    (m, features_count) = X_train.shape
    labels_count = y_train.shape[1]
    X = tf.placeholder(tf.float32, [None, features_count], name="X")
    y = tf.placeholder(tf.float32, [None, labels_count], name="y")

    # Initialize parameters
    parameters = initialize_parameters(layout)

    # Do the forward propagation
    A_output = forward_propagation(X, parameters)

    # Calculate the cost
    cost = calculate_cost(A_output, y, parameters)

    # Define the tensorflow optimizer to do the back propagation
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

    # Initialize all the variables
    init = tf.global_variables_initializer()
    # Start the session to compute the tensorflow graph
    with tf.Session() as sess:

        # Run the global initializer
        # Loop over all epochs while training the model
        for epoch in range(epochs):
            epoch_cost = 0
            # Split the data into batches of size batch_size
            for batch in range(int(m / batch_size)):
                # Get a batch of X and y                
                X_batch = X_train[batch*batch_size:(1 + batch)*batch_size]
                y_batch = y_train[batch*batch_size:(1 + batch)*batch_size]
                # Run the optimizer
                _ , batch_cost =[optimizer, cost], feed_dict={X: X_batch, y: y_batch})
                epoch_cost += batch_cost
            # Print the cost to check if it's decreasing
            print("Cost after epoch " + str(epoch+1) + ": " + str(epoch_cost))

        print("Training complete!")

        # Check how many of the predictions were correct
        #check_predictions = tf.equal(tf.round(A_output), y)
        check_predictions = tf.equal(tf.argmax(A_output, axis=1), tf.argmax(y, axis=1))

        # Check accuracy on train and test set
        accuracy = tf.reduce_mean(tf.cast(check_predictions, tf.float32))
        accuracy_train = accuracy.eval({X: X_train, y: y_train})
        accuracy_test = accuracy.eval({X: X_test, y: y_test})
        print("Accuracy on the training set: " + str(accuracy_train*100) + " %")
        print("Accuracy on the test set: " + str(accuracy_test*100) + " %")
        # Save trained parameters
        parameters =
        return parameters	

The output in the terminal after running train_parameters looks like this:

>>> train_parameters(X_train, y_train, X_test, y_test, layout)
Cost after epoch 0: 2.29775438644
Cost after epoch 1: 0.968977658078
Cost after epoch 2: 0.84340425767
Cost after epoch 3: 0.536806614138
Cost after epoch 4: 0.252858966356
Cost after epoch 5: 0.208994472632
Cost after epoch 6: 0.17735228478
Cost after epoch 7: 0.156209989451
Cost after epoch 8: 0.143354448723
Cost after epoch 9: 0.132179139298
Cost after epoch 10: 0.123276243918
Cost after epoch 11: 0.119895999436
Cost after epoch 12: 0.112337087048
Cost after epoch 13: 0.107231109519
Cost after epoch 14: 0.0964344530366
Cost after epoch 15: 0.0971446636831
Cost after epoch 16: 0.0877964202664
Cost after epoch 17: 0.077983592113
Cost after epoch 18: 0.0708390149521
Cost after epoch 19: 0.0632036953466
Training complete!
Accuracy on the training data: 99.2500007153 %
Accuracy on the test data: 96.9714283943 %

This is far from an optimal result on the mnist dataset, but it is sufficient for this guide. You should be able to improve this by playing around with the hyperparameters such as layout, learning_rate, epochs, batch_size etc. Also, consider adding regularization to the cost!


In order for the model parameters to be useful, we need a function that can take in new data along with the trained parameters and predict labels for these. This function is called predict(X_predict, parameters) and returns the predicted label(s).

def predict(X_predict, parameters):
        X_predict (float array)  : Data for which we want to predict a label
        parameters (dict)        : The trained model parameters
        prediction (int array)   : Predicted label(s)
    # Reshape X_predict if necessary
    X_predict = X_predict.reshape([1,len(X_predict)]) if len(X_predict.shape)==1 else X_predict
    # Create placeholder for X
    features_count = X_predict.shape[0] if len(X_predict.shape)==1 else X_predict.shape[1]
    X = tf.placeholder(tf.float32, [None, features_count], name="X")
    # Do the forward propagation
    A_output = forward_propagation(X,parameters)
    # Get the prediction
    prediction = tf.argmax(A_output, axis=1)
    # Evaluate the result
    with tf.Session():
        prediction = prediction.eval({X: X_predict})
        return prediction

Lets see what we get when we test predict():

>>> y_predict = predict(X_test[2], parameters)[0]
>>> y_actual = list(y_test[2]).index(1)
>>> print("Predicted label: " + str(y_predict) + "\nActual label: " + str(y_actual))
Predicted label: 8
Actual label: 8

It seems like everything works as intended. Now try to predict all labels for X_test at once:

>>> predict(X_test, parameters)
array([1, 8, 8, ..., 0, 8, 4], dtype=int64)



That was our guide on how to build a deep neural network in TensorFlow that can be used for image classification. You have learned how to write each of the main functions for a neural network in TensorFlow. We hope that you enjoyed the guide!