# 6 Steps To Write Any Machine Learning Algorithm From Scratch: Perceptron Case Study

**Writing a machine learning algorithm from scratch is an extremely rewarding learning experience.**

It provides you with that “ah ha!” moment where it finally clicks, and you understand what’s really going on under the hood.

Some algorithms are just more complicated than others, so start with something simple Instead, start with something very simple, such as the single layer Perceptron.

I’ll walk you through the following 6-step process to write algorithms from scratch, using the Perceptron as a case-study:

- Get a basic understanding of the algorithm
- Find some different learning sources
- Break the algorithm into chunks
- Start with a simple example
- Validate with a trusted implementation
- Write up your process

## Get a Basic Understanding

This goes back to what I originally stated. If you don’t understand the basics, don’t tackle an algorithm from scratch.

At the very least, you should be able to answer the following questions:

- What is it?
- What is it typically used for?
- When CAN’T I use this?

For the Perceptron, let’s go ahead and answer these questions:

- The single layer Perceptron is the most basic neural network. It’s typically used for binary classification problems (1 or 0, “yes” or “no”).
- Some simple uses might be sentiment analysis (positive or negative response) or loan default prediction (“will default”, “will not default”). For both cases, the decision boundary would need to be linear.
- If the decision boundary is non-linear, you really can’t use the Perceptron. For those problems, you’ll need to use something different.

## Use Different Sources for Learning

After you have a basic understanding of the model, it’s time to start doing your research.

Some people learn better with textbooks, some people learn better with video.

Personally, I like to bounce around and use various types of sources.

For the mathematical details, textbooks do a great job, but for more practical examples, I prefer blog posts and YouTube videos.

For the Perceptron, here’s some great sources:

**Textbooks**

- The Elements of Statistical Learning, Sec. 4.5.1
- Understanding Machine Learning: From Theory To Algorithms, Sec. 21.4

**Blogs**

- How To Implement The Perceptron Algorithm From Scratch In Python, by Jason Brownlee
- Single-Layer Neural Networks and Gradient Descent, by Sebastian Raschka

**Videos**

## Break The Algorithm Into Chunks

Now that we’ve gathered our sources, it’s time to start learning.

Rather than read a chapter or blog post all the way through, start by skimming for section headings, and other important info.

Write down bullet points, and try to outline the algorithm.

After going through the sources, I’ve broken the Perceptron into the following 5 chunks:

- Initialize the weights
- Multiply weights by input and sum them up
- Compare result against the threshold to compute the output (1 or 0)
- Update the weights
- Repeat

Let’s go through each in detail.

**1. Initialize the weights**

To start, we’ll initialize the weight vector.

The number of weights needs to match the number of features. Assuming we have three features, here’s what the weight vector will look likeThe weight vector is usually initialized with zeros, so I’m going to stick with that for the example.

**2. Multiply weights by inputs and sum them up**

Next, we’re going to multiply the weights by the inputs, and then sum them up.

To make it easier to follow along, I’ve colored the weights and their corresponding features in the first row.

After we multiply the weights by the features, we sum them up. This is also known as the dot product.

The final result is 0. I’m going to refer to this temporary result as “f”.

**3. Compare against threshold**

After we’ve calculated the dot product, we need to compare it against a threshold.

I’ve chosen to use 0 as my threshold, but you can play around with this and try some different numbers.

Since the dot product “f” that we calculated (0) is not greater than our threshold (0), our estimate is equal to zero.

I’ve denoted the estimate as a y with a hat (aka, “y hat”), with a 0 subscript to correspond with the first row. You could have used a 1 for the first row instead, it doesn’t really matter. I just chose to start with 0.

If we compare this result with the actual value, we can see that our current weights did not correctly predict the actual output.

Since our prediction was wrong, we need to update the weights, which brings us to the next step.

**4. Update the weights**

Next, we’re going to update the weights. Here’s the equation that we’re going to use:

The basic idea is that we adjust the current weight at iteration “n” so that we get a new weight to use in the next iteration, “n+1”.

To adjust the weight, we need to set a “learning rate”. This is denoted by the Greek letter “eta”.

I’ve chosen to use 0.1 for the learning rate, but you can play around with different numbers, just like with the threshold.

Here’s a quick summary of what we have so far:

Now let’s go ahead and calculate the new weights for iteration n=2.

We’ve successfully completed the first iteration of the Perceptron algorithm.

**5. Repeat**

Since our algorithm didn’t compute the right output, we need to keep going.

Typically we’ll need numerous iterations. Looping through each row in the data set, we’ll update the weights each time.

One complete sweep through the data set is known as an “epoch.”

Since our data set has 3 rows, it will take us three iterations to complete 1 epoch.

We could either set a total number of iterations or epochs to continue performing the algorithm. Maybe we want to specify 30 iterations (or 10 epochs).

As with the threshold and learning rate, the number of epochs is a parameter that you can play around with.

In the next iteration, we move on to the second row of features.

I won’t go through every step again, but here’s the next calculation of the dot product:

Next we would compare the dot product with the threshold to calculate a new estimate, update the weights, and then keep going. If our data is linearly separable, the Perceptron will converge.

## Start With A Simple Example

Now that we’ve broken the algorithm into chunks by hand, it’s time to start implementing it in code.

To keep things simple, I always like to start with a very small “toy data set.”

A nice small, linearly separable data set for this type of problem is a NAND gate. This is a common logic gate used in digital electronics.Since this is a fairly small data set, we can just manually enter it in to Python.

I’m going to add in a dummy feature “x0” which is a column of 1’s. I did this so that our model will calculate the bias term.

You can think of the bias as the intercept term that correctly allows our model to separate the two classes.

Here’s the code for entering in the data:

```
# Importing libraries
# NAND Gate
# Note: x0 is a dummy variable for the bias term
# x0 x1 x2
x = [[1., 0., 0.],
[1., 0., 1.],
[1., 1., 0.],
[1., 1., 1.]]
y =[1.,
1.,
1.,
0.]
```

As with the previous section, I’ll step through the algorithm in chunks, writing code and testing it as we go.

**1. Initialize the weights**

The first step is to initialize the weights.

```
# Initialize the weights
import numpy as np
w = np.zeros(len(x[0]))
```

```
Out:
[ 0. 0. 0.]
```

Keep in mind that the length of the weight vector needs to match the number of features. For this NAND gate example, that length is 3.

**2. Multiply weights by inputs and sum them up**

Next, we’re going to multiply the weights by the inputs, and then sum them up.

Another name for this is the “dot product.”

Again, we can use Numpy to easily perform this operation. The method we’ll use is `.dot()`

.

Let’s start by taking the dot product of the weight vector and the first row of features.

```
# Dot Product
f = np.dot(w, x[0])
print f
```

```
Out:
0.0
```

As expected, the result is 0.

To stay consistent with our notes in the previous section, I assigned the dot product to the variable “f”.

**3. Compare against the threshold**

After we’ve computed the dot product, we’re ready to compare the result against the threshold to make our prediction of the output.

Again, I’m going to stay consistent with our notes in the previous section.

I’m going to make the threshold “z” equal to 0. If the dot product “f” is greater than 0, our prediction will be 1. Otherwise, it will be zero.

Keep in mind that the prediction is usually denoted with a carat over the top, also know as a “hat”. The variable I’ll assign the prediction to is `yhat`

.

```
# Activation Function
z = 0.0
if f > z:
yhat = 1.
else:
yhat = 0.
print yhat
```

```
Out:
0.0
```

As expected, the prediction is 0.

You’ll notice that in the note above the code I’ve called this the “Activation function”. This is a more formal description of what we’re doing.

Taking a look at the first row of the NAND output, we can see that the actual value is 1. Since our prediction was wrong, we’ll need to go ahead and update the weights.

**4. Update the weights**

Now that we’ve made our prediction, we’re ready to update the weights.

We’ll need to set a learning rate before we can do that. To be consistent with our previous example, I’ll assign the learning rate “eta” a value of 0.1.

I’m going to hard code the update for each weight to make things easier to read.

```
# Update the weights
eta = 0.1
w[0] = w[0] + eta*(y[0] - yhat)*x[0][0]
w[1] = w[1] + eta*(y[0] - yhat)*x[0][1]
w[2] = w[2] + eta*(y[0] - yhat)*x[0][2]
print w
```

```
Out:
[ 0.1 0. 0. ]
```

We can see that our weights have now been updated, so we’re ready to keep going.

**5. Repeat**

Now that we’ve worked through each step, it’s time to put everything together.

The one final piece we haven’t discussed is our loss function. This is the function we’re trying to minimize, which in our case will be the sum-of-squared (SSE) error.

This is what we’ll use to calculate our error, and see how the model is performing.

Tying it all together, here’s what the full function looks like:

```
import numpy as np
# Perceptron function
def perceptron(x, y, z, eta, t):
'''
Input Parameters:
x: data set of input features
y: actual outputs
z: activation function threshold
eta: learning rate
t: number of iterations
'''
# initializing the weights
w = np.zeros(len(x[0]))
n = 0
# initializing additional parameters to compute sum-of-squared errors
yhat_vec = np.ones(len(y)) # vector for predictions
errors = np.ones(len(y)) # vector for errors (actual - predictions)
J = [] # vector for the SSE cost function
while n < t: for i in xrange(0, len(x)): # dot product f = np.dot(x[i], w) # activation function if f >= z:
yhat = 1.
else:
yhat = 0.
yhat_vec[i] = yhat
# updating the weights
for j in xrange(0, len(w)):
w[j] = w[j] + eta*(y[i]-yhat)*x[i][j]
n += 1
# computing the sum-of-squared errors
for i in xrange(0,len(y)):
errors[i] = (y[i]-yhat_vec[i])**2
J.append(0.5*np.sum(errors))
return w, J
```

Now that we’ve coded the full perceptron, let’s go ahead and run it:

```
# x0 x1 x2
x = [[1., 0., 0.],
[1., 0., 1.],
[1., 1., 0.],
[1., 1., 1.]]
y =[1.,
1.,
1.,
0.]
z = 0.0
eta = 0.1
t = 50
print "The weights are:"
print perceptron(x, y, z, eta, t)[0]
print "The errors are:"
print perceptron(x, y, z, eta, t)[0]
```

```
Out:
The weights are:
[ 0.2 -0.2 -0.1]
The errors are:
[0.5, 1.5, 1.5, 1.0, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
```

Taking a look at the errors we can see that the error went to 0 around the 6th iteration. For the remainder of the iterations it remained at 0.

When the error goes to 0 and stays there, we know that our model has converged. This tells us that our model has correctly “learned” the appropriate weights.

In the next section we’ll use our calculated weights on a larger data

set to make predictions.

## Validate with a trusted implementation

Up to this point we’ve found different learning resources, worked through the algorithm by hand, and tested it in code with a simple example.

Now it’s time to compare our results with a trusted implementation. For comparison, we’ll use the Perceptron from scikit-learn.

We’ll work through this comparison using the following steps:

- Import the data
- Split the data into train/test sets
- Train our Perceptron
- Test the Perceptron
- Compare with the scikit-learn Perceptron

**1. Import the data**

Lets start by importing the data. You can get a copy of the dataset here.

This is a linearly separable dataset that I created to make sure that the Perceptron would work. Just to confirm, let’s go ahead and plot the data.

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("dataset.csv")
plt.scatter(df.values[:,1], df.values[:,2], c = df['3'], alpha=0.8)
```

text

Taking a look at the plot it’s pretty easy to see that we can separate this data with a straight line.

Before we move on, I’ll explain my plotting code above.

I used Pandas to import the csv, which automatically puts the data in a dataframe.

To plot the data I had to pull the values out of the dataframe, so that’s why i used the `.values`

method.

The features are in columns 1 and 2, so that’s why I used those in the scatterplot function. Column 0 is a dummy feature of 1’s that I included so that the intercept is calculated. This should be familiar to what we did with the NAND gate in the previous section.

Finally, I colored the two classes using `c = df['3'], alpha = 0.8`

in the scatterplot function. The output is the data in column 3 (0 or 1), so I told the function to color the two classes using column ‘3’.

You can find more info on Matplotlib’s scatterplot function here.

**2. Split the data into train/test sets**

Now that we’ve confirmed the data can be separated linearly, it’s time to split the data.

It’s always good practice to train a model on a separate dataset from the one you test on. This helps to avoid overfitting.

There’s different methods for doing this, but to keep things simple I’ll just use a training set and a test set.

I’ll start by shuffling up my data. If you take a look at the original file you’ll see that the data is grouped by rows with 0 in the output (third column) and then followed by all of the 1’s. I want to shake things up and add some randomness, so I’m going to shuffle it.

```
df = df.values
np.random.seed(5)
np.random.shuffle(df)
```

I started by changing the data from a dataframe to a numpy array. This is going to make it easier to work with a lot of the numpy functions I’m going to use, such as `.shuffle`

.

To make the results reproducible for you I set a random seed (5). After we’ve finished, try changing the random seed and see how the results change.

Next I’m going split 70% of my data into a training set, and 30% into a test set.

```
train = df[0:int(0.7*len(df))]
test = df[int(0.7*len(df)):int(len(df))]
```

The final step is to separate out the features and the outputs for the training and test sets.

```
x_train = train[:, 0:3]
y_train = train[:, 3]
x_test = test[:, 0:3]
y_test = test[:, 3]
```

I chose 70%/30% for my train/test sets just for the sake of this example, but I encourage you to look into other methods, such as k-fold cross validation.

**3. Train our Perceptron**

Next, we’ll train our Perceptron.

This is pretty straightforward, we’re just going to reuse the code that we build in the previous section.

```
def perceptron_train(x, y, z, eta, t):
'''
Input Parameters:
x: data set of input features
y: actual outputs
z: activation function threshold
eta: learning rate
t: number of iterations
'''
# initializing the weights
w = np.zeros(len(x[0]))
n = 0
# initializing additional parameters to compute sum-of-squared errors
yhat_vec = np.ones(len(y)) # vector for predictions
errors = np.ones(len(y)) # vector for errors (actual - predictions)
J = [] # vector for the SSE cost function
while n < t: for i in xrange(0, len(x)): # dot product f = np.dot(x[i], w) # activation function if f >= z:
yhat = 1.
else:
yhat = 0.
yhat_vec[i] = yhat
# updating the weights
for j in xrange(0, len(w)):
w[j] = w[j] + eta*(y[i]-yhat)*x[i][j]
n += 1
# computing the sum-of-squared errors
for i in xrange(0,len(y)):
errors[i] = (y[i]-yhat_vec[i])**2
J.append(0.5*np.sum(errors))
return w, J
z = 0.0
eta = 0.1
t = 50
perceptron_train(x_train, y_train, z, eta, t)
```

Let’s go ahead and take a look at the weights and the sum-of-squared error.

```
w = perceptron_train(x_train, y_train, z, eta, t)[0]
J = perceptron_train(x_train, y_train, z, eta, t)[1]
print w
print J
```

```
Out:
[-0.5 -0.29850122 0.35054929]
[4.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
```

The weights don’t mean much to us right now, but we’ll use those numbers in the next section to test our perceptron. We’ll also use the weights to compare our model to the scikit-learn model.

Taking a look at the sum-of-squared error we can see that our Perceptron has converged, which we’d expect since the data is linearly separable.

**4. Test our Perceptron**

Now it’s time to test our Perceptron. To do that we’ll build a small `perceptron_test`

function.

This is very similar to what we’ve already seen. This function takes the dot product of the weights we calculated using the `perceptron_train`

function and the features, along with the activation function, to make predictions.

The only piece we haven’t seen is the `accuracy_score`

. This is an evaluation metric function from scikit-learn. You can learn more about it here.

Putting all of this together, here’s what the code looks like:

```
from sklearn.metrics import accuracy_score
w = perceptron_train(x_train, y_train, z, eta, t)[0]
def perceptron_test(x, w, z, eta, t):
y_pred = []
for i in xrange(0, len(x-1)):
f = np.dot(x[i], w)
# activation function
if f > z:
yhat = 1
else:
yhat = 0
y_pred.append(yhat)
return y_pred
y_pred = perceptron_test(x_test, w, z, eta, t)
print "The accuracy score is:"
print accuracy_score(y_test, y_pred)
```

```
Out:
The accuracy score is:
1.0
```

A score of 1.0 indicates that our model correctly made predictions on all of the test data. This dataset is clearly separable, so we’d expect this result.

**5. Compare with the scikit-learn Perceptron**

The final step is to compare our results with the Perceptron from scikit-learn. Here’s the code from that model:

```
from sklearn.linear_model import Perceptron
# training the sklearn Perceptron
clf = Perceptron(random_state=None, eta0=0.1, shuffle=False, fit_intercept=False)
clf.fit(x_train, y_train)
y_predict = clf.predict(x_test)
```

Now that we’ve trained the model, lets compare the weights to the weights that our model calculated.

```
Out:
sklearn weights:
[-0.5 -0.29850122 0.35054929]
my perceptron weights:
[-0.5 -0.29850122 0.35054929]
```

The weights from the scikit-learn model are identical to our weights. This means that our model is working correctly, which is great news.

There’s a few minor points to go over before we wrap up. In the scikit-learn model we had to set the random state to ‘None’ and turn off the shuffling. We already set a random seed and shuffled the data, so we didn’t need to do this again.

We also had to set the learning rate ‘eta0’ to 0.1 to be identical to our model.

The final point is the intercept. Since we already include a dummy feature column of 1s we were fitting intercept automatically, so we didn’t need to turn this on in the scikit-learn perceptron.

These all seem like minor details, but if we didn’t set these, we wouldn’t be able to reproduce the same results as our model.

This is an important point. Before using a model, it’s very important to read the documentation and understand what all of the different settings do.

## Write Up Your Process

This last step in the process is probably the most important.

You’ve just gone through all the work of learning, taking notes, writing the algorithm from scratch, and comparing it with a trusted implementation. Don’t let all that good work go to waste!

Writing up the process is important for two reasons:

- You’ll gain an even deeper understanding because you’re teaching others what you just learned.
- You can showcase it to potential employers.

It’s one thing to show that you can implement an algorithm from a machine learning library, but it’s even more impressive if you can implement it yourself from scratch.

A great way to showcase your work is with a GitHub Pages portfolio.

## Conclusion

In this post we learned how to implement the Perceptron from scratch.

More importantly, we learned how to find useful learning resources, and how to break an algorithm into chunks.

We then learned how to implement and test an algorithm in code using a toy dataset.

Finally, we finished up by comparing the results from our model to a trusted implementation. To get a full copy of the Python code used, click on the green button below.

This is a great methodology to learn an algorithm on a deeper level so that you can implement it yourself.

Most of the time you’ll use a trusted implementation, but if you really want to gain a deeper understanding of what’s going on under the hood, implementing it from scratch is a great exercise.

Be sure to leave a comment below and let me know if you have any other tips that help you during the learning process!

# 6 Steps To Write Any Machine Learning Algorithm From Scratch: Perceptron Case Study

**Writing a machine learning algorithm from scratch is an extremely rewarding learning experience.**

It provides you with that “ah ha!” moment where it finally clicks, and you understand what’s really going on under the hood.

I’ll walk you through the following 6-step process to write algorithms from scratch, using the Perceptron as a case-study:

- Get a basic understanding
- Find different sources
- Break the algorithm into chunks
- Start with a simple example
- Validate with a trusted model
- Write up your process

## Get a Basic Understanding

This goes back to what I originally stated. If you don’t understand the basics, don’t tackle an algorithm from scratch.

At the very least, you should be able to answer the following questions:

- What is it?
- What is it typically used for?
- When CAN’T I use this?

For the Perceptron, let’s go ahead and answer these questions:

- The single layer Perceptron is the most basic neural network. It’s typically used for binary classification problems (1 or 0, “yes” or “no”).
- Some simple uses might be sentiment analysis (positive or negative response) or loan default prediction (“will default”, “will not default”). For both cases, the decision boundary would need to be linear.
- If the decision boundary is non-linear, you really can’t use the Perceptron. For those problems, you’ll need to use something different.

## Use Different Sources for Learning

After you have a basic understanding of the model, it’s time to start doing your research.

Some people learn better with textbooks, some people learn better with video.

Personally, I like to bounce around and use various types of sources.

For the mathematical details, textbooks do a great job, but for more practical examples, I prefer blog posts and YouTube videos.

For the Perceptron, here’s some great sources:

**Textbooks**

- The Elements of Statistical Learning, Sec. 4.5.1
- Understanding Machine Learning: From Theory To Algorithms, Sec. 21.4

**Blogs**

- How To Implement The Perceptron Algorithm From Scratch In Python, by Jason Brownlee
- Single-Layer Neural Networks and Gradient Descent, by Sebastian Raschka

**Videos**

## Break The Algorithm Into Chunks

Now that we’ve gathered our sources, it’s time to start learning.

Rather than read a chapter or blog post all the way through, start by skimming for section headings, and other important info.

Write down bullet points, and try to outline the algorithm.

After going through the sources, I’ve broken the Perceptron into the following 5 chunks:

- Initialize the weights
- Multiply weights by input and sum
- Compare result against threshold
- Update the weights
- Repeat

Let’s go through each in detail.

**1. Initialize the weights**

To start, we’ll initialize the weight vector.

The number of weights needs to match the number of features. Assuming we have three features, here’s what the weight vector will look likeThe weight vector is usually initialized with zeros, so I’m going to stick with that for the example.

**2. Multiply weights by inputs and sum them up**

Next, we’re going to multiply the weights by the inputs, and then sum them up.

To make it easier to follow along, I’ve colored the weights and their corresponding features in the first row.

After we multiply the weights by the features, we sum them up. This is also known as the dot product.

The final result is 0. I’m going to refer to this temporary result as “f”.

**3. Compare against threshold**

After we’ve calculated the dot product, we need to compare it against a threshold.

I’ve chosen to use 0 as my threshold, but you can play around with this and try some different numbers.

Since the dot product “f” that we calculated (0) is not greater than our threshold (0), our estimate is equal to zero.

I’ve denoted the estimate as a y with a hat (aka, “y hat”), with a 0 subscript to correspond with the first row. You could have used a 1 for the first row instead, it doesn’t really matter. I just chose to start with 0.

If we compare this result with the actual value, we can see that our current weights did not correctly predict the actual output.

Since our prediction was wrong, we need to update the weights, which brings us to the next step.

**4. Update the weights**

Next, we’re going to update the weights. Here’s the equation that we’re going to use:

The basic idea is that we adjust the current weight at iteration “n” so that we get a new weight to use in the next iteration, “n+1”.

To adjust the weight, we need to set a “learning rate”. This is denoted by the Greek letter “eta”.

I’ve chosen to use 0.1 for the learning rate, but you can play around with different numbers, just like with the threshold.

Here’s a quick summary of what we have so far:

Now let’s go ahead and calculate the new weights for iteration n=2.

We’ve successfully completed the first iteration of the Perceptron algorithm.

**5. Repeat**

Since our algorithm didn’t compute the right output, we need to keep going.

Typically we’ll need numerous iterations. Looping through each row in the data set, we’ll update the weights each time.

One complete sweep through the data set is known as an “epoch.”

Since our data set has 3 rows, it will take us three iterations to complete 1 epoch.

We could either set a total number of iterations or epochs to continue performing the algorithm. Maybe we want to specify 30 iterations (or 10 epochs).

As with the threshold and learning rate, the number of epochs is a parameter that you can play around with.

In the next iteration, we move on to the second row of features.

I won’t go through every step again, but here’s the next calculation of the dot product:

Next we would compare the dot product with the threshold to calculate a new estimate, update the weights, and then keep going. If our data is linearly separable, the Perceptron will converge.

## Start With A Simple Example

Now that we’ve broken the algorithm into chunks by hand, it’s time to start implementing it in code.

To keep things simple, I always like to start with a very small “toy data set.”

A nice small, linearly separable data set for this type of problem is a NAND gate. This is a common logic gate used in digital electronics.Since this is a fairly small data set, we can just manually enter it in to Python.

I’m going to add in a dummy feature “x0” which is a column of 1’s. I did this so that our model will calculate the bias term.

You can think of the bias as the intercept term that correctly allows our model to separate the two classes.

Here’s the code for entering in the data:

```
# Importing libraries
# NAND Gate
# Note: x0 is a dummy variable for the bias term
# x0 x1 x2
x = [[1., 0., 0.],
[1., 0., 1.],
[1., 1., 0.],
[1., 1., 1.]]
y =[1.,
1.,
1.,
0.]
```

As with the previous section, I’ll step through the algorithm in chunks, writing code and testing it as we go.

**1. Initialize the weights**

The first step is to initialize the weights.

```
# Initialize the weights
import numpy as np
w = np.zeros(len(x[0]))
```

```
Out:
[ 0. 0. 0.]
```

Keep in mind that the length of the weight vector needs to match the number of features. For this NAND gate example, that length is 3.

**2. Multiply weights by inputs and sum them up**

Next, we’re going to multiply the weights by the inputs, and then sum them up.

Another name for this is the “dot product.”

Again, we can use Numpy to easily perform this operation. The method we’ll use is `.dot()`

.

Let’s start by taking the dot product of the weight vector and the first row of features.

```
# Dot Product
f = np.dot(w, x[0])
print f
```

```
Out:
0.0
```

As expected, the result is 0.

To stay consistent with our notes in the previous section, I assigned the dot product to the variable “f”.

**3. Compare against the threshold**

After we’ve computed the dot product, we’re ready to compare the result against the threshold to make our prediction of the output.

Again, I’m going to stay consistent with our notes in the previous section.

I’m going to make the threshold “z” equal to 0. If the dot product “f” is greater than 0, our prediction will be 1. Otherwise, it will be zero.

Keep in mind that the prediction is usually denoted with a carat over the top, also know as a “hat”. The variable I’ll assign the prediction to is `yhat`

.

```
# Activation Function
z = 0.0
if f > z:
yhat = 1.
else:
yhat = 0.
print yhat
```

```
Out:
0.0
```

As expected, the prediction is 0.

You’ll notice that in the note above the code I’ve called this the “Activation function”. This is a more formal description of what we’re doing.

Taking a look at the first row of the NAND output, we can see that the actual value is 1. Since our prediction was wrong, we’ll need to go ahead and update the weights.

**4. Update the weights**

Now that we’ve made our prediction, we’re ready to update the weights.

We’ll need to set a learning rate before we can do that. To be consistent with our previous example, I’ll assign the learning rate “eta” a value of 0.1.

I’m going to hard code the update for each weight to make things easier to read.

```
# Update the weights
eta = 0.1
w[0] = w[0] + eta*(y[0] - yhat)*x[0][0]
w[1] = w[1] + eta*(y[0] - yhat)*x[0][1]
w[2] = w[2] + eta*(y[0] - yhat)*x[0][2]
print w
```

```
Out:
[ 0.1 0. 0. ]
```

We can see that our weights have now been updated, so we’re ready to keep going.

**5. Repeat**

Now that we’ve worked through each step, it’s time to put everything together.

The one final piece we haven’t discussed is our loss function. This is the function we’re trying to minimize, which in our case will be the sum-of-squared (SSE) error.

This is what we’ll use to calculate our error, and see how the model is performing.

Tying it all together, here’s what the full function looks like:

```
import numpy as np
# Perceptron function
def perceptron(x, y, z, eta, t):
'''
Input Parameters:
x: data set of input features
y: actual outputs
z: activation function threshold
eta: learning rate
t: number of iterations
'''
# initializing the weights
w = np.zeros(len(x[0]))
n = 0
# initializing additional parameters to compute sum-of-squared errors
yhat_vec = np.ones(len(y)) # vector for predictions
errors = np.ones(len(y)) # vector for errors (actual - predictions)
J = [] # vector for the SSE cost function
while n < t: for i in xrange(0, len(x)): # dot product f = np.dot(x[i], w) # activation function if f >= z:
yhat = 1.
else:
yhat = 0.
yhat_vec[i] = yhat
# updating the weights
for j in xrange(0, len(w)):
w[j] = w[j] + eta*(y[i]-yhat)*x[i][j]
n += 1
# computing the sum-of-squared errors
for i in xrange(0,len(y)):
errors[i] = (y[i]-yhat_vec[i])**2
J.append(0.5*np.sum(errors))
return w, J
```

Now that we’ve coded the full perceptron, let’s go ahead and run it:

```
# x0 x1 x2
x = [[1., 0., 0.],
[1., 0., 1.],
[1., 1., 0.],
[1., 1., 1.]]
y =[1.,
1.,
1.,
0.]
z = 0.0
eta = 0.1
t = 50
print "The weights are:"
print perceptron(x, y, z, eta, t)[0]
print "The errors are:"
print perceptron(x, y, z, eta, t)[0]
```

```
Out:
The weights are:
[ 0.2 -0.2 -0.1]
The errors are:
[0.5, 1.5, 1.5, 1.0, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
```

Taking a look at the errors we can see that the error went to 0 around the 6th iteration. For the remainder of the iterations it remained at 0.

When the error goes to 0 and stays there, we know that our model has converged. This tells us that our model has correctly “learned” the appropriate weights.

In the next section we’ll use our calculated weights on a larger data

set to make predictions.

## Validate with a trusted implementation

Up to this point we’ve found different learning resources, worked through the algorithm by hand, and tested it in code with a simple example.

Now it’s time to compare our results with a trusted implementation. For comparison, we’ll use the Perceptron from scikit-learn.

We’ll work through this comparison using the following steps:

- Import the data
- Split the data into train/test sets
- Train our Perceptron
- Test the Perceptron
- Compare with the scikit-learn Perceptron

**1. Import the data**

Lets start by importing the data. You can get a copy of the dataset here.

This is a linearly separable dataset that I created to make sure that the Perceptron would work. Just to confirm, let’s go ahead and plot the data.

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("dataset.csv")
plt.scatter(df.values[:,1], df.values[:,2], c = df['3'], alpha=0.8)
```

text

Taking a look at the plot it’s pretty easy to see that we can separate this data with a straight line.

Before we move on, I’ll explain my plotting code above.

I used Pandas to import the csv, which automatically puts the data in a dataframe.

To plot the data I had to pull the values out of the dataframe, so that’s why i used the `.values`

method.

The features are in columns 1 and 2, so that’s why I used those in the scatterplot function. Column 0 is a dummy feature of 1’s that I included so that the intercept is calculated. This should be familiar to what we did with the NAND gate in the previous section.

Finally, I colored the two classes using `c = df['3'], alpha = 0.8`

in the scatterplot function. The output is the data in column 3 (0 or 1), so I told the function to color the two classes using column ‘3’.

You can find more info on Matplotlib’s scatterplot function here.

**2. Split the data into train/test sets**

Now that we’ve confirmed the data can be separated linearly, it’s time to split the data.

It’s always good practice to train a model on a separate dataset from the one you test on. This helps to avoid overfitting.

There’s different methods for doing this, but to keep things simple I’ll just use a training set and a test set.

I’ll start by shuffling up my data. If you take a look at the original file you’ll see that the data is grouped by rows with 0 in the output (third column) and then followed by all of the 1’s. I want to shake things up and add some randomness, so I’m going to shuffle it.

```
df = df.values
np.random.seed(5)
np.random.shuffle(df)
```

I started by changing the data from a dataframe to a numpy array. This is going to make it easier to work with a lot of the numpy functions I’m going to use, such as `.shuffle`

.

To make the results reproducible for you I set a random seed (5). After we’ve finished, try changing the random seed and see how the results change.

Next I’m going split 70% of my data into a training set, and 30% into a test set.

```
train = df[0:int(0.7*len(df))]
test = df[int(0.7*len(df)):int(len(df))]
```

The final step is to separate out the features and the outputs for the training and test sets.

```
x_train = train[:, 0:3]
y_train = train[:, 3]
x_test = test[:, 0:3]
y_test = test[:, 3]
```

I chose 70%/30% for my train/test sets just for the sake of this example, but I encourage you to look into other methods, such as k-fold cross validation.

**3. Train our Perceptron**

Next, we’ll train our Perceptron.

This is pretty straightforward, we’re just going to reuse the code that we build in the previous section.

```
def perceptron_train(x, y, z, eta, t):
'''
Input Parameters:
x: data set of input features
y: actual outputs
z: activation function threshold
eta: learning rate
t: number of iterations
'''
# initializing the weights
w = np.zeros(len(x[0]))
n = 0
# initializing additional parameters to compute sum-of-squared errors
yhat_vec = np.ones(len(y)) # vector for predictions
errors = np.ones(len(y)) # vector for errors (actual - predictions)
J = [] # vector for the SSE cost function
while n < t: for i in xrange(0, len(x)): # dot product f = np.dot(x[i], w) # activation function if f >= z:
yhat = 1.
else:
yhat = 0.
yhat_vec[i] = yhat
# updating the weights
for j in xrange(0, len(w)):
w[j] = w[j] + eta*(y[i]-yhat)*x[i][j]
n += 1
# computing the sum-of-squared errors
for i in xrange(0,len(y)):
errors[i] = (y[i]-yhat_vec[i])**2
J.append(0.5*np.sum(errors))
return w, J
z = 0.0
eta = 0.1
t = 50
perceptron_train(x_train, y_train, z, eta, t)
```

Let’s go ahead and take a look at the weights and the sum-of-squared error.

```
w = perceptron_train(x_train, y_train, z, eta, t)[0]
J = perceptron_train(x_train, y_train, z, eta, t)[1]
print w
print J
```

```
Out:
[-0.5 -0.29850122 0.35054929]
[4.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
```

The weights don’t mean much to us right now, but we’ll use those numbers in the next section to test our perceptron. We’ll also use the weights to compare our model to the scikit-learn model.

Taking a look at the sum-of-squared error we can see that our Perceptron has converged, which we’d expect since the data is linearly separable.

**4. Test our Perceptron**

Now it’s time to test our Perceptron. To do that we’ll build a small `perceptron_test`

function.

This is very similar to what we’ve already seen. This function takes the dot product of the weights we calculated using the `perceptron_train`

function and the features, along with the activation function, to make predictions.

The only piece we haven’t seen is the `accuracy_score`

. This is an evaluation metric function from scikit-learn. You can learn more about it here.

Putting all of this together, here’s what the code looks like:

```
from sklearn.metrics import accuracy_score
w = perceptron_train(x_train, y_train, z, eta, t)[0]
def perceptron_test(x, w, z, eta, t):
y_pred = []
for i in xrange(0, len(x-1)):
f = np.dot(x[i], w)
# activation function
if f > z:
yhat = 1
else:
yhat = 0
y_pred.append(yhat)
return y_pred
y_pred = perceptron_test(x_test, w, z, eta, t)
print "The accuracy score is:"
print accuracy_score(y_test, y_pred)
```

```
Out:
The accuracy score is:
1.0
```

A score of 1.0 indicates that our model correctly made predictions on all of the test data. This dataset is clearly separable, so we’d expect this result.

**5. Compare with the scikit-learn Perceptron**

The final step is to compare our results with the Perceptron from scikit-learn. Here’s the code from that model:

```
from sklearn.linear_model import Perceptron
# training the sklearn Perceptron
clf = Perceptron(random_state=None, eta0=0.1, shuffle=False, fit_intercept=False)
clf.fit(x_train, y_train)
y_predict = clf.predict(x_test)
```

Now that we’ve trained the model, lets compare the weights to the weights that our model calculated.

```
Out:
sklearn weights:
[-0.5 -0.29850122 0.35054929]
my perceptron weights:
[-0.5 -0.29850122 0.35054929]
```

The weights from the scikit-learn model are identical to our weights. This means that our model is working correctly, which is great news.

There’s a few minor points to go over before we wrap up. In the scikit-learn model we had to set the random state to ‘None’ and turn off the shuffling. We already set a random seed and shuffled the data, so we didn’t need to do this again.

We also had to set the learning rate ‘eta0’ to 0.1 to be identical to our model.

The final point is the intercept. Since we already include a dummy feature column of 1s we were fitting intercept automatically, so we didn’t need to turn this on in the scikit-learn perceptron.

These all seem like minor details, but if we didn’t set these, we wouldn’t be able to reproduce the same results as our model.

This is an important point. Before using a model, it’s very important to read the documentation and understand what all of the different settings do.

## Write Up Your Process

This last step in the process is probably the most important.

You’ve just gone through all the work of learning, taking notes, writing the algorithm from scratch, and comparing it with a trusted implementation. Don’t let all that good work go to waste!

Writing up the process is important for two reasons:

- You’ll gain an even deeper understanding because you’re teaching others what you just learned.
- You can showcase it to potential employers.

It’s one thing to show that you can implement an algorithm from a machine learning library, but it’s even more impressive if you can implement it yourself from scratch.

A great way to showcase your work is with a GitHub Pages portfolio.

## Conclusion

In this post we learned how to implement the Perceptron from scratch.

More importantly, we learned how to find useful learning resources, and how to break an algorithm into chunks.

We then learned how to implement and test an algorithm in code using a toy dataset.

Finally, we finished up by comparing the results from our model to a trusted implementation. To get a full copy of the Python code used, click on the green button below.

This is a great methodology to learn an algorithm on a deeper level so that you can implement it yourself.

Most of the time you’ll use a trusted implementation, but if you really want to gain a deeper understanding of what’s going on under the hood, implementing it from scratch is a great exercise.

Be sure to leave a comment below and let me know if you have any other tips that help you during the learning process!