Today we are going to implement logistic regression as a neural network. This is definitely one of the simplest neural network, and is great to get your feet wet in neural network. After completing this tutorial, you will know:

• How to implement logistic regression.
• How to use logistic regression.
• How to use gradient descent.
• How a neural network works.
• Case study - Breast Cancer Wisconsin Data Set (predict whether the tumor is benign or malignant)

## Introduction to Logistic Regression

Logistic regression is a binary classification algorithm, which can be used to classify linearly seperable classes, i.e. two classes that can be separated by a line. It is one of the most basic algorithms. Most of them time, this is the first algorithm that I use for classification. It kind of provides the lower bound for accuracy. Let us see the architecture of logistic regression.

# The General Architecture of the learning algorithm

Before you get started please check out the notation that is being used below. The following image explains the architecture of logistic regression as a neural network:

Mathematical expression of the algorithm

For one example $x^{(i)}$:

$z^{(i)} = w^T x^{(i)} + b$ $\hat{y}^{(i)} = a^{(i)} = sigmoid(z^{(i)})$ $\mathcal{L}(a^{(i)}, y^{(i)}) = - y^{(i)} \log(a^{(i)}) - (1-y^{(i)} ) \log(1-a^{(i)})$

The cost is then computed by summing over all training examples: $J = \frac{1}{m} \sum_{i=1}^m \mathcal{L}(a^{(i)}, y^{(i)})$

Key steps: We will carry out the following steps: - Initialize the parameters of the model - Learn the parameters for the model by minimizing the cost
- Use the learned parameters to make predictions (on the test set) - Analyse the results and conclude

## Building the parts of our algorithm

The main steps for building a Neural Network are:

1. Define the model structure (such as number of input features)
2. Initialize the model’s parameters
3. Loop:
• Calculate current loss (forward propagation)
• Calculate current gradient (backward propagation)

You often build 1-3 separately and integrate them into one function we call model().

# import the libraries needed
import numpy as np


### Helper functions

Implement sigmoid() function to compute $sigmoid( w^T x + b)$ to make predictions.

def sigmoid(z):
"""
Computer the sigmoid of z

Arguments:
x -- A scalar or numpy array of any size.

Return:
s -- sigmoid(z)
"""

# Calculate sigmoid
s = 1 / (1 + np.exp(-z))

return s

# function to test sigmoid
def test_sigmoid():
print ("sigmoid(0) = " + str(sigmoid(0)))
print ("sigmoid(9.2) = " + str(sigmoid(9.2)))

test_sigmoid()


Output

    sigmoid(0) = 0.5
sigmoid(9.2) = 0.999898970806


### Initializing parameters

Implement parameter initialization to initialize w as a vector of zeros and b to zero.

def initialize_with_zeros(dim):
"""
This function creates a vector of zeros of shape (dim, 1) for w and initializes b to 0.

Argument:
dim -- size of the w vector we want (or number of parameters in this case)

Returns:
w -- initialized vector of shape (dim, 1)
b -- initialized scalar (corresponds to the bias)
"""

# Initialize w and b
w = np.zeros(shape=(dim, 1))
b = 0

assert(w.shape == (dim, 1))
assert(isinstance(b, float) or isinstance(b, int))

return w, b

def test_initialize_with_zeros(dim):
w, b = initialize_with_zeros(dim)
print ("w = " + str(w))
print ("b = " + str(b))

test_initialize_with_zeros(2)


Output

    w = [[ 0.]
[ 0.]]
b = 0


### Forward and Backward propagation

Now that your parameters are initialized, you can do the “forward” and “backward” propagation steps for learning the parameters. Forward Propagation:

• You get X
• You compute $A = \sigma(w^T X + b) = (a^{(0)}, a^{(1)}, ..., a^{(m-1)}, a^{(m)})$
• You calculate the cost function: $J = -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log(a^{(i)})+(1-y^{(i)})\log(1-a^{(i)})$

Here are the two formulas you will be using:

$\frac{\partial J}{\partial w} = \frac{1}{m}X(A-Y)^T$ $\frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^m (a^{(i)}-y^{(i)})$
def propagate(w, b, X, Y):
"""
Implement the cost function and its gradient for the propagation explained above

Arguments:
w -- weights, a numpy array
b -- bias, a scalar
X -- data of size (number of features, number of examples)
Y -- true "label" vector of size (1, number of examples)

Return:
cost -- negative log-likelihood cost for logistic regression
dw -- gradient of the loss with respect to w, thus same shape as w
db -- gradient of the loss with respect to b, thus same shape as b
"""

m = X.shape[1]

### Forward Propagation
# compute activation
A = sigmoid(np.dot(w.T, X) + b)
# compute cost
cost = (- 1 / m) * np.sum(Y * np.log(A) + (1 - Y) * (np.log(1 - A)))

### Backward Propagation
dw = (1 / m) * np.dot(X, (A - Y).T)
db = (1 / m) * np.sum(A - Y)

assert(dw.shape == w.shape)
assert(db.dtype == float)
cost = np.squeeze(cost)
assert(cost.shape == ())

"db": db}


def test_propagate():
w, b, X, Y = np.array([[1], [2]]), 2, np.array([[1,2], [3,4]]), np.array([[1, 0]])
grads, cost = propagate(w, b, X, Y)
print ("dw = " + str(grads["dw"]))
print ("db = " + str(grads["db"]))
print ("cost = " + str(cost))

test_propagate()


Output

    dw = [[ 0.99993216]
[ 1.99980262]]
db = 0.499935230625
cost = 6.00006477319


### Optimization

• You have initialized your parameters.
• You are also able to compute a cost function and its gradient.
• Now, you want to update the parameters using gradient descent.

Implementing the optimization function. The goal is to learn $w$ and $b$ by minimizing the cost function $J$. For a parameter $\theta$, the update rule is $\theta = \theta - \alpha \text{ } d\theta$, where $\alpha$ is the learning rate.

def optimize(w, b, X, Y, num_iterations, learning_rate, print_cost = False):
"""
This function optimizes w and b by running a gradient descent algorithm

Arguments:
w -- weights, a numpy array of size (number of features, 1)
b -- bias, a scalar
X -- data of shape (number of features, number of examples)
Y -- true "label" vector (containing 0 or 1), of shape (1, number of examples)
num_iterations -- number of iterations of the optimization loop
learning_rate -- learning rate of the gradient descent update rule
print_cost -- True to print the loss every 100 steps

Returns:
params -- dictionary containing the weights w and bias b
grads -- dictionary containing the gradients of the weights and bias with respect to the cost function
costs -- list of all the costs computed during the optimization, this will be used to plot the learning curve.

Tips:
You basically need to write down two steps and iterate through them:
1) Calculate the cost and the gradient for the current parameters. Use propagate().
2) Update the parameters using gradient descent rule for w and b.
"""

costs = []

for i in range(num_iterations):

grads, cost = propagate(w, b, X, Y)

# update weights and bias
w = w - learning_rate * dw
b = b - learning_rate * db

# Record the costs
if i % 100 == 0:
costs.append(cost)

# Print the cost every 100 training examples
if print_cost and i % 100 == 0:
print ("Cost after iteration %i: %f" % (i, cost))

params = {"w": w,
"b": b}

"db": db}


def test_optimize():
w, b, X, Y = np.array([[1], [2]]), 2, np.array([[1,2], [3,4]]), np.array([[1, 0]])

params, grads, costs = optimize(w, b, X, Y, num_iterations= 100, learning_rate = 0.009, print_cost = False)

print ("w = " + str(params["w"]))
print ("b = " + str(params["b"]))
print ("dw = " + str(grads["dw"]))
print ("db = " + str(grads["db"]))

test_optimize()


Output

    w = [[ 0.1124579 ]
[ 0.23106775]]
b = 1.55930492484
dw = [[ 0.90158428]
[ 1.76250842]]
db = 0.430462071679


### Predict

The previous function will output the learned w and b. We are able to use w and b to predict the labels for a dataset X. Implement the predict() function. There is two steps to computing predictions:

1. Calculate $\hat{Y} = A = \sigma(w^T X + b)$

2. Convert the entries of a into 0 (if activation <= 0.5) or 1 (if activation > 0.5), stores the predictions in a vector Y_prediction.

def predict(w, b, X):
"""
Predict whether the label is 0 or 1 using learned logistic regression parameters (w, b)

Arguments:
w -- weights, a numpy array of size (number of features, 1)
b -- bias, a scalar
X -- data of size (number of features, number of examples)

Returns:
Y_prediction -- a numpy array (vector) containing all predictions (0/1) for the examples in X
"""

m = X.shape[1]
Y_prediction = np.zeros((1, m))
w = w.reshape(X.shape[0], 1)

# Compute vector "A" predicting the probabilities of class being "1"
A = sigmoid(np.dot(w.T, X) + b)

for i in range(A.shape[1]):
# Convert probabilities a[0,i] to actual predictions p[0,i]
Y_prediction[0, i] = 1 if A[0, i] > 0.5 else 0

assert(Y_prediction.shape == (1, m))

return Y_prediction

def test_predict():
w, b, X = np.array([[1], [2]]), 2, np.array([[1,2], [3,4]])
print("predictions = " + str(predict(w, b, X)))

test_predict()


Output

predictions = [[ 1.  1.]]


What to remember: You’ve implemented several functions that:

• Initialize (w,b)
• Optimize the loss iteratively to learn parameters (w,b):
• computing the cost and its gradient
• updating the parameters using gradient descent
• Use the learned (w,b) to predict the labels for a given set of examples

### Merge all functions into a model

You will now see how the overall model is structured by putting together all the building blocks (functions implemented in the previous parts) together, in the right order.

Implement the model function. Use the following notation: - Y_prediction for your predictions on the test set - Y_prediction_train for your predictions on the train set - w, costs, grads for the outputs of optimize()

def model(X_train, Y_train, X_test, Y_test, num_iterations=2000, learning_rate=0.5, print_cost=False):
"""
Builds the logistic regression model by calling the function you've implemented previously

Arguments:
X_train -- training set represented by a numpy array of shape (num_px * num_px * 3, m_train)
Y_train -- training labels represented by a numpy array (vector) of shape (1, m_train)
X_test -- test set represented by a numpy array of shape (num_px * num_px * 3, m_test)
Y_test -- test labels represented by a numpy array (vector) of shape (1, m_test)
num_iterations -- hyperparameter representing the number of iterations to optimize the parameters
learning_rate -- hyperparameter representing the learning rate used in the update rule of optimize()
print_cost -- Set to true to print the cost every 100 iterations

Returns:
d -- dictionary containing information about the model.
"""

# initialize parameters with zeros
w, b = initialize_with_zeros(X_train.shape[0])

parameters, grads, costs = optimize(w, b, X_train, Y_train, num_iterations, learning_rate, print_cost)

# Retrieve parameters w and b from dictionary "parameters"
w = parameters["w"]
b = parameters["b"]

# Predict test/train set examples
Y_prediction_test = predict(w, b, X_test)
Y_prediction_train = predict(w, b, X_train)

# Print train/test Errors
print("train accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_train - Y_train)) * 100))
print("test accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_test - Y_test)) * 100))

d = {"costs": costs,
"Y_prediction_test": Y_prediction_test,
"Y_prediction_train" : Y_prediction_train,
"w" : w,
"b" : b,
"learning_rate" : learning_rate,
"num_iterations": num_iterations}

return d


We have successfully implemented logistic regression as a neural network. Now, let us use it for prediction.

## Case Study - Breast Cancer Wisconsin Data Set

Breast Cancer Wisconsin (Diagnostic) Data Set’s features can be used to predict the type of tumor, malignant or benign. You can check the data set’s description here and download it from here. Its attributes are as follows:

#### Attribute Information

1) ID number

2) Diagnosis (M = malignant, B = benign)

3-32) Other attributes

Now, let us load this data set.

import pandas as pd

print(data.shape)


Ouput

(569, 32)


So we have 569 rows and 32 coulmns.

Let us check what the data looks like?

# Explore a bit


Ouput

0 1 2 3 4 5 6 7 8 9 ... 22 23 24 25 26 27 28 29 30 31
0 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678

5 rows × 32 columns

In this set, the 0th column is id, 1st is the class, B for benign and M for malignant, and further columns are real-valued input features. Let us gather the input features and corresponding output classes from data.

# Extract input features from column 2 to 32
X = data.iloc[:, 2:]
# Extract diagnosis class column 1
Y = data.iloc[:, 1]

# Changing M to 1 and B to 0, for malignant and benign respectively
Y = (Y == 'M').astype(np.float64)

# Changing DataFrame to numpy arrays
X = X.values
Y = Y.values

# Changing shape of Y
Y = np.resize(Y, (Y.shape[0], 1))

# Normalize the data features
from sklearn.preprocessing import normalize
X = normalize(X, axis=0)


While working with neural networks and optimization algorithms, always normalize your data features. This is done for following two reasons:

• It prevents the values involved in a network from becoming too large or too small, hence reduces the chances of overflow or underflow
• It helps the optimizer to converge (or approach convergence) faster.

To see what happens when you don’t normalize your data, try commenting out those lines and run the code. (You will get cost’s value nan for nearly all epochs).

# Prepare training and test data
# Put 70% of examples in training set and the remaing 30% in testing set
X_train = X[:int(0.7*X.shape[0]), :].T
Y_train = Y[:int(0.7*Y.shape[0]), :].T
X_test = X[int(0.7*X.shape[0]):, :].T
Y_test = Y[int(0.7*Y.shape[0]):, :].T

# Let's run the model now!!!
model(X_train, Y_train, X_test, Y_test, print_cost=True)


Ouput

    Cost after iteration 0: 0.693147
Cost after iteration 100: 0.647681
Cost after iteration 200: 0.614502
Cost after iteration 300: 0.585481
Cost after iteration 400: 0.559981
Cost after iteration 500: 0.537452
Cost after iteration 600: 0.517429
Cost after iteration 700: 0.499527
Cost after iteration 800: 0.483430
Cost after iteration 900: 0.468875
Cost after iteration 1000: 0.455647
Cost after iteration 1100: 0.443567
Cost after iteration 1200: 0.432484
Cost after iteration 1300: 0.422276
Cost after iteration 1400: 0.412835
Cost after iteration 1500: 0.404075
Cost after iteration 1600: 0.395917
Cost after iteration 1700: 0.388299
Cost after iteration 1800: 0.381163
Cost after iteration 1900: 0.374461
train accuracy: 90.20100502512562 %
test accuracy: 95.32163742690058 %

{'Y_prediction_test': array([[ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,
0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  1.,  0.,  0.,  0.,
0.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  1.,
0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,
0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,
0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  1.,
0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,
0.,  1.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,
0.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,
0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  1.,
1.,  0.]]),
'Y_prediction_train': array([[ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,
0.,  1.,  1.,  0.,  1.,  1.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,
1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  0.,
0.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,
0.,  1.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,
1.,  0.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  0.,  1.,  0.,  1.,
1.,  0.,  0.,  0.,  1.,  1.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,
0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,
1.,  1.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,
0.,  1.,  1.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,
0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,
1.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,  1.,  1.,
0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  1.,
0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,
0.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,
0.,  0.,  1.,  0.,  1.,  1.,  1.,  0.,  0.,  0.,  1.,  1.,  0.,
0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  1.,
0.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
1.,  0.,  0.,  1.,  0.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,
1.,  0.,  1.,  0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,
0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  1.,  0.,  0.,
0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,
0.,  0.,  0.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,
0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
0.,  1.,  1.,  0.,  1.,  1.,  1.,  0.,  1.,  1.,  0.,  0.,  1.,
0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,
0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.]]),
'b': -2.82004015631605,
'costs': [0.69314718055994518,
0.64768121608659124,
0.61450221434667129,
0.58548108050436609,
0.55998108981562744,
0.53745153619366792,
0.51742861842701759,
0.49952734366191881,
0.4834301607574058,
0.46887545101664851,
0.45564724081802283,
0.44356654183651678,
0.43248428219793783,
0.42227562488511811,
0.41283543232151004,
0.40407465235038642,
0.395917434789408,
0.38829882374096664,
0.3811629029392809,
0.37446129799735056],
'learning_rate': 0.5,
'num_iterations': 2000,
'w': array([[ 2.2402501 ],
[ 1.46136967],
[ 2.39127058],
[ 4.24369791],
[ 0.58075704],
[ 3.17224642],
[ 5.31606193],
[ 6.06826111],
[ 0.47147587],
[-0.25072496],
[ 3.70336513],
[-0.19500842],
[ 3.66594003],
[ 4.71289032],
[-0.68066637],
[ 1.12038991],
[ 0.69818476],
[ 1.72171634],
[-0.62144059],
[-0.66633135],
[ 2.92714447],
[ 1.82405968],
[ 3.06332473],
[ 5.21202012],
[ 1.06115042],
[ 4.13504157],
[ 4.99853943],
[ 5.33078157],
[ 1.25268158],
[ 0.8297114 ]])}


We get,

train accuracy: 90.20 %

test accuracy: 95.32 %

Not bad for such a simple algorithm.

Well done for completing the tutorial. Hope you liked it.

The following is the complete code of the application of the above model.

import numpy as np
import pandas as pd
from sklearn.preprocessing import normalize

# Extract input features from column 2 to 32
X = data.iloc[:, 2:]
# Extract diagnosis class column 1
Y = data.iloc[:, 1]

# Changing M to 1 and B to 0, for malignant and benign respectively
Y = (Y == 'M').astype(np.float64)

# Changing DataFrame to numpy arrays
X = X.values
Y = Y.values

# Changing shape of Y
Y = np.resize(Y, (Y.shape[0], 1))

# Normalize the data features
X = normalize(X, axis=0)

# More preprocessing, preaparing data to be fed to our model
X_train = X[:int(0.7*X.shape[0]), :].T
Y_train = Y[:int(0.7*Y.shape[0]), :].T
X_test = X[int(0.7*X.shape[0]):, :].T
Y_test = Y[int(0.7*Y.shape[0]):, :].T

# Let's run the model now!!!
model(X_train, Y_train, X_test, Y_test, print_cost=True)


Things left for another day:

• Analysis of logistic regression.
• How to choose appropriate threshold for the logistic regression.
• Prevent overfitting and underfitting (Regularization, one of the methods).
• Why logistic regression performs poorly on data whose classes cannot be linearly separated.
• How to tweak logistic regression to support multiclass classification.
• After getting 95% test accuracy, why we need to experiment more to get accurate results (cross-validation).