See the latest book content here.

2 Logistic Regression Type Neural Networks

Learning outcomes from this chapter

Logistic regression view as a shallow Neural Network
Maximum Likelihood, loss function, cross-entropy
Softmax regression/ multinomial regression model as a Multiclass Perceptron.
Optimisation procedure: gradient descent, stochastic gradient descent, Mini-Batches
Understand the forward pass and backpropogration step
Implementation from first principles

2.1 Logistic regression view as a shallow Neural Network

2.1.1 Sigmoid function

The sigmoid function $σ (\cdot)$ , also known as the logistic function, is defined as follows:

$\forall z \in R, σ (z) = \frac{1}{1 + e^{- z}} \in] 0, 1 [$

Figure 2.1: Sigmoid function

z <- seq(-5, 5, 0.01)
sigma = 1 / (1 + exp(-z))

plot(sigma~z,type="l",ylab=expression(sigma(z)))

2.1.2 Logistic regression

The logistic regression is a probabilistic model that aims to predict the probability that the outcome variable $y$ is 1. It is defined by assuming that $y | x; θ \sim Bernoulli (ϕ)$ . Then, the logistic regression is defined by applying the soft sigmoid function to the linear predictor $θ^{T} x$ :

$ϕ = h_{θ} (x) = p (y = 1 | x; θ) = \frac{1}{1 + \exp (- θ^{T} x)} = σ (θ^{T} x)$

The logistic regression is also presented:

$Logit [h_{θ} (x)] = l o g i t [p (y = 1 | x; θ)] = θ^{T} x$ where $Logit (p) = l o g (\frac{p}{1 - p})$ .

Remark about notation

$x = (x_{0}, \dots, x_{d})^{T}$ represent a vector of $d + 1$ features/predictors and by convention $x_{0} = 1$
$θ = (θ_{0}, \dots, θ_{d})^{T}$ is the vector of parameter related to the features $x$
$θ_{0}$ is called the intercept by the statistician and named bias by the computer scientist (noted generally $b$ )

2.1.3 Logistic regression for classification

Let’s play with a simple example with 10 points, and two classes (red and blue)

Figure 2.2: Classify red and blue points

clr1 <- c(rgb(1,0,0,1),rgb(0,0,1,1))
clr2 <- c(rgb(1,0,0,.2),rgb(0,0,1,.2))
x <- c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85)
y <- c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3)
z <- c(1,1,1,1,1,0,0,1,0,0)
df <- data.frame(x,y,z)
plot(x,y,pch=19,cex=2,col=clr1[z+1])

In order to classify the points, we run a logistic regression to get predictions

model <- glm(z~x+y,data=df,family=binomial)
#summary(model)

Then, we use the fitted model to define our classifier which is defined as attributed the class that is the most likely.

pred_model <- function(x,y){
 predict(model,newdata=data.frame(x=x,
 y=y),type="response")>.5
}

Using our decision rule, we can visualise the produced partition of the space.

Figure 2.3: Partition using the logistic model

x_grid<-seq(0,1,length=101)
y_grid<-seq(0,1,length=101)
z_grid <- outer(x_grid,y_grid,pred_model)
image(x_grid,y_grid,z_grid,col=clr2)
points(x,y,pch=19,cex=2,col=clr1[z+1])

2.1.4 Likelihood of the logistic model

The maximum likelihood estimation procedure is generally used to estimate the parameters of the models $θ_{0}, \dots, θ_{d}$ .

$p (y | x; θ) = {\begin{cases} h_{θ} (x) & if y = 1, and \\ 1 - h_{θ} (x) & otherwise . \end{cases}$ which could be written as

$p (y | x; θ) = h_{θ} (x)^{y} (1 - h_{θ} (x))^{1 - y},$

Consider now the observation of $m$ training samples denoted by ${(x^{(1)}, y^{(1)}), \dots, (x^{(m)}, y^{(m)})}$ as i.i.d. observations from the logistic model. The likelihood is

$\begin{array}{rcl} L (θ) & = & \prod_{i = 1}^{m} p (y^{(i)} | x^{(i)}; θ) \\ = & \prod_{i = 1}^{m} h_{θ} (x^{(i)})^{y} (1 - h_{θ} (x^{(i)}))^{1 - y} \end{array}$

Then, the following log likelihood is maximized to the estimates of $θ$ :

$ℓ (θ) = log L (θ) = \sum_{i = 1}^{m} [y^{(i)} \log h_{θ} (x^{(i)}) + (1 - y^{(i)}) \log (1 - h_{θ} (x^{(i)}))]$

2.1.5 Shallow Neural Network

The logistic model can ve viewed as a shallow Neural Network.

This figure used here the same notation as the regression logistic model presented by the statistical point of view. However, in the following we will adopt the notation used the most frequently in deep learning framework.

Figure 2.4: Shallow Neural Network

In this figure, $z = w^{T} z + b = w_{1} x_{1} + \dots + w_{d} x_{d} + b$ is the linear combination of the $d$ features/predictors and $a = σ (Z)$ is called the activation function which is the non-linear part of the Neural Network to get a close prediction $\hat{y} \approx y$ .

Remark: In the sequel, we will adopt the following notation:

$x = (x_{1}, \dots, x_{d})^{T} \in ℜ^{d}$ representz a vector of $d$ features/predictors
$w = (w_{1}, \dots, w_{d})^{T}$ is the vector of weight related to the features $x$
$b$ is called the biais
We consider the observations of $m$ training samples denoted by ${(x^{(1)}, y^{(1)}), \dots, (x^{(m)}, y^{(m)})}$

2.1.6 Entropy, Cross-entropy and Kullback-Leibler

Let’s first talk about the cross-entropy which is widely used as loss function for classification purpose.

Cross Entropy (CE) is related to the entropy and the kullback-Leibler risk.

The entropy of a discrete probability distribution $p = (p_{1}, \dots, p_{n})$ is defined as

$H (p) = H (p_{1}, \dots, p_{n}) = - \sum_{i = 1}^{n} p_{i} \log p_{i}$ which is a ``measurement of the disorder or randomness of a system’’.

Kullback and Leibler known also as KL divergence quantifies how similar a probability distribution $p$ is to a candidate distribution $q$ .

$K L (p; q) = - \sum_{i = 1}^{n} p_{i} \log \frac{p_{i}}{q_{i}}$ Note that the $K L$ divergence is not a distance measure as $K L (p; q) \neq K L (q; p)$ . $K L$ is non-negative and zero if and only if $p_{i} = q_{i}$ for all $i$ .

One can easily show that

$K L (p; q) = \underset{cross entropy}{\underset{⏟}{\sum_{i = 1}^{n} p_{i} \log \frac{1}{q_{i}}}} - H (p)$

where the first term of the right part is the cross entropy:

$C E (p, q) = \sum_{i = 1}^{n} p_{i} \log \frac{1}{q_{i}} = - \sum_{i = 1}^{n} p_{i} \log q_{i}$ And we have the relation $C E (p, q) = H (p) + K L (p; q)$

Thus, the cross entropy can be interpreted as the uncertainty implicit in $H (p)$ plus the likelihood that the distribution $p$ could have be generated by the distribution $q$ .

2.1.7 Mathematical expression of the Neural Network:

For one example $x^{(i)}$ , the ouput of this Neural Network is given by:

${\hat{y}}^{(i)} = \underset{\underset{activation function}{\underset{⏟}{a^{(i)}}}}{\underset{⏟}{σ (w^{T} x^{(i)} + b)}},$

where $σ (\cdot)$ is the sigmoid function.

We aim to get the weight $w$ and the biais $b$ such that ${\hat{y}}^{(i)} \approx y^{(i)}$ .

The loss function used for this network is the cross-entropy which is defined for one sample $(x, y)$ :

$L (\hat{y}, y) = - (y \log \hat{y} + (1 - y) \log (1 - \hat{y}))$

Then the Cost function for the entire training data set is:

$J (w, b) = \frac{1}{m} \sum_{i = 1}^{m} L ({\hat{y}}^{(i)}, y^{(i)})$

Further, it is easy to see the connection with the log-likelihood function of the logistic model:

$\begin{array}{rcl} J (w, b) & = & \frac{1}{m} \sum_{i = 1}^{m} L ({\hat{y}}^{(i)}, y^{(i)}) \\ = & - \frac{1}{m} \sum_{i = 1}^{m} [y^{(i)} \log a^{(i)}) + (1 - y^{(i)}) \log (1 - a^{(i)})] \\ = & - \frac{1}{m} \sum_{i = 1}^{m} [y^{(i)} \log h_{θ} (x^{(i)}) + (1 - y^{(i)}) \log (1 - h_{θ} (x^{(i)}))] \\ \equiv & - \frac{1}{m} ℓ (θ) \end{array}$

where $b \equiv θ_{0}$ and $w \equiv (θ_{1}, \dots, θ_{d})$ .

The optimization step will be carried using Gradient Descent procedures and extension which will be briefly presented in the sub-section Optimization

2.2 Softmax regression

A softmax regression, also called a multiclass logistic regression, is used to generalized logistic regression when there are more than 2 outcome classes ( $k = 1, \dots, K$ ). The outcome variable is a discrete variable $y$ which can take one of the $K$ values, $y \in {1, \dots, K}$ . The multinomial regression model is also a GLM (Generalized Linear Model) where the distribution of the outcome $y$ is a Multinomial $(1, π)$ where $π = (ϕ_{1}, \dots, ϕ_{K})$ is a vector with probabilities of success for each category. This Multinomial $(1, π)$ is more precisely called categorical distribution.

The multinomial regression model is parameterize by $K - 1$ parameters, $ϕ_{1}, \dots, ϕ_{K}$ , where $ϕ_{i} = p (y = i; ϕ)$ , and $ϕ_{K} = p (y = K; ϕ) = 1 - \sum_{i = 1}^{K - 1} ϕ_{i}$ .

By convention, we set $θ_{K} = 0$ , which makes the Bernoulli parameter $ϕ_{i}$ of each class $i$ be such that

$ϕ_{i} = \frac{\exp (θ_{i}^{T} x)}{\sum_{j = 1}^{K} \exp (θ_{j}^{T} x)}$ where $θ_{1}, \dots, θ_{K - 1} \in ℜ^{d + 1}$ are the parameters of the model. This model is also called softmax regression and generalize the logistic regression. The output of the model is the estimated probability that $p (y = i | x; θ)$ , for every value of $i = 1, \dots, K$ .

2.2.1 Multinomial regression for classification

We illustrate the Multinomial model by considering three classes: red, yellow and blue.

Figure 2.5: Classify for three color points

clr1 <- c(rgb(1,0,0,1),rgb(1,1,0,1),rgb(0,0,1,1))
clr2 <- c(rgb(1,0,0,.2),rgb(1,1,0,.2),rgb(0,0,1,.2))
x <- c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85)
y <- c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3)
z <- c(1,2,2,2,1,0,0,1,0,0)
df <- data.frame(x,y,z)
plot(x,y,pch=19,cex=2,col=clr1[z+1])

One can use the R package to run a mutinomial regression model

library(nnet)
model.mult <- multinom(z~x+y,data=df)

# weights:  12 (6 variable)
initial  value 10.986123 
iter  10 value 0.794930
iter  20 value 0.065712
iter  30 value 0.064409
iter  40 value 0.061612
iter  50 value 0.058756
iter  60 value 0.056225
iter  70 value 0.055332
iter  80 value 0.052887
iter  90 value 0.050644
iter 100 value 0.048117
final  value 0.048117 
stopped after 100 iterations

Then, the output gives a predicted probability to the three colours and we attribute the color that is the most likely.

pred_mult <- function(x,y){
res <- predict(model.mult,
newdata=data.frame(x=x,y=y),type="probs")
apply(res,MARGIN=1,which.max)
}

x_grid<-seq(0,1,length=101)
y_grid<-seq(0,1,length=101)
z_grid <- outer(x_grid,y_grid,FUN=pred_mult)

We can now visualize the three regions, the frontier being linear, and the intersection being the equiprobable case.

Figure 2.6: Classifier using multinomial model

image(x_grid,y_grid,z_grid,col=clr2)
points(x,y,pch=19,cex=2,col=clr1[z+1])

2.2.2 Likelihood of the softmax model

The maximum likelihood estimation procedure consists of maximizing the log-likelihood:

$\begin{array}{rcl} ℓ (θ) & = & \sum_{i = 1}^{m} \log p (y^{(i) | x^{(i)}; θ}) \\ = & \sum_{i = 1}^{m} \log \prod_{l = 1}^{K} {(\frac{e^{θ_{l}^{T} x^{(i)}}}{\sum_{j = 1}^{K} e^{θ_{j}^{T} x^{(i)}}})}^{1_{{y^{(i)} = l}}} \end{array}$

2.2.3 Softmax regression as shallow Neural Network

The Softmax regression model can be viewed as a shallow Neural Network.

In this representation, there are $K$ neurons where each neuron is defined by his own set of weights $w_{i} \in ℜ^{d}$ and a bais term $b_{i}$ . The linear part is denoted by $z_{i} = w_{i}^{T} x + b_{i}$ and the non linear part (activation part) is $σ_{i} = a_{i} = \frac{\exp (z_{i})}{\sum_{j = 1}^{K} \exp (z_{j})}$ . Note that the denominator of the activation part is defined using the weights from the other neurons. The output is a vector of probabilities $(a_{1}, \dots, a_{K})$ and the function is used for classification purpose:

$\hat{y} = \underset{i \in {1, \dots, K}}{argmax a_{i}}$

2.2.4 Loss function: cross-entropy for categorical variable

Let consider first one training sample $(x, y)$ . The cross entropy loss for categorical response variable, also called Softmax Loss is defined as:

$\begin{array}{rcl} C E & = & - \sum_{i = 1}^{K} {\tilde{y}}_{i} \log p (y = i) \\ = & - \sum_{i = 1}^{K} {\tilde{y}}_{i} \log a_{i} \\ = & - \sum_{i = 1}^{K} {\tilde{y}}_{i} \log (\frac{\exp (z_{i})}{\sum_{j = 1}^{K} \exp (z_{j})}) \end{array}$ where ${\tilde{y}}_{i} = 1_{{y = i}}$ is a binary variable indicating if $y$ is in the class $i$ .

This expression can be rewritten as

$\begin{array}{rcl} C E & = & - \log \prod_{i = 1}^{K} {(\frac{\exp (z_{i})}{\sum_{j = 1}^{K} \exp (z_{j})})}^{1_{{y = i}}} \end{array}$

Then, the cost function for the $m$ training samples is defined as

$\begin{array}{rcl} J (w, b) & = & - \frac{1}{m} \sum_{i = 1}^{m} \log \prod_{k = 1}^{K} {(\frac{\exp (z_{k}^{(i)})}{\sum_{j = 1}^{K} \exp (z_{j}^{(i)})})}^{1_{{y^{(i)} = k}}} \\ \equiv & - \frac{1}{m} ℓ (θ) \end{array}$

2.3 Optimisation

2.3.1 Gradient Descent

Consider unconstrained, smooth convex optimization

$min_{θ \in ℜ^{d}} f (θ),$

Algorithm : Gradient Descent

Choose initial point $θ^{(0)} \in R^{d}$
Repeat $θ^{(k + 1)} = θ^{(k)} - α . \nabla f (θ^{(k)}), k = 1, 2, 3, \dots$
Stop at some point: $| | θ^{(k + 1)} - θ^{(k)} | |_{2}^{2} < ϵ$

Here, $α$ is called the learning rate.

Suppose that we want to find $x$ that minimizes: $f (x) = 1.2 (x - 2)^{2} + 3.2$

Figure 2.7: Closed from solution (red)

f.x <- function(x) 1.2*(x-2)**2+3.2
curve(1.2*(x-2)**2+3.2,0,4,ylab="fx)")
abline(v=2,col="red")

In general, we cannot find a closed form solution, but can compute $\nabla f (x)$

simple.grad.des <- function(x0,alpha,epsilon=0.00001,max.iter=300){
  tol <- 1; xold <- x0; res <- x0; iter <- 1
  while (tol>epsilon & iter < max.iter){
    xnew <- xold - alpha*2.4*(xold-2)
    tol <- abs(xnew-xold)
    xold <- xnew
  res <- c(res,xnew)
  iter <- iter +1
  }
 return(res)
}
result <- simple.grad.des(0,0.01,max.iter=200)
#result[length(result)]

Convergence with a learning rate=0.01

Figure 2.8: alpha=0.01

curve(1.2*(x-2)**2+3.2,0,4,ylab="fx)")
abline(v=2,col="red")
points(result,f.x(result),col="blue")

Convergence with a learning rate=0.83

Figure 2.9: alpha=0.83

result2 <- simple.grad.des(0,0.83,max.iter=200)
#result2[length(result2)]
curve(1.2*(x-2)**2+3.2,0,4,ylab="fx)")
abline(v=2,col="red")
points(result2,f.x(result2),col="blue",type="o")

2.3.2 Gradient Descent for logistic regression

Given $(x^{(i)}, y^{(i)}) \in ℜ \times {0, 1}$ for $i = 1, \dots, m$ , consider the cross-entropy loss function for this data set:

$f (w) = \frac{1}{m} \sum_{i = 1}^{m} (- y^{(i)} w^{T} x^{(i)} + \log (1 + \exp (w^{T} x^{(i)}))) = \sum_{i = 1}^{m} f_{i} (w)$

The gradient is

$\nabla f (w) = \frac{1}{m} \sum_{i = 1}^{m} (p^{(i)} (w) - y^{(i)}) x^{(i)}$

where $\begin{array}{rcl} p^{(i)} (w)) & = & p (Y = 1 | x^{(i), w}) \\ = & \exp (w^{T} x^{(i)}) / (1 + \exp (w^{T} x^{(i)})), i = 1, \dots, m \end{array}$

Algorithm : Batch Gradient Descent

Initialize $w = (0, \dots, 0)$
Repeat until convergence
- Let $g = (0, \dots, 0)$ be the gradient vector
- for $i = 1 : m$ do $p^{(i)} = \exp (w^{T} x^{(i)}) / (1 + \exp (w^{T} x^{(i)}))$ $e r r o r^{(i)} = p^{(i)} - y_{i}$ $g = g + e r r o r^{(i)} . w^{(i)}$
- end
$w = w - α . g / m$
End repeat until convergence

Note that algorithm uses all samples to compute the gradient. This approach is called batch gradient descent.

2.3.3 Stochastic gradient descent

Algorithm : Stochastic Gradient Descent

Initialize $w = (0, \dots, 0)$
Repeat until convergence

Pick sample $i$

$p^{(i)} = \exp (w^{T} x^{(i)}) / (1 + \exp (w^{T} x^{(i)}))$

$e r r o r^{(i)} = p^{(i)} - y_{i}$

$w = w - α (e r r o r^{(i)} \times x^{(i)})$
End repeat until convergence

Coefficient $w$ is updated after each sample.

Remark: The gradient computation $\nabla f (w) = \sum_{i = 1}^{m} (p^{(i)} (w) - y^{(i)}) x^{(i)}$ is doable when $m$ is moderate, but not when $m \sim 500 m i l l i o n$ .

One batch update costs $O (m d)$
One stochastic update costs $O (d)$

2.3.4 Mini-Batches

In practice, mini-batch is often used to:

Compute gradient based on a small subset of samples
Make update to coefficient vector

2.3.5 Example with logistic regression $d = 2$

Simulate some $m$ samples from true model:

$p (Y = 1 | x^{(i)}, w) = \frac{1}{1 + \exp (- w_{1} x_{1}^{(i)} - w_{2} x^{(i)})}$

set.seed(10)
m <- 5000 ;d <- 2 ;w <- c(0.5,-1.5)
x <- matrix(rnorm(m*2),ncol=2,nrow=m)
ptrue <- 1/(1+exp(-x%*%matrix(w,ncol=1)))
y <- rbinom(m,size=1,prob = ptrue)
(w.est <- coef(glm(y~x[,1]+x[,2]-1,family=binomial)))

##    x[, 1]    x[, 2] 
##  0.557587 -1.569509

The cross-entropy loss for this dataset

Cost.fct <- function(w1,w2) {
  w <- c(w1,w2)
  cost <- sum(-y*x%*%matrix(w,ncol=1)+log(1+exp(x%*%matrix(w,ncol=1))))
  return(cost)
}

Figure 2.10: Contour plot of the Cost function

w1 <- seq(0, 1, 0.05)
w2 <- seq(-2, -1, 0.05)
cost <- outer(w1, w2, function(x,y) mapply(Cost.fct,x,y))
contour(x = w1, y = w2, z = cost)
points(x=w.est[1],y=w.est[2],col="black",lwd=2,lty=2,pch=8)

Implementation of Batch Gradient Descent

sigmoid <- function(x) 1/(1+exp(-x))

batch.GD <- function(theta,alpha,epsilon,iter.max=500){
  tol <- 1
  iter <-1
  res.cost <- Cost.fct(theta[1],theta[2])
  res.theta <- theta
  while (tol > epsilon & iter<iter.max) {
      error <- sigmoid(x%*%matrix(theta,ncol=1))-y
      theta.up <- theta-as.vector(alpha*matrix(error,nrow=1)%*%x)
      res.theta <- cbind(res.theta,theta.up)
      tol <- sum((theta-theta.up)**2)^0.5
      theta <- theta.up
      cost <- Cost.fct(theta[1],theta[2])
      res.cost <- c(res.cost,cost)
      iter <- iter +1
    }
  result <- list(theta=theta,res.theta=res.theta,res.cost=res.cost,iter=iter,tol.theta=tol)
  
  return(result)
}

dim(x);length(y)

## [1] 5000    2

## [1] 5000

Figure 2.11: Convergence Batch Gradient Descent

theta0 <- c(0,-1); alpha=0.001
test <- batch.GD(theta=theta0,alpha,epsilon = 0.0000001)
plot(test$res.cost,ylab="cost function",xlab="iteration",main="alpha=0.01",type="l")
abline(h=Cost.fct(w.est[1],w.est[2]),col="red")

Figure 2.12: Convergence of BGD

contour(x = w1, y = w2, z = cost)
points(x=w.est[1],y=w.est[2],col="black",lwd=2,lty=2,pch=8)
record <- as.data.frame(t(test$res.theta))
points(record,col="red",type="o")

Implementation of Stochastic Gradient Descent

Stochastic.GD <- function(theta,alpha,epsilon=0.0001,epoch=50){
  epoch.max <- epoch
  tol <- 1
  epoch <-1
  res.cost <- Cost.fct(theta[1],theta[2])
  res.cost.outer <- res.cost
  res.theta <- theta
  while (tol > epsilon & epoch<epoch.max) {
    for (i in 1:nrow(x)){
      errori <- sigmoid(sum(x[i,]*theta))-y[i]
      xi <- x[i,]
      theta.up <- theta-alpha*errori*xi
      res.theta <- cbind(res.theta,theta.up)
      tol <- sum((theta-theta.up)**2)^0.5
      theta <- theta.up
      cost <- Cost.fct(theta[1],theta[2])
      res.cost <- c(res.cost,cost)
    }
    epoch <- epoch +1
    cost.outer <- Cost.fct(theta[1],theta[2])
    res.cost.outer <- c(res.cost.outer,cost.outer)
    }
  result <- list(theta=theta,res.theta=res.theta,res.cost=res.cost,epoch=epoch,tol.theta=tol)
}

test.SGD <- Stochastic.GD(theta=theta0,alpha,epsilon = 0.0001,epoch=10)

Figure 2.13: Convergence Stochastic Gradient Descent

plot(test.SGD$res.cost,ylab="cost function",xlab="iteration",main="alpha=0.01",type="l")
abline(h=Cost.fct(w.est[1],w.est[2]),col="red")

Figure 2.14: Convergence of Stochastic Gradient Descent

contour(x = w1, y = w2, z = cost)
points(x=w.est[1],y=w.est[2],col="black",lwd=2,lty=2,pch=8)
record2 <- as.data.frame(t(test.SGD$res.theta))
points(record2,col="red",lwd=0.5)

Implementation of mini batch Gradient Descent

Mini.Batch <- function (theta,dataTrain, alpha = 0.1, maxIter = 10, nBatch = 2, seed = NULL,intercept=NULL) 
{
    batchRate <- 1/nBatch
    dataTrain <- matrix(unlist(dataTrain), ncol = ncol(dataTrain), byrow = FALSE)
    set.seed(seed)
    dataTrain <- dataTrain[sample(nrow(dataTrain)), ]
    set.seed(NULL)
    
    res.cost <- Cost.fct(theta[1],theta[2])
    res.cost.outer <- res.cost
    res.theta <- theta
    
    if(!is.null(intercept)) dataTrain <- cbind(1, dataTrain)
    
    temporaryTheta <- matrix(ncol = length(theta), nrow = 1)
    theta <- matrix(theta,ncol = length(theta), nrow = 1)
    
    for (iteration in 1:maxIter ) {
        if (iteration%%nBatch == 1 | nBatch == 1) {
            temp <- 1
            x <- nrow(dataTrain) * batchRate
            temp2 <- x
        }
        batch <- dataTrain[temp:temp2, ]
        inputData <- batch[, 1:ncol(batch) - 1]
        outputData <- batch[, ncol(batch)]
        rowLength <- nrow(batch)
        temp <- temp + x
        temp2 <- temp2 + x
        error <- matrix(sigmoid(inputData %*% t(theta)),ncol=1) - outputData
        for (column in 1:length(theta)) {
            term <- error * inputData[, column]
            gradient <- sum(term)#/rowLength
            temporaryTheta[1, column] = theta[1, column] - (alpha * 
                gradient)
        }
        
        theta <- temporaryTheta
        res.theta <- cbind(res.theta,as.vector(theta))
        cost.outer <- Cost.fct(theta[1,1],theta[1,2])
        res.cost.outer <- c(res.cost.outer,cost.outer)
        
    }
    result <- list(theta=theta,res.theta=res.theta,res.cost.outer=res.cost.outer)
    return(result)
}

theta0 <- c(0,-1); alpha=0.001
data.Train <- cbind(x,y)
test.miniBatch <- Mini.Batch(theta=theta0,dataTrain=data.Train, alpha = 0.001, maxIter = 100, nBatch = 10, seed = NULL,intercept=NULL)
##Result frm Mini-Batch
test.miniBatch$theta

##           [,1]      [,2]
## [1,] 0.5515469 -1.561252

Figure 2.15: Convergence Mini Batch

plot(test.miniBatch$res.cost.outer,ylab="cost function",xlab="iteration",main="alpha=0.001",type="l")
abline(h=Cost.fct(w.est[1],w.est[2]),col="red")

Figure 2.16: Convergence of Stochastic Gradient Descent

contour(x = w1, y = w2, z = cost)
points(x=w.est[1],y=w.est[2],col="black",lwd=2,lty=2,pch=8)
record3 <- as.data.frame(t(test.miniBatch$res.theta))
points(record3,col="red",lwd=0.5,type="o")

2.4 Chain rule

The univariate chain rule and the multivariate chain rule are the key concepts to calculate the derivative of cost with respect to any weight in the network. In the following a refresher of the different chain rules.

2.4.1 Univariate Chain rule

Univariate chain rule

$\frac{\partial f (g (w))}{\partial w} = \frac{\partial f (g (w))}{\partial g (w)} . \frac{\partial g (w)}{\partial w}$

2.4.2 Multivariate Chain Rule

Part I: Let $z = f (x, y)$ , $x = g (t)$ and $y = h (t)$ , where $f, g$ and $h$ are differentiable functions. Then $z = f (x, y) = f (g (t), h (t)))$ is a function of $t$ , and

$\begin{array}{rcl} \frac{d z}{d t} = \frac{d f}{d t} & = & f_{x} (x, y) \frac{d x}{d t} + f_{y} (x, y) \frac{d y}{d t} \\ = & \frac{\partial f}{\partial x} \frac{d x}{d t} + \frac{\partial f}{\partial y} \frac{d y}{d t} . \end{array}$

Part II:

Let $z = f (x, y)$ , $x = g (s, t)$ and $y = h (s, t)$ , where $f, g$ and $h$ are differentiable functions. Then $z$ is a function of $s$ and $t$ , and

$\frac{\partial z}{\partial s} = \frac{\partial f}{\partial x} \frac{\partial x}{\partial s} + \frac{\partial f}{\partial y} \frac{\partial y}{\partial s}$

and

$\frac{\partial z}{\partial t} = \frac{\partial f}{\partial x} \frac{\partial x}{\partial t} + \frac{\partial f}{\partial y} \frac{\partial y}{\partial t}$

Let $z = f (x_{1}, x_{2}, \dots, x_{m})$ be a differentiable function of $m$ variables, where each of the $x_{i}$ is a differentiable function of the variables $t_{1}, t_{2}, \dots, t_{n}$ . Then $z$ is a function of the $t_{i}$ , and

$\frac{\partial z}{\partial t_{i}} = \frac{\partial f}{\partial x_{1}} \frac{\partial x_{1}}{\partial t_{i}} + \frac{\partial f}{\partial x_{2}} \frac{\partial x_{2}}{\partial t_{i}} + \dots + \frac{\partial f}{\partial x_{m}} \frac{\partial x_{m}}{\partial t_{i}} .$

2.5 Forward pass and backpropagation procedures

Backpropagation is the key tool adopted by the neural network community to update the weights. This method exploits the derivative with respect to each weight $w$ using the chain rule (univariate and multivariate rules)

2.5.1 Example with the logistic model using cross-entropy loss

The logistic model can be viewed as a simple neural network with no hidden layer and only a single output node with a sigmoid activation function. The sigmoid activation function $σ (\cdot)$ is applied to the linear combination of the features $z = w^{T} x + b$ and provides the predicted value $a$ that represent the probability that the input $x$ belongs to class one.

Figure 2.17: Forward pass: Logistic model

Forward pass: the forward propagation step consists to get predictions for the training samples and compute the error through a loss function in order to further adapt the weights $w$ and the bias $b$ to decrease the error. This forward pass is going through the following equations:

$z^{(i)} = w^{T} x^{(i)} + b$
$a^{(i)} = σ (z^{(i)}) = \frac{1}{1 + e^{- z^{(i)}}} i = 1, \dots, m$
$L = J (w, b) = - \sum_{i = 1}^{m} y^{(i)} \log (a^{(i)}) + (1 - y^{(i)}) \log (1 - a^{(i)})$

The cost $L = J (w, b)$ is the error we want to reduce by adjusting the weights $w$ and the bais. Variations of the gradient descent algorithm are exploited to update iteratively the parameters. Thus, we have to derive the equations for the gradients on the loss function in order to propagate back the error to adapt the model parameters $w$ and $b$ .

Backward pass based on computation graph:

The chain rule is used and generally illustrated through a computation graph:

Figure 2.18: backpropagation: Logistic model

First to simplify this illustration, remind that:

Thus $\frac{\partial J (w, b)}{\partial w} = \frac{1}{m} \sum_{i = 1}^{m} \frac{\partial L ({\hat{y}}^{(i)}, y^{(i)})}{\partial w}$

To get $\frac{\partial L ({\hat{y}}^{(i)}, y^{(i)})}{\partial w}$ the chain rule is used by considering one sample $(x, y)$ (the notation $^{(i)}$ is ommitted).

Computation graphs are mainly exploited to show dependencies to derive easely the equations for the gradients. Thus to compute the gradient of the cost (lost function), one only need to go back the computation graph and multiply the gradients by each other:

$\frac{\partial L (\hat{y}, y)}{\partial w} = \frac{\partial J (w, b)}{\partial a} . \frac{\partial a}{\partial z} . \frac{\partial z}{\partial w}$

where $a = σ (z)$ and $z = b + w^{T} x$ .

$\frac{\partial J (w, b)}{\partial a} = - \frac{y}{a} + \frac{1 - y}{1 - a} = \frac{a - y}{a (1 - a)}$
$\frac{\partial σ (z)}{\partial z} = σ (z) (1 - σ (z)) = a (1 - a)$
$\frac{\partial z}{\partial w} = x$

Thus,

$\frac{\partial L (\hat{y}, y)}{\partial w} = x (a - y)$

and so, $\frac{\partial J (w, b)}{\partial w} = \frac{1}{m} \sum_{i = 1}^{m} x^{(i)} (σ (z^{(i)}) - y^{(i)})$ In the same vein, it follows

$\frac{\partial J (w, b)}{\partial b} = \frac{1}{m} \sum_{i = 1}^{m} (σ (z^{(i)}) - y^{(i)})$

2.5.2 Updating weights using Backpropagation

For neural network framework, the weights are updated using gradient descent concepts

$w = w - α \frac{\partial J (w, b)}{\partial w}$

The main steps for updating weights are

1. Take a batch of training sample
2. Forward propagation to get the corresponding cost 
3. Backpropagate the cost to get the gradients
4. update the weights using the gradients
5. Repeat step 1 to 4 for a number of iterations

2.6 Backpropagation for the Softmax Shallow Network

2.6.1 Remind some notations

We consider $K$ class: $y^{(i)} \in {1, \dots, K}$ . Given a sample $x$ we want to estimate $p (y = k | x)$ for each $k = 1, \dots, K$ . The softmax model is defined by $σ_{i} = a_{i} = \frac{\exp (z_{i})}{\sum_{j = 1}^{K} \exp (z_{j})}$ , with $a_{i} = p (y = i | x, W)$ , $z_{i} = b_{i} + w_{i}^{T} x$ and we write $W$ to denote all the weights of our network. Then, $W = (w_{1}, \dots, w_{K})$ a $d \times K$ matrix obtained by concatenating $w_{1}, \dots, w_{K}$ into columns.

2.6.2 Chain rule using cross-entropy loss

Let consider one training sample $(x, y)$ . The cross entropy loss is

$C E = - \sum_{i = 1}^{K} {\tilde{y}}_{i} \log a_{i}$

where ${\tilde{y}}_{i} = 1_{{y = i}}$ is a binary variable indicating if $y$ is in the class $i$ .

To get $\frac{\partial C E}{\partial w_{j}}$ ( $j = 1, \dots, K)$ we need to use the multivariate chain rule:

First, we derive

$\begin{array}{rcl} \frac{\partial C E}{\partial z_{j}} & = & \sum_{k}^{K} \frac{\partial C E}{\partial a_{k}} . \frac{\partial a_{k}}{\partial z_{j}} \\ = & \frac{\partial C E}{\partial a_{j}} . \frac{\partial a_{j}}{\partial z_{j}} - \sum_{k \neq j}^{K} \frac{\partial C E}{\partial a_{k}} . \frac{\partial a_{k}}{\partial z_{j}} \end{array}$

$\frac{\partial C E}{\partial a_{j}} = - \frac{{\tilde{y}}_{j}}{a_{j}}$
if $i = j$

$\frac{\partial a_{i}}{\partial z_{j}} = \frac{\partial \frac{e^{z_{i}}}{\sum_{k = 1}^{K} e^{z_{k}}}}{\partial z_{j}} = \frac{e^{z_{i}} (\sum_{k = 1}^{K} e^{z_{k}} - e^{z_{j}})}{{(\sum_{k = 1}^{K} e^{z_{k}})}^{2}} = \frac{e^{z_{j}}}{\sum_{k = 1}^{K} e^{z_{k}}} \times \frac{(\sum_{k = 1}^{K} e^{z_{k}} - e^{z_{j}})}{\sum_{k = 1}^{K} e^{z_{k}}} = a_{i} (1 - a_{j})$

if $i \neq j$

$\frac{\partial a_{i}}{\partial z_{j}} = \frac{\partial \frac{e^{z_{i}}}{\sum_{k = 1}^{K} e^{z_{k}}}}{\partial z_{j}} = \frac{0 - e^{z_{j}} e^{z_{i}}}{{(\sum_{k = 1}^{K} e^{z_{k}})}^{2}} = \frac{- e^{z_{j}}}{\sum_{k = 1}^{K} e^{z_{k}}} \times \frac{e^{z_{i}}}{\sum_{k = 1}^{K} e^{z_{k}}} = - a_{j} . a_{i}$

So we can rewrite it as

$\frac{\partial a_{i}}{\partial z_{j}} = {\begin{cases} a_{i} (1 - a_{j}) & i f & i = j \\ - a_{j} . a_{i} & i f & i \neq j \end{cases}$

Thus, we get

$\begin{array}{rcl} \frac{\partial C E}{\partial z_{j}} & = & \frac{\partial C E}{\partial a_{j}} . \frac{\partial a_{j}}{\partial z_{j}} - \sum_{k \neq j}^{K} \frac{\partial C E}{\partial a_{k}} . \frac{\partial a_{k}}{\partial z_{j}} \\ = & - {\tilde{y}}_{j} (1 - a_{j}) + \sum_{k \neq j}^{K} {\tilde{y}}_{j} a_{k} \\ = & - {\tilde{y}}_{j} + a_{j} \sum_{k}^{K} {\tilde{y}}_{k} = a_{j} - {\tilde{y}}_{j} \end{array}$

We can now derive the gradient for the weights as:

$\begin{array}{rcl} \frac{\partial C E}{\partial w_{j}} & = & \sum_{k}^{K} \frac{\partial C E}{\partial z_{k}} . \frac{\partial z_{k}}{\partial w_{j}} \\ = & (a_{j} - {\tilde{y}}_{j}) x \end{array}$

In the same way, we get

$\begin{array}{rcl} \frac{\partial C E}{\partial b_{j}} & = & \sum_{k}^{K} \frac{\partial C E}{\partial z_{k}} . \frac{\partial z_{k}}{\partial b_{j}} \\ = & (a_{j} - {\tilde{y}}_{j}) \end{array}$

2.6.3 Computation Graph could help

Let consider a simple example with $K = 3$ ( $y \in {1, 2, 3}$ ) and two features ( $x_{1}$ and $x_{2}$ ). The computational graph for this softmax neural network model help us to visualize dependencies between nodes and then to derive the gradient of the cost (loss) in respect to each parameter ( $w_{j} \in ℜ^{2}$ and $b_{j} \in ℜ$ , $j = 1, 2, 3$ ).

Figure 2.19: Computation Graph: softmax

Let’s write as an example for $\frac{\partial L}{\partial w_{2, 1}}$ :

$\begin{aligned} \frac{\partial L}{\partial w_{2, 1}} & = \sum_{i = 1}^{3} (\frac{\partial L}{\partial a_{i}}) (\frac{\partial a_{i}}{\partial z_{2}}) (\frac{\partial z_{2}}{\partial w_{2, 1}}) \\ = (\frac{\partial L}{\partial a_{1}}) (\frac{\partial a_{1}}{\partial z_{2}}) (\frac{\partial z_{2}}{\partial w_{2, 1}}) + (\frac{\partial L}{\partial a_{2}}) (\frac{\partial a_{2}}{\partial z_{2}}) (\frac{\partial z_{2}}{\partial w_{2, 1}}) + (\frac{\partial L}{\partial a_{3}}) (\frac{\partial a_{3}}{\partial z_{2}}) (\frac{\partial z_{2}}{\partial w_{2, 1}}) \end{aligned}$

In fact, we are summing up the contribution of the change of $w_{2, 1}$ over different “paths” (in red from figure above). When we change $w_{2, 1}$ ; $a_{1} a_{2}$ and $a_{3}$ changes as a result. Then, the change of $a_{1} a_{2}$ and $a_{3}$ affects $L$ . We sum up all the changes $w_{2, 1}$ produced over $a_{1} a_{2}$ and $a_{3}$ to $L$ .

import numpy as np
import matplotlib.pyplot as plt
import sympy
from scipy import optimize

Page built: 2021-03-04 using R version 4.0.3 (2020-10-10)

See the latest book content here.

2 Logistic Regression Type Neural Networks

2.1 Logistic regression view as a shallow Neural NetworkCopy link

2.1.1 Sigmoid functionCopy link

2.1.2 Logistic regressionCopy link

2.1.3 Logistic regression for classificationCopy link

2.1.4 Likelihood of the logistic modelCopy link

2.1.5 Shallow Neural NetworkCopy link

2.1.6 Entropy, Cross-entropy and Kullback-LeiblerCopy link

2.1.7 Mathematical expression of the Neural Network:Copy link

2.2 Softmax regressionCopy link

2.2.1 Multinomial regression for classificationCopy link

2.2.2 Likelihood of the softmax modelCopy link

2.2.3 Softmax regression as shallow Neural NetworkCopy link

2.2.4 Loss function: cross-entropy for categorical variableCopy link

2.3 OptimisationCopy link

2.3.1 Gradient DescentCopy link

2.3.2 Gradient Descent for logistic regressionCopy link

2.3.3 Stochastic gradient descentCopy link

2.3.4 Mini-BatchesCopy link

2.3.5 Example with logistic regression d=2d=2 Copy link

2.4 Chain ruleCopy link

2.4.1 Univariate Chain ruleCopy link

2.4.2 Multivariate Chain RuleCopy link

2.5 Forward pass and backpropagation proceduresCopy link

2.5.1 Example with the logistic model using cross-entropy lossCopy link

2.5.2 Updating weights using BackpropagationCopy link

2.6 Backpropagation for the Softmax Shallow NetworkCopy link

2.6.1 Remind some notationsCopy link

2.6.2 Chain rule using cross-entropy lossCopy link

2.6.3 Computation Graph could helpCopy link