Learning outcomes from this chapter
The sigmoid function σ(⋅)σ(⋅), also known as the logistic function, is defined as follows:
∀z∈R,σ(z)=11+e−z∈]0,1[∀z∈R,σ(z)=11+e−z∈]0,1[
Figure 2.1: Sigmoid function
seq(-5, 5, 0.01)
z <- 1 / (1 + exp(-z))
sigma =
plot(sigma~z,type="l",ylab=expression(sigma(z)))
The logistic regression is a probabilistic model that aims to predict the probability that the outcome variable yy is 1. It is defined by assuming that y|x;θ∼Bernoulli(ϕ)y|x;θ∼Bernoulli(ϕ). Then, the logistic regression is defined by applying the soft sigmoid function to the linear predictor θTxθTx:
ϕ=hθ(x)=p(y=1|x;θ)=11+exp(−θTx)=σ(θTx)ϕ=hθ(x)=p(y=1|x;θ)=11+exp(−θTx)=σ(θTx)
The logistic regression is also presented:
Logit[hθ(x)]=logit[p(y=1|x;θ)]=θTxLogit[hθ(x)]=logit[p(y=1|x;θ)]=θTx where Logit(p)=log(p1−p)Logit(p)=log(p1−p).
Remark about notation
Let’s play with a simple example with 10 points, and two classes (red and blue)
Figure 2.2: Classify red and blue points
c(rgb(1,0,0,1),rgb(0,0,1,1))
clr1 <- c(rgb(1,0,0,.2),rgb(0,0,1,.2))
clr2 <- c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85)
x <- c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3)
y <- c(1,1,1,1,1,0,0,1,0,0)
z <- data.frame(x,y,z)
df <-plot(x,y,pch=19,cex=2,col=clr1[z+1])
In order to classify the points, we run a logistic regression to get predictions
glm(z~x+y,data=df,family=binomial)
model <-#summary(model)
Then, we use the fitted model to define our classifier which is defined as attributed the class that is the most likely.
function(x,y){
pred_model <-predict(model,newdata=data.frame(x=x,
y=y),type="response")>.5
}
Using our decision rule, we can visualise the produced partition of the space.
Figure 2.3: Partition using the logistic model
seq(0,1,length=101)
x_grid<-seq(0,1,length=101)
y_grid<- outer(x_grid,y_grid,pred_model)
z_grid <-image(x_grid,y_grid,z_grid,col=clr2)
points(x,y,pch=19,cex=2,col=clr1[z+1])
The maximum likelihood estimation procedure is generally used to estimate the parameters of the models θ0,…,θdθ0,…,θd.
p(y|x;θ)={hθ(x)if y=1, and1−hθ(x)otherwise.p(y|x;θ)={hθ(x)if y=1, and1−hθ(x)otherwise. which could be written as
p(y|x;θ)=hθ(x)y(1−hθ(x))1−y,p(y|x;θ)=hθ(x)y(1−hθ(x))1−y,
Consider now the observation of mm training samples denoted by {(x(1),y(1)),…,(x(m),y(m))}{(x(1),y(1)),…,(x(m),y(m))} as i.i.d. observations from the logistic model. The likelihood is
L(θ)=m∏i=1p(y(i)|x(i);θ)=m∏i=1hθ(x(i))y(1−hθ(x(i)))1−yL(θ)=m∏i=1p(y(i)|x(i);θ)=m∏i=1hθ(x(i))y(1−hθ(x(i)))1−y
Then, the following log likelihood is maximized to the estimates of θθ:
ℓ(θ)=log L(θ)=m∑i=1[y(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))]ℓ(θ)=log L(θ)=m∑i=1[y(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))]
The logistic model can ve viewed as a shallow Neural Network.
This figure used here the same notation as the regression logistic model presented by the statistical point of view. However, in the following we will adopt the notation used the most frequently in deep learning framework.
Figure 2.4: Shallow Neural Network
In this figure, z=wTz+b=w1x1+…+wdxd+bz=wTz+b=w1x1+…+wdxd+b is the linear combination of the dd features/predictors and a=σ(Z)a=σ(Z) is called the activation function which is the non-linear part of the Neural Network to get a close prediction ˆy≈y^y≈y.
Remark: In the sequel, we will adopt the following notation:
Let’s first talk about the cross-entropy which is widely used as loss function for classification purpose.
Cross Entropy (CE) is related to the entropy and the kullback-Leibler risk.
The entropy of a discrete probability distribution p=(p1,…,pn)p=(p1,…,pn) is defined as
H(p)=H(p1,…,pn)=−n∑i=1pilogpiH(p)=H(p1,…,pn)=−n∑i=1pilogpi which is a ``measurement of the disorder or randomness of a system’’.
Kullback and Leibler known also as KL divergence quantifies how similar a probability distribution pp is to a candidate distribution qq.
KL(p;q)=−n∑i=1pilogpiqiKL(p;q)=−n∑i=1pilogpiqi Note that the KLKL divergence is not a distance measure as KL(p;q)≠KL(q;p)KL(p;q)≠KL(q;p). KLKL is non-negative and zero if and only if pi=qipi=qi for all ii.
One can easily show that
KL(p;q)=n∑i=1pilog1qi⏟cross entropy−H(p)KL(p;q)=n∑i=1pilog1qicross entropy−H(p)
where the first term of the right part is the cross entropy:
CE(p,q)=n∑i=1pilog1qi=−n∑i=1pilogqiCE(p,q)=n∑i=1pilog1qi=−n∑i=1pilogqi And we have the relation CE(p,q)=H(p)+KL(p;q)CE(p,q)=H(p)+KL(p;q)
Thus, the cross entropy can be interpreted as the uncertainty implicit in H(p)H(p) plus the likelihood that the distribution pp could have be generated by the distribution qq.
For one example x(i)x(i), the ouput of this Neural Network is given by:
ˆy(i)=σ(wTx(i)+b)⏟a(i)⏟activation function,^y(i)=σ(wTx(i)+b)a(i)activation function,
where σ(⋅)σ(⋅) is the sigmoid function.
We aim to get the weight ww and the biais bb such that ˆy(i)≈y(i)^y(i)≈y(i).
The loss function used for this network is the cross-entropy which is defined for one sample (x,y)(x,y):
L(ˆy,y)=−(ylogˆy+(1−y)log(1−ˆy))L(^y,y)=−(ylog^y+(1−y)log(1−^y))
Then the Cost function for the entire training data set is:
J(w,b)=1mm∑i=1L(ˆy(i),y(i))J(w,b)=1mm∑i=1L(^y(i),y(i))
Further, it is easy to see the connection with the log-likelihood function of the logistic model:
J(w,b)=1mm∑i=1L(ˆy(i),y(i))=−1mm∑i=1[y(i)loga(i))+(1−y(i))log(1−a(i))]=−1mm∑i=1[y(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))]≡−1mℓ(θ)J(w,b)=1mm∑i=1L(^y(i),y(i))=−1mm∑i=1[y(i)loga(i))+(1−y(i))log(1−a(i))]=−1mm∑i=1[y(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))]≡−1mℓ(θ)
where b≡θ0b≡θ0 and w≡(θ1,…,θd)w≡(θ1,…,θd).
The optimization step will be carried using Gradient Descent procedures and extension which will be briefly presented in the sub-section Optimization
A softmax regression, also called a multiclass logistic regression, is used to generalized logistic regression when there are more than 2 outcome classes (k=1,…,Kk=1,…,K). The outcome variable is a discrete variable yy which can take one of the KK values, y∈{1,…,K}y∈{1,…,K}. The multinomial regression model is also a GLM (Generalized Linear Model) where the distribution of the outcome yy is a Multinomial(1,π)(1,π) where π=(ϕ1,…,ϕK)π=(ϕ1,…,ϕK) is a vector with probabilities of success for each category. This Multinomial(1,π)(1,π) is more precisely called categorical distribution.
The multinomial regression model is parameterize by K−1K−1 parameters, ϕ1,…,ϕKϕ1,…,ϕK, where ϕi=p(y=i;ϕ)ϕi=p(y=i;ϕ), and ϕK=p(y=K;ϕ)=1−∑K−1i=1ϕiϕK=p(y=K;ϕ)=1−∑K−1i=1ϕi.
By convention, we set θK=0θK=0, which makes the Bernoulli parameter ϕiϕi of each class ii be such that
ϕi=exp(θTix)K∑j=1exp(θTjx)ϕi=exp(θTix)K∑j=1exp(θTjx) where θ1,…,θK−1 ∈ℜd+1θ1,…,θK−1 ∈Rd+1 are the parameters of the model. This model is also called softmax regression and generalize the logistic regression. The output of the model is the estimated probability that p(y=i|x;θ)p(y=i|x;θ), for every value of i=1,…,Ki=1,…,K.
We illustrate the Multinomial model by considering three classes: red, yellow and blue.
Figure 2.5: Classify for three color points
c(rgb(1,0,0,1),rgb(1,1,0,1),rgb(0,0,1,1))
clr1 <- c(rgb(1,0,0,.2),rgb(1,1,0,.2),rgb(0,0,1,.2))
clr2 <- c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85)
x <- c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3)
y <- c(1,2,2,2,1,0,0,1,0,0)
z <- data.frame(x,y,z)
df <-plot(x,y,pch=19,cex=2,col=clr1[z+1])
One can use the R package to run a mutinomial regression model
library(nnet)
multinom(z~x+y,data=df) model.mult <-
# weights: 12 (6 variable)
initial value 10.986123
iter 10 value 0.794930
iter 20 value 0.065712
iter 30 value 0.064409
iter 40 value 0.061612
iter 50 value 0.058756
iter 60 value 0.056225
iter 70 value 0.055332
iter 80 value 0.052887
iter 90 value 0.050644
iter 100 value 0.048117
final value 0.048117
stopped after 100 iterations
Then, the output gives a predicted probability to the three colours and we attribute the color that is the most likely.
function(x,y){
pred_mult <- predict(model.mult,
res <-newdata=data.frame(x=x,y=y),type="probs")
apply(res,MARGIN=1,which.max)
}
seq(0,1,length=101)
x_grid<-seq(0,1,length=101)
y_grid<- outer(x_grid,y_grid,FUN=pred_mult) z_grid <-
We can now visualize the three regions, the frontier being linear, and the intersection being the equiprobable case.
Figure 2.6: Classifier using multinomial model
image(x_grid,y_grid,z_grid,col=clr2)
points(x,y,pch=19,cex=2,col=clr1[z+1])
The maximum likelihood estimation procedure consists of maximizing the log-likelihood:
ℓ(θ)=m∑i=1logp(y(i)|x(i);θ)=m∑i=1logK∏l=1(eθTlx(i)∑Kj=1eθTjx(i))1{y(i)=l}ℓ(θ)=m∑i=1logp(y(i)|x(i);θ)=m∑i=1logK∏l=1⎛⎝eθTlx(i)∑Kj=1eθTjx(i)⎞⎠1{y(i)=l}
The Softmax regression model can be viewed as a shallow Neural Network.
In this representation, there are KK neurons where each neuron is defined by his own set of weights wi∈ℜdwi∈Rd and a bais term bibi. The linear part is denoted by zi=wTix+bizi=wTix+bi and the non linear part (activation part) is σi=ai=exp(zi)K∑j=1exp(zj)σi=ai=exp(zi)K∑j=1exp(zj). Note that the denominator of the activation part is defined using the weights from the other neurons. The output is a vector of probabilities (a1,…,aK)(a1,…,aK) and the function is used for classification purpose:
ˆy=argmax aii∈{1,…,K}
Let consider first one training sample (x,y). The cross entropy loss for categorical response variable, also called Softmax Loss is defined as:
CE=−K∑i=1˜yilogp(y=i)=−K∑i=1˜yilogai=−K∑i=1˜yilog(exp(zi)K∑j=1exp(zj)) where ˜yi=1{y=i} is a binary variable indicating if y is in the class i.
This expression can be rewritten as
CE=−logK∏i=1(exp(zi)K∑j=1exp(zj))1{y=i}
Then, the cost function for the m training samples is defined as
J(w,b)=−1mm∑i=1logK∏k=1(exp(z(i)k)K∑j=1exp(z(i)j))1{y(i)=k}≡−1mℓ(θ)
Consider unconstrained, smooth convex optimization
minθ∈ℜd f(θ),
Algorithm : Gradient Descent
Here, α is called the learning rate.
Suppose that we want to find x that minimizes: f(x)=1.2(x−2)2+3.2
Figure 2.7: Closed from solution (red)
function(x) 1.2*(x-2)**2+3.2
f.x <-curve(1.2*(x-2)**2+3.2,0,4,ylab="fx)")
abline(v=2,col="red")
In general, we cannot find a closed form solution, but can compute ∇f(x)
function(x0,alpha,epsilon=0.00001,max.iter=300){
simple.grad.des <- 1; xold <- x0; res <- x0; iter <- 1
tol <-while (tol>epsilon & iter < max.iter){
xold - alpha*2.4*(xold-2)
xnew <- abs(xnew-xold)
tol <- xnew
xold <- c(res,xnew)
res <- iter +1
iter <-
}return(res)
} simple.grad.des(0,0.01,max.iter=200)
result <-#result[length(result)]
Convergence with a learning rate=0.01
Figure 2.8: alpha=0.01
curve(1.2*(x-2)**2+3.2,0,4,ylab="fx)")
abline(v=2,col="red")
points(result,f.x(result),col="blue")
Convergence with a learning rate=0.83
Figure 2.9: alpha=0.83
simple.grad.des(0,0.83,max.iter=200)
result2 <-#result2[length(result2)]
curve(1.2*(x-2)**2+3.2,0,4,ylab="fx)")
abline(v=2,col="red")
points(result2,f.x(result2),col="blue",type="o")
Given (x(i),y(i))∈ℜ×{0,1} for i=1,…,m, consider the cross-entropy loss function for this data set:
f(w)=1mm∑i=1(−y(i)wTx(i)+log(1+exp(wTx(i))))=m∑i=1fi(w)
The gradient is
∇f(w)=1mm∑i=1(p(i)(w)−y(i))x(i)
where p(i)(w))=p(Y=1|x(i),w)=exp(wTx(i))/(1+exp(wTx(i))), i=1,…,m
Algorithm : Batch Gradient Descent
Note that algorithm uses all samples to compute the gradient. This approach is called batch gradient descent.
Algorithm : Stochastic Gradient Descent
Initialize w=(0,…,0)
Repeat until convergence
Pick sample i
p(i)=exp(wTx(i))/(1+exp(wTx(i)))
error(i)=p(i)−yi
w=w−α(error(i)×x(i))
End repeat until convergence
Coefficient w is updated after each sample.
Remark: The gradient computation ∇f(w)=∑mi=1(p(i)(w)−y(i))x(i) is doable when m is moderate, but not when m∼500million.
One batch update costs O(md)
One stochastic update costs O(d)
In practice, mini-batch is often used to:
p(Y=1|x(i),w)=11+exp(−w1x(i)1−w2x(i))
set.seed(10)
5000 ;d <- 2 ;w <- c(0.5,-1.5)
m <- matrix(rnorm(m*2),ncol=2,nrow=m)
x <- 1/(1+exp(-x%*%matrix(w,ncol=1)))
ptrue <- rbinom(m,size=1,prob = ptrue)
y <- coef(glm(y~x[,1]+x[,2]-1,family=binomial))) (w.est <-
## x[, 1] x[, 2]
## 0.557587 -1.569509
function(w1,w2) {
Cost.fct <- c(w1,w2)
w <- sum(-y*x%*%matrix(w,ncol=1)+log(1+exp(x%*%matrix(w,ncol=1))))
cost <-return(cost)
}
Figure 2.10: Contour plot of the Cost function
seq(0, 1, 0.05)
w1 <- seq(-2, -1, 0.05)
w2 <- outer(w1, w2, function(x,y) mapply(Cost.fct,x,y))
cost <-contour(x = w1, y = w2, z = cost)
points(x=w.est[1],y=w.est[2],col="black",lwd=2,lty=2,pch=8)
function(x) 1/(1+exp(-x))
sigmoid <-
function(theta,alpha,epsilon,iter.max=500){
batch.GD <- 1
tol <-1
iter <- Cost.fct(theta[1],theta[2])
res.cost <- theta
res.theta <-while (tol > epsilon & iter<iter.max) {
sigmoid(x%*%matrix(theta,ncol=1))-y
error <- theta-as.vector(alpha*matrix(error,nrow=1)%*%x)
theta.up <- cbind(res.theta,theta.up)
res.theta <- sum((theta-theta.up)**2)^0.5
tol <- theta.up
theta <- Cost.fct(theta[1],theta[2])
cost <- c(res.cost,cost)
res.cost <- iter +1
iter <-
} list(theta=theta,res.theta=res.theta,res.cost=res.cost,iter=iter,tol.theta=tol)
result <-
return(result)
}
dim(x);length(y)
## [1] 5000 2
## [1] 5000
Figure 2.11: Convergence Batch Gradient Descent
c(0,-1); alpha=0.001
theta0 <- batch.GD(theta=theta0,alpha,epsilon = 0.0000001)
test <-plot(test$res.cost,ylab="cost function",xlab="iteration",main="alpha=0.01",type="l")
abline(h=Cost.fct(w.est[1],w.est[2]),col="red")
Figure 2.12: Convergence of BGD
contour(x = w1, y = w2, z = cost)
points(x=w.est[1],y=w.est[2],col="black",lwd=2,lty=2,pch=8)
as.data.frame(t(test$res.theta))
record <-points(record,col="red",type="o")
function(theta,alpha,epsilon=0.0001,epoch=50){
Stochastic.GD <- epoch
epoch.max <- 1
tol <-1
epoch <- Cost.fct(theta[1],theta[2])
res.cost <- res.cost
res.cost.outer <- theta
res.theta <-while (tol > epsilon & epoch<epoch.max) {
for (i in 1:nrow(x)){
sigmoid(sum(x[i,]*theta))-y[i]
errori <- x[i,]
xi <- theta-alpha*errori*xi
theta.up <- cbind(res.theta,theta.up)
res.theta <- sum((theta-theta.up)**2)^0.5
tol <- theta.up
theta <- Cost.fct(theta[1],theta[2])
cost <- c(res.cost,cost)
res.cost <-
} epoch +1
epoch <- Cost.fct(theta[1],theta[2])
cost.outer <- c(res.cost.outer,cost.outer)
res.cost.outer <-
} list(theta=theta,res.theta=res.theta,res.cost=res.cost,epoch=epoch,tol.theta=tol)
result <-
}
Stochastic.GD(theta=theta0,alpha,epsilon = 0.0001,epoch=10) test.SGD <-
Figure 2.13: Convergence Stochastic Gradient Descent
plot(test.SGD$res.cost,ylab="cost function",xlab="iteration",main="alpha=0.01",type="l")
abline(h=Cost.fct(w.est[1],w.est[2]),col="red")
Figure 2.14: Convergence of Stochastic Gradient Descent
contour(x = w1, y = w2, z = cost)
points(x=w.est[1],y=w.est[2],col="black",lwd=2,lty=2,pch=8)
as.data.frame(t(test.SGD$res.theta))
record2 <-points(record2,col="red",lwd=0.5)
function (theta,dataTrain, alpha = 0.1, maxIter = 10, nBatch = 2, seed = NULL,intercept=NULL)
Mini.Batch <-
{ 1/nBatch
batchRate <- matrix(unlist(dataTrain), ncol = ncol(dataTrain), byrow = FALSE)
dataTrain <-set.seed(seed)
dataTrain[sample(nrow(dataTrain)), ]
dataTrain <-set.seed(NULL)
Cost.fct(theta[1],theta[2])
res.cost <- res.cost
res.cost.outer <- theta
res.theta <-
if(!is.null(intercept)) dataTrain <- cbind(1, dataTrain)
matrix(ncol = length(theta), nrow = 1)
temporaryTheta <- matrix(theta,ncol = length(theta), nrow = 1)
theta <-
for (iteration in 1:maxIter ) {
if (iteration%%nBatch == 1 | nBatch == 1) {
1
temp <- nrow(dataTrain) * batchRate
x <- x
temp2 <-
} dataTrain[temp:temp2, ]
batch <- batch[, 1:ncol(batch) - 1]
inputData <- batch[, ncol(batch)]
outputData <- nrow(batch)
rowLength <- temp + x
temp <- temp2 + x
temp2 <- matrix(sigmoid(inputData %*% t(theta)),ncol=1) - outputData
error <-for (column in 1:length(theta)) {
error * inputData[, column]
term <- sum(term)#/rowLength
gradient <-1, column] = theta[1, column] - (alpha *
temporaryTheta[ gradient)
}
temporaryTheta
theta <- cbind(res.theta,as.vector(theta))
res.theta <- Cost.fct(theta[1,1],theta[1,2])
cost.outer <- c(res.cost.outer,cost.outer)
res.cost.outer <-
} list(theta=theta,res.theta=res.theta,res.cost.outer=res.cost.outer)
result <-return(result)
}
c(0,-1); alpha=0.001
theta0 <- cbind(x,y)
data.Train <- Mini.Batch(theta=theta0,dataTrain=data.Train, alpha = 0.001, maxIter = 100, nBatch = 10, seed = NULL,intercept=NULL)
test.miniBatch <-##Result frm Mini-Batch
$theta test.miniBatch
## [,1] [,2]
## [1,] 0.5515469 -1.561252
Figure 2.15: Convergence Mini Batch
plot(test.miniBatch$res.cost.outer,ylab="cost function",xlab="iteration",main="alpha=0.001",type="l")
abline(h=Cost.fct(w.est[1],w.est[2]),col="red")
Figure 2.16: Convergence of Stochastic Gradient Descent
contour(x = w1, y = w2, z = cost)
points(x=w.est[1],y=w.est[2],col="black",lwd=2,lty=2,pch=8)
as.data.frame(t(test.miniBatch$res.theta))
record3 <-points(record3,col="red",lwd=0.5,type="o")
The univariate chain rule and the multivariate chain rule are the key concepts to calculate the derivative of cost with respect to any weight in the network. In the following a refresher of the different chain rules.
∂f(g(w))∂w=∂f(g(w))∂g(w).∂g(w)∂w
dzdt=dfdt=fx(x,y)dxdt+fy(x,y)dydt=∂f∂xdxdt+∂f∂ydydt.
∂z∂s=∂f∂x∂x∂s+∂f∂y∂y∂s
and
∂z∂t=∂f∂x∂x∂t+∂f∂y∂y∂t
∂z∂ti=∂f∂x1∂x1∂ti+∂f∂x2∂x2∂ti+⋯+∂f∂xm∂xm∂ti.
Backpropagation is the key tool adopted by the neural network community to update the weights. This method exploits the derivative with respect to each weight w using the chain rule (univariate and multivariate rules)
The logistic model can be viewed as a simple neural network with no hidden layer and only a single output node with a sigmoid activation function. The sigmoid activation function σ(⋅) is applied to the linear combination of the features z=wTx+b and provides the predicted value a that represent the probability that the input x belongs to class one.
Figure 2.17: Forward pass: Logistic model
Forward pass: the forward propagation step consists to get predictions for the training samples and compute the error through a loss function in order to further adapt the weights w and the bias b to decrease the error. This forward pass is going through the following equations:
z(i)=wTx(i)+b
a(i)=σ(z(i))=11+e−z(i) i=1,…,m
L=J(w,b)=−∑mi=1y(i)log(a(i))+(1−y(i))log(1−a(i))
The cost L=J(w,b) is the error we want to reduce by adjusting the weights w and the bais. Variations of the gradient descent algorithm are exploited to update iteratively the parameters. Thus, we have to derive the equations for the gradients on the loss function in order to propagate back the error to adapt the model parameters w and b.
Backward pass based on computation graph:
The chain rule is used and generally illustrated through a computation graph:
Figure 2.18: backpropagation: Logistic model
First to simplify this illustration, remind that:
J(w,b)=1mm∑i=1L(ˆy(i),y(i))=−1mm∑i=1[y(i)loga(i))+(1−y(i))log(1−a(i))]
Thus ∂J(w,b)∂w=1mm∑i=1∂L(ˆy(i),y(i))∂w
To get ∂L(ˆy(i),y(i))∂w the chain rule is used by considering one sample (x,y) (the notation (i) is ommitted).
Computation graphs are mainly exploited to show dependencies to derive easely the equations for the gradients. Thus to compute the gradient of the cost (lost function), one only need to go back the computation graph and multiply the gradients by each other:
∂L(ˆy,y)∂w=∂J(w,b)∂a.∂a∂z.∂z∂w
where a=σ(z) and z=b+wTx.
Thus,
∂L(ˆy,y)∂w=x(a−y)
and so, ∂J(w,b)∂w=1mm∑i=1x(i)(σ(z(i))−y(i)) In the same vein, it follows
∂J(w,b)∂b=1mm∑i=1(σ(z(i))−y(i))
For neural network framework, the weights are updated using gradient descent concepts
w=w−α∂J(w,b)∂w
The main steps for updating weights are
1. Take a batch of training sample
2. Forward propagation to get the corresponding cost
3. Backpropagate the cost to get the gradients
4. update the weights using the gradients
5. Repeat step 1 to 4 for a number of iterations
We consider K class: y(i)∈{1,…,K}. Given a sample x we want to estimate p(y=k|x) for each k=1,…,K. The softmax model is defined by σi=ai=exp(zi)K∑j=1exp(zj), with ai=p(y=i|x,W), zi=bi+wTix and we write W to denote all the weights of our network. Then, W=(w1,…,wK) a d×K matrix obtained by concatenating w1,…,wK into columns.
Let consider one training sample (x,y). The cross entropy loss is
CE=−K∑i=1˜yilogai
where ˜yi=1{y=i} is a binary variable indicating if y is in the class i.
To get ∂CE∂wj (j=1,…,K) we need to use the multivariate chain rule:
First, we derive
∂CE∂zj=K∑k∂CE∂ak.∂ak∂zj=∂CE∂aj.∂aj∂zj−K∑k≠j∂CE∂ak.∂ak∂zj
∂CE∂aj=−˜yjaj
if i=j
∂ai∂zj=∂ezi∑Kk=1ezk∂zj=ezi(∑Kk=1ezk−ezj)(∑Kk=1ezk)2=ezj∑Kk=1ezk×(∑Kk=1ezk−ezj)∑Kk=1ezk=ai(1−aj)
∂ai∂zj=∂ezi∑Kk=1ezk∂zj=0−ezjezi(∑Kk=1ezk)2=−ezj∑Kk=1ezk×ezi∑Kk=1ezk=−aj.ai
So we can rewrite it as
∂ai∂zj={ai(1−aj)ifi=j−aj.aiifi≠j
Thus, we get
∂CE∂zj=∂CE∂aj.∂aj∂zj−K∑k≠j∂CE∂ak.∂ak∂zj=−˜yj(1−aj)+K∑k≠j˜yjak=−˜yj+ajK∑k˜yk=aj−˜yj
We can now derive the gradient for the weights as:
∂CE∂wj=K∑k∂CE∂zk.∂zk∂wj=(aj−˜yj)x
In the same way, we get
∂CE∂bj=K∑k∂CE∂zk.∂zk∂bj=(aj−˜yj)
Let consider a simple example with K=3 (y∈{1,2,3}) and two features (x1 and x2). The computational graph for this softmax neural network model help us to visualize dependencies between nodes and then to derive the gradient of the cost (loss) in respect to each parameter (wj∈ ℜ2 and bj∈ℜ, j=1,2,3).
Figure 2.19: Computation Graph: softmax
Let’s write as an example for ∂L∂w2,1:
∂L∂w2,1=3∑i=1(∂L∂ai)(∂ai∂z2)(∂z2∂w2,1)=(∂L∂a1)(∂a1∂z2)(∂z2∂w2,1)+(∂L∂a2)(∂a2∂z2)(∂z2∂w2,1)+(∂L∂a3)(∂a3∂z2)(∂z2∂w2,1)
In fact, we are summing up the contribution of the change of w2,1 over different “paths” (in red from figure above). When we change w2,1; a1 a2 and a3 changes as a result. Then, the change of a1 a2 and a3 affects L. We sum up all the changes w2,1 produced over a1 a2 and a3 to L.
import numpy as np
import matplotlib.pyplot as plt
import sympy
from scipy import optimize
Page built: 2021-03-04 using R version 4.0.3 (2020-10-10)