Mathematical Engineering

of Deep Learning

Benoit Liquet, Sarat Moka and Yoni Nazarathy

March 3, 2026

B Cross Entropy and Other Expectations

with Logarithms

This appendix expands on basic properties of cross entropy, the KL-divergence, and related

concepts, also in the context of the multivariate normal distribution. It is not meant to be

an extensive review of these concepts but rather provides key deﬁnitions, properties, and

results needed for the content of the book.

B.1 Divergences and Entropies

We ﬁrst deﬁne the relative entropy (KL-divergence), cross entropy, and entropy in the

context of discrete probability distributions. We then present we then provide a deﬁnition of

the KL-divergence for continuous random variables. Finally we deﬁne the Jensen–Shannon

divergence.

The KL-divergence for Discrete Distributions

Assume two probability distributions

(

) and

(

) over elements in some discrete sets

and

respectively. That is,

(

) or

(

) denote the respective probabilities, which are

strictly positive unless x ∈ X

for which p(x) = 0 (or similarly x ∈ X

for which q(x) = 0).

A key measure for the proximity between the distributions

(

) and

(

) is the Kullback–Leibler

divergence, also shortened as KL-divergence, and also known as the relative entropy. It is

denoted

(

p ∥ q

) and as long as

⊆ X

it is the expected value of

log p

(

)

(

) where

X is a random variable following the probability law p(·). Namely,

(p ∥ q) =

x∈X

p(x) log

p(x)

q(x)

. (B.1)

Further if

⊆ X

, that is if there is some element in

that is not in

, then by deﬁnition

(

p ∥ q

) = +

∞

. This deﬁnition as inﬁnity is natural since we would otherwise divide by

0 for some q(x).

Observe that the expression for

(

p ∥ q

) from

(B.1)

can be decomposed into the diﬀerence

of H(p) from H(p, q) via,

(p ∥ q) =

x∈X

p(x) log

q(x)

| {z }

H(p,q)

−

x∈X

p(x) log

p(x)

| {z }

H(p)

B Cross Entropy and Other Expectations with Logarithms

Here,

H(p, q) = −

x∈X

p(x) log q(x) (B.2)

is called the cross entropy of p and q and

H(p) = −

x∈X

p(x) log p(x) (B.3)

is called the entropy of

. Hence in words, the KL-divergence or relative entropy of

and

the cross entropy of

and

with the entropy of

subtracted. Note that in case where there

are only two values in X, say 0 and 1, where we denote p(1) = p

and q(1) = q

, we have

H(p) = −



log p

+ (1 − p

) log(1 − p

)



, (B.4)

H(p, q) = −



log q

+ (1 − p

) log(1 − q

)



. (B.5)

Some observations are in order. First observe that

(

p ∥ q

)

≥

0. Further note that in

general

(

p ∥ q

)



(

q ∥ p

) and similarly

(

p, q

)



(

q, p

). Hence as a “distance

measure” the KL-divergence is not a true metric since it is not symmetric over its arguments.

Nevertheless, when

the KL-divergence is 0 and similarly the cross entropy equals the

entropy. In addition, it can be shown that

(

p ∥ q

) = 0 only when

. Hence the

KL-divergence may play a role similar to a distance metric in certain applications. In fact,

one may consider a sequence

(1)

, q

(2)

, . . .

which has decreasing

(

p ∥ q

(t)

) approaching

0 as

t → ∞

. For such a sequence the probability distributions

(t)

approach

the target

distribution p since the KL-divergence convergences to 0.

The KL-divergence for Continuous Distributions

The KL-divergence in

(B.1)

naturally extends to arbitrary probability distributions that are

not necessarily discrete. In our case let us consider continuous multi-dimensional distributions.

In this case

(

) and

(

) are probability densities, and the sets

, and

are their respective

supports. Now very similarly to (B.1), as long as X

⊆ X

we deﬁne,

(p ∥ q) =

x∈X

p(x) log

p(x)

q(x)

dx. (B.6)

The Jensen-Shannon Divergence

A related measure to the KL-divergence which is symmetric in arguments is the Jensen-

Shannon divergence denoted

JSD

(

p ∥ q

). Either for the discrete or continuous case, is deﬁned

by considering a mixture distribution with support X

∪ X

m(x) =



p + q



and then averaging the KL-divergence between each of the distributions and m(·), namely,

JSD(p ∥ q) =

(p ∥ m) + D

(q ∥ m)

. (B.7)

There are multiple ways to deﬁne convergence of such a sequence of probability distributions. The exact

form is out of our scope.

B.2 Computations for Multivariate Normal Distributions

The square root of

JSD

(

p ∥ q

), sometimes called the Jensen-Shannon distance is a metric in

the mathematical sense.

B.2 Computations for Multivariate Normal Distributions

A univariate (single variable) normal, or Gaussian, distribution has a probability density

function,

N(x ; µ, σ

) =

√

2π

−

(x−µ)

2σ

, for x ∈ R,

and is parameterized by

µ ∈ R

and

0 which are the mean and variance of the

distribution respectively. The standard normal case has µ = 0 and σ

= 1.

dimensional multivariate normal distribution is characterized by a mean vector

µ ∈ R

and a covariance matrix Σ

∈ R

m×m

which is assumed to be symmetric and positive deﬁnite.

The probability density function (pdf) of a multivariate normal distribution is,

N(x ; µ, Σ) =

(det Σ)

1/2

(2π)

m/2

−

(x−µ)

⊤

−1

(x−µ)

, for x ∈ R

where

det

Σ stands for the determinant of a matrix Σ. There are many useful formulas

associated with this distribution with one particular case being the log-density,

log N(x ; µ, Σ) = −

(x − µ)

⊤

−1

(x − µ) −

log(2π) −

log(det Σ). (B.8)

It is also useful to consider the KL-divergence between two multivariate normal distributions.

For short, denote such a distribution as

µ,Σ

when the mean vector is

and the covariance

matrix is Σ. Then if we consider two such distributions on

with corresponding mean

vectors

and

, and corresponding covariance matrices Σ

and Σ

, then it is possible to

show that,

,Σ

∥N

,Σ

) =



(µ

−µ

)

⊤

−1

(µ

−µ

)−m+tr(Σ

−1

)+log

det(Σ

)

det(Σ

)



. (B.9)

A particularly useful case is one where Σ

= σ

I for some constant σ

> 0. In this case,

,Σ

∥N

,σ

) =

2σ

∥µ

− µ

∥

−

tr(Σ

)

2σ

m log σ

−

log det(Σ

)

. (B.10)

Furthermore, if the second distribution is standard, i.e., µ

= 0 and σ

= 1, then

,Σ

∥N

0,I

) =

∥µ

∥

−

tr(Σ

)

−

log det(Σ

)

. (B.11)

Bibliography