i
i
i
i
i
i
i
i
Mathematical Engineering
of Deep Learning
Benoit Liquet, Sarat Moka and Yoni Nazarathy
March 3, 2026
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
B Cross Entropy and Other Expectations
with Logarithms
This appendix expands on basic properties of cross entropy, the KL-divergence, and related
concepts, also in the context of the multivariate normal distribution. It is not meant to be
an extensive review of these concepts but rather provides key definitions, properties, and
results needed for the content of the book.
B.1 Divergences and Entropies
We first define the relative entropy (KL-divergence), cross entropy, and entropy in the
context of discrete probability distributions. We then present we then provide a definition of
the KL-divergence for continuous random variables. Finally we define the Jensen–Shannon
divergence.
The KL-divergence for Discrete Distributions
Assume two probability distributions
p
(
·
) and
q
(
·
) over elements in some discrete sets
X
p
and
X
q
respectively. That is,
p
(
x
) or
q
(
x
) denote the respective probabilities, which are
strictly positive unless x ∈ X
p
for which p(x) = 0 (or similarly x ∈ X
q
for which q(x) = 0).
A key measure for the proximity between the distributions
p
(
·
) and
q
(
·
) is the Kullback–Leibler
divergence, also shortened as KL-divergence, and also known as the relative entropy. It is
denoted
D
KL
(
p q
) and as long as
X
p
X
q
it is the expected value of
log p
(
X
)
/q
(
X
) where
X is a random variable following the probability law p(·). Namely,
D
KL
(p q) =
X
x∈X
p
p(x) log
p(x)
q(x)
. (B.1)
Further if
X
p
⊆ X
q
, that is if there is some element in
X
p
that is not in
X
q
, then by definition
D
KL
(
p q
) = +
. This definition as infinity is natural since we would otherwise divide by
0 for some q(x).
Observe that the expression for
D
KL
(
p q
) from
(B.1)
can be decomposed into the difference
of H(p) from H(p, q) via,
D
KL
(p q) =
X
x∈X
p(x) log
1
q(x)
| {z }
H(p,q)
X
x∈X
p(x) log
1
p(x)
| {z }
H(p)
.
1
i
i
i
i
i
i
i
i
B Cross Entropy and Other Expectations with Logarithms
Here,
H(p, q) =
X
x∈X
p(x) log q(x) (B.2)
is called the cross entropy of p and q and
H(p) =
X
x∈X
p(x) log p(x) (B.3)
is called the entropy of
p
. Hence in words, the KL-divergence or relative entropy of
p
and
q
is
the cross entropy of
p
and
q
with the entropy of
p
subtracted. Note that in case where there
are only two values in X, say 0 and 1, where we denote p(1) = p
1
and q(1) = q
1
, we have
H(p) =
p
1
log p
1
+ (1 p
1
) log(1 p
1
)
, (B.4)
H(p, q) =
p
1
log q
1
+ (1 p
1
) log(1 q
1
)
. (B.5)
Some observations are in order. First observe that
D
KL
(
p q
)
0. Further note that in
general
D
KL
(
p q
)
=
D
KL
(
q p
) and similarly
H
(
p, q
)
=
H
(
q, p
). Hence as a “distance
measure” the KL-divergence is not a true metric since it is not symmetric over its arguments.
Nevertheless, when
p
=
q
the KL-divergence is 0 and similarly the cross entropy equals the
entropy. In addition, it can be shown that
D
KL
(
p q
) = 0 only when
p
=
q
. Hence the
KL-divergence may play a role similar to a distance metric in certain applications. In fact,
one may consider a sequence
q
(1)
, q
(2)
, . . .
which has decreasing
D
KL
(
p q
(t)
) approaching
0 as
t
. For such a sequence the probability distributions
q
(t)
approach
1
the target
distribution p since the KL-divergence convergences to 0.
The KL-divergence for Continuous Distributions
The KL-divergence in
(B.1)
naturally extends to arbitrary probability distributions that are
not necessarily discrete. In our case let us consider continuous multi-dimensional distributions.
In this case
p
(
·
) and
q
(
·
) are probability densities, and the sets
X
p
, and
X
q
are their respective
supports. Now very similarly to (B.1), as long as X
p
X
q
we define,
D
KL
(p q) =
Z
x∈X
p
p(x) log
p(x)
q(x)
dx. (B.6)
The Jensen-Shannon Divergence
A related measure to the KL-divergence which is symmetric in arguments is the Jensen-
Shannon divergence denoted
JSD
(
p q
). Either for the discrete or continuous case, is defined
by considering a mixture distribution with support X
p
X
q
,
m(x) =
1
2
p + q
,
and then averaging the KL-divergence between each of the distributions and m(·), namely,
JSD(p q) =
D
KL
(p m) + D
KL
(q m)
2
. (B.7)
1
There are multiple ways to define convergence of such a sequence of probability distributions. The exact
form is out of our scope.
2
i
i
i
i
i
i
i
i
B.2 Computations for Multivariate Normal Distributions
The square root of
JSD
(
p q
), sometimes called the Jensen-Shannon distance is a metric in
the mathematical sense.
B.2 Computations for Multivariate Normal Distributions
A univariate (single variable) normal, or Gaussian, distribution has a probability density
function,
N(x ; µ, σ
2
) =
1
σ
2π
e
(xµ)
2
2σ
2
, for x R,
and is parameterized by
µ R
and
σ
2
>
0 which are the mean and variance of the
distribution respectively. The standard normal case has µ = 0 and σ
2
= 1.
An
m
dimensional multivariate normal distribution is characterized by a mean vector
µ R
m
and a covariance matrix Σ
R
m×m
which is assumed to be symmetric and positive definite.
The probability density function (pdf) of a multivariate normal distribution is,
N(x ; µ, Σ) =
1
(det Σ)
1/2
(2π)
m/2
e
1
2
(xµ)
Σ
1
(xµ)
, for x R
m
,
where
det
Σ stands for the determinant of a matrix Σ. There are many useful formulas
associated with this distribution with one particular case being the log-density,
log N(x ; µ, Σ) =
1
2
(x µ)
Σ
1
(x µ)
m
2
log(2π)
1
2
log(det Σ). (B.8)
It is also useful to consider the KL-divergence between two multivariate normal distributions.
For short, denote such a distribution as
N
µ,Σ
when the mean vector is
µ
and the covariance
matrix is Σ. Then if we consider two such distributions on
R
m
with corresponding mean
vectors
µ
1
and
µ
2
, and corresponding covariance matrices Σ
1
and Σ
2
, then it is possible to
show that,
D
KL
(N
µ
1
,Σ
1
N
µ
2
,Σ
2
) =
1
2
(µ
1
µ
2
)
Σ
1
2
(µ
1
µ
2
)m+tr
1
2
Σ
1
)+log
det(Σ
2
)
det(Σ
1
)
. (B.9)
A particularly useful case is one where Σ
2
= σ
2
2
I for some constant σ
2
2
> 0. In this case,
D
KL
(N
µ
1
,Σ
1
N
µ
2
2
2
I
) =
1
2σ
2
2
µ
1
µ
2
2
m
2
+
tr
1
)
2σ
2
2
+
m log σ
2
2
2
log det(Σ
1
)
2
. (B.10)
Furthermore, if the second distribution is standard, i.e., µ
2
= 0 and σ
2
2
= 1, then
D
KL
(N
µ
1
,Σ
1
N
0,I
) =
1
2
µ
1
2
m
2
+
tr
1
)
2
log det(Σ
1
)
2
. (B.11)
3
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
Bibliography
5