
i
i
i
i
i
i
i
i
B Cross Entropy and Other Expectations
with Logarithms
This appendix expands on basic properties of cross entropy, the KL-divergence, and related
concepts, also in the context of the multivariate normal distribution. It is not meant to be
an extensive review of these concepts but rather provides key definitions, properties, and
results needed for the content of the book.
B.1 Divergences and Entropies
We first define the relative entropy (KL-divergence), cross entropy, and entropy in the
context of discrete probability distributions. We then present we then provide a definition of
the KL-divergence for continuous random variables. Finally we define the Jensen–Shannon
divergence.
The KL-divergence for Discrete Distributions
Assume two probability distributions
p
(
·
) and
q
(
·
) over elements in some discrete sets
X
p
and
X
q
respectively. That is,
p
(
x
) or
q
(
x
) denote the respective probabilities, which are
strictly positive unless x ∈ X
p
for which p(x) = 0 (or similarly x ∈ X
q
for which q(x) = 0).
A key measure for the proximity between the distributions
p
(
·
) and
q
(
·
) is the Kullback–Leibler
divergence, also shortened as KL-divergence, and also known as the relative entropy. It is
denoted
D
KL
(
p ∥ q
) and as long as
X
p
⊆ X
q
it is the expected value of
log p
(
X
)
/q
(
X
) where
X is a random variable following the probability law p(·). Namely,
D
KL
(p ∥ q) =
X
x∈X
p
p(x) log
p(x)
q(x)
. (B.1)
Further if
X
p
⊆ X
q
, that is if there is some element in
X
p
that is not in
X
q
, then by definition
D
KL
(
p ∥ q
) = +
∞
. This definition as infinity is natural since we would otherwise divide by
0 for some q(x).
Observe that the expression for
D
KL
(
p ∥ q
) from
(B.1)
can be decomposed into the difference
of H(p) from H(p, q) via,
D
KL
(p ∥ q) =
X
x∈X
p(x) log
1
q(x)
| {z }
H(p,q)
−
X
x∈X
p(x) log
1
p(x)
| {z }
H(p)
.
1