i
i
i
i
i
i
i
i
Mathematical Engineering
of Deep Learning
Benoit Liquet, Sarat Moka and Yoni Nazarathy
March 3, 2026
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
A Some Multivariable Calculus
This appendix provides key results and notation from multivariable calculus. It is not an
exhaustive summary of multi-variable calculus but rather contains the results needed for the
contents of the book.
A.1 Vectors and Functions in R
n
Denote the set of all the real numbers by
R
and the real coordinate space of dimension
n
by
R
n
. Each element of R
n
is an n dimensional vector, interpreted as a column of the form
u = (u
1
, . . . , u
n
) = [u
1
··· u
n
]
=
u
1
.
.
.
u
n
.
The Euclidean norm of
u R
n
, measuring the geometric length of
u
and also known as the
L
2
norm, is
u
2
=
u
u =
n
X
i=1
u
2
i
!
1/2
.
Here the scalar
u
v
is the inner product between two vectors
u, v R
n
. A normalized form
of the inner product, called the cosine of the angle between the two vectors, sometimes
simply denoted cos θ, is,
cos θ =
u
v
u
2
v
2
. (A.1)
The Euclidean norm is a special case of the L
p
norm which is defined via,
u
p
=
n
X
i=1
|u
i
|
p
!
1/p
,
for p 1. When p in ·
p
is not specified, we interpret · as the L
2
norm.
Focusing on the L
2
norm and the inner product u
v, the Cauchy-Schwartz inequality is,
|u
v| u∥∥v, (A.2)
where the two sides are equal if and only if
u
and
v
are linearly dependent (that is,
u = c v
for some
c R
). Also, the Euclidean distance (or, simply the distance) between
u
and
v
is
1
i
i
i
i
i
i
i
i
A Some Multivariable Calculus
defined as
u v =
n
X
i=1
(u
i
v
i
)
2
!
1/2
.
An important consequence of the Cauchy-Schwartz inequality is that the Euclidean norm
satisfies the triangle inequality: For any u, v R
n
,
u + v u + v. (A.3)
To see this, observe that
u + v
2
= u
2
+ v
2
+ 2u
v
u
2
+ v
2
+ 2u∥∥v
= (u + v)
2
.
Convergence of a sequence of vectors can be defined via scalar converge of the distance. That
is, a sequence of vectors
u
(1)
, u
(2)
, . . .
in
R
n
is said to converge to a vector
u R
n
, denoted
via lim
k→∞
u
(k)
= u, if
lim
k→∞
u
(k)
u = 0.
That is, if for every ε > 0 there exists an N
0
such that for all k N
0
,
u
(k)
u < ε.
Let
f
:
R
n
R
be an
n
-dimensional multivariate function that maps each vector
u
=
(
u
1
, . . . , u
n
)
R
n
to a real number. Then, the function is said to be continuous at
u R
n
if for any sequence u
(1)
, u
(2)
, . . . such that lim
k→∞
u
(k)
= u, we have that
lim
k→∞
f(u
(k)
) = f(u).
Alternatively, f is continuous at u R
n
if for every ε > 0 there exists δ > 0 such that
|f(u) f(v)| < ε,
for every
v R
n
with
u v < δ
. Continuity of
f
at
u
implies that the values of
f
at
u
and at v can be made arbitrarily close by setting the point v to be arbitrarily close to u.
We can extend the above continuity definitions to multivariate vector valued functions of the
form
f
:
R
n
R
m
that map every
n
dimensional real-valued vector to an
m
dimensional
real-valued vector. Such functions can be written as
f(u) = [f
1
(u) ··· f
m
(u)]
, (A.4)
where
f
i
:
R
n
R
for each
i
= 1
, . . . , m
. Then, the function
f
is called continuous at
u
if
each
f
i
is continuous at
u
. We say that the function
f
is continuous on a set
U R
n
if
f
is
continuous at each point in U.
2
i
i
i
i
i
i
i
i
A.2 Derivatives
A.2 Derivatives
Consider an
n
-dimensional multivariate function
f
:
R
n
R
. The partial derivative
f (u)
u
i
of
f
with respect
u
i
is the derivative taken with respect to the variable
u
i
while keeping all
other variables constant. That is,
f(u)
u
i
= lim
h0
f(u
1
, . . . , u
i1
, u
i
+ h, u
i+1
, . . . , u
n
) f(u)
h
. (A.5)
Suppose that the partial derivative
(A.5)
exists for all
i
= 1
, . . . , n
. Then the gradient of
f
at
u
, denoted by
f
(
u
) or
f (u)
u
, is a concatenation of the partial derivatives of
f
with
respect to all its variables, and it is expressed as a vector:
f(u) =
f(u)
u
=
f(u)
u
1
···
f(u)
u
n
. (A.6)
The gradient
f
(
u
) is a vector capturing the direction of the steepest ascent at
u
. Further,
h||∇f(u)|| is the increase in f when moving in that direction for infinitesimal distance h.
In some situations, instead of a vector form, variables of the function are represented as a
matrix. In that scenario, multivariate functions are of form
f
:
R
n
× R
m
R
, that is,
f
maps matrices
U
= (
u
i,j
) of dimension
n × m
to real values
f
(
U
). If the partial derivative
f (U)
u
i,j
exists for all
i
= 1
, . . . , n
and
j
= 1
, . . . , m
, it is convenient to use the notation
f (U)
U
to denote the collection of the partial derivatives of
f
with respect to all its variables as a
matrix of the same dimension n × m,
f(U)
U
=
f (U)
u
1,1
. . .
f (U)
u
1,m
.
.
.
.
.
.
.
.
.
f (U)
u
n,1
. . .
f (U)
u
n,m
. (A.7)
Directional Derivatives
The directional derivative of
f
:
R
n
R
at
u
in the direction
v R
n
is the scalar defined by
v
f(u) = lim
h0
f(u + hv) f(u)
h
.
The directional derivative generalizes the notion of the partial derivative. In fact, the partial
derivative
f (u)
u
i
is the directional derivative at
u
in the direction of the vector
e
i
which
consists of 1 at the
i
-th coordinate and zeros everywhere else. This simply follows from the
observation that
e
i
f(u) = lim
h0
f(u
1
, . . . , u
i1
, u
i
+ h, u
i+1
, . . . , u
n
) f(u)
h
=
f(u)
u
i
.
3
i
i
i
i
i
i
i
i
A Some Multivariable Calculus
As consequence, if the gradient of
f
exists at
u
, the directional derivative exists in every
direction v and we have
v
f(u) = v
f(u). (A.8)
One way to see
(A.8)
in the case of continuity of the partial derivatives is via a Taylor’s
theorem based first-order approximation (see Theorem A.1):
f(u + hv) = f(u) + (h v)
f(u) + O(h
2
),
where O(h
k
) denotes a function such that O(h
k
)/h
k
goes to a constant as h 0. Thus,
f(u + hv) f(u)
h
= v
f(u) + O(h).
Now take the limit h 0 on both the sides to get (A.8).
It is useful to note that the directional derivative
v
f
(
u
) is maximum in the direction of
the gradient in the sense that for all unit length vectors
v
, the choice
v = f (u)/∥∇f (u)
maximizes ∥∇
v
f(u). This is a consequence of the Cauchy-Schwartz inequality (A.2):
|∇
v
f(u)| = |v
f(u)| v∥∥∇f(u) = ∥∇f(u).
Setting v = f (u)/∥∇f (u) achieves the equality.
Jacobians
The Jacobian is useful for functions of the form
f
:
R
n
R
m
as in
(A.4)
where each
f
i
is
a real-valued function of
u
. The Jacobian of
f
at
u
, denoted by
J
f
, is the
m × n
matrix
defined via
J
f
(u) =
f
1
(u)
u
1
. . .
f
1
(u)
u
n
.
.
.
.
.
.
.
.
.
f
m
(u)
u
1
. . .
f
m
(u)
u
n
. (A.9)
In other words, the
i
-th row of the Jacobian is the gradient
f
i
(
u
). In some situations, it is
convenient to use the notation
f (u)
u
to denote the transpose of the Jacobian of
f
at
u
. That
is,
f(u)
u
= (J
f
(u))
. (A.10)
Hessians
Returning to functions of the form
f
:
R
n
R
, to describe the curvature of the function
f
at a given
u R
n
, it is important to consider the second-order partial derivatives at
u
.
4
i
i
i
i
i
i
i
i
A.2 Derivatives
These partial derivatives are arranged as an
n ×n
matrix, called the Hessian and defined by
2
f(u) =
f(u)
u
=
2
f
u
2
1
2
f
u
1
u
2
···
2
f
u
1
u
n
2
f
u
2
u
1
2
f
u
2
2
···
2
f
u
2
u
n
.
.
.
.
.
.
.
.
.
.
.
.
2
f
u
n
u
1
2
f
u
n
u
2
···
2
f
u
2
n
, (A.11)
where
2
f
u
i
u
j
=
f
u
i
f
u
j
. Note that if all the second-order partial derivatives are continuous
at u, then the Hessian
2
f(u) is a symmetric matrix. That is, for all i, j {1, . . . , n},
2
f
u
i
u
j
=
2
f
u
j
u
i
.
This result is known as Schwarz’s theorem or Clairaut’s theorem. Observe that using the
Jacobian, we can treat the Hessian as the Jacobian of the gradient vector. That is,
2
f(u) = J
f
(u).
Certain attributes of optimization problems are often defined via positive (semi) definiteness
of the Hessian
2
f
(
θ
) at
θ
. In particular, a symmetric matrix
A
is said to be positive
semidefinite if for all ϕ R
d
,
ϕ
A ϕ 0. (A.12)
Furthermore,
A
is said to be positive definite if the inequality in
(A.12)
is strict for all
ϕ R
d
\{
0
}
. Note that the matrix
A
is called is negative semidefinite (respectively, negative
definite) when A is positive semidefinite (respectively, positive definite).
Differentiability
A multivariate vector valued function
f
:
R
n
R
m
is said to be differentiable at
u R
n
if
there is an m ×n dimensional matrix A such that
lim
vu
f(u) f(v) A(u v)
u v
= 0.
Here the limit notation
v u
implies that the limit exists for every sequence
{v
(k)
:
k
1
}
such that
lim
k→∞
v
(k)
=
u
. The matrix
A
is called the derivative. If the function
f
is
differentiable at
u
, then the derivative at
u
is equal to the Jacobian
J
f
(
u
). In particular, if
f
is a real-valued function (that is,
m
= 1) and differentiable at
u
, then the derivative at
u
is
f
(
u
)
. If the derivative is continuous on a set
U R
n
, we say that
f
is continuously
differentiable on U, and in that case all the partial derivatives
f (u)
u
i
are continuous on U.
5
i
i
i
i
i
i
i
i
A Some Multivariable Calculus
A.3 The Multivariable Chain Rule
Consider a multivariate vector valued function h : R
n
R
k
and a multivariate real-valued
function
g
:
R
k
R
. Suppose that
h
is differentiable at
u R
n
and
g
is differentiable at
h
(
u
) = [
h
1
(
u
)
··· h
k
(
u
)]
. Let
f
:
R
n
R
be the composition
f
=
g h
or
f
(
u
) =
g
(
h
(
u
)).
For each i = 1, . . . , n, the multivariate chain rule is,
f(u)
u
i
=
g(h(u))
v
1
h
1
(u)
u
i
+ ··· +
g(h(u))
v
k
h
k
(u)
u
i
,
where
g
v
i
denotes the partial derivative of g with respect to the i-th coordinate. Thus,
f(u)
u
i
=
h
h
1
(u)
u
i
···
h
k
(u)
u
i
i
g
h(u)
,
and combining for all i = 1, . . . , n,
f(u) = J
h
(u)
g
h(u)
.
Now consider the case where
g
is also a multivariate vector valued function. That is, suppose
h
:
R
n
R
k
is differentiable at
u
and
g
:
R
k
R
m
is differentiable at
h
(
u
). Then the
composition f = g h : R
n
R
m
is a vector valued function with Jacobian,
J
f
(u) = J
g
h(u)
J
h
(u). (A.13)
The expression in
(A.13)
is called the multivariable chain rule. In terms of the notation
(A.10), we may represent the multivariable chain rule as,
h
f
u
i
=
h
g
h
i
h
h
u
i
, or
f
u
=
h
u
g
h
. (A.14)
The Chain Rule for a Matrix Derivative of an Affine Transformation
Let us focus on the case
y
=
g
(
h
(
u
)) where
h
:
R
n
R
k
and
g
:
R
k
R
. Specifically let us
assume that
h
(
·
) is the affine function
h
(
u
) =
W u
+
b
where
W R
k×n
and
b R
k
. That is,
y = g(z), with z = W u + b.
We are often interested in the derivative of the scalar output
y
with respect to the matrix
W = [w
i,j
]. This is denoted via
y
W
as in (A.7).
It turns out that we can represent this matrix derivative as the outer product,
y
W
=
y
z
u
, (A.15)
where
y
z
is the gradient of g(·) evaluated at z.
To see
(A.15)
denote the columns of
W
via
w
(1)
, . . . , w
(n)
, each an element of
R
k
, and
observe that,
z = b +
n
X
i=1
u
i
w
(i)
.
6
i
i
i
i
i
i
i
i
A.3 The Multivariable Chain Rule
We may now observe that Jacobian transposed
z/∂w
(i)
is
u
i
I
, where
I
is the
k ×k
identity
matrix. Hence now, using (A.13), we have,
y
w
(i)
=
z
w
(i)
y
z
= u
i
y
z
.
Now we can construct
y
W
column by column,
y
W
=
h
y
w
(1)
···
y
w
(n)
i
=
h
u
1
y
z
··· u
n
y
z
i
=
y
z
u
.
Jacobian Vector Products and Vector Jacobian Products
Let
f
= (
f
1
, . . . , f
m
) =
h
L
h
L1
··· h
1
be a composition of
L
differentiable functions
h
1
, h
2
, . . . , h
L
such that
h
:
R
m
1
R
m
where
m
0
, m
1
, . . . , m
L
are positive integers with
m
0
= n and m
L
= m.
Further, to simplify the notation, for each = 1, . . . , L, let
g
(u) = h
(h
1
(···(h
1
(u)) ···)) .
Then, g
L
(u) = f(u) and by recursive application of (A.13), we obtain
J
f
(u) = J
h
L
(g
L1
(u)) J
h
L1
(g
L2
(u)) ···J
h
1
(u). (A.16)
Note that from the definition of the Jacobian, the
j
-th column of
J
f
(
u
) is the
m
dimensional
vector
f(u)
u
j
=
f
1
(u)
u
j
, . . . ,
f
m
(u)
u
j
= J
f
(u)e
j
,
where
e
j
is the
j
-th unit vector of appropriate dimension. Therefore, using
(A.16)
, for each
j = 1, . . . , n,
f(u)
u
j
= J
h
L
(g
L1
(u))
"
J
h
L1
(g
L2
(u))
h
···[J
h
1
(u)e
j
] ···
i
#
. (A.17)
That is, for each
j
= 1
, . . . , n
,
f (u)
u
j
can be obtained by recursively computing the Jacobian
vector product given by
v
:= J
h
(g
1
(u)) v
1
,
for = 1, . . . , L, starting with v
0
= e
j
and g
0
(u) = u.
On the other hand, since the i-th row of J
f
(u) is the gradient f
i
(u), we have
f
i
(u) = e
i
J
f
(u)
=
"
···
h
e
i
J
h
L
(g
L1
(u))
J
h
L1
(g
L2
(u))
i
···
#
J
h
1
(u). (A.18)
7
i
i
i
i
i
i
i
i
A Some Multivariable Calculus
That is, for each
i
= 1
, . . . , m
,
f
i
(
u
) can be obtained by recursively computing the vector
Jacobian product given by
v
:= v
1
J
h
L+1
(g
L
(u)) ,
for = 1, . . . , L, starting with v
0
= e
i
and g
0
(u) = u.
A.4 Taylor’s Theorem
Once again consider a multivariate real-valued function
f
:
R
n
R
. If all the
k
-order deriva-
tives of
f
are continuous at a point
u R
n
, then Taylor’s theorem offers an approximation
for
f
within a neighborhood of
u
in terms of these derivatives. We are particularly interested
in cases where
k
= 1 and
k
= 2 as they are crucial in implementation of, respectively, the
first-order and the second-order optimization methods. It is easy to understand the theorem
when the function
f
is univariate. Hence we start with the univariate case and then move to
the general multivariate case. We omit the proof of Taylor’s theorem as it is a well known
result that can be found in any standard multivariate calculus textbook.
Univariate Case
Suppose that
n
= 1, that is,
f
is a univariate real-valued function. We say that
f
is
k
-times
continuously differentiable on an open interval
U R
if
f
is
k
times differentiable at every
point on
U
(i.e., the
k
-th order derivative
d
k
f(u)
du
k
exists for all
u U
) and
d
k
f(u)
du
k
is continuous
on U. If k = 0, we interpret
d
k
f(u)
du
k
simply as f(u).
Theorem A.1 (Taylor’s Theorem in
R
). Let
f
:
R R
be
k
-times continuously
differentiable on an open interval U R. Then, for any u, v U,
f(u) =
k
X
i=0
(u v)
i
i!
d
i
f(v)
du
i
+ O
|u v|
k+1
. (A.19)
The polynomial,
P
k
(u) =
k
X
i=0
(u v)
i
i!
d
i
f(v)
du
i
,
appeared in (A.19) is called k-th order Taylor polynomial. Since the remainder
R
k
(u) = f(u) P
k
(u) 0, as x a,
f
(
u
) is approximately equal to
P
k
(
u
) for
u
within a small neighborhood of
a
. Particularly, for
a point
u
near
v
,
P
1
(
u
) is linear approximation of
f
(
u
) and
P
2
(
u
) is quadratic approximation
of f(u).
Multivariate Case
Now consider the multivariate case, that is,
f
is a multivariate real-valued function. In order
to state Taylor’s theorem for this case, we need some new notion that is relevant only here.
8
i
i
i
i
i
i
i
i
A.4 Taylor’s Theorem
An
n
-tuple
α
= (
α
1
, . . . , α
n
) is called multi-index if each
α
i
is a non-negative integer. For a
multi-index α, let
|α| =
n
X
i=1
α
i
, α! = α
1
! ···α
n
!, and u
α
= u
α
1
1
···u
α
n
n
,
for any u R
n
. Then, the higher order partial derivatives are expressed as
D
α
f(u) =
|α|
f(u)
u
α
1
1
···u
α
n
n
.
We say that
f
is
k
-times continuously differentiable on an open set
U R
n
if all the higher
order partial derivatives
D
α
f
(
u
) exists and are continuous on
U
for all multi-index
α
such
that |α| k.
Theorem A.2 (Taylor’s Theorem in
R
n
). Let
f
:
R
n
R
be a
k
-times continuously
differentiable on an open set U R
n
. Then, for any u, v U,
f(u) =
X
α:|α|≤k
D
α
f(v)
(u v)
α
α!
+ O
u v
k+1
. (A.20)
The polynomial,
P
k
(u) =
X
α:|α|≤k
D
α
f(v)
(u v)
α
α!
,
is called k-th order Taylor’s polynomial. In particular,
P
1
(u) =
X
α:|α|≤1
D
α
f(v)
(u v)
α
α!
= f(v) + (u v)
f(a), (A.21)
for
u
near
v
, provides linear approximation, also called first-order Taylor’s approximation,
to f(u), while
P
2
(u) =
X
α:|α|≤2
D
α
f(v)
(u v)
α
α!
= f(v) + (u v)
f(v) +
1
2
(u v)
2
f(v)(u v) (A.22)
provides quadratic approximation, also called second-order Taylor’s approximation, to
f
(
u
).
Linear Approximation with Jacobians and Hessians
Consider a differentiable function
f
:
R
n
R
m
with the
m × n
Jacobian
J
f
(
·
). Then with
Theorem
(A.2)
we may construct a first order linear approximation to
f
(
·
) around any
u
0
R
n
,
˜
f(u) = f(u
0
) + J
f
(u
0
)(u u
0
), (A.23)
where
˜
f(u) f(u).
9
i
i
i
i
i
i
i
i
A Some Multivariable Calculus
Now consider a twice differentiable
g
:
R
n
R
with gradient
g
(
·
) and Hessian matrix
2
g
(
·
). We can set
f
(
u
) =
g
(
u
) with
f
:
R
n
R
n
. Since the Hessian of
g
(
·
) is the Jacobian
of
f
(
·
), from
(A.23)
we obtain a first order linear approximation for the gradient around
u
0
R
n
,
e
g(u) = g(u
0
) +
2
g(u
0
)(u u
0
), (A.24)
where
e
g(u) g(u).
10
i
i
i
i
i
i
i
i
Bibliography
11