Mathematical Engineering

of Deep Learning

Benoit Liquet, Sarat Moka and Yoni Nazarathy

March 3, 2026

A Some Multivariable Calculus

This appendix provides key results and notation from multivariable calculus. It is not an

exhaustive summary of multi-variable calculus but rather contains the results needed for the

contents of the book.

A.1 Vectors and Functions in R

Denote the set of all the real numbers by

and the real coordinate space of dimension

. Each element of R

is an n dimensional vector, interpreted as a column of the form

u = (u

, . . . , u

) = [u

··· u

]

⊤













The Euclidean norm of

u ∈ R

, measuring the geometric length of

and also known as the

norm, is

∥u∥

√

⊤

u =

i=1

1/2

Here the scalar

⊤

is the inner product between two vectors

u, v ∈ R

. A normalized form

of the inner product, called the cosine of the angle between the two vectors, sometimes

simply denoted cos θ, is,

cos θ =

⊤

∥u∥

∥v∥

. (A.1)

The Euclidean norm is a special case of the L

norm which is deﬁned via,

∥u∥

i=1

1/p

for p ≥ 1. When p in ∥ · ∥

is not speciﬁed, we interpret ∥ ·∥ as the L

norm.

Focusing on the L

norm and the inner product u

⊤

v, the Cauchy-Schwartz inequality is,

⊤

v| ≤ ∥u∥∥v∥, (A.2)

where the two sides are equal if and only if

and

are linearly dependent (that is,

u = c v

for some

c ∈ R

). Also, the Euclidean distance (or, simply the distance) between

and

A Some Multivariable Calculus

deﬁned as

∥u − v∥ =

i=1

− v

)

1/2

An important consequence of the Cauchy-Schwartz inequality is that the Euclidean norm

satisﬁes the triangle inequality: For any u, v ∈ R

∥u + v∥ ≤ ∥u∥ + ∥v∥. (A.3)

To see this, observe that

∥u + v∥

= ∥u∥

+ ∥v∥

+ 2u

⊤

≤ ∥u∥

+ ∥v∥

+ 2∥u∥∥v∥

= (∥u∥ + ∥v∥)

Convergence of a sequence of vectors can be deﬁned via scalar converge of the distance. That

is, a sequence of vectors

(1)

, u

(2)

, . . .

is said to converge to a vector

u ∈ R

, denoted

via lim

k→∞

(k)

= u, if

lim

k→∞

∥u

(k)

− u∥ = 0.

That is, if for every ε > 0 there exists an N

such that for all k ≥ N

∥u

(k)

− u∥ < ε.

Let

→ R

be an

-dimensional multivariate function that maps each vector

(

, . . . , u

)

⊤

∈ R

to a real number. Then, the function is said to be continuous at

u ∈ R

if for any sequence u

(1)

, u

(2)

, . . . such that lim

k→∞

(k)

= u, we have that

lim

k→∞

f(u

(k)

) = f(u).

Alternatively, f is continuous at u ∈ R

if for every ε > 0 there exists δ > 0 such that

|f(u) − f(v)| < ε,

for every

v ∈ R

with

∥u − v∥ < δ

. Continuity of

implies that the values of

and at v can be made arbitrarily close by setting the point v to be arbitrarily close to u.

We can extend the above continuity deﬁnitions to multivariate vector valued functions of the

form

→ R

that map every

dimensional real-valued vector to an

dimensional

real-valued vector. Such functions can be written as

f(u) = [f

(u) ··· f

(u)]

⊤

, (A.4)

where

→ R

for each

= 1

, . . . , m

. Then, the function

is called continuous at

each

is continuous at

. We say that the function

is continuous on a set

U ⊆ R

continuous at each point in U.

A.2 Derivatives

Consider an

-dimensional multivariate function

→ R

. The partial derivative

∂f (u)

∂u

with respect

is the derivative taken with respect to the variable

while keeping all

other variables constant. That is,

∂f(u)

∂u

= lim

h→0

f(u

, . . . , u

i−1

, u

+ h, u

i+1

, . . . , u

) − f(u)

. (A.5)

Suppose that the partial derivative

(A.5)

exists for all

= 1

, . . . , n

. Then the gradient of

, denoted by

∇f

(

) or

∂f (u)

∂u

, is a concatenation of the partial derivatives of

with

respect to all its variables, and it is expressed as a vector:

∇f(u) =

∂f(u)

∂u



∂f(u)

∂u

···

∂f(u)

∂u



⊤

. (A.6)

The gradient

∇f

(

) is a vector capturing the direction of the steepest ascent at

. Further,

h||∇f(u)|| is the increase in f when moving in that direction for inﬁnitesimal distance h.

In some situations, instead of a vector form, variables of the function are represented as a

matrix. In that scenario, multivariate functions are of form

× R

→ R

, that is,

maps matrices

= (

i,j

) of dimension

n × m

to real values

(

). If the partial derivative

∂f (U)

∂u

i,j

exists for all

= 1

, . . . , n

and

= 1

, . . . , m

, it is convenient to use the notation

∂f (U)

∂U

to denote the collection of the partial derivatives of

with respect to all its variables as a

matrix of the same dimension n × m,

∂f(U)

∂U







∂f (U)

∂u

1,1

. . .

∂f (U)

∂u

1,m

∂f (U)

∂u

n,1

. . .

∂f (U)

∂u

n,m







. (A.7)

Directional Derivatives

The directional derivative of

→ R

in the direction

v ∈ R

is the scalar deﬁned by

∇

f(u) = lim

h→0

f(u + hv) − f(u)

The directional derivative generalizes the notion of the partial derivative. In fact, the partial

derivative

∂f (u)

∂u

is the directional derivative at

in the direction of the vector

which

consists of 1 at the

-th coordinate and zeros everywhere else. This simply follows from the

observation that

∇

f(u) = lim

h→0

f(u

, . . . , u

i−1

, u

+ h, u

i+1

, . . . , u

) − f(u)

∂f(u)

∂u

A Some Multivariable Calculus

As consequence, if the gradient of

exists at

, the directional derivative exists in every

direction v and we have

∇

f(u) = v

⊤

∇f(u). (A.8)

One way to see

(A.8)

in the case of continuity of the partial derivatives is via a Taylor’s

theorem based ﬁrst-order approximation (see Theorem A.1):

f(u + hv) = f(u) + (h v)

⊤

∇f(u) + O(h

where O(h

) denotes a function such that O(h

)/h

goes to a constant as h → 0. Thus,

f(u + hv) − f(u)

= v

⊤

∇f(u) + O(h).

Now take the limit h → 0 on both the sides to get (A.8).

It is useful to note that the directional derivative

∇

(

) is maximum in the direction of

the gradient in the sense that for all unit length vectors

, the choice

v = ∇f (u)/∥∇f (u)∥

maximizes ∥∇

f(u)∥. This is a consequence of the Cauchy-Schwartz inequality (A.2):

|∇

f(u)| = |v

⊤

∇f(u)| ≤ ∥v∥∥∇f(u)∥ = ∥∇f(u)∥.

Setting v = ∇f (u)/∥∇f (u)∥ achieves the equality.

Jacobians

The Jacobian is useful for functions of the form

→ R

as in

(A.4)

where each

a real-valued function of

. The Jacobian of

, denoted by

, is the

m × n

matrix

deﬁned via

(u) =







∂f

(u)

∂u

. . .

∂f

(u)

∂u

∂f

(u)

∂u

. . .

∂f

(u)

∂u







. (A.9)

In other words, the

-th row of the Jacobian is the gradient

∇f

(

). In some situations, it is

convenient to use the notation

∂f (u)

∂u

to denote the transpose of the Jacobian of

. That

is,

∂f(u)

∂u

= (J

(u))

⊤

. (A.10)

Hessians

Returning to functions of the form

→ R

, to describe the curvature of the function

at a given

u ∈ R

, it is important to consider the second-order partial derivatives at

A.2 Derivatives

These partial derivatives are arranged as an

n ×n

matrix, called the Hessian and deﬁned by

∇

f(u) =

∂∇f(u)

∂u







∂

∂u

∂

∂u

···

∂

∂u

∂

∂u

∂

∂u

···

∂

∂u

∂

∂u

∂

∂u

···

∂

∂u







, (A.11)

where

∂

∂u

∂f

∂u



∂f

∂u



. Note that if all the second-order partial derivatives are continuous

at u, then the Hessian ∇

f(u) is a symmetric matrix. That is, for all i, j ∈ {1, . . . , n},

∂

∂u

∂

∂u

This result is known as Schwarz’s theorem or Clairaut’s theorem. Observe that using the

Jacobian, we can treat the Hessian as the Jacobian of the gradient vector. That is,

∇

f(u) = J

∇f

(u).

Certain attributes of optimization problems are often deﬁned via positive (semi) deﬁniteness

of the Hessian

∇

(

) at

. In particular, a symmetric matrix

is said to be positive

semideﬁnite if for all ϕ ∈ R

⊤

A ϕ ≥ 0. (A.12)

Furthermore,

is said to be positive deﬁnite if the inequality in

(A.12)

is strict for all

ϕ ∈ R

}

. Note that the matrix

is called is negative semideﬁnite (respectively, negative

deﬁnite) when −A is positive semideﬁnite (respectively, positive deﬁnite).

Diﬀerentiability

A multivariate vector valued function

→ R

is said to be diﬀerentiable at

u ∈ R

there is an m ×n dimensional matrix A such that

lim

v→u



∥f(u) − f(v) − A(u − v)∥

∥u − v∥



= 0.

Here the limit notation

v → u

implies that the limit exists for every sequence

(k)

k ≥

}

such that

lim

k→∞

(k)

. The matrix

is called the derivative. If the function

diﬀerentiable at

, then the derivative at

is equal to the Jacobian

(

). In particular, if

is a real-valued function (that is,

= 1) and diﬀerentiable at

, then the derivative at

∇f

(

)

⊤

. If the derivative is continuous on a set

U ⊆ R

, we say that

is continuously

diﬀerentiable on U, and in that case all the partial derivatives

∂f (u)

∂u

are continuous on U.

A Some Multivariable Calculus

A.3 The Multivariable Chain Rule

Consider a multivariate vector valued function h : R

→ R

and a multivariate real-valued

function

→ R

. Suppose that

is diﬀerentiable at

u ∈ R

and

is diﬀerentiable at

(

) = [

(

)

··· h

(

)]

⊤

. Let

→ R

be the composition

g ◦ h

(

) =

(

)).

For each i = 1, . . . , n, the multivariate chain rule is,

∂f(u)

∂u

∂g(h(u))

∂v

∂h

(u)

∂u

+ ··· +

∂g(h(u))

∂v

∂h

(u)

∂u

where

∂g

∂v

denotes the partial derivative of g with respect to the i-th coordinate. Thus,

∂f(u)

∂u

∂h

(u)

∂u

···

∂h

(u)

∂u

∇g



h(u)



and combining for all i = 1, . . . , n,

∇f(u) = J

(u)

⊤

∇g



h(u)



Now consider the case where

is also a multivariate vector valued function. That is, suppose

→ R

is diﬀerentiable at

and

→ R

is diﬀerentiable at

(

). Then the

composition f = g ◦ h : R

→ R

is a vector valued function with Jacobian,

(u) = J



h(u)



(u). (A.13)

The expression in

(A.13)

is called the multivariable chain rule. In terms of the notation

(A.10), we may represent the multivariable chain rule as,

∂f

∂u

⊤

∂g

∂h

⊤

∂h

∂u

⊤

, or

∂f

∂u

∂h

∂u

∂g

∂h

. (A.14)

The Chain Rule for a Matrix Derivative of an Aﬃne Transformation

Let us focus on the case

(

)) where

→ R

and

→ R

. Speciﬁcally let us

assume that

(

) is the aﬃne function

(

) =

W u

where

W ∈ R

k×n

and

b ∈ R

. That is,

y = g(z), with z = W u + b.

We are often interested in the derivative of the scalar output

with respect to the matrix

W = [w

i,j

]. This is denoted via

∂y

∂W

as in (A.7).

It turns out that we can represent this matrix derivative as the outer product,

∂y

∂W

∂y

∂z

⊤

, (A.15)

where

∂y

∂z

is the gradient of g(·) evaluated at z.

To see

(A.15)

denote the columns of

via

(1)

, . . . , w

(n)

, each an element of

, and

observe that,

z = b +

i=1

(i)

A.3 The Multivariable Chain Rule

We may now observe that Jacobian transposed

∂z/∂w

(i)

, where

is the

k ×k

identity

matrix. Hence now, using (A.13), we have,

∂y

∂w

(i)

∂z

∂w

(i)

∂y

∂z

= u

∂y

∂z

Now we can construct

∂y

∂W

column by column,

∂y

∂W

∂y

∂w

(1)

···

∂y

∂w

(n)

∂y

∂z

··· u

∂y

∂z

∂y

∂z

⊤

Jacobian Vector Products and Vector Jacobian Products

Let

= (

, . . . , f

) =

◦ h

L−1

◦ ··· ◦ h

be a composition of

diﬀerentiable functions

, h

, . . . , h

such that

ℓ

ℓ−1

→ R

ℓ

where

, m

, . . . , m

are positive integers with

= n and m

= m.

Further, to simplify the notation, for each ℓ = 1, . . . , L, let

ℓ

(u) = h

ℓ

ℓ−1

(···(h

(u)) ···)) .

Then, g

(u) = f(u) and by recursive application of (A.13), we obtain

(u) = J

L−1

(u)) J

L−1

L−2

(u)) ···J

(u). (A.16)

Note that from the deﬁnition of the Jacobian, the

-th column of

(

) is the

dimensional

vector

∂f(u)

∂u



∂f

(u)

∂u

, . . . ,

∂f

(u)

∂u



= J

(u)e

where

is the

-th unit vector of appropriate dimension. Therefore, using

(A.16)

, for each

j = 1, . . . , n,

∂f(u)

∂u

= J

L−1

(u))

L−1

L−2

(u))

···[J

(u)e

] ···

. (A.17)

That is, for each

= 1

, . . . , n

∂f (u)

∂u

can be obtained by recursively computing the Jacobian

vector product given by

ℓ

:= J

ℓ

ℓ−1

(u)) v

ℓ−1

for ℓ = 1, . . . , L, starting with v

= e

and g

(u) = u.

On the other hand, since the i-th row of J

(u) is the gradient ∇f

(u), we have

∇f

(u) = e

⊤

(u)

···



⊤

L−1

(u))



L−1

L−2

(u))

···

(u). (A.18)

A Some Multivariable Calculus

That is, for each

= 1

, . . . , m

∇f

(

) can be obtained by recursively computing the vector

Jacobian product given by

⊤

ℓ

:= v

⊤

ℓ−1

L−ℓ+1

L−ℓ

(u)) ,

for ℓ = 1, . . . , L, starting with v

= e

and g

(u) = u.

A.4 Taylor’s Theorem

Once again consider a multivariate real-valued function

→ R

. If all the

-order deriva-

tives of

are continuous at a point

u ∈ R

, then Taylor’s theorem oﬀers an approximation

for

within a neighborhood of

in terms of these derivatives. We are particularly interested

in cases where

= 1 and

= 2 as they are crucial in implementation of, respectively, the

ﬁrst-order and the second-order optimization methods. It is easy to understand the theorem

when the function

is univariate. Hence we start with the univariate case and then move to

the general multivariate case. We omit the proof of Taylor’s theorem as it is a well known

result that can be found in any standard multivariate calculus textbook.

Univariate Case

Suppose that

= 1, that is,

is a univariate real-valued function. We say that

-times

continuously diﬀerentiable on an open interval

U ⊆ R

times diﬀerentiable at every

point on

(i.e., the

-th order derivative

f(u)

exists for all

u ∈ U

) and

f(u)

is continuous

on U. If k = 0, we interpret

f(u)

simply as f(u).

Theorem A.1 (Taylor’s Theorem in

). Let

R → R

-times continuously

diﬀerentiable on an open interval U ⊆ R. Then, for any u, v ∈ U,

f(u) =

i=0

(u − v)

f(v)

+ O



|u − v|

k+1



. (A.19)

The polynomial,

(u) =

i=0

(u − v)

f(v)

appeared in (A.19) is called k-th order Taylor polynomial. Since the remainder

(u) = f(u) − P

(u) −→ 0, as x → a,

(

) is approximately equal to

(

) for

within a small neighborhood of

. Particularly, for

a point

near

(

) is linear approximation of

(

) and

(

) is quadratic approximation

of f(u).

Multivariate Case

Now consider the multivariate case, that is,

is a multivariate real-valued function. In order

to state Taylor’s theorem for this case, we need some new notion that is relevant only here.

A.4 Taylor’s Theorem

-tuple

= (

, . . . , α

) is called multi-index if each

is a non-negative integer. For a

multi-index α, let

|α| =

i=1

, α! = α

! ···α

!, and u

= u

···u

for any u ∈ R

. Then, the higher order partial derivatives are expressed as

f(u) =

∂

|α|

f(u)

∂u

···∂u

We say that

-times continuously diﬀerentiable on an open set

U ⊆ R

if all the higher

order partial derivatives

(

) exists and are continuous on

for all multi-index

such

that |α| ≤ k.

Theorem A.2 (Taylor’s Theorem in

). Let

→ R

be a

-times continuously

diﬀerentiable on an open set U ⊆ R

. Then, for any u, v ∈ U,

f(u) =

α:|α|≤k

f(v)

(u − v)

α!

+ O



∥u − v∥

k+1



. (A.20)

The polynomial,

(u) =

α:|α|≤k

f(v)

(u − v)

α!

is called k-th order Taylor’s polynomial. In particular,

(u) =

α:|α|≤1

f(v)

(u − v)

α!

= f(v) + (u − v)

⊤

∇f(a), (A.21)

for

near

, provides linear approximation, also called ﬁrst-order Taylor’s approximation,

to f(u), while

(u) =

α:|α|≤2

f(v)

(u − v)

α!

= f(v) + (u − v)

⊤

∇f(v) +

(u − v)

⊤

∇

f(v)(u −v) (A.22)

provides quadratic approximation, also called second-order Taylor’s approximation, to

(

Linear Approximation with Jacobians and Hessians

Consider a diﬀerentiable function

→ R

with the

m × n

Jacobian

(

). Then with

Theorem

(A.2)

we may construct a ﬁrst order linear approximation to

(

) around any

∈ R

f(u) = f(u

) + J

)(u − u

), (A.23)

where

f(u) ≈ f(u).

A Some Multivariable Calculus

Now consider a twice diﬀerentiable

→ R

with gradient

∇g

(

) and Hessian matrix

∇

(

). We can set

(

) =

∇g

(

) with

→ R

. Since the Hessian of

(

) is the Jacobian

(

), from

(A.23)

we obtain a ﬁrst order linear approximation for the gradient around

∈ R

∇g(u) = ∇g(u

) + ∇

g(u

)(u − u

), (A.24)

where

∇g(u) ≈ ∇g(u).

Bibliography