Mathematical Engineering

of Deep Learning

Benoit Liquet, Sarat Moka and Yoni Nazarathy

March 2, 2026

A ma maman,

Benoit Liquet.

To my mother Mariyamma and my wife Toshali,

Sarat Moka.

To Emily, Kayley, and Yarden,

Yoni Nazarathy.

Preface

In the last few years deep learning has seen explosive growth and even dubbed as the

“new electricity”. The ﬁeld has shown incredible success in automated applications and

predictive tasks. Deep learning models are mathematical in nature, and hence to understand

deep learning, one needs to understand the mathematical description of the models. This

book aims to provide such an understanding via a concise, accessible, and self contained

presentation.

Many deep learning resources focus on programming while making an eﬀort to hide the

mathematics. Other resources focus on theoretical results without an attempt to disambiguate

the terminology of the ﬁeld. A third breed of resources puts heavy emphasis on historical

progression. Each of these viewpoints is important for a speciﬁc purpose, however we

conjecture that for a mathematical audience, these are all suboptimal ways to learn about

deep learning. Hence we created this book.

Our focus is on the basic mathematical constructs that make up the ﬁeld. We call this

mathematical engineering of deep learning. Using the language of equations and

algorithms, deep learning objects interface together to make very powerful models. A reader

armed with basic familiarity of mathematical notation and knowledge of basic calculus,

basic probability, and basic linear algebra can go a long way in understanding deep learning

quickly. For this, we use simple mathematical notation to outline the mechanisms used for

training, execution, and application of deep neural networks.

Deep learning is certainly not soley about mathematics, as it also requires good software,

hardware, and data. However, we aim to present the mathematical technology of deep

learning without focusing on the implementation aspects that one would consider if trying

to use deep learning frameworks in practice. We also aim to focus on the current state of

the art as opposed to the historical progression of the ﬁeld. Finally, we aim to minimize the

focus on the human brain and the loose analogies that one can make between deep artiﬁcial

neural networks and actual biological neurons. All of these aspects that we downplay are

important, but we believe that if in an initial exposure to the ﬁeld one spends too much

time on implementation, history, or bio-neurological analogies, then the simplicity of deep

learning is missed.

The book is primarily intended for readers from engineering, signal processing, statistics,

physics, econometrics, operations research, quantitative management, pure mathematics,

bioinformatics, applied machine learning, or even applied deep learning. A reader with

background in one of these domains will be able to get a concentrated and concise description

of deep learning. In cases where a mathematical refresher is needed, appendices provide a

condensed review, such as for example a review of key aspects of multivariable calculus.

The book can be read sequentially, or alternatively readers may wish to jump between

chapters for quick lookup. It is assumed that readers have had exposure to mathematical

Preface

notation at the level equivalent to at least 3 or 4 university courses. Hence set notation,

matrices, basic probability, and calculus are used without apology. However, no explicit

knowledge of machine learning, statistics, optimization, or advanced probability is needed or

assumed. Our hope is that we strike the right balance so that a mathematically equipped

non-expert can easily read the book in a self contained manner.

While the focus of the book is “mathematical engineering”, we fully acknowledge the

importance of applications and the ability to use software and hardware eﬀectively. For

this you may also use the companion website,

https://deeplearningmath.org/

, where

additional examples, links, and software usage details are provided.

Outline of the Contents

The book has 8 chapters and 2 appendices. Chapters 1 – 4 introduce the ﬁeld, outline key

concepts from machine learning, present an overview of optimization concepts needed for

deep learning, and focus on fundamental models and concepts. Chapter 5 is the central

chapter introducing fully connected deep neural networks. Chapters 6 and 7 deal with the

core models and architectures of deep learning, including convolutional networks, recurrent

neural networks, and transformers. Chapter 8 covers additional popular domains such as

generative models, reinforcement learning, and graph neural networks. Appendices A and B

provide mathematical support. Here is a detailed outlined of the contents.

Chapter 1 – Introduction: In this chapter we present an overview of deep learning,

demonstrate key applications, survey the associated ecosystems of high performance com-

puting, discuss big and high-dimensional data, and set the tone for the rest of the book. The

chapter discusses key terminology including data science, machine learning, and statistical

learning, and with this we place these terms in the context of the book. Key popular datasets

such as ImageNet and MNIST digits are also presented together with a description of the

deep learning culture that emerged.

Chapter 2 – Principles of Machine Learning: Deep learning can be viewed as a

sub-discipline of machine learning and hence this chapter provides an overview of key

machine learning concepts and paradigms. The reader is introduced to supervised learning,

unsupervised learning, and the general concept of iterative based optimization for learning.

The concepts of training sets, test sets, and the like, together with principles of cross

validation and model selection are introduced. A key object explored in the chapter is the

linear model which can be trained also via iterative optimization. We introduce the most

simple gradient descent algorithm and it is later reﬁned in Chapter 4. Gradient descent

is used for training almost any deep learning model. We also explore basic unsupervised

learning algorithms including K-means clustering, principal component analysis (PCA), and

the singular value decomposition (SVD).

Chapter 3 – Simple Neural Networks: In this chapter we focus on logistic regression

(sigmoid) for binary classiﬁcation and the related multinomial regression model (softmax)

for multi-class problems. These models are the most popular shallow neural networks. The

chapter sets the tone for more complex models by introducing principles of deep learning

such as the cross entropy loss and other basic terminology. The chapter also presents a simple

non-linear autoencoder architecture and with this introduces general ideas of autoencoders.

Chapter 4 – Optimization Algorithms: The training of deep learning models involves

optimization over the learned parameters. Hence a solid understanding of optimization

algorithms is required, as well as an understanding of specialized optimization techniques

that work well for deep learning models such as the ADAM algorithm. In this chapter we focus

on such techniques. We also study the details of various forms of automatic diﬀerentiation,

a tool that has become critical in deep learning for computing gradients. Other optimization

techniques, not always popular in contemporary deep learning, are also surveyed. This

includes various ﬁrst-order and second-order methods.

Chapter 5 – Feedforward Deep Networks: This chapter is the heart of the book where

the general feedforward deep neural network, also known as the multi-layer perceptron, is

deﬁned and introduced. After introducing the basic architecture and exploring the expressive

power of deep neural networks, we dive into the details of training by understanding the

backpropagation algorithm for gradient evaluation. We also explore other aspects such as

weight initialization, batch normalization, and dropout.

Chapter 6 – Convolutional Neural Networks: Convolutional neural networks are

natural models for images and similar spatial data formats. In this chapter we explore the

convolution concept and then see it used in the context of deep learning models. The concepts

of channels, and general convolutional neural networks are introduced. We then follow with

an exploration of common unique architectures that have made signiﬁcant impact and are

still in use today. We also explore a few key tasks associated with images such as object

localization and face identiﬁcation.

Chapter 7 – Sequence Models: Sequence models are critical for data such as text with

applications in natural language processing, conversational agents, and translation. In this

chapter we get a taste for the key deep learning ideas of the ﬁeld. We explore recurrent

neural networks and their generalizations including long short term memory (LSTM) models

and gated recurrent unit (GRU) models. We then explore encoder-decoder architectures

building up to the concept of attention where we formalize the attention mechanism. This

idea then integrates in transformer models which in many ways are the state of the art

models used in large language models (LLM).

Chapter 8 – Specialized Architectures and Paradigms: In this ﬁnal chapter we

survey key ideas of specialized architectures and paradigms which are used for various

types of tasks. This includes, generative models, reinforcement learning, and graph neural

networks. In terms of generative models we start by diving into the variational autoencoder

architecture, a probabilistic deep learning model. We then extend to Markovian hierarchical

variational autoencoders of which diﬀusion models are a special case. We then study generative

adversarial networks (GANs) which were the ﬁrst class of highly powerful deep learning

models for realistic looking image generation. The chapter then moves to study reinforcement

learning where we ﬁrst present an overview of basics of Markov decision processes and then

hint on how deep reinforcement learning can be implemented. We close with an introduction

of graph neural networks. As such, the multitude of ideas in this chapter encompass several

paradigms where deep learning models can be modiﬁed or joined together for specialized

purposes.

Preface

With Thanks

We began this project while undertaking instruction at the 2021 AMSI (Australian Mathe-

matical Sciences Institute) summer school. In that course we taught 60 students from all

over Australia for 28 lecture hours. See a link to the course material through the book

website

https://deeplearningmath.org/

. We thank the students for embarking on the

journey with us and further appreciate student comments useful for creating the book. We

also mention that without support from our families and loved ones, this book would not be

possible. We thank Alan White for supplying the banana for Figure 1.1. We thank various

family members for appearing in some of our images.

We especially thank Vishnu Prasath and Ajay Hemanth of Richmond Enterprises PVT LTD

for working on many illustrations of the book. The TikZ source code for these illustrations

is now open sourced with a link available through the course website. A few of the images in

our ﬁgures, when mentioned, are taken from other research papers and other sources. We

thank the authors for permission to use these images. We also thank Toshali Banerjee for

art design.

We also thank the following people for detailed comments and useful discussions: Teo Nguyen,

Thomas Grahm, Vektor Dewanto, and Miriam Redding. In addition, useful comments were

received from Marcus Gallagher, Matt Dirks, Adam Bennaceur, Gabriel Bianconi, Kwangsoo

Cho, Jerzy Filar, Liam Bluett, Fred Roosta-Khorasani, and Maria Vlasiou. Sarat Moka

also thanks Celestien Warnaar-Notschaele and Ole Warnaar for friendship and an extensive

accommodation period in Brisbane during the extensive Sydney lockdown of 2021.

We hope that you enjoy the book.

Benoit Liquet, Sarat Moka, and Yoni Nazarathy.

February 2024.

Contents

Preface 3

1 Introduction 1

1.1 The Age of Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 A Taste of Tasks and Architectures . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Key Ingredients of Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . 12

1.4 DATA, Data, data! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.5 Deep Learning as a Mathematical Engineering Discipline . . . . . . . . . . . 20

1.6 Notation and Mathematical Background . . . . . . . . . . . . . . . . . . . . 23

Notes and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2 Principles of Machine Learning 27

2.1 Key Activities of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . 27

2.2 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.3 Linear Models at Our Core . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.4 Iterative Optimization Based Learning . . . . . . . . . . . . . . . . . . . . . 48

2.5 Generalization, Regularization, and Validation . . . . . . . . . . . . . . . . 52

2.6 A Taste of Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . 62

Notes and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3 Simple Neural Networks 75

3.1 Logistic Regression in Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 75

3.2 Logistic Regression as a Shallow Neural Network . . . . . . . . . . . . . . . 82

3.3 Multi-class Problems with Softmax . . . . . . . . . . . . . . . . . . . . . . . 86

3.4 Beyond Linear Decision Boundaries . . . . . . . . . . . . . . . . . . . . . . . 95

3.5 Shallow Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

Notes and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

4 Optimization Algorithms 111

4.1 Formulation of Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 111

4.2 Optimization in the Context of Deep Learning . . . . . . . . . . . . . . . . 118

4.3 Adaptive Optimization with ADAM . . . . . . . . . . . . . . . . . . . . . . 126

4.4 Automatic Diﬀerentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

4.5 Additional Techniques for First-Order Methods . . . . . . . . . . . . . . . . 141

4.6 Concepts of Second-Order Methods . . . . . . . . . . . . . . . . . . . . . . . 150

Notes and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

5 Feedforward Deep Networks 165

5.1 The General Fully Connected Architecture . . . . . . . . . . . . . . . . . . . 165

5.2 The Expressive Power of Neural Networks . . . . . . . . . . . . . . . . . . . 171

5.3 Activation Function Alternatives . . . . . . . . . . . . . . . . . . . . . . . . 178

5.4 The Backpropagation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 182

5.5 Weight Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

Contents

5.6 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

5.7 Mitigating Overﬁtting with Dropout and Regularization . . . . . . . . . . . 195

Notes and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

6 Convolutional Neural Networks 203

6.1 Overview of Convolutional Neural Networks . . . . . . . . . . . . . . . . . . 203

6.2 The Convolution Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

6.3 Building a Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . . . . 214

6.4 Building a Convolutional Neural Network . . . . . . . . . . . . . . . . . . . 224

6.5 Inception, ResNets, and Other Landmark Architectures . . . . . . . . . . . 234

6.6 Beyond Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

Notes and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245

7 Sequence Models 247

7.1 Overview of Models and Activities for Sequence Data . . . . . . . . . . . . . 247

7.2 Basic Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 253

7.3 Generalizations and Modiﬁcations to RNNs . . . . . . . . . . . . . . . . . . 263

7.4 Encoders Decoders and the Attention Mechanism . . . . . . . . . . . . . . . 269

7.5 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277

Notes and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292

8 Specialized Architectures and Paradigms 295

8.1 Generative Modelling Principles . . . . . . . . . . . . . . . . . . . . . . . . . 295

8.2 Diﬀusion Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304

8.3 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . 313

8.4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326

8.5 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336

Notes and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351

Epilogue 353

A Some Multivariable Calculus 355

A.1 Vectors and Functions in R

. . . . . . . . . . . . . . . . . . . . . . . . . . . 355

A.2 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357

A.3 The Multivariable Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . 360

A.4 Taylor’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362

B Cross Entropy and Other Expectations with Logarithms 365

B.1 Divergences and Entropies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365

B.2 Computations for Multivariate Normal Distributions . . . . . . . . . . . . . 367

Bibliography 397

Index 399