
i
i
i
i
i
i
i
i
2 Principles of Machine Learning
Notes and References
One does not need to master all other branches of machine learning to understand deep learning,
nevertheless getting a taste for key elements of the field is useful. Beyond the basics that we
presented in this chapter, one may consult several general machine learning texts. We recommend
[
43
] for a comprehensive mathematical account of practical machine learning and we recommend the
more classic [
4
] as an additional resource. Further, the book [
53
] provides a probabilistic approach.
Focusing on linear algebra, the introductory book [
6
] is a good introduction to foundations such as
K-means, least squares, and ridge regression. Further, [
65
] provides a richer context covering PCA,
SVD, and many aspects of matrix algebra appearing in machine learning. Finally for a short read
which provides an overview of many practical aspects of machine learning, see [
12
]. An additional
recommended reference is [45].
The worlds of machine learning and statistical inference are intertwined and methods developed
in one field are often used in the other field and vice versa. For those with expertise in one or
both of the fields it is quite easy to spot the differences between the approaches, however for those
entering these worlds afresh it may be helpful to read the survey paper “Statistical modeling: The
two cultures” by Leo Breiman, [
10
]. On that note, to get a feel for many statistical aspects of
linear regression, see, e.g., [
51
] or one of many other statistical books. Note that [
51
] is also a good
reference for understanding interaction terms, a concept that we mentioned in the chapter and did
not cover. A general text that integrates methodology and algorithms with statistical inference and
machine learning together with speculations of future directions is [21].
Throughout this chapter we have made reference to several aspects of statistics or machine learning
that are not studied further in this book. Here are some references for each. In general, a good
reference for likelihood based inference is [
3
]. Specifically Akaike information criterion (AIC),
introduced in [
1
], and the Bayesian information criterion (BIC) are surveyed in [
70
]. A general
class of models also appearing in the next chapter is generalized linear models (GLMs); these first
appeared in [
56
] and a good contemporary applied reference is [
24
]. Other models are general
additive models (GAMs) which extend generalized linear models in which some predictor variables
are modelled by smooth functions; see [
34
]. In terms of non-linear regression the LOESS method is
a generalization of moving average and polynomial regression, see [
17
]. Further, Nadaraya-Watson
kernel regression is a non-parametric regression method in which a kernel function is exploited;
see [62].
We have covered the basics of decision theory via binary classification however there are many more
studies for these aspects. See the comprehensive survey [
22
] on metrics for binary classification as
well as [
29
]. For a discussion of different uses of receiver operating curves and different approaches for
them see [
7
] and [
57
]. The origins of the
F
1
score can be attributed to Cornelis Joost van Rijsbergen
who introduced the effectiveness function of which
F
1
score is a special case; see [
67
]. The SMOTE
method for dealing with unbalanced data is from [14]. See also the surveys [28], [38], and [60].
We briefly mentioned the differences between discriminative and generative learning. More on the
topic is in chapter 9 of [
53
] together with a treatment of the naive Bayes classifier and linear
discriminant analysis (LDA). The area of support vector machines became extremely popular in the
world of machine learning with their height of popularity during the 1990’s and the decade that
followed. A complete treatment of these methods is in [
43
] together with associated ideas of kernel
methods. Specific to this area is the concept of VC dimension (standing for Vapnik–Chervonenkis)
which we did not cover here; see [68].
Decision trees are also very popular machine learning techniques; see chapter 8 of [
43
] for an overview.
Within the study of machine learning, generic methods of boosting and bagging are prominent in
the context of decision trees. Specifically see [
61
] and [
8
]. The random forest algorithm is one such
method that has been hailed the most usable ad-hoc generic method when there is not further
information about the problem; see [
9
]. Gradient boosting has become very popular due to a software
package called XGBoost; see [
15
]. The K-nearest neighbours classification algorithm that we mention
is often used as an introductory example. See for example Section 2.3 of [32].
The origins of least squares fitting are from the turn of the 19’th century, initially with applications
to astronomy. The first least squares publication is typically attributed to an 1809 paper by Gauss
[
25
] although an earlier 1805 publication by Legendre publicized the concept first. An interesting
historical investigation into “who invented least squares” is in [
63
]. Since then, least squares methods
46