
i
i
i
i
i
i
i
i
2 Principles of Machine Learning
Notes and References
One does not need to master all other branches of machine learning to understand deep learning,
nevertheless getting a taste for key elements of the field is useful. Beyond the basics that we presented
in this chapter, one may consult several general machine learning texts. We recommend [
240
] for a
comprehensive mathematical account of practical machine learning and we recommend the more
classic [
39
] as an additional resource. Further, the book [
299
] provides a probabilistic approach.
Focusing on linear algebra, the introductory book [
56
] is a good introduction to foundations such as
K-means, least squares, and ridge regression. Further, [
391
] provides a richer context covering PCA,
SVD, and many aspects of matrix algebra appearing in machine learning. Finally for a short read
which provides an overview of many practical aspects of machine learning, see [
68
]. An additional
recommended reference is [263].
The worlds of machine learning and statistical inference are intertwined and methods developed
in one field are often used in the other field and vice versa. For those with expertise in one or
both of the fields it is quite easy to spot the differences between the approaches, however for those
entering these worlds afresh it may be helpful to read the survey paper “Statistical modeling: The
two cultures” by Leo Breiman, [
61
]. On that note, to get a feel for many statistical aspects of linear
regression,see,e.g.,[
296
] or one of many other statistical books. Note that [
296
]isalsoagood
reference for understanding interaction terms, a concept that we mentioned in the chapter and did
not cover. A general text that integrates methodology and algorithms with statistical inference and
machine learning together with s peculations of future directions is [115].
Throughout this chapter we have made reference to several aspects of statistics or machine learning
that are not studied further in this book. Here are some references for each. In general, a good
reference for likelihood based inference is [
31
]. Sp ecifically Akaike information criterion (AIC),
introduced in [
7
], and the Bayesian informat ion criterion (BIC) are surveyed in [
432
]. A general
class of models also appearing in the next chapter is generalized linear models (GLMs); these first
app eared in [
304
] and a go od contemporary applied reference is [
121
]. Other models are general
additive models (GAMs) which extend generalized linear models in which some predictor variables
are modelled by smooth functions; see [
168
]. In terms of non-linear regression the LOESS method is
a generalization of moving average and polynomial regression, see [
88
]. Further, Nadaraya-Watson
kernel regression is a non-parametric regression method in w hich a kernel function is exploited;
see [378].
We have covered the basics of decision theory via binary classification however there are many more
studies for these aspects. See the comprehensive survey [
118
] on metrics for binary classification as
well as [
158
]. For a discussion of different uses of receiver operating curves and different approaches
for them see [
58
] and [
315
]. The origins of the
F
1
score can be attributed to Cornelis Joost van
Rijsb ergen who introduced the effectiveness function of which
F
1
score is a special case; see [
408
].
The SMOTE method for dealing with unbalanced data is from [
75
]. See also the surveys [
153
], [
211
],
and [348].
We briefly mentioned the differences between discriminative and generative learning. More on the
topic is in chapter 9 of [
299
] together with a treatment of the naive Bayes classifier and linear
discriminant analysis (LDA). The area of support vector machines became extremely popular in the
world of machine learning with their height of popularity during the 1990’s and the decade that
followed. A complete treatment of these methods is in [
240
] together with associated ideas of kernel
methods. Specific to this area is the concept of VC dimension (standing for Vapnik–Chervonenkis)
which we did not cover here; see [409].
Decision trees are also very popular machine learning techniques; see chapter 8 of [
240
] for an
overview. Within the study of machine learning, generic methods of boosting and bagging are
prominent in the context of decision trees. Specifically see [
366
] and [
59
]. The random forest
algorithm is one such method that has been hailed the most usable ad-hoc generic method when
there is not further information about the problem; see [
60
]. Gradient boosting has become very
popular due to a software package called XGBoost;see[
77
]. The K-nearest neighbours classification
algorithm that we mention is often used as an introductory example. See for example Section 2.3
of [166].
The origins of least squares fitting are from the turn of the 19’th century, initially with applications
to astronomy. The first least squares publication is typically attributed to an 1809 paper by Gauss
72