
i
i
i
i
i
i
i
i
7 Sequence Models
Notes and References
A useful applied introductory text about time-series sequence data analysis is [
198
] and a more
theoretical book is [
64
]. Yet, while these are texts about sequence models, the traditional statistical
and forecasting time-series focus is on cases where each
x
ÈtÍ
is a scalar or a low dimensional vector.
Neural network models, the topic of this chapter, are very different, and for an early review of
recurrent neural networks and generalizations see chapter 10 of [
142
] and the many references
there-in, where key references are also listed below.
As the most common application of sequence models is textual data, let us mention early texts on
natural language processing (NLP). General early approaches to NLP are summarized in [
280
] and
[
216
] where the topic is tackled via rule-based approaches based on the statistics of grammar. A
much more mode rn summary of applications is [
228
] and a review of applications of neural networks
for NLP is in [
138
], yet this field is quickly advancing at the time of publishing of this current book.
See also chapter 7 of the book [
4
] for a comprehensive discussion of RNNs as well as their long short
term memory (LSTM) and gated recurrent units (GRU) generalizations.
Recurrent neural networks (RNN) are useful for broad applications such as DNA sequencing, see
for example [
375
], image captioning as in [
188
], time series prediction as in [
22
], sentiment analysis
as in [
274
], speech recognition as in [
146
], and many other applications. Possibly one of the first
constructions of recurrent neural networks (RNN) in their modern form appeared in [
117
] and is
sometimes referred to as an Elman network. Yet this was not the inception of ideas for recurrent
neural networks and earlier ideas appeared in several influential works over the previous decades.
See [
367
] for an historical account with notable earlier publications including [
13
] in 1972, and [
185
],
and [357] in the 1980’s.
The introduction of bidirectional RNN is in [
369
]. The introduction of long short term memory
(LSTM) models in the late 1990’s was in [
184
]. Gated recurrent units (GRUs) are much more
recent concepts and were introduced in [
80
] and [
85
] after the big spark of interest in de ep learning
o ccurred. An empirical comparison of these various approaches is in [
214
]. A more contemporary
review of LSTM is in [
438
]. These days, for advanced NLP tasks LSTMs and GRUs are generally
outp erform ed by transformer models, yet in non-NLP applications we expect to see LSTMs remain
a useful tool for many years to come. Some recent example application papers include [
220
], [
295
],
[351], and [446], among many others.
Moving onto textual data, the idea of word embeddings is now standard in the world of NLP. The
key principle originates with the word2vec work in [
288
]. Word embedding was further developed
with GloVe in [
327
]. These days when considering dictionaries, lexicons, tokenizations, and word
embeddings, one may often use dedicated libraries such as for example those supplied (and con-
tributed to) w ith HuggingFace.
16
An applied book in this domain is [
403
] and since the field is
moving quickly, many others are to appear as well.
The modern neural encoder-decoder approach was pioneered by [
218
] and then in the context of
machine translation, influential works are [
81
] and [
392
]. The idea of using attention in recent times,
first for handwriting recognition, was proposed in [
145
] and then the work in [
20
] extended the idea,
and applied it to machine translation as we illustrate in our Figure 7.10. A massive advance was with
the 2017 paper, “Attention is all you need”, [
410
], which introduced the transformer architecture,
the backbone of almost all of today’s leading large language models. Ideas of layer normalization
are from [
19
]. Further details of transformers can be found in [
331
], and a survey of variants of
transformers as well as non-NLP applications can be found in [262].
At the time of publishing of this book the hottest topics in the world of deep learning are large
language models and their multi-modal counterparts. A recent comprehensive survey is in [
449
],
and other surveys are [
73
] and [
155
]. We should note that as this particular field is moving very
rapidly at the time of publication of the book, there will surely be significant advances in the years
coming. Multimodal models are being developed and deployed as well, and these models have images
as input and output in addition to text; see [
428
] for a survey. Indeed beyond the initial task of
machine translation, transformers have also been applied to images with incredible success. A first
landmark paper on this avenue is [107]. See also the survey papers [227] and [269].
16
https://huggingface.co.
294