
i
i
i
i
i
i
i
i
7 Sequence Models
Notes and References
A useful applied introductory text about time-series sequence data analysis is [
24
] and a more
theoretical book is [
6
]. Yet, while these are texts about sequence models, the traditional statistical
and forecasting time-series focus is on cases where each
x
⟨t⟩
is a scalar or a low dimensional vector.
Neural network models, the topic of this chapter, are very different, and for an early review of
recurrent neural networks and generalizations see chapter 10 of [
17
] and the many references there-in,
where key references are also listed below.
As the most common application of sequence models is textual data, let us mention early texts on
natural language processing (NLP). General early approaches to NLP are summarized in [
35
] and
[
26
] where the topic is tackled via rule-based approaches based on the statistics of grammar. A
much more modern summary of applications is [
30
] and a review of applications of neural networks
for NLP is in [
16
], yet this field is quickly advancing at the time of publishing of this current book.
See also chapter 7 of the book [
1
] for a comprehensive discussion of RNNs as well as their long short
term memory (LSTM) and gated recurrent units (GRU) generalizations.
Recurrent neural networks (RNN) are useful for broad applications such as DNA sequencing, see
for example [
46
], image captioning as in [
23
], time series prediction as in [
5
], sentiment analysis
as in [
34
], speech recognition as in [
19
], and many other applications. Possibly one of the first
constructions of recurrent neural networks (RNN) in their modern form appeared in [
15
] and is
sometimes referred to as an Elman network. Yet this was not the inception of ideas for recurrent
neural networks and earlier ideas appeared in several influential works over the previous decades.
See [
44
] for an historical account with notable earlier publications including [
2
] in 1972, and [
22
],
and [43] in the 1980’s.
The introduction of bidirectional RNN is in [
45
]. The introduction of long short term memory
(LSTM) models in the late 1990’s was in [
21
]. Gated recurrent units (GRUs) are much more recent
concepts and were introduced in [
9
] and [
11
] after the big spark of interest in deep learning occurred.
An empirical comparison of these various approaches is in [
25
]. A more contemporary review of
LSTM is in [
54
]. These days, for advanced NLP tasks LSTMs and GRUs are generally outperformed
by transformer models, yet in non-NLP applications we expect to see LSTMs remain a useful tool
for many years to come. Some recent example application papers include [
28
], [
37
], [
42
], and [
55
],
among many others.
Moving onto textual data, the idea of word embeddings is now standard in the world of NLP. The
key principle originates with the word2vec work in [
36
]. Word embedding was further developed with
GloVe in [
38
]. These days when considering dictionaries, lexicons, tokenizations, and word embeddings,
one may often use dedicated libraries such as for example those supplied (and contributed to) with
HuggingFace.
16
An applied book in this domain is [
50
] and since the field is moving quickly, many
others are to appear as well.
The modern neural encoder-decoder approach was pioneered by [
27
] and then in the context of
machine translation, influential works are [
10
] and [
48
]. The idea of using attention in recent times,
first for handwriting recognition, was proposed in [
18
] and then the work in [
4
] extended the idea,
and applied it to machine translation as we illustrate in our Figure 7.10. A massive advance was
with the 2017 paper, “Attention is all you need”, [
51
], which introduced the transformer architecture,
the backbone of almost all of today’s leading large language models. Ideas of layer normalization are
from [
3
]. Further details of transformers can be found in [
39
], and a survey of variants of transformers
as well as non-NLP applications can be found in [31].
At the time of publishing of this book the hottest topics in the world of deep learning are large
language models and their multi-modal counterparts. A recent comprehensive survey is in [
56
], and
other surveys are [
8
] and [
20
]. We should note that as this particular field is moving very rapidly at
the time of publication of the book, there will surely be significant advances in the years coming.
Multimodal models are being developed and deployed as well, and these models have images as
input and output in addition to text; see [
52
] for a survey. Indeed beyond the initial task of machine
translation, transformers have also been applied to images with incredible success. A first landmark
paper on this avenue is [13]. See also the survey papers [29] and [33].
16
https://huggingface.co.
46