Bibliography

[1]

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,

A. Davis, J. Dean, M. Devin, et al. TensorFlow: Large-scale machine learning on

heterogeneous distributed systems. arXiv:1603.04467, 2016.

[2]

D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for boltzmann

machines. Cognitive Science, 1985.

[3]

J. Adler and S. Lunz. Banach Wasserstein GAN. Advances in Neural Information

Processing Systems, 2018.

[4] C. C. Aggarwal. Neural networks and deep learning. Springer, 2018.

[5] A. Agresti. Categorical data analysis. John Wiley & Sons, 2003.

[6] A. Agresti. Analysis of ordinal categorical data. John Wiley & Sons, 2010.

[7]

H. Akaike. A new look at the statistical model identiﬁcation. IEEE Transactions on

Automatic Control, 1974.

[8]

K. Akuzawa, Y. Iwasawa, and Y. Matsuo. Expressive speech synthesis via modeling

expressions with variational autoencoder. arXiv:1804.02135, 2018.

[9] P. Albertos and I. Mareels. Feedback and control for everyone. Springer, 2010.

[10] J. J. Allaire. Deep Learning with R. Simon and Schuster, 2018.

[11]

A. Alotaibi. Deep generative adversarial networks for image-to-image translation: A

review. Symmetry, 2020.

[12]

S. Amari. A theory of adaptive pattern classiﬁers. IEEE Transactions on Electronic

Computers, 1967.

[13]

S. Amari. Learning patterns and pattern sequences by self-organizing nets of threshold

elements. IEEE Transactions on Computers, 1972.

[14] P. J. Antsaklis and A. N. Michel. Linear systems. Springer, 1997.

[15]

M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks.

In International Conference on Machine Learning, 2017.

[16]

L. Armijo. Minimization of functions having Lipschitz continuous ﬁrst partial deriva-

tives. Paciﬁc Journal of mathematics, 1966.

369

Bibliography

[17]

S. Arora, Z. Li, and K. Lyu. Theoretical analysis of auto rate-tuning by batch

normalization. arXiv:1812.03981, 2018.

[18]

J. Atwood and D. Towsley. Diﬀusion-convolutional neural networks. Advances in

neural information processing systems, 2016.

[19]

J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv:1607.06450, 2016.

[20]

D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning

to align and translate. arXiv:1409.0473, 2014.

[21]

P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning

from examples without local minima. Neural networks, 1989.

[22]

W. Bao, J. Yue, and Y. Rao. A deep learning framework for ﬁnancial time series using

stacked autoencoders and long-short term memory. PLoS ONE, 2017.

[23]

D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba. Network dissection: Quantifying

interpretability of deep visual representations. In Proceedings of the IEEE conference

on computer vision and pattern recognition, 2017.

[24]

A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind. Automatic

diﬀerentiation in machine learning: A survey. Journal of Machine Learning Research,

2018.

[25]

L. M. Beda, L. N. Korolev, N. V. Sukkikh, and T. S. Frolova. Programs for automatic

diﬀerentiation for the machine BESM (in Russian). Technical report, Institute for

Precise Mechanics and Computation Techniques, Academy of Science, Moscow, USSR,

1959.

[26]

M. Belkin, D. Hsu, S. Ma, and S. Mandal. Reconciling modern machine-learning

practice and the classical bias–variance trade-oﬀ. Proceedings of the National Academy

of Sciences, 2019.

[27]

A. Ben-Tal and A. Nemirovski. Lectures on modern convex optimization. SIAM,

Philadelphia, PA; MPS, Philadelphia, PA, 2001.

[28]

Y. Bengio. Learning deep architectures for ai. Foundations and Trends® in Machine

Learning, 2009.

[29]

Y. Bengio. Practical recommendations for gradient-based training of deep architectures.

In Neural Networks: Tricks of the Trade. 2012.

[30]

Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of

deep networks. Advances in Neural Information Processing Systems, 2006.

[31]

J. O. Berger and R. L. Wolpert. The Likelihood Principle: A Review, Generalizations,

and Statistical Implications. Lecture Notes—Monograph Series, 1988.

[32]

D. Bertsekas, A. Nedić, and A. E. Ozdaglar. Convex analysis and optimization. Athena

Scientiﬁc, 2003.

370

Bibliography

[33]

D. P. Bertsekas. Dynamic Programming and Optimal Control, Volume. II. Athena

Scientiﬁc, 3rd edition, 2007.

[34]

D. P. Bertsekas. Dynamic programming and optimal control: Volume I. Athena

Scientiﬁc, 2012.

[35] D. P. Bertsekas. Nonlinear programming. Athena Scientiﬁc, Third edition, 2016.

[36] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming. Athena Scientiﬁc,

1996.

[37]

D. Bertsimas and J. N. Tsitsiklis. Introduction to linear optimization. Athena Scientiﬁc,

1997.

[38]

C. Biever. ChatGPT broke the Turing test-the race is on for new ways to assess AI.

Nature, 2023.

[39] C. M. Bishop. Pattern Recognition and Machine learning. Springer, 2006.

[40]

A. Bjerhammar. Application of calculus of matrices to method of least squares: with

special reference to geodetic calculations. Elander, 1951.

[41]

D. M. Blei, A. Kucukelbir, and J. D. McAuliﬀe. Variational inference: A review for

statisticians. J. Amer. Statist. Assoc., 2017.

[42] C. I. Bliss. The method of probits. Science, 1934.

[43]

S. Bock and Weiß. A Proof of Local Convergence for the Adam Optimizer. In 2019

International Joint Conference on Neural Networks (IJCNN), 2019.

[44]

D. Böhning. Multinomial logistic regression algorithm. Annals of the institute of

Statistical Mathematics, 1992.

[45]

D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee. YOLACT: Real-Time Instance Segmentation.

In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019.

[46]

S. Bond-Taylor, A. Leach, Y. Long, and C. G. Willcocks. Deep generative modelling: A

comparative review of vaes, gans, normalizing ﬂows, energy-based and autoregressive

models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.

[47]

J. F. Bonnans, J. C. Gilbert, C. Lemaréchal, and C. A. Sagastizábal. Numerical

optimization: Theoretical and practical aspects. Springer Science & Business Media,

2006.

[48]

L. Bottou. Online algorithms and stochastic approximations. Online learning in neural

networks, 1998.

[49]

L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings

of COMPSTAT’2010: 19th International Conference on Computational Statistics, 2010.

371

Bibliography

[50]

L. Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade.

2012.

[51]

L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine

learning. SIAM review, 2018.

[52]

L. Bottou and Y. LeCun. Large scale online learning. In Advances in Neural Information

Processing Systems, 2003.

[53]

H. Bourlard and Y. Kamp. Auto-association by multilayer perceptrons and singular

value decomposition. Biological cybernetics, 1988.

[54]

S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio.

Generating sentences from a continuous space. In 20th SIGNLL Conference on

Computational Natural Language Learning, CoNLL 2016, 2016.

[55]

S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press,

2004.

[56]

S. Boyd and L. Vandenberghe. Introduction to applied linear algebra: Vectors, matrices,

and least squares. Cambridge university press, 2018.

[57]

J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula,

A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang. JAX: Composable

Transformations of Python+NumPy Programs. http://github.com/google/jax, 2018.

[58]

A. P. Bradley. The use of the area under the ROC curve in the evaluation of machine

learning algorithms. Pattern Recognition, 1997.

[59] L. Breiman. Bagging predictors. Machine Learning, 1996.

[60] L. Breiman. Random forests. Machine Learning, 2001.

[61]

L. Breiman. Statistical Modeling: The Two Cultures (with Comments and a Rejoinder

by the Author). Statistical Science, 2001.

[62]

L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classiﬁcation and regression

trees. Wadsworth & Brooks. Cole Statistics/Probability Series, 1984.

[63]

J. S. Bridle. Probabilistic interpretation of feedforward classiﬁcation network outputs,

with relationships to statistical pattern recognition. In Neurocomputing. 1990.

[64]

P. J. Brockwell and R. A. Davis. Time Series: Theory and Methods. Springer Science

& Business Media, 1991.

[65]

J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah. Signature veriﬁcation

using a "siamese" time delay neural network. Advances in neural information processing

systems, 1993.

372

Bibliography

[66]

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,

P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances

in Neural Information Processing Systems, 2020.

[67]

J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally

connected networks on graphs. arXiv:1312.6203, 2013.

[68]

A. Burkov. The Hundred-Page Machine Learning Book. Andriy Burkov Quebec City,

QC, Canada, 2019.

[69] O. Calin. Deep Learning Architectures. Springer, 2020.

[70]

A. Canziani, A. Paszke, and E. Culurciello. An Analysis of Deep Neural Network

Models for Practical Applications. arXiv:1605.07678, 2016.

[71]

H. Cao, C. Tan, Z. Gao, Y. Xu, G. Chen, P. A. Heng, and S. Z. Li. A survey on

generative diﬀusion model. arXiv:2209.02646, 2022.

[72]

A. Cauchy. Méthode générale pour la résolution des systèmes d’équations simultanées.

Comp. Rend. Sci. Paris, 1847.

[73]

Y. Chang, X. Wang, J. Wang, Y. Wu, K. Zhu, H. Chen, L. Yang, X. Yi, C. Wang,

Y. Wang, et al. A survey on evaluation of large language models. arXiv:2307.03109,

2023.

[74]

D. Charte, F. Charte, S. García, M. J. del Jesus, and F. Herrera. A practical tutorial on

autoencoders for nonlinear feature fusion: Taxonomy, models, software and guidelines.

Information Fusion, 2018.

[75]

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: synthetic

minority over-sampling technique. Journal of artiﬁcial intelligence research, 2002.

[76]

L. Chen, S. Li, Q. Bai, J. Yang, S. Jiang, and Y. Miao. Review of image classiﬁcation

algorithms based on convolutional neural networks. Remote Sensing, 2021.

[77]

T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of

the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data

Mining, 2016.

[78]

X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan:

Interpretable representation learning by information maximizing generative adversarial

nets. Advances in Neural Information Processing Systems, 2016.

[79]

P. S. Chib and P. Singh. Recent advancements in end-to-end autonomous driving

using deep learning: A survey. IEEE Transactions on Intelligent Vehicles, 2023.

[80]

K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio. On the properties of

neural machine translation: Encoder-decoder approaches. arXiv:1409.1259, 2014.

373

Bibliography

[81]

K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and

Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical

machine translation. arXiv:1406.1078, 2014.

[82]

D. Choi, C. J. Shallue, Z. Nado, J. Lee, C. J. Maddison, and G. E. Dahl. On empirical

comparisons of optimizers for deep learning. arXiv:1910.05446, 2019.

[83]

Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo. Stargan: Uniﬁed generative

adversarial networks for multi-domain image-to-image translation. In Proceedings of

the IEEE conference on computer vision and pattern recognition, 2018.

[84]

S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively,

with application to face veriﬁcation. In 2005 IEEE computer society conference on

computer vision and pattern recognition (CVPR’05), 2005.

[85]

J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent

neural networks on sequence modeling. arXiv:1412.3555, 2014.

[86] D. C. Cires

an, U. Meier, L. M. Gambardella, and J. Schmidhuber. Deep, big, simple

neural nets for handwritten digit recognition. Neural Computation, 2010.

[87]

G. Claeskens and N. L. Hjort. Model selection and model averaging. Cambridge Books,

2008.

[88]

W. S. Cleveland. Robust locally weighted regression and smoothing scatterplots.

Journal of the American Statistical Association, 1979.

[89]

A. K. Cline and I. S. Dhillon. Computation of the singular value decomposition. In

Handbook of linear algebra. 2006.

[90]

N. Cohen, O. Sharir, and A. Shashua. On the Expressive Power of Deep Learning: A

Tensor Analysis, 2016.

[91]

D. Commenges and H. Jacqmin-Gadda. Dynamical biostatistical models. CRC Press,

2015.

[92] P. Congdon. Bayesian Statistical Modelling. John Wiley & Sons, 2007.

[93]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to algorithms.

MIT press, 2022.

[94] D. R. Cox and D. V. Hinkley. Theoretical statistics. CRC Press, 1979.

[95]

J. S. Cramer. The origins of logistic regression. Tinbergen Institute Working Paper

No. 2002-119/4, 2002.

[96]

F. A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah. Diﬀusion models in vision: A

survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.

[97]

G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics

of Control, Signals, and Systems, 1989.

374

Bibliography

[98]

Y. H. Dai. Convergence properties of the BFGS algorithm. SIAM Journal on

Optimization, 2002.

[99]

W. C. Davidon. Variable metric method for minimization. Technical report, Argonne

National Lab., Lemont, Ill., 1959.

[100]

W. C. Davidon. Variable metric method for minimization. SIAM Journal on Opti-

mization, 1991.

[101]

M. P. Deisenroth, A. A. Faisal, and C. S. Ong. Mathematics for Machine Learning.

Cambridge University Press, 2020.

[102]

J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale

Hierarchical Image Database. In 2009 IEEE Conference on Computer Vision and

Pattern Recognition, 2009.

[103]

J. Devlin, M. W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep

bidirectional transformers for language understanding. arXiv:1810.04805, 2018.

[104]

P. Dhariwal and A. Nichol. Diﬀusion models beat gans on image synthesis. Advances

in Neural Information Processing Systems, 2021.

[105]

L. DO Q. Numerically eﬃcient methods for solving least squares problems. Pennsyl-

vania State University, 2012.

[106]

A. J. Dobson and A. G. Barnett. An introduction to generalized linear models. Chapman

and Hall/CRC, 2018.

[107]

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner,

M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An Image Is Worth 16x16

Words: Transformers for Image Recognition at Scale. arXiv:2010.11929, 2020.

[108]

J. C. Douma and J. T. Weedon. Analysing continuous proportions in ecology and

evolution: A practical introduction to beta and Dirichlet regression. Methods in Ecology

and Evolution, 2019.

[109]

T. Dozat. Incorporating Nesterov momentum into Adam. International Conference

on Learning Representations (ICLR) Workshop, 2016.

[110]

N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W.

Yu, O. Firat, et al. Glam: Eﬃcient scaling of language models with mixture-of-experts.

In International Conference on Machine Learning, 2022.

[111]

S. R. Dubey, S. K. Singh, and B. B. Chaudhuri. Activation functions in deep learning:

A comprehensive survey and benchmark. Neurocomputing, 2022.

[112]

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning

and stochastic optimization. Journal of Machine Learning Research, 2011.

[113]

V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style.

arXiv:1610.07629, 2016.

375

Bibliography

[114]

C. Eckart and G. Young. The approximation of one matrix by another of lower rank.

Psychometrika, 1936.

[115]

B. Efron and T. Hastie. Computer age statistical inference. Cambridge University

Press, 2016.

[116]

R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In

Conference on learning theory, 2016.

[117] J. L. Elman. Finding structure in time. Cognitive science, 1990.

[118]

F. Emmert-Streib, S. Moutari, and M. Dehmer. A comprehensive survey of error

measures for evaluating binary decision making in data science. Wiley Interdisciplinary

Reviews: Data Mining and Knowledge Discovery, 2019.

[119]

D. Erhan, Y. Bengio, A. Courville, and P. Vincent. Visualizing higher-layer features

of a deep network. University of Montreal, 2009.

[120]

A. E. Ezugwu, A. M. Ikotun, O. O. Oyelade, L. Abualigah, J. O. Agushaka, C. I. Eke,

and A. A. Akinyelu. A comprehensive survey of clustering algorithms: State-of-the-art

machine learning applications, taxonomy, challenges, and future research prospects.

Engineering Applications of Artiﬁcial Intelligence, 2022.

[121]

J. J. Faraway. Extending the Linear Model with R: Generalized Linear, Mixed Eﬀects

and Nonparametric Regression Models. Chapman and Hall/CRC, 2016.

[122]

C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional Two-Stream Network

Fusion for Video Action Recognition. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, 2016.

[123] R. Fletcher. Practical methods of optimization. John Wiley & Sons, 2013.

[124]

R. Fletcher and M. J. D. Powell. A rapidly convergent descent method for minimization.

The Computer Journal, 1963.

[125]

V. François-Lavet, P. Henderson, R. Islam, M. G. Bellemare, and J. Pineau. An

introduction to deep reinforcement learning. Foundations and Trends in Machine

Learning, 2018.

[126]

K. Fukushima. Visual feature extraction by a multilayered network of analog threshold

elements. IEEE Transactions on Systems Science and Cybernetics, 1969.

[127]

K. Fukushima. c: A self-organizing neural network model for a mechanism of pattern

recognition unaﬀected by shift in position. Biological cybernetics, 1980.

[128]

Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model

uncertainty in deep learning. In International conference on machine learning, 2016.

[129]

A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, P. Martinez-

Gonzalez, and J. Garcia-Rodriguez. A Survey on Deep Learning Techniques for Image

and Video Semantic Segmentation. Applied Soft Computing, 2018.

376

Bibliography

[130]

S. Garg and G. Ramakrishnan. Advances in Quantum Deep Learning: An Overview.

arXiv:2005.04316, 2020.

[131]

L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm of artistic style.

arXiv:1508.06576, 2015.

[132]

Gauss, C. F. Theoria Motus Corporum Coelestium. Perthes, Hamburg. Translation

reprinted as Theory of the Motions of the Heavenly Bodies Moving about the Sun in

Conic Sections. Dover, New York, 1963, 1809.

[133]

A. Géron. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow:

Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media, 2019.

[134]

J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message

passing for quantum chemistry. In International conference on machine learning, 2017.

[135]

R. Girshick. Fast R-CNN. In Proceedings of the IEEE International Conference on

Computer Vision, 2015.

[136]

R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate

object detection and semantic segmentation. In Proceedings of the IEEE conference

on computer vision and pattern recognition, 2014.

[137]

X. Glorot and Y. Bengio. Understanding the diﬃculty of training deep feedforward

neural networks. In Proceedings of the thirteenth international conference on artiﬁcial

intelligence and statistics, 2010.

[138]

Y. Goldberg. Neural Network Methods for Natural Language Processing. Springer

Nature, 2022.

[139]

G. H. Golub. Least squares, singular values and matrix approximations. Aplikace

matematiky, 1968.

[140]

G. H. Golub and C. Reinsch. Singular value decomposition and least squares solutions.

In Linear algebra. 1971.

[141] G. H. Golub and C. F. Van Loan. Matrix computations. JHU press, 2013.

[142] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.

[143]

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,

A. Courville, and Y. Bengio. Generative adversarial nets. Advances in Neural Infor-

mation Processing Systems, 2014.

[144]

M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains.

In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005.,

2005.

[145]

A. Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850,

2013.

377

Bibliography

[146]

A. Graves, A. R. Mohamed, and G. Hinton. Speech recognition with deep recurrent

neural networks. In 2013 IEEE international conference on acoustics, speech and

signal processing, 2013.

[147]

A. Griewank and A. Walther. Evaluating derivatives: Principles and techniques of

algorithmic diﬀerentiation. SIAM, 2008.

[148]

A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In Pro-

ceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery

and data mining, 2016.

[149]

R. Gueorguieva, R. Rosenheck, and D. Zelterman. Dirichlet component regression and

its applications to psychiatric data. Computational Statistics & Data Analysis, 2008.

[150]

J. Gui, Z. Sun, Y. Wen, D. Tao, and J. Ye. A review on generative adversarial networks:

Algorithms, theory, and applications. IEEE Transactions on Knowledge and Data

Engineering, 2021.

[151]

I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved

training of Wasserstein GANs. Advances in Neural Information Processing Systems,

2017.

[152]

I. Gulrajani, K. Kumar, F. Ahmed, A. A. Taiga, F. Visin, D. Vazquez, and A. Courville.

PixelVAE: A Latent Variable Model for Natural Images. In International Conference

on Learning Representations, 2016.

[153]

H. Guo, Y. Li, J. Shang, M. Gu, Y. Huang, and B. Gong. Learning from class-

imbalanced data: Review of methods and applications. Expert systems with applications,

2017.

[154]

I. Guyon, P. Albrecht, Y. LeCun, J. Denker, and W. Hubbard. Design of a neural

network character recognizer for a touch terminal. Pattern Recognition, 1991.

[155]

M. U. Hadi, R. Qureshi, A. Shah, M. Irfan, A. Zafar, M. B. Shaikh, N. Akhtar, J. Wu,

S. Mirjalili, et al. Large language models: A comprehensive survey of its applications,

challenges, limitations, and future prospects. Authorea Preprints, 2023.

[156]

W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large

graphs. Advances in neural information processing systems, 2017.

[157]

W. L. Hamilton. Graph Representation Learning. Morgan & Claypool Publishers,

2020.

[158]

D. J. Hand. Assessing the performance of classiﬁcation methods. International

Statistical Review, 2012.

[159]

K. Hara, D. Saitoh, and H. Shouno. Analysis of dropout learning regarded as ensemble

learning. In Artiﬁcial Neural Networks and Machine Learning–ICANN 2016: 25th

International Conference on Artiﬁcial Neural Networks, Barcelona, Spain, September

6-9, 2016, Proceedings, Part II 25, 2016.

378

Bibliography

[160] M. A. Hardy. Regression with dummy variables. Sage, 1993.

[161]

D. Harrison Jr and D. L. Rubinfeld. Hedonic Housing Prices and the Demand for

Clean Air. Journal of Environmental Economics and Management, 1978.

[162]

G. M. Harshvardhan, M. K. Gourisaria, M. Pandey, and S. S. Rautaray. A comprehen-

sive survey and analysis of generative models in machine learning. Computer Science

Review, 2020.

[163]

W. Harvey, S. Naderiparizi, V. Masrani, C. Weilbach, and F. Wood. Flexible diﬀusion

modeling of long videos. Advances in Neural Information Processing Systems, 2022.

[164] D. Hassabis. Artiﬁcial Intelligence: Chess Match of the Century. Nature, 2017.

[165]

D. Hassabis, D. Kumaran, C. Summerﬁeld, and M. Botvinick. Neuroscience-Inspired

Artiﬁcial Intelligence. Neuron, 2017.

[166]

T. Hastie, R. Tibshirani, J. H. Friedman, and J. H. Friedman. The elements of

statistical learning: Data mining, inference, and prediction. Springer, 2009.

[167]

T. Hastie, R. Tibshirani, and M. Wainwright. Statistical learning with sparsity.

Monographs on statistics and applied probability, 2015.

[168] T. J. Hastie and R. J. Tibshirani. Generalized Additive Models. Routledge, 2017.

[169] S. Haykin. Neural networks: a comprehensive foundation. Prentice Hall, 1998.

[170]

K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. In Proceedings of the

IEEE International Conference on Computer Vision, 2017.

[171]

K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectiﬁers: Surpassing human-

level performance on imagenet classiﬁcation. In Proceedings of the IEEE international

conference on computer vision, 2015.

[172]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In

Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.

[173]

R. Hecht-Nielsen. Theory of the backpropagation neural network. In Neural Networks

for Perception. 1992.

[174]

M. Henaﬀ, J. Bruna, and Y. LeCun. Deep convolutional networks on graph-structured

data. arXiv:1506.05163, 2015.

[175]

S. Herculano-Houzel. The human brain in numbers: a linearly scaled-up primate brain.

Frontiers in human neuroscience, 2009.

[176]

A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or.

Prompt-to-prompt image editing with cross attention control. arXiv:2208.01626, 2022.

[177]

M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving Linear

Systems. Journal of research of the National Bureau of Standards, 1952.

379

Bibliography

[178]

R. H. Hijazi and R. W. Jernigan. Modelling compositional data using Dirichlet

regression models. Journal of Applied Probability & Statistics, 2009.

[179] J. M. Hilbe. Logistic regression models. Chapman and Hall/CRC, 2009.

[180]

G. E. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief

nets. Neural Computation, 2006.

[181]

G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdi-

nov. Improving neural networks by preventing co-adaptation of feature detectors.

arXiv:1207.0580, 2012.

[182]

J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole,

M. Norouzi, D. J. Fleet, et al. Imagen video: High deﬁnition video generation with

diﬀusion models. arXiv:2210.02303, 2022.

[183]

J. Ho, A. Jain, and P. Abbeel. Denoising diﬀusion probabilistic models. Advances in

Neural Information Processing Systems, 2020.

[184]

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation,

1997.

[185]

J. J. Hopﬁeld. Neural networks and physical systems with emergent collective compu-

tational abilities. Proceedings of the national academy of sciences, 1982.

[186]

K. Hornik. Approximation capabilities of multilayer feedforward networks. Neural

Networks, 1991.

[187]

D. W. Hosmer Jr, S. Lemeshow, and R. X. Sturdivant. Applied logistic regression.

John Wiley & Sons, 2013.

[188]

M. Z. Hossain, F. Sohel, M. Shiratuddin, and H. Laga. A comprehensive survey of

deep learning for image captioning. ACM Computing Surveys (CSUR), 2019.

[189]

H. Hotelling. Analysis of a complex of statistical variables into principal components.

Journal of educational psychology, 1933.

[190]

J. Howard and S. Gugger. Deep Learning for Coders with fastai and PyTorch. O’Reilly

Media, 2020.

[191]

C. C. Hsu, H. T. Hwang, Y. C. Wu, Y. Tsao, and H. M. Wang. Voice conversion from

unaligned corpora using variational autoencoding wasserstein generative adversarial

networks. Interspeech 2017, 2017.

[192]

G. Hu, Y. Yang, D. Yi, J. Kittler, W. Christmas, S. Z. Li, and T. Hospedales. When

Face Recognition Meets with Deep Learning: An Evaluation of Convolutional Neural

Networks for Face Recognition. In Proceedings of the IEEE International Conference

on Computer Vision Workshops, 2015.

[193]

Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing. Toward controlled

generation of text. In International Conference on Machine Learning, 2017.

380

Bibliography

[194]

L. Huang, J. Qin, Y. Zhou, F. Zhu, L. Liu, and L. Shao. Normalization techniques in

training dnns: Methodology, analysis and application. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 2023.

[195]

D. H. Hubel and T. N. Wiesel. Receptive ﬁelds of single neurones in the cat’s striate

cortex. The Journal of physiology, 1959.

[196]

D. H. Hubel and T. N. Wiesel. Receptive ﬁelds, binocular interaction and functional

architecture in the cat’s visual cortex. The Journal of Physiology, 1962.

[197]

P. J. Huber. Robust regression: asymptotics, conjectures and Monte Carlo. The Annals

of Statistics, 1973.

[198]

R. J. Hyndman and G. Athanasopoulos. Forecasting: Principles and Practice. OTexts,

3rd edition, 2021.

[199]

F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer.

Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model

size. arXiv:1602.07360, 2016.

[200]

G. Iglesias, E. Talavera, and A. Díaz-Álvarez. A survey on GANs for computer vision:

Recent research, analysis and taxonomy. Computer Science Review, 2023.

[201]

M. Innes. Flux: Elegant Machine Learning with Julia. Journal of Open Source Software,

2018.

[202]

S. Ioﬀe. Batch renormalization: Towards reducing minibatch dependence in batch-

normalized models. Advances in Neural Information Processing Systems, 2017.

[203]

S. Ioﬀe and C. Szegedy. Batch normalization: Accelerating deep network training by

reducing internal covariate shift. In International conference on machine learning,

2015.

[204]

P. Isola, J. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional

adversarial networks. In Proceedings of the IEEE conference on computer vision and

pattern recognition, 2017.

[205]

A. G. Ivakhnenko. Polynomial theory of complex systems. IEEE Transactions on

Systems, Man, and Cybernetics, 1971.

[206]

A. G. Ivakhnenko and V. G. Lapa. Cybernetic predicting devices. Purdue Univ

Lafayette Ind School of Electrical Engineering, appearing in The Defense Technical

Information Center, 1966.

[207] B. Jähne. Digital image processing. Springer Science & Business Media, 2005.

[208]

A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM computing

surveys (CSUR), 1999.

381

Bibliography

[209]

G. Jin, Y. Liang, Y. Fang, Z. Shao, J. Huang, J. Zhang, and Y. Zheng. Spatio-temporal

graph neural networks for predictive learning in urban computing: A survey. IEEE

Transactions on Knowledge and Data Engineering, 2023.

[210]

W. Jin, R. Barzilay, and T. Jaakkola. Junction tree variational autoencoder for

molecular graph generation. In International Conference on Machine Learning, 2018.

[211]

J. M. Johnson and T. M. Khoshgoftaar. Survey on deep learning with class imbalance.

Journal of Big Data, 2019.

[212] I. T. Jolliﬀe. Principal component analysis for special types of data. Springer, 2002.

[213]

L. V. Jospin, W. Buntine, F. Boussaid, H. Laga, and M. Bennamoun. Hands-on

Bayesian Neural Networks–A Tutorial for Deep Learning Users. arXiv:2007.06823,

2020.

[214]

R. Jozefowicz, W. Zaremba, and I. Sutskever. An empirical exploration of recurrent

network architectures. In International conference on machine learning, 2015.

[215]

Y. Jung. Multiple predicting K-fold cross-validation for model selection. Journal of

Nonparametric Statistics, 2018.

[216] D. Jurafsky and J. H. Martin. Speech and Language Processing. Pearson, 2000.

[217]

L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey.

Journal of Artiﬁcial Intelligence Research, 4, 1996.

[218]

N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. In

Proceedings of the 2013 conference on empirical methods in natural language processing,

2013.

[219]

M. Kang, J. Y. Zhu, R. Zhang, J. Park, E. Shechtman, S. Paris, and T. Park. Scaling

up GANs for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on

Computer Vision and Pattern Recognition, 2023.

[220]

Z. Karevan and J. A. K. Suykens. Transductive lstm for time-series prediction: An

application to weather forecasting. Neural Networks, 2020.

[221]

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-

Scale Video Classiﬁcation with Convolutional Neural Networks. In Proceedings of the

IEEE Conference on Computer Vision and Pattern Recognition, 2014.

[222]

T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved

quality, stability, and variation. arXiv:1710.10196, 2017.

[223]

T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila. Training

generative adversarial networks with limited data. Advances in Neural Information

Processing Systems, 2020.

382

Bibliography

[224] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative

adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision

and pattern recognition, 2019.

[225]

T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila. Analyzing and

improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference

on computer vision and pattern recognition, 2020.

[226]

L. Kaufman and P. J. Rousseeuw. Finding groups in data: an introduction to cluster

analysis. John Wiley & Sons, 2009.

[227]

S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah. Transformers

in vision: A survey. ACM Computing Surveys (CSUR), 2022.

[228]

D. Khurana, A. Koli, K. Khatter, and S. Singh. Natural language processing: State of

the art, current trends and challenges. Multimedia Tools and Applications, 2023.

[229]

J. Kiefer. Sequential minimax search for a maximum. Proceedings of the American

mathematical society, 1953.

[230]

J. Kiefer. Optimum experimental designs. Journal of the Royal Statistical Society:

Series B (Methodological), 1959.

[231]

J. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a regression

function. The Annals of Mathematical Statistics, 1952.

[232]

J. H. Kim. Estimating classiﬁcation error rate: Repeated cross-validation, repeated

hold-out and bootstrap. Computational statistics & data analysis, 2009.

[233]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980,

2014.

[234]

D. P. Kingma and M. Welling. Auto-encoding variational Bayes. arXiv:1312.6114,

2013.

[235]

D. P. Kingma and M. Welling. An introduction to variational autoencoders. Founda-

tions and Trends® in Machine Learning, 2019.

[236]

T. N. Kipf and M. Welling. Variational graph auto-encoders. arXiv:1611.07308, 2016.

[237]

S. Kiranyaz, T. Ince, and M. Gabbouj. Optimization techniques: An overview. Multidi-

mensional Particle Swarm Optimization for Machine Learning and Pattern Recognition,

2014.

[238]

M. J. Kochenderfer and T. A. Wheeler. Algorithms for optimization. MIT Press, 2019.

[239]

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep

convolutional neural networks. Advances in neural information processing systems,

2012.

383

Bibliography

[240]

D. P. Kroese, Z. Botev, T. Taimre, and R. Vaisman. Data science and machine

learning: Mathematical and statistical methods. CRC Press, 2019.

[241]

A. Krogh and J. Hertz. A Simple Weight Decay Can Improve Generalization. In

Advances in Neural Information Processing Systems, 1991.

[242]

J. Kuehn, S. Abadie, B. Liquet, and V. Roeber. A deep learning super-resolution

model to speed up computations of coastal sea states. Applied Ocean Research, 2023.

[243]

J. Kukačka, V. Golkov, and D. Cremers. Regularization for deep learning: A taxonomy.

arXiv:1710.10686, 2017.

[244] H. Kwakernaak and R. Sivan. Modern signal and systems. Prentice Hall, 1991.

[245]

P. Lafaye de Micheaux, R. Drouilhet, and B. Liquet. The R software: Fundamentals

of programming and statistical analysis. Springer, 2013.

[246]

H. Lai, S. Xiao, Y. Pan, Z. Cui, J. Feng, C. Xu, J. Yin, and S. Yan. Deep recurrent

regression for facial landmark detection. IEEE Transactions on Circuits and Systems

for Video Technology, 2016.

[247]

K. J. Lang. A time-delay neural network architecture for speech recognition. Technical

Report, Carnegie-Mellon University, 1988.

[248]

H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin. Exploring strategies for

training deep neural networks. Journal of Machine Learning Research, 2009.

[249] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 2015.

[250]

Y. LeCun, B. Boser, J. Denker, D. Henderson, W. Hubbard, and L. Jackel. Handwritten

digit recognition with a back-propagation network. Advances in neural information

processing systems, 1989.

[251]

Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and

L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural

Computation, 1989.

[252] Y. LeCun, L. Bottou, Y. Bengio, and P. Haﬀner. Gradient-based learning applied to

document recognition. Proceedings of the IEEE, 1998.

[253] C. Lemaréchal. Cauchy and the gradient method. Doc Math Extra, 2012.

[254]

K. Levenberg. A method for the solution of certain non-linear problems in least squares.

Quarterly of applied mathematics, 1944.

[255]

H. Levitt. Transformed Up-Down Methods in Psychoacoustics. The Journal of the

Acoustical Society of America, 1971.

[256]

Q. Li, W. Cai, X. Wang, Y. Zhou, D. D. Feng, and M. Chen. Medical Image Classiﬁ-

cation with Convolutional Neural Network. In 2014 13th International Conference on

Control Automation Robotics & Vision (ICARCV), 2014.

384

Bibliography

[257]

Y. Li, R. Yu, C. Shahabi, and Y. Liu. Diﬀusion convolutional recurrent neural network:

Data-driven traﬃc forecasting. arXiv:1707.01926, 2017.

[258]

Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou. A survey of convolutional neural networks:

Analysis, applications, and prospects. IEEE Transactions on Neural Networks and

Learning Systems, 2021.

[259]

H. W. Lin, M. Tegmark, and D. Rolnick. Why does deep and cheap learning work so

well? Journal of Statistical Physics, 2017.

[260] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv:1312.4400, 2013.

[261]

T. Lin, S. U. Stich, K. K. Patel, and M. Jaggi. Don’t Use Large Mini-batches, Use

Local SGD. In International Conference on Learning Representations, 2020.

[262] T. Lin, Y. Wang, X. Liu, and X. Qiu. A survey of transformers. AI Open, 2022.

[263]

A. Lindholm, N. Wahlström, F. Lindsten, and T. B. Schön. Machine Learning - A

First Course for Engineers and Scientists. Cambridge University Press, 2022.

[264]

H. Ling, K. Kreis, D. Li, S. W. Kim, A. Torralba, and S. Fidler. Editgan: High-precision

semantic image editing. Advances in Neural Information Processing Systems, 2021.

[265] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm

as a taylor expansion of the local rounding errors. Master’s Thesis (in Finnish), Univ.

Helsinki, 1970.

[266]

D. C. Liu and J. Nocedal. On the limited memory BFGS method for large-scale

optimization. Mathematical Programming, 1989.

[267]

S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. Path Aggregation Network for Instance

Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, 2018.

[268]

Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,

and V. Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach.

arXiv:1907.11692, 2019.

[269]

Y. Liu, Y. Zhang, Y. Wang, F. Hou, J. Yuan, J. Tian, Y. Zhang, Z. Shi, J. Fan, and

Z. He. A survey of visual transformers. IEEE Transactions on Neural Networks and

Learning Systems, 2023.

[270]

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv:1711.05101,

2017.

[271]

M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet. Are GANs created

equal? A large-scale study. Advances in Neural Information Processing Systems, 2018.

[272]

D. G. Luenberger and Y. Ye. Linear and nonlinear programming. Springer, Cham,

Fifth edition, 2021.

385

Bibliography

[273] Y. Ma and J. Tang. Deep Learning on Graphs. Cambridge University Press, 2021.

[274]

A. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning

word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the

Association for Computational Linguistics: Human Language Technologies, 2011.

[275]

A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning

Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of

the Association for Computational Linguistics: Human Language Technologies, 2011.

[276]

A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectiﬁer nonlinearities improve neural

network acoustic models. In International Conference on Machine Learning, 2013.

[277]

J. MacQueen. Some methods for classiﬁcation and analysis of multivariate observations.

In Proceedings of the 5th Berkeley symposium on mathematical statistics and probability,

1967.

[278]

A. Mahendran and A. Vedaldi. Understanding deep image representations by inverting

them. In Proceedings of the IEEE conference on computer vision and pattern recognition,

2015.

[279]

K. M. Malan. A survey of advances in landscape analysis for optimization. Algorithms,

2021.

[280] C. Manning and H. Schutze. Foundations of Statistical Natural Language Processing.

MIT press, 1999.

[281]

E. R. Mansﬁeld and B. P. Helms. Detecting multicollinearity. The American Statisti-

cian, 1982.

[282]

N. Mazyavkina, S. Sviridov, S. Ivanov, and E. Burnaev. Reinforcement learning for

combinatorial optimization: A survey. Computers & Operations Research, 2021.

[283]

W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous

activity. The Bulletin of Mathematical Biophysics, 1943.

[284]

S. Mei, Y. Bai, and A. Montanari. The landscape of empirical risk for nonconvex

losses. The Annals of Statistics, 2018.

[285]

G. Menghani. Eﬃcient deep learning: A survey on making deep learning models

smaller, faster, and better. ACM Computing Surveys, 2023.

[286]

L. Mero, D. Yi, M. Dianati, and A. Mouzakitis. A survey on imitation learning

techniques for end-to-end autonomous vehicles. IEEE Transactions on Intelligent

Transportation Systems, 2022.

[287]

A. Micheli. Neural network for graphs: A contextual constructive approach. IEEE

Transactions on Neural Networks, 2009.

[288]

T. Mikolov, K. Chen, G. Corrado, and J. Dean. Eﬃcient estimation of word represen-

tations in vector space. arXiv:1301.3781, 2013.

386

Bibliography

[289] G. A. Miller. WordNet: An Electronic Lexical Database. MIT Press, 1998.

[290]

S. Minaee, Y. Y. Boykov, F. Porikli, A. J. Plaza, N. Kehtarnavaz, and D. Terzopoulos.

Image Segmentation Using Deep Learning: A Survey. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 2021.

[291]

M. Minsky and S. A. Papert. Perceptrons: An Introduction to Computational Geometry.

MIT Press, 1969.

[292]

L. Mirsky. Symmetric gauge functions and unitarily invariant norms. The quarterly

journal of mathematics, 1960.

[293]

M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv:1411.1784,

2014.

[294]

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and

M. Riedmiller. Playing atari with deep reinforcement learning. arXiv:1312.5602, 2013.

[295]

A. Moghar and M. Hamiche. Stock market prediction using lstm recurrent neural

network. Procedia Computer Science, 2020.

[296] D. C. Montgomery. Design and Analysis of Experiments. John Wiley & Sons, 2017.

[297]

E. H. Moore. On the reciprocal of the general algebraic matrix. Bull. Am. Math. Soc.,

1920.

[298] K. P. Murphy. Machine learning: a probabilistic perspective. MIT Press, 2012.

[299] K. P. Murphy. Probabilistic Machine Learning: An Introduction. MIT Press, 2022.

[300]

F. Murtagh and P. Contreras. Algorithms for hierarchical clustering: an overview.

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2012.

[301]

V. Nair and G. E. Hinton. Rectiﬁed linear units improve restricted boltzmann machines.

In International conference on machine learning, 2010.

[302]

S. C. Narula and J. F. Wellington. The minimum sum of absolute errors regression:

A state of the art survey. International Statistical Review/Revue Internationale de

Statistique, 1982.

[303] Y. Nazarathy and H. Klok. Statistics with Julia. Springer, 2021.

[304]

J. A. Nelder and R. W. M. Wedderburn. Generalized Linear Models. Journal of the

Royal Statistical Society: Series A, 1972.

[305]

Y. E. Nesterov. A method for solving the convex programming problem with conver-

gence rate O(1/k

). Dokl. Akad. Nauk SSSR, 1983.

[306]

A. Ng. Machine Learning Yearning.

https: // info. deeplearning. ai/

machine-learning-yearning-book , 2017.

387

Bibliography

[307]

A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and

M. Chen. Glide: Towards photorealistic image generation and editing with text-guided

diﬀusion models. arXiv:2112.10741, 2021.

[308]

M. A. Nielsen. Neural networks and deep learning. Determination press San Francisco,

CA, 2015.

[309]

M. Niepert, M. Ahmed, and K. Kutzkov. Learning convolutional neural networks for

graphs. In International conference on machine learning, 2016.

[310]

J. Nocedal. Updating quasi-Newton matrices with limited storage. Mathematics of

Computation, 1980.

[311] J. Nocedal and S. J. Wright. Numerical optimization. Springer, 1999.

[312]

H. Noh, S. Hong, and B. Han. Learning Deconvolution Network for Semantic Seg-

mentation. In Proceedings of the IEEE International Conference on Computer Vision,

2015.

[313]

J. F. Nolan. Analytical diﬀerentiation on a digital computer. PhD thesis, Massachusetts

Institute of Technology, 1953.

[314]

A. B. Novikoﬀ. On convergence proofs for perceptrons. Technical report, Stanford

Research Inst. Menlo Park CA, 1963.

[315]

N. A. Obuchowski and J. A. Bullen. Receiver Operating Characteristic (ROC) Curves:

Review of Methods with Applications in Diagnostic Medicine. Physics in Medicine &

Biology, 2018.

[316]

A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classiﬁer

GANs. In International Conference on Machine Learning, 2017.

[317]

M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu. Asymmetric transitivity preserving

graph embedding. In Proceedings of the 22nd ACM SIGKDD international conference

on Knowledge discovery and data mining, 2016.

[318]

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang,

S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions

with human feedback. Advances in Neural Information Processing Systems, 2022.

[319]

A. S. Pandya and R. B. Macy. Pattern Recognition with Neural Networks in C++.

CRC Press, 1995.

[320]

J. Papa. PyTorch Pocket Reference: Building and Deploying Deep Learning Model.

O’Reilly Media, 2021.

[321]

J. M. Papakonstantinou and R. A. Tapia. Origin and evolution of the secant method

in one dimension. The American Mathematical Monthly, 2013.

388

Bibliography

[322]

G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan.

Normalizing ﬂows for probabilistic modeling and inference. The Journal of Machine

Learning Research, 2021.

[323]

R. Pascanu, T. Mikolov, and Y. Bengio. On the diﬃculty of training recurrent neural

networks. In International conference on machine learning, 2013.

[324]

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,

L. Antiga, and A. Lerer. Automatic diﬀerentiation in PyTorch. 31st Conference on

Neural Information Processing Systems (NIPS2017), 2017.

[325]

J. Pearl and D. Mackenzie. The Book of Why: The New Science of Cause and Eﬀect.

Basic Books, 2018.

[326]

K. Pearson. LIII. On lines and planes of closest ﬁt to systems of points in space. The

London, Edinburgh, and Dublin philosophical magazine and journal of science, 1901.

[327]

J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word

representation. In Proceedings of the 2014 conference on empirical methods in natural

language processing (EMNLP), 2014.

[328]

R. Penrose. A generalized inverse for matrices. In Mathematical proceedings of the

Cambridge Philosophical Society, 1955.

[329]

B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social repre-

sentations. In Proceedings of the 20th ACM SIGKDD international conference on

Knowledge discovery and data mining, 2014.

[330]

K. B. Petersen and M. S. Pedersen. The matrix cookbook. Technical University of

Denmark, 2012.

[331]

M. Phuong and M. Hutter. Formal algorithms for transformers. arXiv:2207.09238,

2022.

[332]

E. Plaut. From principal subspaces to principal components with linear autoencoders.

arXiv:1804.10253, 2018.

[333]

T. Poggio, A. Banburski, and Q. Liao. Theoretical issues in deep networks. Proceedings

of the National Academy of Sciences, 2020.

[334]

T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, and Q. Liao. Why and When Can

Deep – but Not Shallow – Networks Avoid the Curse of Dimensionality: a Review,

2017.

[335]

B. T. Polyak. Some methods of speeding up the convergence of iteration methods.

USSR Computational Mathematics and Mathematical Physics, 1964.

[336] S. J. D. Prince. Understanding Deep Learning. MIT Press, 2023.

[337]

M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming.

John Wiley & Sons, 2014.

389

Bibliography

[338]

C. Rackauckas, Y. Ma, J. Martensen, C. Warner, K. Zubov, R. Supekar, D. Skinner,

A. Ramadhan, and A. Edelman. Universal diﬀerential equations for scientiﬁc machine

learning. arXiv:2001.04385, 2020.

[339]

A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep

convolutional generative adversarial networks. arXiv:1511.06434, 2015.

[340]

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language

models are unsupervised multitask learners. OpenAI blog, 2019.

[341]

J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoﬀmann, F. Song, J. Aslanides,

S. Henderson, R. Ring, S. Young, et al. Scaling language models: Methods, analysis &

insights from training gopher. arXiv:2112.11446, 2021.

[342] L. Ramalho. Fluent Python. O’Reilly Media, Incorporated, 2021.

[343]

R. Ranganath, D. Tran, and D. Blei. Hierarchical variational models. In International

conference on machine learning. PMLR, 2016.

[344]

W. Rawat and Z. Wang. Deep convolutional neural networks for image classiﬁcation:

A comprehensive review. Neural Computation, 2017.

[345]

S. J. Reddi, S. Kale, and S. Kumar. On the Convergence of Adam and Beyond.

arXiv:1904.09237, 2019.

[346]

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Uniﬁed,

real-time object detection. In Proceedings of the IEEE conference on computer vision

and pattern recognition, 2016.

[347]

D. Rezende and S. Mohamed. Variational inference with normalizing ﬂows. In

International Conference on Machine Learning. PMLR, 2015.

[348]

S. Rezvani and X. Wang. A broad review on class imbalance learning techniques.

Applied Soft Computing, 2023.

[349]

H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the

American Mathematical Society, 1952.

[350]

H. Robbins and S. Monro. A stochastic approximation method. The annals of

mathematical statistics, 1951.

[351]

J. Rodriguez-Perez, C. Leigh, B. Liquet, C. Kermorvant, E. Peterson, D. Sous, and

K. Mengersen. Detecting technical anomalies in high-frequency water-quality data

using artiﬁcial neural networks. Environmental Science & Technology, 2020.

[352]

O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical

image segmentation. In Medical Image Computing and Computer-Assisted Intervention–

MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015,

Proceedings, Part III 18, 2015.

390

Bibliography

[353]

F. Rosenblatt. The perceptron: a probabilistic model for information storage and

organization in the brain. Psychological review, 1958.

[354]

H. H. Rosenbrock. An automatic method for ﬁnding the greatest or least value of a

function. The computer journal, 1960.

[355] S. M. Ross. A ﬁrst course in probability. Pearson, 2014.

[356]

S. Ruder. An overview of gradient descent optimization algorithms. arXiv:1609.04747,

2016.

[357]

D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by

back-propagating errors. Nature, 1986.

[358]

C. Saharia, W. Chan, H. Chang, C. Lee, J. Ho, T. Salimans, D. Fleet, and M. Norouzi.

Palette: Image-to-image diﬀusion models. In ACM SIGGRAPH 2022 Conference

Proceedings, 2022.

[359]

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour,

R. G. Lopes, B. K. Ayan, T. Salimans, et al. Photorealistic text-to-image diﬀusion

models with deep language understanding. Advances in Neural Information Processing

Systems, 2022.

[360]

C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi. Image

super-resolution via iterative reﬁnement. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 2022.

[361]

R. Salakhutdinov and G. Hinton. Deep boltzmann machines. In Artiﬁcial Intelligence

and Statistics, 2009.

[362]

S. Sarao Mannelli and P. Urbani. Analytical study of momentum-based acceleration

methods in paradigmatic high-dimensional non-convex problems. Advances in Neural

Information Processing Systems, 2021.

[363]

M. Sarigül and M. Avci. Performance comparison of diﬀerent momentum techniques

on deep reinforcement learning. Journal of Information and Telecommunication, 2018.

[364] N. Savage. How AI and Neuroscience Drive Each Other Forwards. Nature, 2019.

[365]

F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph

neural network model. IEEE transactions on neural networks, 2008.

[366]

R. E. Schapire and Y. Freund. Boosting: Foundations and Algorithms. The MIT Press,

2012.

[367]

J. Schmidhuber. Annotated history of modern AI and Deep learning. arXiv:2212.11279,

2022.

[368]

F. Schroﬀ, D. Kalenichenko, and J. Philbin. Facenet: A uniﬁed embedding for face

recognition and clustering. In Proceedings of the IEEE conference on computer vision

and pattern recognition, 2015.

391

Bibliography

[369]

M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE

transactions on Signal Processing, 1997.

[370] T. J. Sejnowski. The Deep Learning Revolution. MIT Press, 2018.

[371]

Y. Seo, M. Deﬀerrard, P. Vandergheynst, and X. Bresson. Structured sequence modeling

with graph convolutional recurrent networks. In Neural Information Processing: 25th

International Conference, ICONIP, 2018, Proceedings, Part I 25, 2018.

[372]

P. Sermanet and Y. LeCun. Traﬃc Sign Recognition with Multi-Scale Convolutional

Networks. In The 2011 International Joint Conference on Neural Networks, 2011.

[373]

V. Sharma, M. Gupta, A. Kumar, and D. Mishra. Video Processing Using Deep

Learning Techniques: A Systematic Literature Review. IEEE Access, 2021.

[374]

A. Shashua and A. Levin. Ranking with large margin principle: Two approaches.

Advances in neural information processing systems, 2002.

[375]

Z. Shen, W. Bao, and D. S. Huang. Recurrent neural network for predicting transcrip-

tion factor binding sites. Scientiﬁc reports, 2018.

[376]

R. Shwartz-Ziv and N. Tishby. Opening the black box of deep neural networks via

information. arXiv:1703.00810, 2017.

[377]

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche,

J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the

game of Go with deep neural networks and tree search. Nature, 2016.

[378] J. S. Simonoﬀ. Smoothing Methods in Statistics. Springer Science & Business Media,

2012.

[379]

K. Simonyan and A. Zisserman. Two-Stream Convolutional Networks for Action

Recognition in Videos. Advances in Neural Information Processing Systems, 2014.

[380]

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale

image recognition. arXiv:1409.1556, 2014.

[381]

S. A. Sisson, Y. Fan, and M. A. Beaumont, editors. Handbook of Approximate Bayesian

Computation. CRC Press, Boca Raton, FL, 2019.

[382]

S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, J. Casper, Z. Liu,

S. Prabhumoye, G. Zerveas, V. Korthikanti, et al. Using deepspeed and mega-

tron to train megatron-turing nlg 530b, a large-scale generative language model.

arXiv:2201.11990, 2022.

[383]

I. Sobel. History and deﬁnition of the sobel operator. Retrieved from the World Wide

Web, 2014.

[384]

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsuper-

vised learning using nonequilibrium thermodynamics. In International Conference on

Machine Learning, 2015.

392

Bibliography

[385]

C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther. Ladder

variational autoencoders. Advances in neural information processing systems, 2016.

[386]

B. Speelpenning. Compiling fast partial derivatives of functions given by algorithms.

University of Illinois at Urbana-Champaign, 1980.

[387]

A. Sperduti and A. Starita. Supervised neural networks for the classiﬁcation of

structures. IEEE Transactions on Neural Networks, 1997.

[388]

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout:

a simple way to prevent neural networks from overﬁtting. The journal of machine

learning research, 2014.

[389]

S. M. Stigler. Gauss and the invention of least squares. The Annals of Statistics, 1981.

[390]

P. Stoica and Y. Selen. Model-order selection: a review of information criterion rules.

IEEE Signal Processing Magazine, 2004.

[391]

G. Strang. Linear algebra and learning from data. Wellesley-Cambridge Press Cam-

bridge, 2019.

[392]

I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural

networks. In Advances in neural information processing systems, 2014.

[393]

R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press,

2018.

[394]

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,

and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE

conference on computer vision and pattern recognition, 2015.

[395]

M. Tan and Q. Le. EﬃcientNet: Rethinking Model Scaling for Convolutional Neural

Networks. In International Conference on Machine Learning, 2019.

[396]

M. Tan and Q. V. Le. EﬃcientNetV2: Smaller Models and Faster Training. Interna-

tional Conference on Machine Learning, PMLR, 2021.

[397]

J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Large-scale information

network embedding. In Proceedings of the 24th international conference on world wide

web, 2015.

[398]

A. Tavanaei, M. Ghodrati, S. R. Kheradpisheh, T. Masquelier, and A. Maida. Deep

Learning in Spiking Neural Networks. Neural Networks, 2019.

[399]

M. Telgarsky. Beneﬁts of depth in neural networks. In Conference on learning theory,

2016.

[400]

R. Thoppilan, D. D. Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H. Cheng, A. Jin,

T. Bos, L. Baker, Y. Du, et al. Lamda: Language models for dialog applications.

arXiv:2201.08239, 2022.

393

Bibliography

[401]

A. N. Tikhonov. On the stability of inverse problems. In Comptes Rendus de l’Académie

des Sciences de l’URSS, 1943.

[402]

A. C. Tsoi. Face recognition: A convolutional neural-network approach. IEEE Trans-

actions on Neural Networks, 1997.

[403]

L. Tunstall, L. V. Werra, and T. Wolf. Natural Language Processing with Transformers.

O’Reilly Media, Inc., 2022.

[404]

A. M. Turing and J. Haugeland. Computing Machinery and Intelligence. MIT Press

Cambridge, MA, 1950.

[405]

I. Ulku and E. Akagündüz. A Survey on Deep Learning-Based Architectures for

Semantic Segmentation on 2D Images. Applied Artiﬁcial Intelligence, 2022.

[406]

D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normalization: The missing

ingredient for fast stylization. arXiv:1607.08022, 2016.

[407]

R. van de Schoot, S. Depaoli, R. King, B. Kramer, K. Märtens, M. G. Tadesse,

M. Vannucci, A. Gelman, D. Veen, J. Willemsen, and C. Yau. Bayesian Statistics and

Modelling. Nature Reviews Methods Primers, 2021.

[408] C. Van Rijsbergen. Information Retrieval (Book 2nd ed), 1979.

[409] V. N. Vapnick. Statistical Learning Theory. Wiley, New York, 1998.

[410]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,

and I. Polosukhin. Attention is all you need. In Advances in neural information

processing systems, 2017.

[411]

P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph

Attention Networks. International Conference on Learning Representations (ICLR),

2018.

[412]

A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang. Phoneme recognition

using time-delay neural networks. In Backpropagation. 2013.

[413]

L. Waikhom and R. Patgiri. A survey of graph neural networks in various learning

paradigms: methods, applications, and challenges. Artiﬁcial Intelligence Review, 2023.

[414]

J. Walker, C. Doersch, A. Gupta, and M. Hebert. An uncertain future: Forecasting

from static images using variational autoencoders. In Computer Vision–ECCV 2016:

14th European Conference, 2016, Proceedings, Part VII 14, 2016.

[415]

C. Y. Wang, A. Bochkovskiy, and H. Y. M. Liao. YOLOv7: Trainable Bag-of-Freebies

Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.

[416]

L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal

Segment Networks: Towards Good Practices for Deep Action Recognition. In European

Conference on Computer Vision, 2016.

394

Bibliography

[417]

T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-resolution image

synthesis and semantic manipulation with conditional GANs. In Proceedings of the

IEEE conference on computer vision and pattern recognition, 2018.

[418]

Z. Wang, L. Zhao, and W. Xing. Stylediﬀusion: Controllable disentangled style transfer

via diﬀusion models. In Proceedings of the IEEE/CVF International Conference on

Computer Vision, 2023.

[419] C. J. Watkins and P. Dayan. Q-learning. Machine learning, 1992.

[420] R. E. Wengert. A simple automatic derivative evaluation program. Communications

of the ACM, 1964.

[421]

P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In System

Modeling and Optimization: Proceedings of the 10th IFIP Conference, 1981, 2005.

[422] P. Wolfe. Convergence conditions for ascent methods. SIAM Review, 1969.

[423]

P. Wolfe. Convergence conditions for ascent methods. II: Some corrections. SIAM

Review, 1971.

[424]

T. Wong and P. Yeh. Reliable accuracy estimates from k-fold cross-validation. IEEE

Transactions on Knowledge and Data Engineering, 2019.

[425]

H. Wu, Z. Xu, J. Zhang, W. Yan, and X. Ma. Face Recognition Based on Convolution

Siamese Networks. In 2017 10th International Congress on Image and Signal Processing,

BioMedical Engineering and Informatics (CISP-BMEI), 2017.

[426]

Y. Wu and K. He. Group normalization. In Proceedings of the European conference

on computer vision (ECCV), 2018.

[427]

Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu. A comprehensive survey on

graph neural networks. IEEE transactions on neural networks and learning systems,

2020.

[428]

P. Xu, X. Zhu, and D. A. Clifton. Multimodal learning with transformers: A survey.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.

[429]

A. Yang, B. Xiao, B. Wang, B. Zhang, C. Bian, C. Yin, C. Lv, D. Pan, D. Wang,

D. Yan, et al. Baichuan 2: Open large-scale language models. arXiv:2309.10305, 2023.

[430]

L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, and

M. Yang. Diﬀusion models: A comprehensive survey of methods and applications.

ACM Computing Surveys, 2023.

[431]

T. Yang, Q. Lin, and Z. Li. Uniﬁed convergence analysis of stochastic momentum

methods for convex and non-convex optimization. arXiv:1604.03257, 2016.

[432]

Y. Yang. Can the strengths of AIC and BIC be shared? A conﬂict between model

identiﬁcation and regression estimation. Biometrika, 2005.

395

Bibliography

[433]

Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le. XLNet:

Generalized autoregressive pretraining for language understanding. Advances in neural

information processing systems, 2019.

[434]

G. Yao, T. Lei, and J. Zhong. A Review of Convolutional-Neural-Network-Based

Action Recognition. Pattern Recognition Letters, 2019.

[435]

F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions.

arXiv:1511.07122, 2015.

[436]

F. Yu, V. Koltun, and T. Funkhouser. Dilated residual networks. In Proceedings of

the IEEE conference on computer vision and pattern recognition, pages 472–480, 2017.

[437]

L. Yu, W. Zhang, J. Wang, and Y. Yu. Seqgan: Sequence generative adversarial nets

with policy gradient. In Proceedings of the AAAI conference on artiﬁcial intelligence,

2017.

[438]

Y. Yu, X. Si, C. Hu, and J. Zhang. A review of recurrent neural networks: Lstm cells

and network architectures. Neural Computation, 2019.

[439]

Y. Yuan. Recent advances in trust region algorithms. Mathematical Programming,

2015.

[440] M. D. Zeiler. Adadelta: An adaptive learning rate method. arXiv:1212.5701, 2012.

[441]

M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In

European conference on computer vision, 2014.

[442]

M. D. Zeiler, G. W. Taylor, and R. Fergus. Adaptive deconvolutional networks for

mid and high level feature learning. In 2011 International Conference on Computer

Vision, 2011.

[443]

X. Zeng and T. R. Martinez. Distribution-balanced stratiﬁed cross-validation for

accuracy estimation. Journal of Experimental & Theoretical Artiﬁcial Intelligence,

2000.

[444]

C. Zhang, J. Bütepage, H. Kjellström, and S. Mandt. Advances in variational inference.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.

[445]

M. Zhang and Y. Chen. Link prediction based on graph neural networks. Advances in

neural information processing systems, 2018.

[446]

N. Zhang, S. Shen, A. Zhou, and Y. Jin. Application of lstm approach for modelling

stress–strain behaviour of soil. Applied Soft Computing, 2021.

[447]

Q. Zhang and S. Zhu. Visual interpretability for deep learning: a survey. Frontiers of

Information Technology & Electronic Engineering, 2018.

[448]

W. Zhao, J. P. Queralta, and T. Westerlund. Sim-to-real transfer in deep reinforcement

learning for robotics: a survey. In 2020 IEEE symposium series on computational

intelligence (SSCI), 2020.

396

Bibliography

[449]

W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang,

Z. Dong, Y. Du, C. Yang, Y. Chen, Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang, Z. Liu,

P. Liu, J. Y. Nie, and J. R. Wen. A survey of large language models. arXiv:2303.18223,

2023.

[450]

Y. Zhao, X. Li, W. Zhang, S. Zhao, M. Makkie, M. Zhang, Q. Li, and T. Liu. Modeling

4D fMRI Data via Spatio-Temporal Convolutional Neural Networks (ST-CNN). In

Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st

International Conference, 2018, Proceedings, Part III 11, 2018.

[451]

J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun. Graph

neural networks: A review of methods and applications. AI open, 2020.

[452]

Q. Zhou, W. Chen, S. Song, J. Gardner, K. Weinberger, and Y. Chen. A reduction of

the elastic net to support vector machines with an application to GPU computing. In

Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 2015.

[453]

H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal

of the Royal Statistical Society: Series B (Statistical Methodology), 2005.

[454]

Z. Zou, K. Chen, Z. Shi, Y. Guo, and J. Ye. Object Detection in 20 Years: A Survey.

Proceedings of the IEEE, 2023.

397

Index

absolute error loss, 43, 73

absolute improvement, 118

AC-GAN, 352

accuracy, 36

actions, 12, 326

activation function, 83

active learning, 29

Adadelta, 142, 143

Adagrad, 130

ADAM algorithm, 127

Adamax, 142, 145

adaptive instance normalization, 325

adaptive learning rates per coordinate, 127

adaptive moment estimation, 127

adaptive subgradient, 130

adjacency lists, 339

adjacency matrix, 339

adjacency weight matrix, 339

adjoints, 140

adversarial, 314, 315

adversarial autoencoders, 110

aﬃne discrete time linear dynamical sys-

tem, 50

agent, 12, 29, 326, 329

aggregate, 344

Akaike information criterion, 30, 72

AlexNet, 10, 15, 25, 201, 234, 245

algorithms, 27, 28

alignment function, 272

alignment model, 272

alignment scores, 272

AlphaGo, 26

anchor, 242

annotated, 17

approximate Bayesian computation, 351

arctan, 180

area under the curve (AUC), 37

Armijo condition, 149

Armijo line search, 162

artiﬁcial general intelligence, 13

artiﬁcial intelligence, 2

artiﬁcial neural network, 1

artiﬁcial neuron, 83

attention mechanism, 7, 11, 269, 272

attention weights, 272

auto-regressive, 250

auto-regressive stochastic sequence, 309

autoencoders, 7, 9, 99

automatic control, 326

automatic diﬀerentiation, 133, 162

auxiliary classiﬁer generative adversarial

network, 324

average-pooling, 226

backpropagation, 245

backpropagation algorithm, 9, 182, 201

backpropagation through time, 257, 258

backtracking, 149

backtracking line search, 149, 162

backward mode, 136

backward mode automatic diﬀerentiation,

201

backward pass, 140

bag of words, 252

bagging, 72

balanced, 19

basic gradient descent with exact line search,

147

batch normalization, 9, 192, 202, 231

batch normalization inception, 245

Bayesian information criterion, 30, 72

Bayesian neural networks, 26

Bayesian statistics, 351

beam search, 290

Bellman equations, 331

BERT, 293

bias, 39, 76

bias and variance tradeoﬀ, 53, 57

bias correction, 131

bias vector, 167

bidirectional recurrent neural network, 263,

292

big data (analytics), 2

binary cross entropy, 79

399

Index

binomial distribution, 76

Boltzman machine, 351

boosting, 72

bottleneck, 99

bottleneck layers, 236

bounding box, 241

Broyden-Fletcher-Goldfarb-Shanno, 159

C-GAN, 352

C/C++, 25

categorical, 17

categorical cross entropy, 91

categorical distribution, 87

Cauchy-Schwartz, 355

causal modeling, 26

cell state, 253, 265

centered data matrix, 66

centers (K-means), 63

central processing units, 15

centroids, 63

channels, 10, 221

ChatGPT, 11

Chinchilla, 293

CIFAR-10 dataset, 20

classiﬁcation, 3, 28, 32

closed loop control, 329

cloud computing, 15

clustering, 28, 62

code (encoder-decoder), 99, 270

colorization, 351

column-major, 17

combine (message passing GNN), 344

commutative, 207

comparison network, 242

complete graph, 339

computational graph, 137

computer algebra systems, 134

computer scientists, 6

computer vision, 238

conditional generation paradigm, 324

conditional generative adversarial network,

324

conﬁdence bands, 32

conﬁdence intervals, 110

confusion matrix, 36, 47

conjugate, 149

conjugate gradient method, 148

connected, 339

consistent, 54

constrained, 112

constrained optimization, 104

constraint set, 112

context vector, 250, 270

continuous, 356

continuously diﬀerentiable, 359, 362, 363

contraction mapping, 333

contractive autoencoders, 110

control policy, 329

control theory, 7, 326

controller, 329

converge, 356

convex function, 113

convex hull, 121

convex set, 113

convexity, 113

convolution, 204, 207, 210, 211

convolution theorem, 347

convolutional kernel, 217

convolutional layers, 204, 224

convolutional neural network, 7, 10, 203,

245

convolutionalization, 230

correlated features, 66

cosine of the angle, 355

covariance, 66

covariance matrix, 298

cross attention layer, 288

cross entropy, 366

cross validation, 60

curvature condition, 158

data, 27

data augmentation, 324

data matrix, 65

data mining, 2

data reduction, 29

data science, 2

data scientists, 6

data to data paradigm, 325

Davidon-Fletcher-Powell, 163

de-mean, 66

de-noising, 107

decay parameter, 130

decision rule, 35

decision trees, 32, 39, 72

decoder, 7, 9, 11, 99, 269, 296, 300

deconvolution architecture, 234, 246

deep, 84

deep blue, 26

deep learning (DL), 2

deep Q-learning, 335

deep reinforcement learning, 326, 334

400

Index

DeepWalk, 352

degradation problem, 236

degree, 96, 339

denoising autoencoder, 108

denoising matching, 308

denoising mechanism, 306

dense layers, 224

dense network, 165

dense neural network, 7

depth, 212, 221

depth reduction layer, 230

derivative, 359

descent direction, 145

descent direction method, 117

descent step, 117

design matrix, 40

diﬀerentiable, 359

diﬀerentiable programming, 133, 134

diﬀusion model, 7, 25, 296, 309, 351

dilation, 218, 220

dimension, 17

Dirac delta function, 208

directed graph, 338

directional derivative, 357

Dirichlet regression, 110

discount factor, 330

discriminative models, 38

discriminator, 11, 296, 314

distance channel, 239

dropout, 9, 195, 202, 231

duality theory, 321

dummy, 45

dying ReLU, 180

dynamic context vector, 275

dynamic equilibrium, 315

dynamic graph neural networks, 342

dynamic graphs, 352

dynamic programming principle, 332

early stopping, 124

Eckart-Young-Mirsky theorem, 70, 73, 104

edge, 338

edge level features, 340

edge set, 337

EditGAN, 352

eﬀectiveness function, 72

EﬃcientNet, 234, 235, 238, 245

elastic net, 59, 73

elbow, 65

Elman network, 292

elu, 180

embedding vector, 248

encoder, 7, 9, 11, 99, 269, 296, 300

encoder-decoder architecture, 250, 269

engineered feature, 34

enhance, 325

ensemble, 197

ensemble learning, 197

ensemble method, 197

environment, 12, 326

epoch, 123

epsilon greedy, 333

error, 32, 40

estimated gradient, 123

ethical aspects, 26

Euclidean distance, 355

Euclidean norm, 355

evidence, 302

evidence lower bound (ELBO), 302

exact line search, 146

expected generalization performance, 55

exploding gradient, 189, 202

exponential decay parameter, 119

exponential smoothing, 127

expression swell, 133

extracted features, 166

score, 38

face recognition, 241, 245

false negative (FN), 36

false positive (FP), 36

false positive rate, 37

fashion MNIST dataset, 19

fast Fourier transform, 245

fast.ai, 4, 25

feature based, 232

feature engineering, 176

feature extraction, 9

feature maps, 206, 223

feature vectors, 17

feedback control, 329

feedforward deep neural network, 165

feedforward fully connected neural net-

work, 7, 9

feedforward network, 7, 165

feedforward pass, 170

ﬁltering, 204

ﬁne-tuning, 4

ﬁnite sum problem, 114

ﬁrst Wolfe condition, 149

ﬁrst-order method, 127

401

Index

ﬁrst-order Taylor’s approximation, 363

Flux.jl, 25, 162, 201

fMRI (functional magnetic resonance imag-

ing), 239

forward mode, 136

forward pass, 140, 168

Fourier analysis, 283

Fourier transform, 347

freezing, 277

freezing layers, 228

Fruits 360, 4

full SVD, 42, 69

fully connected deep autoencoder architec-

ture, 9

fully connected graph, 339

fully connected layers, 9

fully connected network, 7, 165

fully convolutional network, 224, 230

game theory, 11

gate, 174, 255

gated recurrent unit, 10, 268, 292

Gaussian, 367

Gaussian mixture model, 299, 351

general additive models, 72

general fully connected architecture, 165

general fully connected neural network,

165

generalization ability, 52

generalization error, 54

generalization gap, 55

generalization performance, 54

generalized additive model, 34

generalized linear model, 34, 72, 76

generalized recursive neuron, 352

generative adversarial network (GAN), 7,

11, 26, 296, 351

generative modelling, 29, 295

generative models, 38

generator, 11, 314

GLaM, 293

global minimum, 112

GloVe, 292

Golub-Reinsch algorithm, 73

GoogLeNet, 234, 235, 245

Gopher, 293

GPT-2, 293

GPT-3, 26, 293

gradient, 357

gradient boosted trees, 39

gradient boosting, 72

gradient clipping, 189, 262

gradient descent, 48

gradient magnitude, 118

gradient penalty, 323

Gram matrix, 42

graph, 337

graph attention networks, 349, 352

graph convolutional network, 346, 352

graph embeddings, 341

graph neural networks, 12, 273, 336

graphical processing units, 15

GraphSAGE, 352

grid-structured data, 203

group normalization, 231, 245

Hadamard product, 130

handwriting recognition, 292

He initialization, 190

Heaviside step function, 110

Hessian, 359

Hessian matrix, 84

hidden layer, 99, 166

hidden Markov models, 351

hidden state, 253, 265, 268, 342

hierarchical clustering, 73

hierarchical Markovian variational autoen-

coders, 11, 351

hierarchical variational autoencoders, 306,

351

high dimensional, 15

high pass ﬁltering, 348

hold out set, 30

HOPE, 352

Hopﬁeld networks, 201

Huber error loss, 43, 73

HuggingFace, 292

hyper-parameter, 59

hyperbolic tangent, 179

hypothesis tests, 32, 110

i.i.d., 77

identiﬁable, 87

identiﬁcation, 241

identity activation function, 83

image captioning, 250

image classiﬁcation, 246

image processing, 10

image sequences, 239

image to image paradigm, 324, 325

image to text, 269

Imagen, 351

402

Index

Imagen Video, 351

ImageNet, 25

ImageNet challenge, 20, 245

ImageNet database, 3, 20

imitation game, 13

imitation learning, 352

impulse response, 208

impulse signal, 208

in-degree, 339

inception module, 235

inception network, 234, 235, 245

indicator, 45

inductive learning, 341

inexact line search, 146, 149

inﬁnite horizon expected discounted re-

ward, 330

inﬂection point, 152

Info-GAN, 324, 352

information theory, 7, 201

inherent noise, 58

inner product, 355

inpainting, 351

input channels, 213

input layer, 165

input neuron, 165

instance segmentation, 241

interaction term, 34, 72

intercept, 39, 76

internal cell state, 265

internal features, 206

internal gates, 265

internal hidden state, 268

interpolation on the latent space, 109

interpretable machine learning, 232

interpretable models, 80

interpretation, 33, 80, 231

Jacobian, 358

jacobian vector product, 135, 361

JAX, 162, 201

Jensen-Shannon distance, 367

Jensen-Shannon divergence, 366

Julia, 25

K-fold cross validation, 30, 60, 73

K-means, 62, 73

K-nearest neighbours, 39, 72

Kalman ﬁltering, 326

Kantorovich-Rubinstein duality theorem,

321

Keras, 4, 25

kernel methods, 72

key, 272, 279

knee point, 65

Krylov subspace methods, 154

Kullback–Leibler divergence, 365

L-BFGS, 151

labelling, 63

labels, 17

LaMDa, 293

landmark detection, 240

Laplacian matrix, 347

large language models, 7, 11, 13, 250, 269,

290, 353

lasso, 59, 73

latent space, 107, 250, 296

latent variable, 77

latent variable sample marginal distribu-

tion, 301

layer, 165

layer normalization, 286, 292

leaky ReLU, 180, 201

learning, 3, 27

learning rate, 48, 128

learning to rank, 110

least absolute shrinkage and selection op-

erator, 59

least squares, 72

least squares problem, 41

LeNet-5, 234, 245

likelihood, 30

likelihood function, 77

Limited-memory BFGS, 151

limited-memory BFGS, 159

LINE, 352

line search, 142, 145

linear algebra, 26

linear approximation, 362, 363

linear autoencoder, 103

linear classiﬁers, 81

linear discriminant analysis, 38, 72

linear model, 9

linear programming, 321

linear regression, 72

linear time invariant, 207, 245

linearity, 208

linearly separable, 110

link function, 76

loading vector, 67

local minimum, 112

locality, 204, 343

403

Index

localization, 5

localization and classiﬁcation, 240

locally convex, 114

locally estimated scatterplot smoothing

(LOESS), 34, 72

log odds, 76

log-density, 367

log-likelihood, 78

log-sum-exp, 116

logistic, 179

logistic distribution, 77

logistic function, 76

logistic regression, 9, 35

logit, 76

long short term memory, 10, 292

look-ahead momentum, 143

look-ahead prediction, 250

loss function, 31, 40

loss landscape, 51

low pass ﬁltering, 348

low rank approximation, 70

machine learning, 2

machine translation, 11, 250, 269

manual annotation process, 18

many to many, 251

many to one, 251

Maple, 25

mapping network, 325

margin, 243

Markov chain, 306, 328

Markov decision processes, 327

Markov property, 306

Markovian, 306

Markovian hierarchical variational autoen-

coder, 306

masked self attention, 281

masking, 282

Mathematica, 25

mathematical engineering, 7

mathematical game, 315

MATLAB, 25

matrix calculus, 202

max-pooling, 226

maximum a posteriori probability, 93, 168

maximum likelihood estimation, 43, 77

mean computation, 63

mean square error, 31

mean vector, 298

Mechanical Turk, 25

Megatron-Turing NLG, 293

message, 344

message passing, 344, 352

message passing neural network (MPNN),

344

mini-batch, 123

mini-batch gradient descent, 123

minimax objective, 315

mixed models, 34

mixture components, 299

mixture weights, 299

MNIST database, 18

mode collapse, 316

model bias, 53, 58

model misspeciﬁcation, 34

model parameters, 33

model selection, 52, 73

model variance, 53, 58

models, 27

momentum, 127, 128

momentum parameter, 128

momentum update, 128

monomial, 96

Moore-Penrose pseudo inverse, 42, 73

multi-class classiﬁcation, 32, 45

multi-class logistic regression, 86

multi-collinearity, 73

multi-graph, 340

multi-head attention, 278

multi-head cross attention, 289

multi-head self attention, 278

multi-index, 363

multi-layer dense network, 165

multi-layer perceptron, 7, 9, 165, 201

multimodal model, 11, 292

multinomial distribution, 87

multinomial logistic regression, 86

multinomial regression model, 9, 86

multiplication gate, 174

multivariate chain rule, 360

multivariate normal distributions, 298

Nadam, 142

Nadaraya-Watson kernel regression, 34, 72

naive Bayes classiﬁer, 38, 72

narrow tasks, 13

natural language processing (NLP), 10, 20,

248, 292

NCHW, 188

negative deﬁnite, 359

negative predictive value, 36

negative semideﬁnite, 359

404

Index

neighbours, 339

Neocognitron, 234, 245

Nesterov accelerated gradient, 142

Nesterov acceleration, 142

Nesterov momentum, 142

network dissection, 246

network inversion, 246

network within a network, 234, 235

network-in-network, 245

neural network, 1

neural style transfer, 325

neuron, 165

Newton’s method, 151

Newton-Raphson, 151, 163

NHWC, 188

no self loops, 344

node level features, 340

node set, 337

node2vec, 352

noise features, 66

noising mechanism, 306

nominal categorical variable, 17

non-interpretable, 80

non-linear PCA, 105

non-saturating GAN, 319

normal, 367

normal equations, 42

normalization, 192

normalization of the data, 31

normalizing ﬂows, 352

NS-GAN, 352

number of output channels, 223

numerical, 17

numerical diﬀerentiation, 133

object detection, 240

object localization, 240

objective function, 112

observations, 12, 326

occlusions, 232

odds, 76

odds ratio, 80

one by one convolutional layer, 229

one step transition probabilities, 306

one to many, 251

one vs. all, 45

one vs. one, 45

one vs. rest, 45

one-hot encoding, 43, 248

one-hot encoding positional embedding,

282

open loop control, 329

ordinal categorical variable, 17

ordinal regression, 110

oscillation, 153

out-degree, 339

output layer, 166

over-training, 124

overcomplete, 110

overﬁt, 58

overﬁtting, 52, 124

overshoot, 152

padding, 218

parameter estimates, 33

parametric model, 298

partial derivative, 357

partially observable Markov decision pro-

cesses, 327

path, 339

peaks function, 113

perceptron, 9, 25, 110, 201

perceptron learning algorithm, 110

performance function, 53

performance metrics, 53

permutation invariance, 343

permutation matrix, 339

piecewise aﬃne function, 180

planar graph, 340

policy, 329

policy iteration, 332

pooling, 10, 224

pooling stride, 227

positional embeddings, 278, 282

positive deﬁnite, 359

positive predictive value, 36

positive semideﬁnite, 359

power product, 96

precision, 36

predeﬁned learning rate schedule, 119

prediction, 28

PReLU, 180, 201

preprocessing, 31, 228

primal, 138

principal component analysis (PCA), 62,

principal components, 66

prior matching, 301, 308

probability, 7

probit regression model, 77, 110

prompt, 6

proxy vectors, 272

405

Index

pure function, 135

pure mathematics, 7

Python, 4, 25

PyTorch, 4, 25, 162, 201

PyTorch Lightning, 25

Q-function, 331

Q-learning, 333, 352

Q-table, 333

quadratic approximation, 362, 363

quadratic loss, 40

quantum deep learning, 26

quasi-Newton, 151

quasi-Newton method, 154, 163

query, 272, 279

R statistical computing system, 25

random forest, 39, 72

ranking learning, 110

real world, 27, 28

recall, 36

receiver operating characteristic (ROC)

curve, 36

receptive ﬁeld, 220

receptive ﬁeld of a derived feature, 227

reconstruction, 301

reconstruction term, 308

recurrence relation, 253

recurrent neural network, 7, 248, 253, 292

recursive graph, 253

reduced SVD, 42, 69

reference level, 45

regression, 5, 28, 32

regression parameter, 39, 76

regularization, 53, 59, 195, 202

regularization parameter, 59

regularization term, 59

reinforcement learning, 7, 12, 29, 326

relative entropy, 365

relative improvement criterion, 118

ReLU, 180, 201

reparameterization trick, 311

reply memory, 335

researchers, 7

reset gate, 268

residual, 40

residual connections, 234, 238

ResNet, 235, 238, 245

restoration, 351

reward, 12, 326

reward function, 330

ridge regression, 59, 73

RMSprop, 130

RNN cell, 255

RNN unit, 255

Roberta, 293

robust autoencoders, 110

root mean square propagation, 130

Rosenbrock function, 162

row-major, 17

saddle points, 114

sample correlation, 52

sample correlation matrix, 66

sample covariance matrix, 66

sample mean, 31

sample standard deviation, 31

sample variance, 31

Schur product, 130

scientiﬁc machine learning, 15

score function, 272

scree plot, 71

secant equation, 158

secant method, 151, 153

second Wolfe condition, 150

second-order methods, 150

second-order Taylor’s approximation, 363

seen data, 30, 52

self attention, 278, 279

self loops, 338, 344

self-supervised learning, 30

self-supervision, 262

selu, 180

semantic segmentation, 5, 240

semi-supervised learning, 29

sensitivity, 36

sentiment, 18

sentiment analysis, 250

sequence GAN, 352

shallow, 84

shallow neural network, 83

shortcut connections, 237

siamese network, 242

sigmoid, 168, 174, 179

sigmoid function, 76

simple gate, 255

simple linear regression model, 32

single layer, 83

single sample Bellman estimate, 334

singular value decomposition (SVD), 42,

69, 73

singular values, 69

406

Index

singular vectors, 69

skip-gram model, 252

Sobel ﬁlter, 204, 245

softmax, 168

softmax activation function, 88

softmax logistic regression, 86

softmax regression, 86

softplus, 180

softsign, 180

sparse autoencoders, 110

spatial-temporal graph neural networks,

352

speciﬁcity, 36

spectral, 347

spectral convolutional graph neural net-

works, 347, 352

spectral decomposition, 69, 347

spectral graph neural networks, 347

spectral graph theory, 347

spectral weights, 348

spiking neural network, 25

split strategy (80-20), 35

square loss, 40

SqueezeNet, 245

stacked recurrent neural network, 264

standard multivariate normal, 298

standard normal, 367

standardization of the data, 31

standardized samples, 31

state evolution, 253

state exploration, 333

state information, 328

state space, 329

static graph neural networks, 342

statistical learning, 2

statisticians, 6

statistics, 2, 7

steepest descent, 117

step, 179

step size, 117

Stirling’s approximation, 97

stochastic approximation, 334

stochastic gradient descent, 121, 162, 201

strictly convex, 114

stride, 218

stride of one, 219

strong Wolfe condition, 150

style transfer paradigm, 324, 325

style-GAN, 325, 352

supervised learning, 17, 28

support vector machines (SVM), 39, 72

swish, 180

symbolic diﬀerentiation, 133

synthesis network, 325

synthetic minority oversampling technique

(SMOTE), 38

tangent, 138

tanh, 179, 201

tasks, 5

tasks on edges, 341

tasks on graphs, 341

tasks on nodes, 341

Taylor polynomial, 362

Taylor’s theorem, 362

teacher forcing, 263

temporal diﬀerence learning, 335

tensor processing units, 15

TensorFlow, 4, 25, 162, 201

termination condition, 48, 117, 132, 157

test set, 18, 30

testing data, 30

text in reverse order, 271

text to image, 269, 351

threshold, 168

Tikhonov regularization, 59, 73

time delay neural network, 245

time homogenous, 328

time horizon, 250

time invariance, 208

time-series, 292

Toeplitz matrix, 210, 245

tokenizers, 248

tokens, 248

train set, 18

train-validate split, 60

train-validate-test split, 61

trainable convolutions, 10

training, 2

training data, 30

training loss, 124

training set, 30, 60

training time, 57

transductive learning, 341

transfer learning, 4, 29, 228

transformer architecture, 11, 353

transformer block, 278, 284

transformer decoder block, 288

transformer encoder-decoder architecture,

287

transformer models, 7, 246, 269, 292

transformers, 235, 248, 277

407

Index

translation invariance, 204, 343

triangle inequality, 356

triplet loss, 243

true negative (TN), 36

true positive (TP), 36

truncated backpropagation through time,

262

trust region methods, 153

tuple, 23

Turing test, 13, 25

unbiased, 121

unbiased estimator, 54

unconstrained, 112

uncropping, 351

undercomplete, 110

underﬁtting, 52, 58

undirected graph, 338

unfolded graph, 253

unit, 103, 165

unit (RNN), 250

univariate, 32

unseen data, 30, 52

unseen input data, 3

unsupervised learning, 28

update, 344

update gate, 268

update rule (quasi-Newton), 156

upsampling, 325

validation set, 30, 60

value function, 331

value iteration, 332

values, 279

vanishing, 202

vanishing gradient, 189

variational autoencoder, 11, 110, 296, 351

variational Bayes, 351

variational posterior, 300

VC dimension, 72

vector, 23

vector Jacobian product, 135, 362

vertex, 338

VGG model, 245

VGG16, 25

VGG19, 3, 25

VGGNet, 235

visual cortex, 13

volume convolution, 212

W-GAN, 352

Wasserstein distance, 319, 320

Wasserstein GAN, 319

weight based, 232

weight clipping, 323

weight decay, 199

weight initialization, 9, 171, 202

weight matrix, 167

weight vector, 39, 76

weighted Frobenius norm, 159

weighted graph, 339

Wisconsin breast cancer dataset, 35

word embedding, 248, 252, 292

word2vec, 252, 292

WordNet, 25

Xavier initialization, 190, 202

XGBoost, 72

XLNET, 293

YOLO, 245

ZFNet, 235

408