Are you over 18 and want to see adult content?

# More Annotations

Last Days of Spring - Personal Lifestyleblog

Are you over 18 and want to see adult content?

HIFI-REGLER - Spezial-Versand fÃ¼r HiFi und Heimkino

Are you over 18 and want to see adult content?

Real Estate Philippines - Buy, Sell, & Rent Properties - MyProperty.ph

Are you over 18 and want to see adult content?

A complete backup of maximus3102math.weebly.com

Are you over 18 and want to see adult content?

Browse or Chat with Thousands of Asian Girls with the Asian Dating Site - AsiaMe.com

Are you over 18 and want to see adult content?

death wears green â€” LiveJournal

Are you over 18 and want to see adult content?

# Favourite Annotations

KvalitnÃ kabelky + levnÃ© kabelky + dÄ›tskÃ© zboÅ¾Ã + zavazadla - mankshop.cz

Are you over 18 and want to see adult content?

HodlBot - The Best Crypto Trading Bot for Binance & Kraken

Are you over 18 and want to see adult content?

ÐžÑ„Ð¸Ñ†Ð¸Ð°Ð»ÑŒÐ½Ñ‹Ð¹ ÑÐ°Ð¹Ñ‚ - Ð¡Ð¾ÐºÑ€Ñ‹Ñ‚Ð¾Ðµ Ð¡Ð¾ÐºÑ€Ð¾Ð²Ð¸Ñ‰Ðµ

Are you over 18 and want to see adult content?

Alles fÃ¼r einen unvergesslichen Geburtstag

Are you over 18 and want to see adult content?

Expat Academy - Where Global Mobility Professionals Learn, Connect & Share

Are you over 18 and want to see adult content?

# Text

### 1).

7.1. DEEP CONVOLUTIONAL NEURAL NETWORKS (ALEXNET) Deep Convolutional Neural Networks (AlexNet) — Dive into Deep Learning 0.16.2 documentation. 7.1. Deep Convolutional Neural Networks (AlexNet) Although CNNs were well known in the computer vision and machine learning communities following the introduction of LeNet, they did not immediately dominate the field. 10.6. SELF-ATTENTION AND POSITIONAL ENCODING 10.6.2. Comparing CNNs, RNNs, and Self-Attention¶. Let us compare architectures for mapping a sequence of \(n\) tokens to another sequence of equal length, where each input or output token is represented by a \(d\)-dimensional vector.Specifically, we will consider CNNs, RNNs, and self-attention. 13.11. FULLY CONVOLUTIONAL NETWORKS (FCN) 13.11.1. Constructing a Model¶. Here, we demonstrate the most basic design of a fully convolutional network model. As shown in Fig. 13.11.1, the fully convolutional network first uses the convolutional neural network to extract image features, then transforms the number of channels into the number of categories through the \(1\times 1\) convolution layer, and finally transforms the height and 9.1. GATED RECURRENT UNITS (GRU) 9.1.1.1. Reset Gate and Update Gate¶. The first thing we need to introduce are the reset gate and the update gate.We engineer them to be vectors with entries in \((0, 1)\) such that we can perform convex combinations. For instance, a reset gate would allow us to control how much of the previous state we might still want to remember. 6.3. PADDING AND STRIDE 6.3.2. Stride¶. When computing the cross-correlation, we start with the convolution window at the top-left corner of the input tensor, and then slide it over all locations both down and to the right. 6.6. CONVOLUTIONAL NEURAL NETWORKS (LENET) 6.6. Convolutional Neural Networks (LeNet) — Dive into Deep Learning 0.16.4 documentation. 6.6. Convolutional Neural Networks (LeNet) We now have all the ingredients required to assemble a fully-functional CNN. In our earlier encounter with image data, we applied a softmax regression model ( Section 3.6) and an MLP model ( Section 4.2) to DIVE INTO DEEP LEARNING Dive into Deep Learning. Interactive deep learning book with code, math, and discussions. Implemented with NumPy/MXNet, PyTorch, and TensorFlow. Adopted at 175 universities from 40 countries. 3.4. SOFTMAX REGRESSION 3.4.1. Classification Problem¶. To get our feet wet, let us start off with a simple image classification problem. Here, each input consists of a \(2\times2\) grayscale image. We can represent each pixel value with a single scalar, giving us four features \(x_1, x_2, x_3, x_4\).Further, let us assume that each image belongs to one among the categories “cat”, “chicken”, and “dog”. 10.5. MULTI-HEAD ATTENTION In practice, given the same set of queries, keys, and values we may want our model to combine knowledge from different behaviors of the same attention mechanism, such as capturing dependencies of various ranges (e.g., shorter-range vs. longer-range) within a sequence. 9.7. SEQUENCE TO SEQUENCE LEARNING 9.7.1. Encoder¶. Technically speaking, the encoder transforms an input sequence of variable length into a fixed-shape context variable \(\mathbf{c}\), and encodes the input sequence information in this context variable.As depicted in Fig. 9.7.1, we can use an RNN to design the encoder.. Let us consider a sequence example (batch size:### 1).

7.1. DEEP CONVOLUTIONAL NEURAL NETWORKS (ALEXNET) Deep Convolutional Neural Networks (AlexNet) — Dive into Deep Learning 0.16.2 documentation. 7.1. Deep Convolutional Neural Networks (AlexNet) Although CNNs were well known in the computer vision and machine learning communities following the introduction of LeNet, they did not immediately dominate the field. 10.6. SELF-ATTENTION AND POSITIONAL ENCODING 10.6.2. Comparing CNNs, RNNs, and Self-Attention¶. Let us compare architectures for mapping a sequence of \(n\) tokens to another sequence of equal length, where each input or output token is represented by a \(d\)-dimensional vector.Specifically, we will consider CNNs, RNNs, and self-attention. 13.11. FULLY CONVOLUTIONAL NETWORKS (FCN) 13.11.1. Constructing a Model¶. Here, we demonstrate the most basic design of a fully convolutional network model. As shown in Fig. 13.11.1, the fully convolutional network first uses the convolutional neural network to extract image features, then transforms the number of channels into the number of categories through the \(1\times 1\) convolution layer, and finally transforms the height and 9.1. GATED RECURRENT UNITS (GRU) 9.1.1.1. Reset Gate and Update Gate¶. The first thing we need to introduce are the reset gate and the update gate.We engineer them to be vectors with entries in \((0, 1)\) such that we can perform convex combinations. For instance, a reset gate would allow us to control how much of the previous state we might still want to remember. 6.3. PADDING AND STRIDE 6.3.2. Stride¶. When computing the cross-correlation, we start with the convolution window at the top-left corner of the input tensor, and then slide it over all locations both down and to the right. 6.6. CONVOLUTIONAL NEURAL NETWORKS (LENET) 6.6. Convolutional Neural Networks (LeNet) — Dive into Deep Learning 0.16.4 documentation. 6.6. Convolutional Neural Networks (LeNet) We now have all the ingredients required to assemble a fully-functional CNN. In our earlier encounter with image data, we applied a softmax regression model ( Section 3.6) and an MLP model ( Section 4.2) to 11.1. OPTIMIZATION AND DEEP LEARNING 11.1.2.3. Vanishing Gradients¶. Probably the most insidious problem to encounter is the vanishing gradient. Recall our commonly-used activation functions and their derivatives in Section 4.1.2.For instance, assume that we want to minimize the function \(f(x) = \tanh(x)\) and we happen to get started at \(x = 4\).As we can see, the gradient of \(f\) is close to nil. 16. RECOMMENDER SYSTEMS 16. Recommender Systems¶. Shuai Zhang (Amazon), Aston Zhang (Amazon), and Yi Tay (Google). Recommender systems are widely employed in industry and are ubiquitous in our daily lives. These systems are utilized in a number of areas such as online shopping sites (e.g., amazon.com), music/movie services site (e.g., Netflix and Spotify), mobile application stores (e.g., IOS app store and google 9.7. SEQUENCE TO SEQUENCE LEARNING 9.7.1. Encoder¶. Technically speaking, the encoder transforms an input sequence of variable length into a fixed-shape context variable \(\mathbf{c}\), and encodes the input sequence information in this context variable.As depicted in Fig. 9.7.1, we can use an RNN to design the encoder.. Let us consider a sequence example (batch size:### 1).

D2L.TORCH — DIVE INTO DEEP LEARNING 0.16.5 DOCUMENTATION### # Defined in file:

./chapter_recurrent-neural-networks/language-models-and-dataset.md 10.4. BAHDANAU ATTENTION 10.4.1. Model¶. When describing Bahdanau attention for the RNN encoder-decoder below, we will follow the same notation in Section 9.7.The new attention-based model is the same as that in Section 9.7 except that the context variable \(\mathbf{c}\) in is replaced by \(\mathbf{c}_{t'}\) at any decoding time step \(t'\).Suppose that there are \(T\) tokens in the input sequence, the context 18.7. MAXIMUM LIKELIHOOD 18.7.1. The Maximum Likelihood Principle¶. This has a Bayesian interpretation which can be helpful to think about. Suppose that we have a model with parameters \(\boldsymbol{\theta}\) and a collection of data examples \(X\).For concreteness, we can imagine that \(\boldsymbol{\theta}\) is a single value representing the probability that a coin comes up heads when flipped, and \(X\) is a 11.11. LEARNING RATE SCHEDULING Let us have a look at what happens if we invoke this algorithm with default settings, such as a learning rate of \(0.3\) and train for \(30\) iterations. Note how the training accuracy keeps on increasing while progress in terms of test accuracy stalls beyond a point. 14.10. PRETRAINING BERT 14.10. Pretraining BERT — Dive into Deep Learning 0.16.4 documentation. 14.10. Pretraining BERT. With the BERT model implemented in Section 14.8 and the pretraining examples generated from the WikiText-2 dataset in Section 14.9, we will pretrain BERT on the WikiText-2 dataset in this section. 9.2. LONG SHORT-TERM MEMORY (LSTM) 9.2.1. Gated Memory Cell¶. Arguably LSTM’s design is inspired by logic gates of a computer. LSTM introduces a memory cell (or cell for short) that has the same shape as the hidden state (some literatures consider the memory cell as a special type of the hidden state), engineered to record additional information. To control the memory cell we need a number of gates. 13.10. TRANSPOSED CONVOLUTION 13.10.1. Basic 2D Transposed Convolution¶. Let us consider a basic case that both input and output channels are 1, with 0 padding and 1 stride. Fig. 13.10.1 illustrates how transposed convolution with a \(2\times 2\) kernel is computed on the \(2\times 2\) input matrix. DIVE INTO DEEP LEARNING Dive into Deep Learning. Interactive deep learning book with code, math, and discussions. Implemented with NumPy/MXNet, PyTorch, and TensorFlow. Adopted at 175 universities from 40 countries. 16. RECOMMENDER SYSTEMS 16. Recommender Systems¶. Shuai Zhang (Amazon), Aston Zhang (Amazon), and Yi Tay (Google). Recommender systems are widely employed in industry and are ubiquitous in our daily lives. These systems are utilized in a number of areas such as online shopping sites (e.g., amazon.com), music/movie services site (e.g., Netflix and Spotify), mobile application stores (e.g., IOS app store and google 11.1. OPTIMIZATION AND DEEP LEARNING 11.1.2.3. Vanishing Gradients¶. Probably the most insidious problem to encounter is the vanishing gradient. Recall our commonly-used activation functions and their derivatives in Section 4.1.2.For instance, assume that we want to minimize the function \(f(x) = \tanh(x)\) and we happen to get started at \(x = 4\).As we can see, the gradient of \(f\) is close to nil. 3.4. SOFTMAX REGRESSION 3.4.1. Classification Problem¶. To get our feet wet, let us start off with a simple image classification problem. Here, each input consists of a \(2\times2\) grayscale image. We can represent each pixel value with a single scalar, giving us four features \(x_1, x_2, x_3, x_4\).Further, let us assume that each image belongs to one among the categories “cat”, “chicken”, and “dog”. 10.5. MULTI-HEAD ATTENTION In practice, given the same set of queries, keys, and values we may want our model to combine knowledge from different behaviors of the same attention mechanism, such as capturing dependencies of various ranges (e.g., shorter-range vs. longer-range) within a sequence. 10.6. SELF-ATTENTION AND POSITIONAL ENCODING 10.6.2. Comparing CNNs, RNNs, and Self-Attention¶. Let us compare architectures for mapping a sequence of \(n\) tokens to another sequence of equal length, where each input or output token is represented by a \(d\)-dimensional vector.Specifically, we will consider CNNs, RNNs, and self-attention. 9.2. LONG SHORT-TERM MEMORY (LSTM) 9.2.1. Gated Memory Cell¶. Arguably LSTM’s design is inspired by logic gates of a computer. LSTM introduces a memory cell (or cell for short) that has the same shape as the hidden state (some literatures consider the memory cell as a special type of the hidden state), engineered to record additional information. To control the memory cell we need a number of gates. 9.4. BIDIRECTIONAL RECURRENT NEURAL NETWORKS 9.4.1. Dynamic Programming in Hidden Markov Models¶. This subsection serves to illustrate the dynamic programming problem. The specific technical details do not matter for understanding the deep learning models but they help in motivating why one might use deep 9.1. GATED RECURRENT UNITS (GRU) 9.1.1.1. Reset Gate and Update Gate¶. The first thing we need to introduce are the reset gate and the update gate.We engineer them to be vectors with entries in \((0, 1)\) such that we can perform convex combinations. For instance, a reset gate would allow us to control how much of the previous state we might still want to remember. 6.6. CONVOLUTIONAL NEURAL NETWORKS (LENET) 6.6. Convolutional Neural Networks (LeNet) — Dive into Deep Learning 0.16.4 documentation. 6.6. Convolutional Neural Networks (LeNet) We now have all the ingredients required to assemble a fully-functional CNN. In our earlier encounter with image data, we applied a softmax regression model ( Section 3.6) and an MLP model ( Section 4.2) to DIVE INTO DEEP LEARNING Dive into Deep Learning. Interactive deep learning book with code, math, and discussions. Implemented with NumPy/MXNet, PyTorch, and TensorFlow. Adopted at 175 universities from 40 countries. 16. RECOMMENDER SYSTEMS 16. Recommender Systems¶. Shuai Zhang (Amazon), Aston Zhang (Amazon), and Yi Tay (Google). Recommender systems are widely employed in industry and are ubiquitous in our daily lives. These systems are utilized in a number of areas such as online shopping sites (e.g., amazon.com), music/movie services site (e.g., Netflix and Spotify), mobile application stores (e.g., IOS app store and google 11.1. OPTIMIZATION AND DEEP LEARNING 11.1.2.3. Vanishing Gradients¶. Probably the most insidious problem to encounter is the vanishing gradient. Recall our commonly-used activation functions and their derivatives in Section 4.1.2.For instance, assume that we want to minimize the function \(f(x) = \tanh(x)\) and we happen to get started at \(x = 4\).As we can see, the gradient of \(f\) is close to nil. 3.4. SOFTMAX REGRESSION 3.4.1. Classification Problem¶. To get our feet wet, let us start off with a simple image classification problem. Here, each input consists of a \(2\times2\) grayscale image. We can represent each pixel value with a single scalar, giving us four features \(x_1, x_2, x_3, x_4\).Further, let us assume that each image belongs to one among the categories “cat”, “chicken”, and “dog”. 10.5. MULTI-HEAD ATTENTION In practice, given the same set of queries, keys, and values we may want our model to combine knowledge from different behaviors of the same attention mechanism, such as capturing dependencies of various ranges (e.g., shorter-range vs. longer-range) within a sequence. 10.6. SELF-ATTENTION AND POSITIONAL ENCODING 10.6.2. Comparing CNNs, RNNs, and Self-Attention¶. Let us compare architectures for mapping a sequence of \(n\) tokens to another sequence of equal length, where each input or output token is represented by a \(d\)-dimensional vector.Specifically, we will consider CNNs, RNNs, and self-attention. 9.2. LONG SHORT-TERM MEMORY (LSTM) 9.2.1. Gated Memory Cell¶. Arguably LSTM’s design is inspired by logic gates of a computer. LSTM introduces a memory cell (or cell for short) that has the same shape as the hidden state (some literatures consider the memory cell as a special type of the hidden state), engineered to record additional information. To control the memory cell we need a number of gates. 9.4. BIDIRECTIONAL RECURRENT NEURAL NETWORKS 9.4.1. Dynamic Programming in Hidden Markov Models¶. This subsection serves to illustrate the dynamic programming problem. The specific technical details do not matter for understanding the deep learning models but they help in motivating why one might use deep 9.1. GATED RECURRENT UNITS (GRU) 9.1.1.1. Reset Gate and Update Gate¶. The first thing we need to introduce are the reset gate and the update gate.We engineer them to be vectors with entries in \((0, 1)\) such that we can perform convex combinations. For instance, a reset gate would allow us to control how much of the previous state we might still want to remember. 6.6. CONVOLUTIONAL NEURAL NETWORKS (LENET) 6.6. Convolutional Neural Networks (LeNet) — Dive into Deep Learning 0.16.4 documentation. 6.6. Convolutional Neural Networks (LeNet) We now have all the ingredients required to assemble a fully-functional CNN. In our earlier encounter with image data, we applied a softmax regression model ( Section 3.6) and an MLP model ( Section 4.2) to 11.1. OPTIMIZATION AND DEEP LEARNING 11.1.2.3. Vanishing Gradients¶. Probably the most insidious problem to encounter is the vanishing gradient. Recall our commonly-used activation functions and their derivatives in Section 4.1.2.For instance, assume that we want to minimize the function \(f(x) = \tanh(x)\) and we happen to get started at \(x = 4\).As we can see, the gradient of \(f\) is close to nil. 9.3. DEEP RECURRENT NEURAL NETWORKS 9.3.1. Functional Dependencies¶. We can formalize the functional dependencies within the deep architecture of \(L\) hidden layers depicted in Fig. 9.3.1.Our following discussion focuses primarily on the vanilla RNN model, but it applies to other sequence models, too. 6.3. PADDING AND STRIDE In the previous example of Fig. 6.2.1, our input had both a height and width of 3 and our convolution kernel had both a height and width of 2, yielding an output representation with dimension \(2\times2\).As we generalized in Section 6.2, assuming that the input shape is \(n_h\times n_w\) and the convolution kernel shape is \(k_h\times k_w\), then the output shape will be \((n_h-k_h+1) \times 18.7. MAXIMUM LIKELIHOOD 18.7.1. The Maximum Likelihood Principle¶. This has a Bayesian interpretation which can be helpful to think about. Suppose that we have a model with parameters \(\boldsymbol{\theta}\) and a collection of data examples \(X\).For concreteness, we can imagine that \(\boldsymbol{\theta}\) is a single value representing the probability that a coin comes up heads when flipped, and \(X\) is a 9.5. MACHINE TRANSLATION AND THE DATASET 9.5.1. Downloading and Preprocessing the Dataset¶. To begin with, we download an English-French dataset that consists of bilingual sentence pairs from the Tatoeba Project.Each line in the dataset is a tab-delimited pair of an English text sequence and the translated French text sequence. 9.7. SEQUENCE TO SEQUENCE LEARNING 9.7.1. Encoder¶. Technically speaking, the encoder transforms an input sequence of variable length into a fixed-shape context variable \(\mathbf{c}\), and encodes the input sequence information in this context variable.As depicted in Fig. 9.7.1, we can use an RNN to design the encoder.. Let us consider a sequence example (batch size:### 1).

14.10. PRETRAINING BERT 14.10. Pretraining BERT — Dive into Deep Learning 0.16.4 documentation. 14.10. Pretraining BERT. With the BERT model implemented in Section 14.8 and the pretraining examples generated from the WikiText-2 dataset in Section 14.9, we will pretrain BERT on the WikiText-2 dataset in this section. 18. APPENDIX: MATHEMATICS FOR DEEP LEARNING 18. Appendix: Mathematics for Deep Learning¶. Brent Werness (Amazon), Rachel Hu (Amazon), and authors of this book. One of the wonderful parts of modern deep learning is the fact that much of it can be understood and used without a full understanding of the mathematics### below it.

9.6. ENCODER-DECODER ARCHITECTURE Encoder-Decoder Architecture — Dive into Deep Learning 0.16.4 documentation. 9.6. Encoder-Decoder Architecture. As we have discussed in Section 9.5, machine translation is a major problem domain for sequence transduction models, whose input and output are both variable-length sequences. To handle this type of inputs and outputs,### we can design

16.6. NEURAL COLLABORATIVE FILTERING FOR PERSONALIZED 16.6.1. The NeuMF model¶. As aforementioned, NeuMF fuses two subnetworks. The GMF is a generic neural network version of matrix factorization where the input is the elementwise product of user and item latent factors. DIVE INTO DEEP LEARNING Dive into Deep Learning. Interactive deep learning book with code, math, and discussions. Implemented with NumPy/MXNet, PyTorch, and TensorFlow. Adopted at 175 universities from 40 countries. 16. RECOMMENDER SYSTEMS 16. Recommender Systems¶. Shuai Zhang (Amazon), Aston Zhang (Amazon), and Yi Tay (Google). Recommender systems are widely employed in industry and are ubiquitous in our daily lives. These systems are utilized in a number of areas such as online shopping sites (e.g., amazon.com), music/movie services site (e.g., Netflix and Spotify), mobile application stores (e.g., IOS app store and google 11.1. OPTIMIZATION AND DEEP LEARNING 11.1.2.3. Vanishing Gradients¶. Probably the most insidious problem to encounter is the vanishing gradient. Recall our commonly-used activation functions and their derivatives in Section 4.1.2.For instance, assume that we want to minimize the function \(f(x) = \tanh(x)\) and we happen to get started at \(x = 4\).As we can see, the gradient of \(f\) is close to nil. 3.4. SOFTMAX REGRESSION 3.4.1. Classification Problem¶. To get our feet wet, let us start off with a simple image classification problem. Here, each input consists of a \(2\times2\) grayscale image. We can represent each pixel value with a single scalar, giving us four features \(x_1, x_2, x_3, x_4\).Further, let us assume that each image belongs to one among the categories “cat”, “chicken”, and “dog”. 10.5. MULTI-HEAD ATTENTION In practice, given the same set of queries, keys, and values we may want our model to combine knowledge from different behaviors of the same attention mechanism, such as capturing dependencies of various ranges (e.g., shorter-range vs. longer-range) within a sequence. 10.6. SELF-ATTENTION AND POSITIONAL ENCODING 10.6.2. Comparing CNNs, RNNs, and Self-Attention¶. Let us compare architectures for mapping a sequence of \(n\) tokens to another sequence of equal length, where each input or output token is represented by a \(d\)-dimensional vector.Specifically, we will consider CNNs, RNNs, and self-attention. 9.2. LONG SHORT-TERM MEMORY (LSTM) 9.2.1. Gated Memory Cell¶. Arguably LSTM’s design is inspired by logic gates of a computer. LSTM introduces a memory cell (or cell for short) that has the same shape as the hidden state (some literatures consider the memory cell as a special type of the hidden state), engineered to record additional information. To control the memory cell we need a number of gates. 9.4. BIDIRECTIONAL RECURRENT NEURAL NETWORKS 9.4.1. Dynamic Programming in Hidden Markov Models¶. This subsection serves to illustrate the dynamic programming problem. The specific technical details do not matter for understanding the deep learning models but they help in motivating why one might use deep 9.1. GATED RECURRENT UNITS (GRU) 9.1.1.1. Reset Gate and Update Gate¶. The first thing we need to introduce are the reset gate and the update gate.We engineer them to be vectors with entries in \((0, 1)\) such that we can perform convex combinations. For instance, a reset gate would allow us to control how much of the previous state we might still want to remember. 6.6. CONVOLUTIONAL NEURAL NETWORKS (LENET) 6.6. Convolutional Neural Networks (LeNet) — Dive into Deep Learning 0.16.4 documentation. 6.6. Convolutional Neural Networks (LeNet) We now have all the ingredients required to assemble a fully-functional CNN. In our earlier encounter with image data, we applied a softmax regression model ( Section 3.6) and an MLP model ( Section 4.2) to DIVE INTO DEEP LEARNING Dive into Deep Learning. Interactive deep learning book with code, math, and discussions. Implemented with NumPy/MXNet, PyTorch, and TensorFlow. Adopted at 175 universities from 40 countries. 16. RECOMMENDER SYSTEMS 16. Recommender Systems¶. Shuai Zhang (Amazon), Aston Zhang (Amazon), and Yi Tay (Google). Recommender systems are widely employed in industry and are ubiquitous in our daily lives. These systems are utilized in a number of areas such as online shopping sites (e.g., amazon.com), music/movie services site (e.g., Netflix and Spotify), mobile application stores (e.g., IOS app store and google 11.1. OPTIMIZATION AND DEEP LEARNING 11.1.2.3. Vanishing Gradients¶. Probably the most insidious problem to encounter is the vanishing gradient. Recall our commonly-used activation functions and their derivatives in Section 4.1.2.For instance, assume that we want to minimize the function \(f(x) = \tanh(x)\) and we happen to get started at \(x = 4\).As we can see, the gradient of \(f\) is close to nil. 3.4. SOFTMAX REGRESSION 3.4.1. Classification Problem¶. To get our feet wet, let us start off with a simple image classification problem. Here, each input consists of a \(2\times2\) grayscale image. We can represent each pixel value with a single scalar, giving us four features \(x_1, x_2, x_3, x_4\).Further, let us assume that each image belongs to one among the categories “cat”, “chicken”, and “dog”. 10.5. MULTI-HEAD ATTENTION In practice, given the same set of queries, keys, and values we may want our model to combine knowledge from different behaviors of the same attention mechanism, such as capturing dependencies of various ranges (e.g., shorter-range vs. longer-range) within a sequence. 10.6. SELF-ATTENTION AND POSITIONAL ENCODING 10.6.2. Comparing CNNs, RNNs, and Self-Attention¶. Let us compare architectures for mapping a sequence of \(n\) tokens to another sequence of equal length, where each input or output token is represented by a \(d\)-dimensional vector.Specifically, we will consider CNNs, RNNs, and self-attention. 9.2. LONG SHORT-TERM MEMORY (LSTM) 9.2.1. Gated Memory Cell¶. Arguably LSTM’s design is inspired by logic gates of a computer. LSTM introduces a memory cell (or cell for short) that has the same shape as the hidden state (some literatures consider the memory cell as a special type of the hidden state), engineered to record additional information. To control the memory cell we need a number of gates. 9.4. BIDIRECTIONAL RECURRENT NEURAL NETWORKS 9.4.1. Dynamic Programming in Hidden Markov Models¶. This subsection serves to illustrate the dynamic programming problem. The specific technical details do not matter for understanding the deep learning models but they help in motivating why one might use deep 9.1. GATED RECURRENT UNITS (GRU) 9.1.1.1. Reset Gate and Update Gate¶. The first thing we need to introduce are the reset gate and the update gate.We engineer them to be vectors with entries in \((0, 1)\) such that we can perform convex combinations. For instance, a reset gate would allow us to control how much of the previous state we might still want to remember. 6.6. CONVOLUTIONAL NEURAL NETWORKS (LENET) 6.6. Convolutional Neural Networks (LeNet) — Dive into Deep Learning 0.16.4 documentation. 6.6. Convolutional Neural Networks (LeNet) We now have all the ingredients required to assemble a fully-functional CNN. In our earlier encounter with image data, we applied a softmax regression model ( Section 3.6) and an MLP model ( Section 4.2) to 11.1. OPTIMIZATION AND DEEP LEARNING 11.1.2.3. Vanishing Gradients¶. Probably the most insidious problem to encounter is the vanishing gradient. Recall our commonly-used activation functions and their derivatives in Section 4.1.2.For instance, assume that we want to minimize the function \(f(x) = \tanh(x)\) and we happen to get started at \(x = 4\).As we can see, the gradient of \(f\) is close to nil. 9.3. DEEP RECURRENT NEURAL NETWORKS 9.3.1. Functional Dependencies¶. We can formalize the functional dependencies within the deep architecture of \(L\) hidden layers depicted in Fig. 9.3.1.Our following discussion focuses primarily on the vanilla RNN model, but it applies to other sequence models, too. 18.7. MAXIMUM LIKELIHOOD 18.7.1. The Maximum Likelihood Principle¶. This has a Bayesian interpretation which can be helpful to think about. Suppose that we have a model with parameters \(\boldsymbol{\theta}\) and a collection of data examples \(X\).For concreteness, we can imagine that \(\boldsymbol{\theta}\) is a single value representing the probability that a coin comes up heads when flipped, and \(X\) is a 6.3. PADDING AND STRIDE In the previous example of Fig. 6.2.1, our input had both a height and width of 3 and our convolution kernel had both a height and width of 2, yielding an output representation with dimension \(2\times2\).As we generalized in Section 6.2, assuming that the input shape is \(n_h\times n_w\) and the convolution kernel shape is \(k_h\times k_w\), then the output shape will be \((n_h-k_h+1) \times 9.5. MACHINE TRANSLATION AND THE DATASET 9.5.1. Downloading and Preprocessing the Dataset¶. To begin with, we download an English-French dataset that consists of bilingual sentence pairs from the Tatoeba Project.Each line in the dataset is a tab-delimited pair of an English text sequence and the translated French text sequence. 14.10. PRETRAINING BERT 14.10. Pretraining BERT — Dive into Deep Learning 0.16.4 documentation. 14.10. Pretraining BERT. With the BERT model implemented in Section 14.8 and the pretraining examples generated from the WikiText-2 dataset in Section 14.9, we will pretrain BERT on the WikiText-2 dataset in this section. 9.7. SEQUENCE TO SEQUENCE LEARNING 9.7.1. Encoder¶. Technically speaking, the encoder transforms an input sequence of variable length into a fixed-shape context variable \(\mathbf{c}\), and encodes the input sequence information in this context variable.As depicted in Fig. 9.7.1, we can use an RNN to design the encoder.. Let us consider a sequence example (batch size:### 1).

18. APPENDIX: MATHEMATICS FOR DEEP LEARNING 18. Appendix: Mathematics for Deep Learning¶. Brent Werness (Amazon), Rachel Hu (Amazon), and authors of this book. One of the wonderful parts of modern deep learning is the fact that much of it can be understood and used without a full understanding of the mathematics### below it.

16.6. NEURAL COLLABORATIVE FILTERING FOR PERSONALIZED 16.6.1. The NeuMF model¶. As aforementioned, NeuMF fuses two subnetworks. The GMF is a generic neural network version of matrix factorization where the input is the elementwise product of user and item latent factors. 9.6. ENCODER-DECODER ARCHITECTURE Encoder-Decoder Architecture — Dive into Deep Learning 0.16.4 documentation. 9.6. Encoder-Decoder Architecture. As we have discussed in Section 9.5, machine translation is a major problem domain for sequence transduction models, whose input and output are both variable-length sequences. To handle this type of inputs and outputs,### we can design

DIVE INTO DEEP LEARNING Dive into Deep Learning. Interactive deep learning book with code, math, and discussions. Implemented with NumPy/MXNet, PyTorch, and TensorFlow. Adopted at 175 universities from 40 countries. 11.1. OPTIMIZATION AND DEEP LEARNING 11.1.2.3. Vanishing Gradients¶. Probably the most insidious problem to encounter is the vanishing gradient. Recall our commonly-used activation functions and their derivatives in Section 4.1.2.For instance, assume that we want to minimize the function \(f(x) = \tanh(x)\) and we happen to get started at \(x = 4\).As we can see, the gradient of \(f\) is close to nil. 16. RECOMMENDER SYSTEMS 16. Recommender Systems¶. Shuai Zhang (Amazon), Aston Zhang (Amazon), and Yi Tay (Google). Recommender systems are widely employed in industry and are ubiquitous in our daily lives. These systems are utilized in a number of areas such as online shopping sites (e.g., amazon.com), music/movie services site (e.g., Netflix and Spotify), mobile application stores (e.g., IOS app store and google 5.1. LAYERS AND BLOCKS 5.1. Layers and Blocks — Dive into Deep Learning 0.16.4 documentation. 5.1. Layers and Blocks. When we first introduced neural networks, we focused on linear models with a single output. Here, the entire model consists of just a single neuron. Note that a single neuron (i) takes some set of inputs; (ii) generates a corresponding### scalar output

19.5. SELECTING SERVERS AND GPUS 19.5. Selecting Servers and GPUs — Dive into Deep Learning 0.16.4 documentation. 19.5. Selecting Servers and GPUs. Deep learning training generally requires large amounts of computation. At present GPUs are the most cost-effective hardware accelerators for deep learning. In particular, compared with CPUs, GPUs are cheaper and### offer higher

3.4. SOFTMAX REGRESSION 3.4.1. Classification Problem¶. To get our feet wet, let us start off with a simple image classification problem. Here, each input consists of a \(2\times2\) grayscale image. We can represent each pixel value with a single scalar, giving us four features \(x_1, x_2, x_3, x_4\).Further, let us assume that each image belongs to one among the categories “cat”, “chicken”, and “dog”. 7.1. DEEP CONVOLUTIONAL NEURAL NETWORKS (ALEXNET) Deep Convolutional Neural Networks (AlexNet) — Dive into Deep Learning 0.16.2 documentation. 7.1. Deep Convolutional Neural Networks (AlexNet) Although CNNs were well known in the computer vision and machine learning communities following the introduction of LeNet, they did not immediately dominate the field. 16.9. FACTORIZATION MACHINES Factorization machines (FM) , proposed by Steffen Rendle in 2010, is a supervised algorithm that can be used for classification, regression, and ranking tasks.It quickly took notice and became a popular and impactful method for making predictions and recommendations. Particularly, it is a generalization of the linear regression model and the matrix factorization model.### 11.10. ADAM

11.10. Adam. In the discussions leading up to this section we encountered a number of techniques for efficient optimization. Let us recap them in detail here: We saw that Section 11.4 is more effective than Gradient Descent when solving optimization problems, e.g., OPERATOR OPTIMIZATIONS ON CPUS Operator Optimizations on CPUs¶. In the past three chapters we mainly focus on the functionality of operators, namely, how to implement the operators to function correctly in TVM. DIVE INTO DEEP LEARNING Dive into Deep Learning. Interactive deep learning book with code, math, and discussions. Implemented with NumPy/MXNet, PyTorch, and TensorFlow. Adopted at 175 universities from 40 countries. 11.1. OPTIMIZATION AND DEEP LEARNING 11.1.2.3. Vanishing Gradients¶. Probably the most insidious problem to encounter is the vanishing gradient. Recall our commonly-used activation functions and their derivatives in Section 4.1.2.For instance, assume that we want to minimize the function \(f(x) = \tanh(x)\) and we happen to get started at \(x = 4\).As we can see, the gradient of \(f\) is close to nil. 16. RECOMMENDER SYSTEMS 16. Recommender Systems¶. Shuai Zhang (Amazon), Aston Zhang (Amazon), and Yi Tay (Google). Recommender systems are widely employed in industry and are ubiquitous in our daily lives. These systems are utilized in a number of areas such as online shopping sites (e.g., amazon.com), music/movie services site (e.g., Netflix and Spotify), mobile application stores (e.g., IOS app store and google 5.1. LAYERS AND BLOCKS 5.1. Layers and Blocks — Dive into Deep Learning 0.16.4 documentation. 5.1. Layers and Blocks. When we first introduced neural networks, we focused on linear models with a single output. Here, the entire model consists of just a single neuron. Note that a single neuron (i) takes some set of inputs; (ii) generates a corresponding### scalar output

19.5. SELECTING SERVERS AND GPUS 19.5. Selecting Servers and GPUs — Dive into Deep Learning 0.16.4 documentation. 19.5. Selecting Servers and GPUs. Deep learning training generally requires large amounts of computation. At present GPUs are the most cost-effective hardware accelerators for deep learning. In particular, compared with CPUs, GPUs are cheaper and### offer higher

3.4. SOFTMAX REGRESSION 3.4.1. Classification Problem¶. To get our feet wet, let us start off with a simple image classification problem. Here, each input consists of a \(2\times2\) grayscale image. We can represent each pixel value with a single scalar, giving us four features \(x_1, x_2, x_3, x_4\).Further, let us assume that each image belongs to one among the categories “cat”, “chicken”, and “dog”. 7.1. DEEP CONVOLUTIONAL NEURAL NETWORKS (ALEXNET) Deep Convolutional Neural Networks (AlexNet) — Dive into Deep Learning 0.16.2 documentation. 7.1. Deep Convolutional Neural Networks (AlexNet) Although CNNs were well known in the computer vision and machine learning communities following the introduction of LeNet, they did not immediately dominate the field. 16.9. FACTORIZATION MACHINES Factorization machines (FM) , proposed by Steffen Rendle in 2010, is a supervised algorithm that can be used for classification, regression, and ranking tasks.It quickly took notice and became a popular and impactful method for making predictions and recommendations. Particularly, it is a generalization of the linear regression model and the matrix factorization model.### 11.10. ADAM

11.10. Adam. In the discussions leading up to this section we encountered a number of techniques for efficient optimization. Let us recap them in detail here: We saw that Section 11.4 is more effective than Gradient Descent when solving optimization problems, e.g., OPERATOR OPTIMIZATIONS ON CPUS Operator Optimizations on CPUs¶. In the past three chapters we mainly focus on the functionality of operators, namely, how to implement the operators to function correctly in TVM. 9.3. DEEP RECURRENT NEURAL NETWORKS 9.3.1. Functional Dependencies¶. We can formalize the functional dependencies within the deep architecture of \(L\) hidden layers depicted in Fig. 9.3.1.Our following discussion focuses primarily on the vanilla RNN model, but it applies to other sequence models, too.### 13. COMPUTER VISION

13. Computer Vision — Dive into Deep Learning 0.16.4 documentation. 13. Computer Vision. Whether it is medical diagnosis, self-driving vehicles, camera monitoring, or smart filters, many applications in the field of computer vision are closely related to our current and future lives. In recent years, deep learning has been the### transformative

6.3. PADDING AND STRIDE In the previous example of Fig. 6.2.1, our input had both a height and width of 3 and our convolution kernel had both a height and width of 2, yielding an output representation with dimension \(2\times2\).As we generalized in Section 6.2, assuming that the input shape is \(n_h\times n_w\) and the convolution kernel shape is \(k_h\times k_w\), then the output shape will be \((n_h-k_h+1) \times 9.4. BIDIRECTIONAL RECURRENT NEURAL NETWORKS 9.4.1. Dynamic Programming in Hidden Markov Models¶. This subsection serves to illustrate the dynamic programming problem. The specific technical details do not matter for understanding the deep learning models but they help in motivating why one might use deep 6. CONVOLUTIONAL NEURAL NETWORKS Convolutional Neural Networks — Dive into Deep Learning 0.16.4 documentation. 6. Convolutional Neural Networks. In earlier chapters, we came up against image data, for which each example consists of a two-dimensional grid of pixels. Depending on whether we are handling black-and-white or color images, each pixel location might be### associated

14.2. APPROXIMATE TRAINING Because the softmax operation has considered that the context word could be any word in the dictionary \(\mathcal{V}\), the loss mentioned above actually includes the sum of the number of items in the dictionary size.From the last section, we know that for both the skip-gram model and CBOW model, because they both get the conditional probability using a softmax operation, the gradient 11.11. LEARNING RATE SCHEDULING Let us have a look at what happens if we invoke this algorithm with default settings, such as a learning rate of \(0.3\) and train for \(30\) iterations. Note how the training accuracy keeps on increasing while progress in terms of test accuracy stalls beyond a point. 18.7. MAXIMUM LIKELIHOOD 18.7.1. The Maximum Likelihood Principle¶. This has a Bayesian interpretation which can be helpful to think about. Suppose that we have a model with parameters \(\boldsymbol{\theta}\) and a collection of data examples \(X\).For concreteness, we can imagine that \(\boldsymbol{\theta}\) is a single value representing the probability that a coin comes up heads when flipped, and \(X\) is a 16.9. FACTORIZATION MACHINES Factorization machines (FM) , proposed by Steffen Rendle in 2010, is a supervised algorithm that can be used for classification, regression, and ranking tasks.It quickly took notice and became a popular and impactful method for making predictions and recommendations. Particularly, it is a generalization of the linear regression model and the matrix factorization model.### 11.9. ADADELTA

11.9.1. The Algorithm¶. In a nutshell, Adadelta uses two state variables, \(\mathbf{s}_t\) to store a leaky average of the second moment of the gradient and \(\Delta\mathbf{x}_t\) to store a leaky average of the second moment of the change of parameters in the model itself. Note that we use the original notation and naming of the authors for compatibility with other publications and DIVE INTO DEEP LEARNING Dive into Deep Learning. Interactive deep learning book with code, math, and discussions. Implemented with NumPy/MXNet, PyTorch, and TensorFlow. Adopted at 175 universities from 40 countries. 11.1. OPTIMIZATION AND DEEP LEARNING 11.1.2.3. Vanishing Gradients¶. Probably the most insidious problem to encounter is the vanishing gradient. Recall our commonly-used activation functions and their derivatives in Section 4.1.2.For instance, assume that we want to minimize the function \(f(x) = \tanh(x)\) and we happen to get started at \(x = 4\).As we can see, the gradient of \(f\) is close to nil. 16. RECOMMENDER SYSTEMS 16. Recommender Systems¶. Shuai Zhang (Amazon), Aston Zhang (Amazon), and Yi Tay (Google). Recommender systems are widely employed in industry and are ubiquitous in our daily lives. These systems are utilized in a number of areas such as online shopping sites (e.g., amazon.com), music/movie services site (e.g., Netflix and Spotify), mobile application stores (e.g., IOS app store and google 5.1. LAYERS AND BLOCKS 5.1. Layers and Blocks — Dive into Deep Learning 0.16.4 documentation. 5.1. Layers and Blocks. When we first introduced neural networks, we focused on linear models with a single output. Here, the entire model consists of just a single neuron. Note that a single neuron (i) takes some set of inputs; (ii) generates a corresponding### scalar output

19.5. SELECTING SERVERS AND GPUS 19.5. Selecting Servers and GPUs — Dive into Deep Learning 0.16.4 documentation. 19.5. Selecting Servers and GPUs. Deep learning training generally requires large amounts of computation. At present GPUs are the most cost-effective hardware accelerators for deep learning. In particular, compared with CPUs, GPUs are cheaper and### offer higher

3.4. SOFTMAX REGRESSION 3.4.1. Classification Problem¶. To get our feet wet, let us start off with a simple image classification problem. Here, each input consists of a \(2\times2\) grayscale image. We can represent each pixel value with a single scalar, giving us four features \(x_1, x_2, x_3, x_4\).Further, let us assume that each image belongs to one among the categories “cat”, “chicken”, and “dog”. 6.3. PADDING AND STRIDE In the previous example of Fig. 6.2.1, our input had both a height and width of 3 and our convolution kernel had both a height and width of 2, yielding an output representation with dimension \(2\times2\).As we generalized in Section 6.2, assuming that the input shape is \(n_h\times n_w\) and the convolution kernel shape is \(k_h\times k_w\), then the output shape will be \((n_h-k_h+1) \times 16.9. FACTORIZATION MACHINES Factorization machines (FM) , proposed by Steffen Rendle in 2010, is a supervised algorithm that can be used for classification, regression, and ranking tasks.It quickly took notice and became a popular and impactful method for making predictions and recommendations. Particularly, it is a generalization of the linear regression model and the matrix factorization model.### 11.10. ADAM

11.10. Adam. In the discussions leading up to this section we encountered a number of techniques for efficient optimization. Let us recap them in detail here: We saw that Section 11.4 is more effective than Gradient Descent when solving optimization problems, e.g., OPERATOR OPTIMIZATIONS ON CPUS Operator Optimizations on CPUs¶. In the past three chapters we mainly focus on the functionality of operators, namely, how to implement the operators to function correctly in TVM. DIVE INTO DEEP LEARNING Dive into Deep Learning. Interactive deep learning book with code, math, and discussions. Implemented with NumPy/MXNet, PyTorch, and TensorFlow. Adopted at 175 universities from 40 countries. 11.1. OPTIMIZATION AND DEEP LEARNING 11.1.2.3. Vanishing Gradients¶. Probably the most insidious problem to encounter is the vanishing gradient. Recall our commonly-used activation functions and their derivatives in Section 4.1.2.For instance, assume that we want to minimize the function \(f(x) = \tanh(x)\) and we happen to get started at \(x = 4\).As we can see, the gradient of \(f\) is close to nil. 16. RECOMMENDER SYSTEMS 16. Recommender Systems¶. Shuai Zhang (Amazon), Aston Zhang (Amazon), and Yi Tay (Google). Recommender systems are widely employed in industry and are ubiquitous in our daily lives. These systems are utilized in a number of areas such as online shopping sites (e.g., amazon.com), music/movie services site (e.g., Netflix and Spotify), mobile application stores (e.g., IOS app store and google 5.1. LAYERS AND BLOCKS 5.1. Layers and Blocks — Dive into Deep Learning 0.16.4 documentation. 5.1. Layers and Blocks. When we first introduced neural networks, we focused on linear models with a single output. Here, the entire model consists of just a single neuron. Note that a single neuron (i) takes some set of inputs; (ii) generates a corresponding### scalar output

19.5. SELECTING SERVERS AND GPUS 19.5. Selecting Servers and GPUs — Dive into Deep Learning 0.16.4 documentation. 19.5. Selecting Servers and GPUs. Deep learning training generally requires large amounts of computation. At present GPUs are the most cost-effective hardware accelerators for deep learning. In particular, compared with CPUs, GPUs are cheaper and### offer higher

3.4. SOFTMAX REGRESSION 3.4.1. Classification Problem¶. To get our feet wet, let us start off with a simple image classification problem. Here, each input consists of a \(2\times2\) grayscale image. We can represent each pixel value with a single scalar, giving us four features \(x_1, x_2, x_3, x_4\).Further, let us assume that each image belongs to one among the categories “cat”, “chicken”, and “dog”. 6.3. PADDING AND STRIDE In the previous example of Fig. 6.2.1, our input had both a height and width of 3 and our convolution kernel had both a height and width of 2, yielding an output representation with dimension \(2\times2\).As we generalized in Section 6.2, assuming that the input shape is \(n_h\times n_w\) and the convolution kernel shape is \(k_h\times k_w\), then the output shape will be \((n_h-k_h+1) \times 16.9. FACTORIZATION MACHINES Factorization machines (FM) , proposed by Steffen Rendle in 2010, is a supervised algorithm that can be used for classification, regression, and ranking tasks.It quickly took notice and became a popular and impactful method for making predictions and recommendations. Particularly, it is a generalization of the linear regression model and the matrix factorization model.### 11.10. ADAM

11.10. Adam. In the discussions leading up to this section we encountered a number of techniques for efficient optimization. Let us recap them in detail here: We saw that Section 11.4 is more effective than Gradient Descent when solving optimization problems, e.g., OPERATOR OPTIMIZATIONS ON CPUS Operator Optimizations on CPUs¶. In the past three chapters we mainly focus on the functionality of operators, namely, how to implement the operators to function correctly in TVM. 9.3. DEEP RECURRENT NEURAL NETWORKS 9.3.1. Functional Dependencies¶. We can formalize the functional dependencies within the deep architecture of \(L\) hidden layers depicted in Fig. 9.3.1.Our following discussion focuses primarily on the vanilla RNN model, but it applies to other sequence models, too.### 13. COMPUTER VISION

13. Computer Vision — Dive into Deep Learning 0.16.4 documentation. 13. Computer Vision. Whether it is medical diagnosis, self-driving vehicles, camera monitoring, or smart filters, many applications in the field of computer vision are closely related to our current and future lives. In recent years, deep learning has been the### transformative

6. CONVOLUTIONAL NEURAL NETWORKS Convolutional Neural Networks — Dive into Deep Learning 0.16.4 documentation. 6. Convolutional Neural Networks. In earlier chapters, we came up against image data, for which each example consists of a two-dimensional grid of pixels. Depending on whether we are handling black-and-white or color images, each pixel location might be### associated

9.4. BIDIRECTIONAL RECURRENT NEURAL NETWORKS 9.4.1. Dynamic Programming in Hidden Markov Models¶. This subsection serves to illustrate the dynamic programming problem. The specific technical details do not matter for understanding the deep learning models but they help in motivating why one might use deep 14.2. APPROXIMATE TRAINING Because the softmax operation has considered that the context word could be any word in the dictionary \(\mathcal{V}\), the loss mentioned above actually includes the sum of the number of items in the dictionary size.From the last section, we know that for both the skip-gram model and CBOW model, because they both get the conditional probability using a softmax operation, the gradient 11.11. LEARNING RATE SCHEDULING Let us have a look at what happens if we invoke this algorithm with default settings, such as a learning rate of \(0.3\) and train for \(30\) iterations. Note how the training accuracy keeps on increasing while progress in terms of test accuracy stalls beyond a point. 6.3. PADDING AND STRIDE 6.3.2. Stride¶. When computing the cross-correlation, we start with the convolution window at the top-left corner of the input tensor, and then slide it over all locations both down and to the right. 18.7. MAXIMUM LIKELIHOOD 18.7.1. The Maximum Likelihood Principle¶. This has a Bayesian interpretation which can be helpful to think about. Suppose that we have a model with parameters \(\boldsymbol{\theta}\) and a collection of data examples \(X\).For concreteness, we can imagine that \(\boldsymbol{\theta}\) is a single value representing the probability that a coin comes up heads when flipped, and \(X\) is a 16.9. FACTORIZATION MACHINES Factorization machines (FM) , proposed by Steffen Rendle in 2010, is a supervised algorithm that can be used for classification, regression, and ranking tasks.It quickly took notice and became a popular and impactful method for making predictions and recommendations. Particularly, it is a generalization of the linear regression model and the matrix factorization model.### 11.9. ADADELTA

11.9.1. The Algorithm¶. In a nutshell, Adadelta uses two state variables, \(\mathbf{s}_t\) to store a leaky average of the second moment of the gradient and \(\Delta\mathbf{x}_t\) to store a leaky average of the second moment of the change of parameters in the model itself. Note that we use the original notation and naming of the authors for compatibility with other publications and### __

Dive into Deep Learning### _search_

### Quick search _code_

### Show Source

### __ Courses __ PDF

### __ All Notebooks

### __ Discuss __

GitHub __ 中文版### Table Of Contents

### * Preface

### * Installation

### * Notation

### * 1. Introduction

### * 2. Preliminaries

_keyboard_arrow_down_ * 2.1. Data Manipulation * 2.2. Data Preprocessing * 2.3. Linear Algebra### * 2.4. Calculus

* 2.5. Automatic Differentiation### * 2.6. Probability

* 2.7. Documentation * 3. Linear Neural Networks _keyboard_arrow_down_ * 3.1. Linear Regression * 3.2. Linear Regression Implementation from Scratch * 3.3. Concise Implementation of Linear Regression * 3.4. Softmax Regression * 3.5. The Image Classification Dataset * 3.6. Implementation of Softmax Regression from Scratch * 3.7. Concise Implementation of Softmax Regression * 4. Multilayer Perceptrons _keyboard_arrow_down_ * 4.1. Multilayer Perceptrons * 4.2. Implementation of Multilayer Perceptrons from Scratch * 4.3. Concise Implementation of Multilayer Perceptrons * 4.4. Model Selection, Underfitting, and Overfitting### * 4.5. Weight Decay

### * 4.6. Dropout

* 4.7. Forward Propagation, Backward Propagation, and Computational### Graphs

* 4.8. Numerical Stability and Initialization * 4.9. Environment and Distribution Shift * 4.10. Predicting House Prices on Kaggle * 5. Deep Learning Computation _keyboard_arrow_down_ * 5.1. Layers and Blocks * 5.2. Parameter Management * 5.3. Deferred Initialization * 5.4. Custom Layers### * 5.5. File I/O

### * 5.6. GPUs

* 6. Convolutional Neural Networks _keyboard_arrow_down_ * 6.1. From Fully-Connected Layers to Convolutions * 6.2. Convolutions for Images * 6.3. Padding and Stride * 6.4. Multiple Input and Multiple Output Channels### * 6.5. Pooling

* 6.6. Convolutional Neural Networks (LeNet) * 7. Modern Convolutional Neural Networks _keyboard_arrow_down_ * 7.1. Deep Convolutional Neural Networks (AlexNet) * 7.2. Networks Using Blocks (VGG) * 7.3. Network in Network (NiN) * 7.4. Networks with Parallel Concatenations (GoogLeNet) * 7.5. Batch Normalization * 7.6. Residual Networks (ResNet) * 7.7. Densely Connected Networks (DenseNet) * 8. Recurrent Neural Networks _keyboard_arrow_down_ * 8.1. Sequence Models * 8.2. Text Preprocessing * 8.3. Language Models and the Dataset * 8.4. Recurrent Neural Networks * 8.5. Implementation of Recurrent Neural Networks from Scratch * 8.6. Concise Implementation of Recurrent Neural Networks * 8.7. Backpropagation Through Time * 9. Modern Recurrent Neural Networks _keyboard_arrow_down_ * 9.1. Gated Recurrent Units (GRU) * 9.2. Long Short-Term Memory (LSTM) * 9.3. Deep Recurrent Neural Networks * 9.4. Bidirectional Recurrent Neural Networks * 9.5. Machine Translation and the Dataset * 9.6. Encoder-Decoder Architecture * 9.7. Sequence to Sequence Learning### * 9.8. Beam Search

* 10. Attention Mechanisms _keyboard_arrow_down_ * 10.1. Attention Mechanisms * 10.2. Sequence to Sequence with Attention Mechanisms### * 10.3. Transformer

* 11. Optimization Algorithms _keyboard_arrow_down_ * 11.1. Optimization and Deep Learning### * 11.2. Convexity

* 11.3. Gradient Descent * 11.4. Stochastic Gradient Descent * 11.5. Minibatch Stochastic Gradient Descent### * 11.6. Momentum

### * 11.7. Adagrad

### * 11.8. RMSProp

### * 11.9. Adadelta

### * 11.10. Adam

* 11.11. Learning Rate Scheduling * 12. Computational Performance _keyboard_arrow_down_ * 12.1. Compilers and Interpreters * 12.2. Asynchronous Computation * 12.3. Automatic Parallelism### * 12.4. Hardware

* 12.5. Training on Multiple GPUs * 12.6. Concise Implementation for Multiple GPUs * 12.7. Parameter Servers * 13. Computer Vision _keyboard_arrow_down_ * 13.1. Image Augmentation### * 13.2. Fine-Tuning

* 13.3. Object Detection and Bounding Boxes * 13.4. Anchor Boxes * 13.5. Multiscale Object Detection * 13.6. The Object Detection Dataset * 13.7. Single Shot Multibox Detection (SSD) * 13.8. Region-based CNNs (R-CNNs) * 13.9. Semantic Segmentation and the Dataset * 13.10. Transposed Convolution * 13.11. Fully Convolutional Networks (FCN) * 13.12. Neural Style Transfer * 13.13. Image Classification (CIFAR-10) on Kaggle * 13.14. Dog Breed Identification (ImageNet Dogs) on Kaggle * 14. Natural Language Processing: Pretraining _keyboard_arrow_down_ * 14.1. Word Embedding (word2vec) * 14.2. Approximate Training * 14.3. The Dataset for Pretraining Word Embedding * 14.4. Pretraining word2vec * 14.5. Word Embedding with Global Vectors (GloVe) * 14.6. Subword Embedding * 14.7. Finding Synonyms and Analogies * 14.8. Bidirectional Encoder Representations from Transformers### (BERT)

* 14.9. The Dataset for Pretraining BERT * 14.10. Pretraining BERT * 15. Natural Language Processing: Applications _keyboard_arrow_down_ * 15.1. Sentiment Analysis and the Dataset * 15.2. Sentiment Analysis: Using Recurrent Neural Networks * 15.3. Sentiment Analysis: Using Convolutional Neural Networks * 15.4. Natural Language Inference and the Dataset * 15.5. Natural Language Inference: Using Attention * 15.6. Fine-Tuning BERT for Sequence-Level and Token-Level### Applications

* 15.7. Natural Language Inference: Fine-Tuning BERT * 16. Recommender Systems _keyboard_arrow_down_ * 16.1. Overview of Recommender Systems * 16.2. The MovieLens Dataset * 16.3. Matrix Factorization * 16.4. AutoRec: Rating Prediction with Autoencoders * 16.5. Personalized Ranking for Recommender Systems * 16.6. Neural Collaborative Filtering for Personalized Ranking * 16.7. Sequence-Aware Recommender Systems * 16.8. Feature-Rich Recommender Systems * 16.9. Factorization Machines * 16.10. Deep Factorization Machines * 17. Generative Adversarial Networks _keyboard_arrow_down_ * 17.1. Generative Adversarial Networks * 17.2. Deep Convolutional Generative Adversarial Networks * 18. Appendix: Mathematics for Deep Learning _keyboard_arrow_down_ * 18.1. Geometry and Linear Algebraic Operations * 18.2. Eigendecompositions * 18.3. Single Variable Calculus * 18.4. Multivariable Calculus * 18.5. Integral Calculus * 18.6. Random Variables * 18.7. Maximum Likelihood * 18.8. Distributions### * 18.9. Naive Bayes

### * 18.10. Statistics

* 18.11. Information Theory * 19. Appendix: Tools for Deep Learning _keyboard_arrow_down_ * 19.1. Using Jupyter * 19.2. Using Amazon SageMaker * 19.3. Using AWS EC2 Instances * 19.4. Using Google Colab * 19.5. Selecting Servers and GPUs * 19.6. Contributing to This Book * 19.7. d2l API Document### * References

### Table Of Contents

### * Preface

### * Installation

### * Notation

### * 1. Introduction

### * 2. Preliminaries

_keyboard_arrow_down_ * 2.1. Data Manipulation * 2.2. Data Preprocessing * 2.3. Linear Algebra### * 2.4. Calculus

* 2.5. Automatic Differentiation### * 2.6. Probability

* 2.7. Documentation * 3. Linear Neural Networks _keyboard_arrow_down_ * 3.1. Linear Regression * 3.2. Linear Regression Implementation from Scratch * 3.3. Concise Implementation of Linear Regression * 3.4. Softmax Regression * 3.5. The Image Classification Dataset * 3.6. Implementation of Softmax Regression from Scratch * 3.7. Concise Implementation of Softmax Regression * 4. Multilayer Perceptrons _keyboard_arrow_down_ * 4.1. Multilayer Perceptrons * 4.2. Implementation of Multilayer Perceptrons from Scratch * 4.3. Concise Implementation of Multilayer Perceptrons * 4.4. Model Selection, Underfitting, and Overfitting### * 4.5. Weight Decay

### * 4.6. Dropout

* 4.7. Forward Propagation, Backward Propagation, and Computational### Graphs

* 4.8. Numerical Stability and Initialization * 4.9. Environment and Distribution Shift * 4.10. Predicting House Prices on Kaggle * 5. Deep Learning Computation _keyboard_arrow_down_ * 5.1. Layers and Blocks * 5.2. Parameter Management * 5.3. Deferred Initialization * 5.4. Custom Layers### * 5.5. File I/O

### * 5.6. GPUs

* 6. Convolutional Neural Networks _keyboard_arrow_down_ * 6.1. From Fully-Connected Layers to Convolutions * 6.2. Convolutions for Images * 6.3. Padding and Stride * 6.4. Multiple Input and Multiple Output Channels### * 6.5. Pooling

* 6.6. Convolutional Neural Networks (LeNet) * 7. Modern Convolutional Neural Networks _keyboard_arrow_down_ * 7.1. Deep Convolutional Neural Networks (AlexNet) * 7.2. Networks Using Blocks (VGG) * 7.3. Network in Network (NiN) * 7.4. Networks with Parallel Concatenations (GoogLeNet) * 7.5. Batch Normalization * 7.6. Residual Networks (ResNet) * 7.7. Densely Connected Networks (DenseNet) * 8. Recurrent Neural Networks _keyboard_arrow_down_ * 8.1. Sequence Models * 8.2. Text Preprocessing * 8.3. Language Models and the Dataset * 8.4. Recurrent Neural Networks * 8.5. Implementation of Recurrent Neural Networks from Scratch * 8.6. Concise Implementation of Recurrent Neural Networks * 8.7. Backpropagation Through Time * 9. Modern Recurrent Neural Networks _keyboard_arrow_down_ * 9.1. Gated Recurrent Units (GRU) * 9.2. Long Short-Term Memory (LSTM) * 9.3. Deep Recurrent Neural Networks * 9.4. Bidirectional Recurrent Neural Networks * 9.5. Machine Translation and the Dataset * 9.6. Encoder-Decoder Architecture * 9.7. Sequence to Sequence Learning### * 9.8. Beam Search

* 10. Attention Mechanisms _keyboard_arrow_down_ * 10.1. Attention Mechanisms * 10.2. Sequence to Sequence with Attention Mechanisms### * 10.3. Transformer

* 11. Optimization Algorithms _keyboard_arrow_down_ * 11.1. Optimization and Deep Learning### * 11.2. Convexity

* 11.3. Gradient Descent * 11.4. Stochastic Gradient Descent * 11.5. Minibatch Stochastic Gradient Descent### * 11.6. Momentum

### * 11.7. Adagrad

### * 11.8. RMSProp

### * 11.9. Adadelta

### * 11.10. Adam

* 11.11. Learning Rate Scheduling * 12. Computational Performance _keyboard_arrow_down_ * 12.1. Compilers and Interpreters * 12.2. Asynchronous Computation * 12.3. Automatic Parallelism### * 12.4. Hardware

* 12.5. Training on Multiple GPUs * 12.6. Concise Implementation for Multiple GPUs * 12.7. Parameter Servers * 13. Computer Vision _keyboard_arrow_down_ * 13.1. Image Augmentation### * 13.2. Fine-Tuning

* 13.3. Object Detection and Bounding Boxes * 13.4. Anchor Boxes * 13.5. Multiscale Object Detection * 13.6. The Object Detection Dataset * 13.7. Single Shot Multibox Detection (SSD) * 13.8. Region-based CNNs (R-CNNs) * 13.9. Semantic Segmentation and the Dataset * 13.10. Transposed Convolution * 13.11. Fully Convolutional Networks (FCN) * 13.12. Neural Style Transfer * 13.13. Image Classification (CIFAR-10) on Kaggle * 13.14. Dog Breed Identification (ImageNet Dogs) on Kaggle * 14. Natural Language Processing: Pretraining _keyboard_arrow_down_ * 14.1. Word Embedding (word2vec) * 14.2. Approximate Training * 14.3. The Dataset for Pretraining Word Embedding * 14.4. Pretraining word2vec * 14.5. Word Embedding with Global Vectors (GloVe) * 14.6. Subword Embedding * 14.7. Finding Synonyms and Analogies * 14.8. Bidirectional Encoder Representations from Transformers### (BERT)

* 14.9. The Dataset for Pretraining BERT * 14.10. Pretraining BERT * 15. Natural Language Processing: Applications _keyboard_arrow_down_ * 15.1. Sentiment Analysis and the Dataset * 15.2. Sentiment Analysis: Using Recurrent Neural Networks * 15.3. Sentiment Analysis: Using Convolutional Neural Networks * 15.4. Natural Language Inference and the Dataset * 15.5. Natural Language Inference: Using Attention * 15.6. Fine-Tuning BERT for Sequence-Level and Token-Level### Applications

* 15.7. Natural Language Inference: Fine-Tuning BERT * 16. Recommender Systems _keyboard_arrow_down_ * 16.1. Overview of Recommender Systems * 16.2. The MovieLens Dataset * 16.3. Matrix Factorization * 16.4. AutoRec: Rating Prediction with Autoencoders * 16.5. Personalized Ranking for Recommender Systems * 16.6. Neural Collaborative Filtering for Personalized Ranking * 16.7. Sequence-Aware Recommender Systems * 16.8. Feature-Rich Recommender Systems * 16.9. Factorization Machines * 16.10. Deep Factorization Machines * 17. Generative Adversarial Networks _keyboard_arrow_down_ * 17.1. Generative Adversarial Networks * 17.2. Deep Convolutional Generative Adversarial Networks * 18. Appendix: Mathematics for Deep Learning _keyboard_arrow_down_ * 18.1. Geometry and Linear Algebraic Operations * 18.2. Eigendecompositions * 18.3. Single Variable Calculus * 18.4. Multivariable Calculus * 18.5. Integral Calculus * 18.6. Random Variables * 18.7. Maximum Likelihood * 18.8. Distributions### * 18.9. Naive Bayes

### * 18.10. Statistics

* 18.11. Information Theory * 19. Appendix: Tools for Deep Learning _keyboard_arrow_down_ * 19.1. Using Jupyter * 19.2. Using Amazon SageMaker * 19.3. Using AWS EC2 Instances * 19.4. Using Google Colab * 19.5. Selecting Servers and GPUs * 19.6. Contributing to This Book * 19.7. d2l API Document### * References

DIVE INTO DEEP LEARNING¶ DIVE INTO DEEP LEARNING INTERACTIVE deep learning book with code, math, and discussions Implemented with NUMPY/MXNET, PYTORCH, and TENSORFLOW Adopted at 140 universities from 35 countries### ANNOUNCEMENTS

* If you plan to use D2L to teach your class in the 2021 Spring semester, you may apply for free computing resources for your class by 11/22/2020. * We have added PyTorch implementations up to Chapter 11 (Optimization) and TensorFlow implementations up to Chapter 7 (Modern CNNs). To keep track of the latest updates, please follow D2L's open-source project . * We have re-organized Chapter: NLP pretraining### and

Chapter: NLP applications### , and

added sections of BERT (model### , data

### ,

### pretraining

### ,

### fine-tuning

### ,

### application

### )

and natural language inference (data### ,

### model

### ).

* The Chinese version is the No. 1 best seller of new books in "Computers and Internet" at the largest Chinese online### bookstore.

* Slides, Jupyter notebooks, assignments, and videos of the Berkeley course can be found at the syllabus page### .

### AUTHORS

### ASTON ZHANG

Amazon Senior Scientist### ZACK C. LIPTON

### Amazon Scientist

CMU Assistant Professor### MU LI

Amazon Senior Principal Scientist### ALEX J. SMOLA

Amazon VP/Distinguished Scientist### CHAPTER AUTHORS

### BRENT WERNESS

### Amazon Scientist

_Mathematics for Deep Learning### _

### RACHEL HU

### Amazon Scientist

_Mathematics for Deep Learning### _

### SHUAI ZHANG

ETH Zürich Postdoctoral Researcher _Recommender Systems### _

### YI TAY

### Google Scientist

_Recommender Systems### _

FRAMEWORK ADAPTATION AUTHORS### ANIRUDH DAGAR

### IIT Roorkee Student

_PyTorch Adaptation_### YUAN TANG

Ant Group Senior Engineer _TensorFlow Adaptation_ WE THANK ALL THE COMMUNITY CONTRIBUTORS FOR MAKING THIS OPEN SOURCE BOOK BETTER FOR EVERYONE. CONTRIBUTE TO THE BOOK. EACH SECTION IS AN EXECUTABLE JUPYTER NOTEBOOK You can modify the code and tune hyperparameters to get instant feedback to accumulate practical experiences in deep learning.### Run

### locally

### Amazon

### SageMaker

### Colab

MATHEMATICS + FIGURES + CODE We offer an interactive learning experience with mathematics, figures, code, text, and discussions, where concepts and techniques are illustrated and implemented with experiments on real data sets. ACTIVE COMMUNITY SUPPORT You can discuss and learn with thousands of peers in the community through the link provided in each section. D2L AS A TEXTBOOK OR A REFERENCE BOOK _Click here to show the full list. _ Alexandria University Ateneo de Naga University Birla Institute of Technology and Science, Hyderabad### Cairo University

Carnegie Mellon University College of Engineering Pune### Columbia University

### Duke University

Durban University of Technology### Emory University

Federal University Lokoja### Fudan University

Gayatri Vidya Parishad College of Engineering (Autonomous)### Gazi Üniversitesi

Georgia Institute of Technology Golden Gate University### Habib University

Hangzhou Dianzi University Hankuk University of Foreign Studies Harbin Institute of Technology Hasso-Plattner-Institut Heinrich-Heine-Universität Düsseldorf Hiroshima University Ho Chi Minh City University of Foreign Languages and Information### Technology

### Hochschule Bremen

Hochschule für Technik und Wirtschaft Hong Kong University of Science and Technology Huazhong University of Science and Technology Imperial College London Indian Institute of Technology Bombay Indian Institute of Technology Jodhpur Indian Institute of Technology Kanpur Indian Institute of Technology Kharagpur Indian Institute of Technology Mandi Indian Institute of Technology Ropar Indira Gandhi National Open University Indraprastha Institute of Information Technology, Delhi Institut de recherche en informatique de Toulouse Institut Supérieur d'Informatique et des Techniques de Communication Institut Supérieur De L'electronique Et Du Numérique Instituto Tecnológico Autónomo de México İstanbul Teknik Üniversitesi IT-Universitetet i København King Abdullah University of Science and Technology Kongu Engineering College KPR Institute of Engineering and Technology Kyungpook National University Lancaster University Leibniz Universität Hannover Leuphana University of Lüneburg London School of Economics & Political Science Massachusetts Institute of Technology### McGill University

Milwaukee School of Engineering### Minia University

### Monash University

Multimedia University National Chung Hsing University National Institute of Technical Teachers Training&Research National Institute of Technology, Warangal National Taiwan University National United University National University of Singapore Nazarbayev University### New York University

### Newman University

North Ossetian State University Northeastern University### Ohio University

### Peking University

Politecnico di Milano Pontificia Universidad Católica de Chile Portland State University### Purdue University

### Queen's University

Radboud Universiteit### Rowan University

Rutgers, The State University of New Jersey Sapienza Università di Roma Shanghai Jiao Tong University Shanghai University of Finance and Economics### Sogang University

Southern New Hampshire University St. Pölten University of Applied Sciences### Stanford University

Stevens Institute of Technology Technische Universiteit Delft Tekirdağ Namık Kemal Üniversitesi Texas A&M University Thapar Institute of Engineering and Technology The State University of New York at Binghamton The University of Texas at Austin### Tsinghua University

Universidad Carlos III de Madrid Universidad de Zaragoza Universidad Militar Nueva Granada Universidad Nacional Agraria La Molina Universidad Nacional de Colombia Sede Manizales Universidade Federal de Minas Gerais Universidade Federal de Ouro Preto Universidade Federal do Rio Grande Universidade NOVA de Lisboa Universidade Presbiteriana Mackenzie Università degli Studi di Brescia Università degli Studi di Catania Università degli Studi di Padova Universität Heidelberg Universitat Politècnica de Catalunya Universitatea de Vest din Timișoara Université Paris-Saclay University of Arkansas University of Augsburg University of California, Berkeley University of California, Los Angeles University of California, San Diego University of California, Santa Barbara University of Cambridge University of Cincinnati University of Illinois at Urbana-Champaign University of Liège University of Maryland University of Minnesota, Twin Cities University of New Hampshire University of North Carolina at Chapel Hill University of North Texas University of Northern Philippines University of Pennsylvania University of Science and Technology of China University of Southern Maine University of St Andrews University of Technology Sydney University of Washington University of Waterloo University of Wisconsin Madison Univerzita Komenského v Bratislave Vietnamese-German University Wageningen University West Virginia University### Western University

Xavier University Bhubaneswar### Yeshiva University

### Yunnan University

### Zhejiang University

IF YOU USE D2L TO TEACH (OR PLAN TO) AND WOULD LIKE TO RECEIVE A FREE HARDCOPY, PLEASE CONTACT US### .

BIBTEX ENTRY FOR CITING THE BOOK @book{zhang2020dive, title={Dive into Deep Learning}, author={Aston Zhang and Zachary C. Lipton and Mu Li and Alexander J. Smola}, note={\url{https://d2l.ai}},### year={2020}

### }

### TABLE OF CONTENTS

### * Preface

### * Installation

### * Notation

### * 1. Introduction

* 1.1. A Motivating Example * 1.2. Key Components * 1.3. Kinds of Machine Learning Problems### * 1.4. Roots

* 1.5. The Road to Deep Learning * 1.6. Success Stories * 1.7. Characteristics### * 1.8. Summary

### * 1.9. Exercises

### * 2. Preliminaries

* 2.1. Data Manipulation * 2.2. Data Preprocessing * 2.3. Linear Algebra### * 2.4. Calculus

* 2.5. Automatic Differentiation### * 2.6. Probability

* 2.7. Documentation * 3. Linear Neural Networks * 3.1. Linear Regression * 3.2. Linear Regression Implementation from Scratch * 3.3. Concise Implementation of Linear Regression * 3.4. Softmax Regression * 3.5. The Image Classification Dataset * 3.6. Implementation of Softmax Regression from Scratch * 3.7. Concise Implementation of Softmax Regression * 4. Multilayer Perceptrons * 4.1. Multilayer Perceptrons * 4.2. Implementation of Multilayer Perceptrons from Scratch * 4.3. Concise Implementation of Multilayer Perceptrons * 4.4. Model Selection, Underfitting, and Overfitting### * 4.5. Weight Decay

### * 4.6. Dropout

* 4.7. Forward Propagation, Backward Propagation, and Computational### Graphs

* 4.8. Numerical Stability and Initialization * 4.9. Environment and Distribution Shift * 4.10. Predicting House Prices on Kaggle * 5. Deep Learning Computation * 5.1. Layers and Blocks * 5.2. Parameter Management * 5.3. Deferred Initialization * 5.4. Custom Layers### * 5.5. File I/O

### * 5.6. GPUs

* 6. Convolutional Neural Networks * 6.1. From Fully-Connected Layers to Convolutions * 6.2. Convolutions for Images * 6.3. Padding and Stride * 6.4. Multiple Input and Multiple Output Channels### * 6.5. Pooling

* 6.6. Convolutional Neural Networks (LeNet) * 7. Modern Convolutional Neural Networks * 7.1. Deep Convolutional Neural Networks (AlexNet) * 7.2. Networks Using Blocks (VGG) * 7.3. Network in Network (NiN) * 7.4. Networks with Parallel Concatenations (GoogLeNet) * 7.5. Batch Normalization * 7.6. Residual Networks (ResNet) * 7.7. Densely Connected Networks (DenseNet) * 8. Recurrent Neural Networks * 8.1. Sequence Models * 8.2. Text Preprocessing * 8.3. Language Models and the Dataset * 8.4. Recurrent Neural Networks * 8.5. Implementation of Recurrent Neural Networks from Scratch * 8.6. Concise Implementation of Recurrent Neural Networks * 8.7. Backpropagation Through Time * 9. Modern Recurrent Neural Networks * 9.1. Gated Recurrent Units (GRU) * 9.2. Long Short-Term Memory (LSTM) * 9.3. Deep Recurrent Neural Networks * 9.4. Bidirectional Recurrent Neural Networks * 9.5. Machine Translation and the Dataset * 9.6. Encoder-Decoder Architecture * 9.7. Sequence to Sequence Learning### * 9.8. Beam Search

* 10. Attention Mechanisms * 10.1. Attention Mechanisms * 10.2. Sequence to Sequence with Attention Mechanisms### * 10.3. Transformer

* 11. Optimization Algorithms * 11.1. Optimization and Deep Learning### * 11.2. Convexity

* 11.3. Gradient Descent * 11.4. Stochastic Gradient Descent * 11.5. Minibatch Stochastic Gradient Descent### * 11.6. Momentum

### * 11.7. Adagrad

### * 11.8. RMSProp

### * 11.9. Adadelta

### * 11.10. Adam

* 11.11. Learning Rate Scheduling * 12. Computational Performance * 12.1. Compilers and Interpreters * 12.2. Asynchronous Computation * 12.3. Automatic Parallelism### * 12.4. Hardware

* 12.5. Training on Multiple GPUs * 12.6. Concise Implementation for Multiple GPUs * 12.7. Parameter Servers * 13. Computer Vision * 13.1. Image Augmentation### * 13.2. Fine-Tuning

* 13.3. Object Detection and Bounding Boxes * 13.4. Anchor Boxes * 13.5. Multiscale Object Detection * 13.6. The Object Detection Dataset * 13.7. Single Shot Multibox Detection (SSD) * 13.8. Region-based CNNs (R-CNNs) * 13.9. Semantic Segmentation and the Dataset * 13.10. Transposed Convolution * 13.11. Fully Convolutional Networks (FCN) * 13.12. Neural Style Transfer * 13.13. Image Classification (CIFAR-10) on Kaggle * 13.14. Dog Breed Identification (ImageNet Dogs) on Kaggle * 14. Natural Language Processing: Pretraining * 14.1. Word Embedding (word2vec) * 14.2. Approximate Training * 14.3. The Dataset for Pretraining Word Embedding * 14.4. Pretraining word2vec * 14.5. Word Embedding with Global Vectors (GloVe) * 14.6. Subword Embedding * 14.7. Finding Synonyms and Analogies * 14.8. Bidirectional Encoder Representations from Transformers### (BERT)

* 14.9. The Dataset for Pretraining BERT * 14.10. Pretraining BERT * 15. Natural Language Processing: Applications * 15.1. Sentiment Analysis and the Dataset * 15.2. Sentiment Analysis: Using Recurrent Neural Networks * 15.3. Sentiment Analysis: Using Convolutional Neural Networks * 15.4. Natural Language Inference and the Dataset * 15.5. Natural Language Inference: Using Attention * 15.6. Fine-Tuning BERT for Sequence-Level and Token-Level### Applications

* 15.7. Natural Language Inference: Fine-Tuning BERT * 16. Recommender Systems * 16.1. Overview of Recommender Systems * 16.2. The MovieLens Dataset * 16.3. Matrix Factorization * 16.4. AutoRec: Rating Prediction with Autoencoders * 16.5. Personalized Ranking for Recommender Systems * 16.6. Neural Collaborative Filtering for Personalized Ranking * 16.7. Sequence-Aware Recommender Systems * 16.8. Feature-Rich Recommender Systems * 16.9. Factorization Machines * 16.10. Deep Factorization Machines * 17. Generative Adversarial Networks * 17.1. Generative Adversarial Networks * 17.2. Deep Convolutional Generative Adversarial Networks * 18. Appendix: Mathematics for Deep Learning * 18.1. Geometry and Linear Algebraic Operations * 18.2. Eigendecompositions * 18.3. Single Variable Calculus * 18.4. Multivariable Calculus * 18.5. Integral Calculus * 18.6. Random Variables * 18.7. Maximum Likelihood * 18.8. Distributions### * 18.9. Naive Bayes

### * 18.10. Statistics

* 18.11. Information Theory * 19. Appendix: Tools for Deep Learning * 19.1. Using Jupyter * 19.2. Using Amazon SageMaker * 19.3. Using AWS EC2 Instances * 19.4. Using Google Colab * 19.5. Selecting Servers and GPUs * 19.6. Contributing to This Book * 19.7. d2l API Document### * References

### __

### Next

### Preface

# Details

Copyright © 2022 ArchiveBay.com. All rights reserved. Terms of Use | Privacy Policy | DMCA | 2021 | Feedback | Advertising | RSS 2.0