Contents

\(\newcommand{\vect}[1]{\boldsymbol{#1}}\) \(\newcommand{\transp}{^{\text{T}}}\) \(\newcommand{\mat}[1]{\boldsymbol{\mathcal{#1}}}\) \(\newcommand{\sign}{\text{sign}}\)

1 Goals

In this tutorial, we mainly use the MNIST dataset to explore classification deep neural networks (DNN) models. At the end of this tutorial, you should be comfortable to use a software package (here keras) to run different models for a classification task. You will explore different models by exploring/tuning different hyperparamaters of the DNN:

2 MNIST Data set

2.1 Loading package and dataset

First you have to install keras R packages using

install.packages(keras)

We load some extra packages that you have to install too.

library(dplyr)         

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(keras)         
library(dslabs)
library(tidyr)
library(stringr)
library(purrr)

MNIST data are available from the package dslabs

# Import MNIST training data
mnist <- dslabs::read_mnist()
mnist_x <- mnist$train$images
mnist_y <- mnist$train$labels
mnist_x_test <- mnist$test$images
mnist_y_test <- mnist$test$labels
dim(mnist_x)
[1] 60000   784
dim(mnist_x_test)
[1] 10000   784
length(mnist_y)
[1] 60000
length(mnist_y_test)
[1] 10000

Reminder: Keep in mind that all features need to be numeric for running a feedforward DNN. When you have some categorical features you have to transform into numerical values such as one-hot encoded.

2.1.1 Scale the data set

The data is in gray scale with each image having 0 to 255 pixels. We therefore need to scale the data by the number of pixels.

colnames(mnist_x) <- paste0("V", 1:ncol(mnist_x))
mnist_x <- mnist_x / 255
colnames(mnist_x_test) <- paste0("V", 1:ncol(mnist_x))
mnist_x_test <- mnist_x_test / 255

2.1.2 Transform the outcome

For multi-classification model (multinomial response 0 to 9), Keras uses one-hot encoded for the outcome

# One-hot encode response
mnist_y <- to_categorical(mnist_y, 10)
dim(mnist_y)
[1] 60000    10

2.2 Implementation a DNN using Keras

2.2.1 Procedure

  • Initiate a sequential feed-forward DNN using keras_model_sequential()
  • Add some dense layers.
model <- keras_model_sequential() %>%
  layer_dense(units = 128, input_shape = ncol(mnist_x)) %>%
  layer_dense(units = 64) %>%
  layer_dense(units = 10)

Here, we have two hidden layers:

  • 128 neurons for the first layer
  • 64 for the second
  • 10 neurons for the output layer

Note the input_shape argument which represents the number of features in the data (here 784)

2.2.2 Activation

We need to choose some activation function:

model <- keras_model_sequential() %>%
  layer_dense(units = 128, activation = "relu", input_shape = ncol(mnist_x)) %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_dense(units = 10, activation = "softmax")

Here, it is natural to choose the softmax function for the output layer. It is the most common to use ReLU activation for hidden layer

2.2.3 Backpropagation and Optimizer

Now, we have to define our objective function to optimize and the optimizer to get a solution. The natural choise here is the cross entropy for categorical outcome. You have learned in lecture 3 the different variant of the gradient descent. Keras offers several optimizers:

  • Stochastic gradient descent (sgd) optimizer
  • Adaptive Moment Estimation (adam)
  • rmsprop
  • Adaptive learning rate (Adadelta)

We will use in this example rmsprop optimizer.

model <- keras_model_sequential() %>%
  
  # Network architecture
  layer_dense(units = 128, activation = "relu", input_shape = ncol(mnist_x)) %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_dense(units = 10, activation = "softmax") %>%
  
  # Backpropagation
  compile(
    loss = 'categorical_crossentropy',
    optimizer = optimizer_rmsprop(),
    metrics = c('accuracy')
  )

Note that we define a metric on which the model will be evaluated. We have used the “accuracy” metric to assess the performance of the model

2.2.4 Summary of our modef

summary(model)
Model: "sequential_2"
________________________________________________________________________________
Layer (type)                        Output Shape                    Param #     
================================================================================
dense_8 (Dense)                     (None, 128)                     100480      
________________________________________________________________________________
dense_7 (Dense)                     (None, 64)                      8256        
________________________________________________________________________________
dense_6 (Dense)                     (None, 10)                      650         
================================================================================
Total params: 109,386
Trainable params: 109,386
Non-trainable params: 0
________________________________________________________________________________

2.2.5 Train our model

We will train our model on 25 epochs and a bath size of 128. We also use \(20\%\) of our data for the evaluation step during the training phase, meaning that \(60,000\times 0.2=12,000\) of the samples are using for the validation step while 48,000 samples are using for the optimization step.

Reminder: An epoch describes the number of times the algorithm sees the entire data set. So with a batch size of 128, one epoch is achieved after 375 passes.

# Train the model
set.seed(1)
fit1 <- model %>%
  fit(
    x = mnist_x,
    y = mnist_y,
    epochs = 25,
    batch_size = 128,
    validation_split = 0.2,
    verbose = FALSE
  )

# Display output
fit1

Final epoch (plot to see history):
        loss: 0.002657
    accuracy: 0.9991
    val_loss: 0.1722
val_accuracy: 0.9773 
plot(fit1)
`geom_smooth()` using formula 'y ~ x'
Training and validation performance over 25 epochs.

Figure 1: Training and validation performance over 25 epochs

We can see that the loss function improves rapidly. However, we can see a potential overfit after 10 epochs. Indeed, the accurary rate of the validation set presents a flat shape after 10 epochs.

2.2.6 Prediction

We can predict now the class (digits) for a new image:

model%>% predict_classes(mnist_x_test[1:10,])
 [1] 7 2 1 0 4 1 4 9 6 9

Compared to the true label

mnist_y_test[1:10]
 [1] 7 2 1 0 4 1 4 9 5 9

It looks good ?

pred <- model%>% predict_classes(mnist_x_test[,])
table(pred,mnist_y_test)
    mnist_y_test
pred    0    1    2    3    4    5    6    7    8    9
   0  973    0    2    0    2    2    4    2    2    4
   1    0 1114    0    0    0    1    2    2    0    1
   2    0    5 1007    1    5    0    1   13    6    0
   3    3    1    6  997    1    9    1    2    7   10
   4    0    0    1    0  940    1    3    1    2    2
   5    1    2    0    4    2  869    5    0    3    2
   6    1    4    3    0    5    4  940    0    0    1
   7    1    3    4    3    4    0    0 1001    4    7
   8    1    6    9    3    2    3    2    3  947    5
   9    0    0    0    2   21    3    0    4    3  977
sum(diag(table(pred,mnist_y_test)))/10000
[1] 0.9765

2.2.7 Performance on test set

library(caret)
confusionMatrix(factor(pred),factor(mnist_y_test))->res
res$table
          Reference
Prediction    0    1    2    3    4    5    6    7    8    9
         0  973    0    2    0    2    2    4    2    2    4
         1    0 1114    0    0    0    1    2    2    0    1
         2    0    5 1007    1    5    0    1   13    6    0
         3    3    1    6  997    1    9    1    2    7   10
         4    0    0    1    0  940    1    3    1    2    2
         5    1    2    0    4    2  869    5    0    3    2
         6    1    4    3    0    5    4  940    0    0    1
         7    1    3    4    3    4    0    0 1001    4    7
         8    1    6    9    3    2    3    2    3  947    5
         9    0    0    0    2   21    3    0    4    3  977
res$overall[1]
Accuracy 
  0.9765 

2.3 Improve our model by tuning some parameters

2.3.1 Model complexity

We will explore different model size by playing with the number of hidden layer from 1 to 3 and different number of neurons. Complex models have higher capacity to learn more features and patterns in the data, however they can overfit the training data. We try to maximize a high validation performance while minimizing the complexity of our model. The folowing table present the 9 models we will explore:

Table 1: 9 Models with different complexity according number of layers and nodes per layer.
Hidden Layers
Size 1 2 3
small 16 16, 8 16, 8, 4
medium 64 64, 32 64, 32, 16
large 256 256, 128 256, 128, 64

We have 9 models to run !!! better to wrap it into a nice function.

2.3.2 Small Model: one layer model

compiler <- function(object) {
  compile(
    object,
    loss = 'categorical_crossentropy',
    optimizer = optimizer_rmsprop(),
    metrics = c('accuracy')
  )
}

trainer <- function(object) {
  fit(
    object,
    x = mnist_x,
    y = mnist_y,
    epochs = 25,
    batch_size = 128,
    validation_split = .2,
    verbose = FALSE
    )
}

# One layer models -------------------------------------------------------------
# small  model
`1 layer_small` <- keras_model_sequential() %>%
  layer_dense(units = 16, activation = "relu", input_shape = ncol(mnist_x)) %>%
  layer_dense(units = 10, activation = "softmax") %>%
  compiler() %>%
  trainer()

# medium
`1 layer_medium` <- keras_model_sequential() %>%
  layer_dense(units = 64, activation = "relu", input_shape = ncol(mnist_x)) %>%
  layer_dense(units = 10, activation = "softmax") %>%
  compiler() %>%
  trainer()

# large
`1 layer_large` <- keras_model_sequential() %>%
  layer_dense(units = 256, activation = "relu", input_shape = ncol(mnist_x)) %>%
  layer_dense(units = 10, activation = "softmax") %>%
  compiler() %>%
  trainer()

We can plot the results:

models <- ls(pattern = "layer_") 
df <- models %>%
  map(get) %>%
  map(~ data.frame(
    `Validation error` = .$metrics$val_loss,
    `Training error`   = .$metrics$loss,
    epoch = seq_len(.$params$epoch)
    )) %>%
  map2_df(models, ~ mutate(.x, model = .y)) %>%
  separate(model, into = c("Middle layers", "Number of nodes"), sep = "_") %>%
  gather(Validation, Loss, Validation.error:Training.error) %>%
  mutate(
    Validation = str_replace_all(Validation, "\\.", " "),
    `Number of nodes` = factor(`Number of nodes`, levels = c("small", "medium", "large"))
    )

best <- df %>% 
  filter(Validation == "Validation error") %>%
  group_by(`Middle layers`, `Number of nodes`) %>% 
  filter(Loss == min(Loss)) %>%
  mutate(label = paste("Min validation error:", round(Loss, 4)))

ggplot(df, aes(epoch, Loss)) +
  geom_hline(data = best, aes(yintercept = Loss), lty = "dashed", color = "grey50") +
  geom_text(data = best, aes(x = 25, y = 0.95, label = label), size = 4, hjust = 1, vjust = 1) + 
  geom_point(aes(color = Validation)) +
  geom_line(aes(color = Validation)) +
  facet_grid(`Number of nodes` ~ `Middle layers`, scales = "free_y") +
  scale_y_continuous(limits = c(0, 1)) +
  theme(legend.title = element_blank(),
        legend.position = "top") +
  xlab("Epoch")

2.4 TASK1

  • Repeat it for the 2-layer and 3-layer model.

You should get a figure similar as the following one:

2.5 Task 2

What are you conclusions from this experiment ? which models present some overfit issue ? which models to keep ?

2.6 Batch normalization

Here we will add a normalization batch step after each layer. An example using the following code.

model_w_norm <- keras_model_sequential() %>%
  
  # Network architecture with batch normalization
  layer_dense(units = 256, activation = "relu", input_shape = ncol(mnist_x)) %>%
  layer_batch_normalization() %>%
  layer_dense(units = 128, activation = "relu") %>%
  layer_batch_normalization() %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_batch_normalization() %>%
  layer_dense(units = 10, activation = "softmax") %>%

  # Backpropagation
  compile(
    loss = "categorical_crossentropy",
    optimizer = optimizer_rmsprop(),
    metrics = c("accuracy")
  )

Question: Complexity size of your model (number of parameters) ?

Now we can explore the effect of batch normalization on our 9 models

# One layer models -------------------------------------------------------------
# small  model
`1 layer_small` <- keras_model_sequential() %>%
  layer_dense(units = 16, activation = "relu", input_shape = ncol(mnist_x)) %>%
  layer_batch_normalization() %>%
  layer_dense(units = 10, activation = "softmax") %>%
  compiler() %>%
  trainer()

# medium
`1 layer_medium` <- keras_model_sequential() %>%
  layer_dense(units = 64, activation = "relu", input_shape = ncol(mnist_x)) %>%
  layer_batch_normalization() %>%
  layer_dense(units = 10, activation = "softmax") %>%
  compiler() %>%
  trainer()

# large
`1 layer_large` <- keras_model_sequential() %>%
  layer_dense(units = 256, activation = "relu", input_shape = ncol(mnist_x)) %>%
  layer_batch_normalization() %>%
  layer_dense(units = 10, activation = "softmax") %>%
  compiler() %>%
  trainer()

# Two layer models -------------------------------------------------------------
# small capacity model
`2 layer_small` <- keras_model_sequential() %>%
  layer_dense(units = 16, activation = "relu", input_shape = ncol(mnist_x)) %>%
  layer_batch_normalization() %>%
  layer_dense(units = 8, activation = "relu") %>%
  layer_batch_normalization() %>%
  layer_dense(units = 10, activation = "softmax") %>%
  compiler() %>%
  trainer()

# medium
`2 layer_medium` <- keras_model_sequential() %>%
  layer_dense(units = 64, activation = "relu", input_shape = ncol(mnist_x)) %>%
  layer_batch_normalization() %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_batch_normalization() %>%
  layer_dense(units = 10, activation = "softmax") %>%
  compiler() %>%
  trainer()

# large
`2 layer_large` <- keras_model_sequential() %>%
  layer_dense(units = 256, activation = "relu", input_shape = ncol(mnist_x)) %>%
  layer_batch_normalization() %>%
  layer_dense(units = 128, activation = "relu") %>%
  layer_batch_normalization() %>%
  layer_dense(units = 10, activation = "softmax") %>%
  compiler() %>%
  trainer()

# Three layer models -------------------------------------------------------------
# small capacity model
`3 layer_small` <- keras_model_sequential() %>%
  layer_dense(units = 16, activation = "relu", input_shape = ncol(mnist_x)) %>%
  layer_batch_normalization() %>%
  layer_dense(units = 8, activation = "relu") %>%
  layer_batch_normalization() %>%
  layer_dense(units = 4, activation = "relu") %>%
  layer_batch_normalization() %>%
  layer_dense(units = 10, activation = "softmax") %>%
  compiler() %>%
  trainer()

# medium
`3 layer_medium` <- keras_model_sequential() %>%
  layer_dense(units = 64, activation = "relu", input_shape = ncol(mnist_x)) %>%
  layer_batch_normalization() %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_batch_normalization() %>%
  layer_dense(units = 16, activation = "relu") %>%
  layer_batch_normalization() %>%
  layer_dense(units = 10, activation = "softmax") %>%
  compiler() %>%
  trainer()

# large
`3 layer_large` <- keras_model_sequential() %>%
  layer_dense(units = 256, activation = "relu", input_shape = ncol(mnist_x)) %>%
  layer_batch_normalization() %>%
  layer_dense(units = 128, activation = "relu") %>%
  layer_batch_normalization() %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_batch_normalization() %>%
  layer_dense(units = 10, activation = "softmax") %>%
  compiler() %>%
  trainer()

models <- ls(pattern = "layer_") 
df_batch <- models %>%
  map(get) %>%
  map(~ data.frame(
    `Validation error` = .$metrics$val_loss,
    `Training error`   = .$metrics$loss,
    epoch = seq_len(.$params$epoch)
    )) %>%
  map2_df(models, ~ mutate(.x, model = .y)) %>%
  separate(model, into = c("Middle layers", "Number of nodes"), sep = "_") %>%
  gather(Validation, Loss, Validation.error:Training.error) %>%
  mutate(
    Validation = str_replace_all(Validation, "\\.", " "),
    `Number of nodes` = factor(`Number of nodes`, levels = c("small", "medium", "large")),
    `Batch normalization` = TRUE
    )

Plot the results

df2 <- df %>%
  mutate(`Batch normalization` = FALSE) %>%
  bind_rows(df_batch) %>% 
  filter(Validation == "Validation error")

best <- df2 %>% 
  filter(Validation == "Validation error") %>%
  group_by(`Middle layers`, `Number of nodes`) %>% 
  filter(Loss == min(Loss)) %>%
  mutate(label = paste("Min validation error:", round(Loss, 4)))

ggplot(df2, aes(epoch, Loss, color = `Batch normalization`)) + 
  geom_text(data = best, aes(x = 25, y = 0.95, label = label), size = 4, hjust = 1, vjust = 1) + 
  geom_point() +
  geom_line() +
  facet_grid(`Number of nodes` ~ `Middle layers`, scales = "free_y") +
  scale_y_continuous(limits = c(0, 1)) +
  xlab("Epoch") +
  scale_color_discrete("Batch normalization") +
  theme(legend.position = "top")
The effect of batch normalization on validation loss for various model capacities.

Figure 2: The effect of batch normalization on validation loss for various model capacities

2.7 Regularization

Regularization is generally a good practice for overfitting issue. Here we explore \(L_1\) and \(L_2\) normalization:

Here the code for a model with batch normalization and \(L_2\) regularization.

model_w_reg <- keras_model_sequential() %>%
  
  # Network architecture with L1 regularization and batch normalization
  layer_dense(units = 256, activation = "relu", input_shape = ncol(mnist_x),
              kernel_regularizer = regularizer_l2(0.001)) %>%
  layer_batch_normalization() %>%
  layer_dense(units = 128, activation = "relu", 
              kernel_regularizer = regularizer_l2(0.001)) %>%
  layer_batch_normalization() %>%
  layer_dense(units = 64, activation = "relu", 
              kernel_regularizer = regularizer_l2(0.001)) %>%
  layer_batch_normalization() %>%
  layer_dense(units = 10, activation = "softmax") %>%

  # Backpropagation
  compile(
    loss = "categorical_crossentropy",
    optimizer = optimizer_rmsprop(),
    metrics = c("accuracy")
  )

2.7.1 Task

Present the result of the model with 3-layer with 256, 128 and 64 nodes per respective layer, with batch normalization and with \(L_2\) regularization components.


Final epoch (plot to see history):
        loss: 0.1015
    accuracy: 0.9846
    val_loss: 0.1553
val_accuracy: 0.9707 
`geom_smooth()` using formula 'y ~ x'

Is the \(L_2\) improved your model ?

2.8 Dropout

An another avenue for overfitting issue is to use the dropout strategy.

model_w_drop <- keras_model_sequential() %>%
  
  # Network architecture with 20% dropout
  layer_dense(units = 256, activation = "relu", input_shape = ncol(mnist_x)) %>%
  layer_dropout(rate = 0.2) %>%
  layer_dense(units = 128, activation = "relu") %>%
  layer_dropout(rate = 0.2) %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_dropout(rate = 0.2) %>%
  layer_dense(units = 10, activation = "softmax") %>%

  # Backpropagation
  compile(
    loss = "categorical_crossentropy",
    optimizer = optimizer_rmsprop(),
    metrics = c("accuracy")
  )

Here we compare 3 models: baseline, batch normalization and dropout + batch normalization

fit_baseline <- keras_model_sequential() %>%
  layer_dense(units = 256, activation = "relu", input_shape = ncol(mnist_x)) %>%
  layer_dense(units = 128, activation = "relu") %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_dense(units = 10, activation = "softmax") %>%
  compiler() %>%
  fit(
    x = mnist_x,
    y = mnist_y,
    epochs = 35,
    batch_size = 128,
    validation_split = 0.2,
    verbose = FALSE
  )

fit_norm <- keras_model_sequential() %>%
  layer_dense(units = 256, activation = "relu", input_shape = ncol(mnist_x)) %>%
  layer_batch_normalization() %>%
  layer_dense(units = 128, activation = "relu") %>%
  layer_batch_normalization() %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_batch_normalization() %>%
  layer_dense(units = 10, activation = "softmax") %>%
  compiler() %>%
  fit(
    x = mnist_x,
    y = mnist_y,
    epochs = 35,
    batch_size = 128,
    validation_split = 0.2,
    verbose = FALSE
  )

fit_reg <- keras_model_sequential() %>%
  layer_dense(units = 256, activation = "relu", input_shape = ncol(mnist_x)) %>%
  layer_batch_normalization() %>%
  layer_dropout(rate = 0.4) %>%
  layer_dense(units = 128, activation = "relu") %>%
  layer_batch_normalization() %>%
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_batch_normalization() %>%
  layer_dropout(rate = 0.2) %>%
  layer_dense(units = 10, activation = "softmax") %>%
  compiler() %>%
  fit(
    x = mnist_x,
    y = mnist_y,
    epochs = 35,
    batch_size = 128,
    validation_split = 0.2,
    verbose = FALSE
  )

models <- ls(pattern = "fit_") 
df_reg <- models %>%
  map(get) %>%
  map(~ data.frame(
    `Validation error` = .$metrics$val_loss,
    `Training error`   = .$metrics$loss,
    epoch = seq_len(.$params$epoch)
    )) %>%
  map2_df(models, ~ mutate(.x, model = .y)) %>%
  mutate(Model = case_when(
    model == "fit_baseline" ~ "Baseline",
    model == "fit_norm"     ~ "Baseline + batch normalization",
    model == "fit_reg"      ~ "Baseline + batch normalization + dropout"
  )) %>%
  gather(Validation, Loss, Validation.error:Training.error)

We plot the results

baseline <- df %>%
  filter(`Middle layers` == "3 layer", `Number of nodes` == "large") %>%
  mutate(Model = "Baseline") %>%
  select(epoch, Model, Validation, Loss)
batch <- df_batch %>%
  filter(`Middle layers` == "3 layer", `Number of nodes` == "large") %>%
  mutate(Model = "Baseline + batch normalization") %>%
  select(epoch, Model, Validation, Loss)
df_reg <- df_reg %>%
  select(-model) %>%
  filter(Model == "Baseline + batch normalization + dropout") %>%
  mutate(Validation = stringr::str_replace_all(Validation, "\\.", " ")) %>%
  bind_rows(batch, baseline)

best <- df_reg %>% 
  filter(Validation == "Validation error") %>%
  group_by(Model) %>% 
  filter(Loss == min(Loss)) %>%
  mutate(label = paste("Min validation error:", round(Loss, 4)))

ggplot(df_reg, aes(epoch, Loss)) + 
  geom_text(data = best, aes(x = 35, y = 0.49, label = label), size = 4, hjust = 1, vjust = 1) +
  geom_point(aes(color = Validation)) +
  geom_line(aes(color = Validation)) +
  facet_wrap(~ Model) +
  xlab("Epoch") +
  theme(legend.title = element_blank(),
        legend.position = "top")
The effect of regularization with dropout on validation loss.

Figure 3: The effect of regularization with dropout on validation loss

Task: What are your conclusions ?

2.9 What about the initialization of the weight

Keep in mind that weight initialization can have an impact on both the convergence rate and the accuracy of your network. In Keras, the argument kernel_initializer in the function layer_dense allows us to set up different weight initialization. The default is the “Glorot uniform” which draws samples from a uniform within \([-a,a]\) where \(a=\sqrt{\frac{6}{(fan_i+fan_o)}}\) where \(fan_i\) is the number of inputs in the weight tensor and \(fan_o\) is the number of output in the weight tensor.

fit_reg_init <- keras_model_sequential() %>%
  layer_dense(units = 256, activation = "relu", kernel_initializer=initializer_random_normal(),input_shape = ncol(mnist_x)) %>%
  layer_batch_normalization() %>%
  layer_dropout(rate = 0.4) %>%
  layer_dense(units = 128, kernel_initializer=initializer_random_normal(),activation = "relu") %>%
  layer_batch_normalization() %>%
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = 64, kernel_initializer=initializer_random_normal(),activation = "relu") %>%
  layer_batch_normalization() %>%
  layer_dropout(rate = 0.2) %>%
  layer_dense(units = 10, kernel_initializer=initializer_random_normal(),activation = "softmax") %>%
  compiler() %>%
  fit(
    x = mnist_x,
    y = mnist_y,
    epochs = 35,
    batch_size = 128,
    validation_split = 0.2,
    verbose = FALSE
  )
fit_reg_init

Final epoch (plot to see history):
        loss: 0.0419
    accuracy: 0.9869
    val_loss: 0.07696
val_accuracy: 0.9812 
plot(fit_reg_init)
`geom_smooth()` using formula 'y ~ x'

Do you see any differences ? Is it better ?

2.10 Early stop

You can also adust the number of “epoch” by adding callback_early_stopping(patience = 5) to stop training if the loss has not improved after 5 epochs.

model_final <- keras_model_sequential() %>%
  layer_dense(units = 256, activation = "relu", kernel_initializer=initializer_random_normal(),input_shape = ncol(mnist_x)) %>%
  layer_batch_normalization() %>%
  layer_dropout(rate = 0.4) %>%
  layer_dense(units = 128, kernel_initializer=initializer_random_normal(),activation = "relu") %>%
  layer_batch_normalization() %>%
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = 64, kernel_initializer=initializer_random_normal(),activation = "relu") %>%
  layer_batch_normalization() %>%
  layer_dropout(rate = 0.2) %>%
  layer_dense(units = 10, kernel_initializer=initializer_random_normal(),activation = "softmax") %>%
  compile(
    loss = 'categorical_crossentropy',
    optimizer = optimizer_rmsprop(),
    metrics = c('accuracy')
  ) 

model_final.fit <- model_final %>%
  fit(
    x = mnist_x,
    y = mnist_y,
    epochs = 35,
    batch_size = 128,
    validation_split = 0.2,
    callbacks = list(callback_early_stopping(patience = 5)),
    verbose = FALSE
  )

model_final.fit

Final epoch (plot to see history):
        loss: 0.06366
    accuracy: 0.9808
    val_loss: 0.08111
val_accuracy: 0.9791 
# Optimal
min(model_final.fit$metrics$val_loss)
[1] 0.07242727
max(model_final.fit$metrics$val_acc)
[1] 0.97975
plot(model_final.fit)
`geom_smooth()` using formula 'y ~ x'
Training and validation performance on our 3-layer large network with dropout.

Figure 4: Training and validation performance on our 3-layer large network with dropout

2.11 Final Task

Present the performance of your final model on the test set:

  • confusion matrix
  • accuracy

2.11.1 Performance

          Reference
Prediction    0    1    2    3    4    5    6    7    8    9
         0  969    0    1    1    1    2    3    1    2    6
         1    1 1126    1    0    2    0    3    6    0    5
         2    1    4 1019    3    4    0    1   11    4    0
         3    1    1    1  996    0   11    0    3    5    8
         4    1    0    1    0  958    0    1    0    4   13
         5    0    1    0    2    0  866    2    0    1    0
         6    4    2    0    0    7    7  948    0    2    1
         7    1    0    6    4    2    1    0 1001    3    6
         8    2    1    3    3    2    3    0    2  952    3
         9    0    0    0    1    6    2    0    4    1  967
Accuracy 
  0.9802