\(\newcommand{\vect}[1]{\boldsymbol{#1}}\) \(\newcommand{\transp}{^{\text{T}}}\) \(\newcommand{\mat}[1]{\boldsymbol{\mathcal{#1}}}\) \(\newcommand{\sign}{\text{sign}}\)
In this tutorial, we mainly use the MNIST dataset to explore classification deep neural networks (DNN) models. At the end of this tutorial, you should be comfortable to use a software package (here keras) to run different models for a classification task. You will explore different models by exploring/tuning different hyperparamaters of the DNN:
First you have to install keras R packages using
install.packages(keras)
We load some extra packages that you have to install too.
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(keras)
library(dslabs)
library(tidyr)
library(stringr)
library(purrr)
MNIST data are available from the package dslabs
# Import MNIST training data
mnist <- dslabs::read_mnist()
mnist_x <- mnist$train$images
mnist_y <- mnist$train$labels
mnist_x_test <- mnist$test$images
mnist_y_test <- mnist$test$labels
dim(mnist_x)
[1] 60000 784
dim(mnist_x_test)
[1] 10000 784
length(mnist_y)
[1] 60000
length(mnist_y_test)
[1] 10000
Reminder: Keep in mind that all features need to be numeric for running a feedforward DNN. When you have some categorical features you have to transform into numerical values such as one-hot encoded.
The data is in gray scale with each image having 0 to 255 pixels. We therefore need to scale the data by the number of pixels.
colnames(mnist_x) <- paste0("V", 1:ncol(mnist_x))
mnist_x <- mnist_x / 255
colnames(mnist_x_test) <- paste0("V", 1:ncol(mnist_x))
mnist_x_test <- mnist_x_test / 255
For multi-classification model (multinomial response 0 to 9), Keras uses one-hot encoded for the outcome
# One-hot encode response
mnist_y <- to_categorical(mnist_y, 10)
dim(mnist_y)
[1] 60000 10
keras_model_sequential()
model <- keras_model_sequential() %>%
layer_dense(units = 128, input_shape = ncol(mnist_x)) %>%
layer_dense(units = 64) %>%
layer_dense(units = 10)
Here, we have two hidden layers:
Note the input_shape argument which represents the number of features in the data (here 784)
We need to choose some activation function:
model <- keras_model_sequential() %>%
layer_dense(units = 128, activation = "relu", input_shape = ncol(mnist_x)) %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(units = 10, activation = "softmax")
Here, it is natural to choose the softmax function for the output layer. It is the most common to use ReLU activation for hidden layer
Now, we have to define our objective function to optimize and the optimizer to get a solution. The natural choise here is the cross entropy for categorical outcome. You have learned in lecture 3 the different variant of the gradient descent. Keras offers several optimizers:
We will use in this example rmsprop optimizer.
model <- keras_model_sequential() %>%
# Network architecture
layer_dense(units = 128, activation = "relu", input_shape = ncol(mnist_x)) %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(units = 10, activation = "softmax") %>%
# Backpropagation
compile(
loss = 'categorical_crossentropy',
optimizer = optimizer_rmsprop(),
metrics = c('accuracy')
)
Note that we define a metric on which the model will be evaluated. We have used the “accuracy” metric to assess the performance of the model
summary(model)
Model: "sequential_2"
________________________________________________________________________________
Layer (type) Output Shape Param #
================================================================================
dense_8 (Dense) (None, 128) 100480
________________________________________________________________________________
dense_7 (Dense) (None, 64) 8256
________________________________________________________________________________
dense_6 (Dense) (None, 10) 650
================================================================================
Total params: 109,386
Trainable params: 109,386
Non-trainable params: 0
________________________________________________________________________________
We will train our model on 25 epochs and a bath size of 128. We also use \(20\%\) of our data for the evaluation step during the training phase, meaning that \(60,000\times 0.2=12,000\) of the samples are using for the validation step while 48,000 samples are using for the optimization step.
Reminder: An epoch describes the number of times the algorithm sees the entire data set. So with a batch size of 128, one epoch is achieved after 375 passes.
# Train the model
set.seed(1)
fit1 <- model %>%
fit(
x = mnist_x,
y = mnist_y,
epochs = 25,
batch_size = 128,
validation_split = 0.2,
verbose = FALSE
)
# Display output
fit1
Final epoch (plot to see history):
loss: 0.002657
accuracy: 0.9991
val_loss: 0.1722
val_accuracy: 0.9773
plot(fit1)
`geom_smooth()` using formula 'y ~ x'
We can see that the loss function improves rapidly. However, we can see a potential overfit after 10 epochs. Indeed, the accurary rate of the validation set presents a flat shape after 10 epochs.
We can predict now the class (digits) for a new image:
model%>% predict_classes(mnist_x_test[1:10,])
[1] 7 2 1 0 4 1 4 9 6 9
Compared to the true label
mnist_y_test[1:10]
[1] 7 2 1 0 4 1 4 9 5 9
It looks good ?
pred <- model%>% predict_classes(mnist_x_test[,])
table(pred,mnist_y_test)
mnist_y_test
pred 0 1 2 3 4 5 6 7 8 9
0 973 0 2 0 2 2 4 2 2 4
1 0 1114 0 0 0 1 2 2 0 1
2 0 5 1007 1 5 0 1 13 6 0
3 3 1 6 997 1 9 1 2 7 10
4 0 0 1 0 940 1 3 1 2 2
5 1 2 0 4 2 869 5 0 3 2
6 1 4 3 0 5 4 940 0 0 1
7 1 3 4 3 4 0 0 1001 4 7
8 1 6 9 3 2 3 2 3 947 5
9 0 0 0 2 21 3 0 4 3 977
sum(diag(table(pred,mnist_y_test)))/10000
[1] 0.9765
library(caret)
confusionMatrix(factor(pred),factor(mnist_y_test))->res
res$table
Reference
Prediction 0 1 2 3 4 5 6 7 8 9
0 973 0 2 0 2 2 4 2 2 4
1 0 1114 0 0 0 1 2 2 0 1
2 0 5 1007 1 5 0 1 13 6 0
3 3 1 6 997 1 9 1 2 7 10
4 0 0 1 0 940 1 3 1 2 2
5 1 2 0 4 2 869 5 0 3 2
6 1 4 3 0 5 4 940 0 0 1
7 1 3 4 3 4 0 0 1001 4 7
8 1 6 9 3 2 3 2 3 947 5
9 0 0 0 2 21 3 0 4 3 977
res$overall[1]
Accuracy
0.9765
We will explore different model size by playing with the number of hidden layer from 1 to 3 and different number of neurons. Complex models have higher capacity to learn more features and patterns in the data, however they can overfit the training data. We try to maximize a high validation performance while minimizing the complexity of our model. The folowing table present the 9 models we will explore:
Size | 1 | 2 | 3 |
---|---|---|---|
small | 16 | 16, 8 | 16, 8, 4 |
medium | 64 | 64, 32 | 64, 32, 16 |
large | 256 | 256, 128 | 256, 128, 64 |
We have 9 models to run !!! better to wrap it into a nice function.
compiler <- function(object) {
compile(
object,
loss = 'categorical_crossentropy',
optimizer = optimizer_rmsprop(),
metrics = c('accuracy')
)
}
trainer <- function(object) {
fit(
object,
x = mnist_x,
y = mnist_y,
epochs = 25,
batch_size = 128,
validation_split = .2,
verbose = FALSE
)
}
# One layer models -------------------------------------------------------------
# small model
`1 layer_small` <- keras_model_sequential() %>%
layer_dense(units = 16, activation = "relu", input_shape = ncol(mnist_x)) %>%
layer_dense(units = 10, activation = "softmax") %>%
compiler() %>%
trainer()
# medium
`1 layer_medium` <- keras_model_sequential() %>%
layer_dense(units = 64, activation = "relu", input_shape = ncol(mnist_x)) %>%
layer_dense(units = 10, activation = "softmax") %>%
compiler() %>%
trainer()
# large
`1 layer_large` <- keras_model_sequential() %>%
layer_dense(units = 256, activation = "relu", input_shape = ncol(mnist_x)) %>%
layer_dense(units = 10, activation = "softmax") %>%
compiler() %>%
trainer()
We can plot the results:
models <- ls(pattern = "layer_")
df <- models %>%
map(get) %>%
map(~ data.frame(
`Validation error` = .$metrics$val_loss,
`Training error` = .$metrics$loss,
epoch = seq_len(.$params$epoch)
)) %>%
map2_df(models, ~ mutate(.x, model = .y)) %>%
separate(model, into = c("Middle layers", "Number of nodes"), sep = "_") %>%
gather(Validation, Loss, Validation.error:Training.error) %>%
mutate(
Validation = str_replace_all(Validation, "\\.", " "),
`Number of nodes` = factor(`Number of nodes`, levels = c("small", "medium", "large"))
)
best <- df %>%
filter(Validation == "Validation error") %>%
group_by(`Middle layers`, `Number of nodes`) %>%
filter(Loss == min(Loss)) %>%
mutate(label = paste("Min validation error:", round(Loss, 4)))
ggplot(df, aes(epoch, Loss)) +
geom_hline(data = best, aes(yintercept = Loss), lty = "dashed", color = "grey50") +
geom_text(data = best, aes(x = 25, y = 0.95, label = label), size = 4, hjust = 1, vjust = 1) +
geom_point(aes(color = Validation)) +
geom_line(aes(color = Validation)) +
facet_grid(`Number of nodes` ~ `Middle layers`, scales = "free_y") +
scale_y_continuous(limits = c(0, 1)) +
theme(legend.title = element_blank(),
legend.position = "top") +
xlab("Epoch")
You should get a figure similar as the following one:
What are you conclusions from this experiment ? which models present some overfit issue ? which models to keep ?
Here we will add a normalization batch step after each layer. An example using the following code.
model_w_norm <- keras_model_sequential() %>%
# Network architecture with batch normalization
layer_dense(units = 256, activation = "relu", input_shape = ncol(mnist_x)) %>%
layer_batch_normalization() %>%
layer_dense(units = 128, activation = "relu") %>%
layer_batch_normalization() %>%
layer_dense(units = 64, activation = "relu") %>%
layer_batch_normalization() %>%
layer_dense(units = 10, activation = "softmax") %>%
# Backpropagation
compile(
loss = "categorical_crossentropy",
optimizer = optimizer_rmsprop(),
metrics = c("accuracy")
)
Question: Complexity size of your model (number of parameters) ?
Now we can explore the effect of batch normalization on our 9 models
# One layer models -------------------------------------------------------------
# small model
`1 layer_small` <- keras_model_sequential() %>%
layer_dense(units = 16, activation = "relu", input_shape = ncol(mnist_x)) %>%
layer_batch_normalization() %>%
layer_dense(units = 10, activation = "softmax") %>%
compiler() %>%
trainer()
# medium
`1 layer_medium` <- keras_model_sequential() %>%
layer_dense(units = 64, activation = "relu", input_shape = ncol(mnist_x)) %>%
layer_batch_normalization() %>%
layer_dense(units = 10, activation = "softmax") %>%
compiler() %>%
trainer()
# large
`1 layer_large` <- keras_model_sequential() %>%
layer_dense(units = 256, activation = "relu", input_shape = ncol(mnist_x)) %>%
layer_batch_normalization() %>%
layer_dense(units = 10, activation = "softmax") %>%
compiler() %>%
trainer()
# Two layer models -------------------------------------------------------------
# small capacity model
`2 layer_small` <- keras_model_sequential() %>%
layer_dense(units = 16, activation = "relu", input_shape = ncol(mnist_x)) %>%
layer_batch_normalization() %>%
layer_dense(units = 8, activation = "relu") %>%
layer_batch_normalization() %>%
layer_dense(units = 10, activation = "softmax") %>%
compiler() %>%
trainer()
# medium
`2 layer_medium` <- keras_model_sequential() %>%
layer_dense(units = 64, activation = "relu", input_shape = ncol(mnist_x)) %>%
layer_batch_normalization() %>%
layer_dense(units = 32, activation = "relu") %>%
layer_batch_normalization() %>%
layer_dense(units = 10, activation = "softmax") %>%
compiler() %>%
trainer()
# large
`2 layer_large` <- keras_model_sequential() %>%
layer_dense(units = 256, activation = "relu", input_shape = ncol(mnist_x)) %>%
layer_batch_normalization() %>%
layer_dense(units = 128, activation = "relu") %>%
layer_batch_normalization() %>%
layer_dense(units = 10, activation = "softmax") %>%
compiler() %>%
trainer()
# Three layer models -------------------------------------------------------------
# small capacity model
`3 layer_small` <- keras_model_sequential() %>%
layer_dense(units = 16, activation = "relu", input_shape = ncol(mnist_x)) %>%
layer_batch_normalization() %>%
layer_dense(units = 8, activation = "relu") %>%
layer_batch_normalization() %>%
layer_dense(units = 4, activation = "relu") %>%
layer_batch_normalization() %>%
layer_dense(units = 10, activation = "softmax") %>%
compiler() %>%
trainer()
# medium
`3 layer_medium` <- keras_model_sequential() %>%
layer_dense(units = 64, activation = "relu", input_shape = ncol(mnist_x)) %>%
layer_batch_normalization() %>%
layer_dense(units = 32, activation = "relu") %>%
layer_batch_normalization() %>%
layer_dense(units = 16, activation = "relu") %>%
layer_batch_normalization() %>%
layer_dense(units = 10, activation = "softmax") %>%
compiler() %>%
trainer()
# large
`3 layer_large` <- keras_model_sequential() %>%
layer_dense(units = 256, activation = "relu", input_shape = ncol(mnist_x)) %>%
layer_batch_normalization() %>%
layer_dense(units = 128, activation = "relu") %>%
layer_batch_normalization() %>%
layer_dense(units = 64, activation = "relu") %>%
layer_batch_normalization() %>%
layer_dense(units = 10, activation = "softmax") %>%
compiler() %>%
trainer()
models <- ls(pattern = "layer_")
df_batch <- models %>%
map(get) %>%
map(~ data.frame(
`Validation error` = .$metrics$val_loss,
`Training error` = .$metrics$loss,
epoch = seq_len(.$params$epoch)
)) %>%
map2_df(models, ~ mutate(.x, model = .y)) %>%
separate(model, into = c("Middle layers", "Number of nodes"), sep = "_") %>%
gather(Validation, Loss, Validation.error:Training.error) %>%
mutate(
Validation = str_replace_all(Validation, "\\.", " "),
`Number of nodes` = factor(`Number of nodes`, levels = c("small", "medium", "large")),
`Batch normalization` = TRUE
)
Plot the results
df2 <- df %>%
mutate(`Batch normalization` = FALSE) %>%
bind_rows(df_batch) %>%
filter(Validation == "Validation error")
best <- df2 %>%
filter(Validation == "Validation error") %>%
group_by(`Middle layers`, `Number of nodes`) %>%
filter(Loss == min(Loss)) %>%
mutate(label = paste("Min validation error:", round(Loss, 4)))
ggplot(df2, aes(epoch, Loss, color = `Batch normalization`)) +
geom_text(data = best, aes(x = 25, y = 0.95, label = label), size = 4, hjust = 1, vjust = 1) +
geom_point() +
geom_line() +
facet_grid(`Number of nodes` ~ `Middle layers`, scales = "free_y") +
scale_y_continuous(limits = c(0, 1)) +
xlab("Epoch") +
scale_color_discrete("Batch normalization") +
theme(legend.position = "top")
Regularization is generally a good practice for overfitting issue. Here we explore \(L_1\) and \(L_2\) normalization:
Here the code for a model with batch normalization and \(L_2\) regularization.
model_w_reg <- keras_model_sequential() %>%
# Network architecture with L1 regularization and batch normalization
layer_dense(units = 256, activation = "relu", input_shape = ncol(mnist_x),
kernel_regularizer = regularizer_l2(0.001)) %>%
layer_batch_normalization() %>%
layer_dense(units = 128, activation = "relu",
kernel_regularizer = regularizer_l2(0.001)) %>%
layer_batch_normalization() %>%
layer_dense(units = 64, activation = "relu",
kernel_regularizer = regularizer_l2(0.001)) %>%
layer_batch_normalization() %>%
layer_dense(units = 10, activation = "softmax") %>%
# Backpropagation
compile(
loss = "categorical_crossentropy",
optimizer = optimizer_rmsprop(),
metrics = c("accuracy")
)
Present the result of the model with 3-layer with 256, 128 and 64 nodes per respective layer, with batch normalization and with \(L_2\) regularization components.
Final epoch (plot to see history):
loss: 0.1015
accuracy: 0.9846
val_loss: 0.1553
val_accuracy: 0.9707
`geom_smooth()` using formula 'y ~ x'
Is the \(L_2\) improved your model ?
An another avenue for overfitting issue is to use the dropout strategy.
model_w_drop <- keras_model_sequential() %>%
# Network architecture with 20% dropout
layer_dense(units = 256, activation = "relu", input_shape = ncol(mnist_x)) %>%
layer_dropout(rate = 0.2) %>%
layer_dense(units = 128, activation = "relu") %>%
layer_dropout(rate = 0.2) %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dropout(rate = 0.2) %>%
layer_dense(units = 10, activation = "softmax") %>%
# Backpropagation
compile(
loss = "categorical_crossentropy",
optimizer = optimizer_rmsprop(),
metrics = c("accuracy")
)
Here we compare 3 models: baseline, batch normalization and dropout + batch normalization
fit_baseline <- keras_model_sequential() %>%
layer_dense(units = 256, activation = "relu", input_shape = ncol(mnist_x)) %>%
layer_dense(units = 128, activation = "relu") %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(units = 10, activation = "softmax") %>%
compiler() %>%
fit(
x = mnist_x,
y = mnist_y,
epochs = 35,
batch_size = 128,
validation_split = 0.2,
verbose = FALSE
)
fit_norm <- keras_model_sequential() %>%
layer_dense(units = 256, activation = "relu", input_shape = ncol(mnist_x)) %>%
layer_batch_normalization() %>%
layer_dense(units = 128, activation = "relu") %>%
layer_batch_normalization() %>%
layer_dense(units = 64, activation = "relu") %>%
layer_batch_normalization() %>%
layer_dense(units = 10, activation = "softmax") %>%
compiler() %>%
fit(
x = mnist_x,
y = mnist_y,
epochs = 35,
batch_size = 128,
validation_split = 0.2,
verbose = FALSE
)
fit_reg <- keras_model_sequential() %>%
layer_dense(units = 256, activation = "relu", input_shape = ncol(mnist_x)) %>%
layer_batch_normalization() %>%
layer_dropout(rate = 0.4) %>%
layer_dense(units = 128, activation = "relu") %>%
layer_batch_normalization() %>%
layer_dropout(rate = 0.3) %>%
layer_dense(units = 64, activation = "relu") %>%
layer_batch_normalization() %>%
layer_dropout(rate = 0.2) %>%
layer_dense(units = 10, activation = "softmax") %>%
compiler() %>%
fit(
x = mnist_x,
y = mnist_y,
epochs = 35,
batch_size = 128,
validation_split = 0.2,
verbose = FALSE
)
models <- ls(pattern = "fit_")
df_reg <- models %>%
map(get) %>%
map(~ data.frame(
`Validation error` = .$metrics$val_loss,
`Training error` = .$metrics$loss,
epoch = seq_len(.$params$epoch)
)) %>%
map2_df(models, ~ mutate(.x, model = .y)) %>%
mutate(Model = case_when(
model == "fit_baseline" ~ "Baseline",
model == "fit_norm" ~ "Baseline + batch normalization",
model == "fit_reg" ~ "Baseline + batch normalization + dropout"
)) %>%
gather(Validation, Loss, Validation.error:Training.error)
We plot the results
baseline <- df %>%
filter(`Middle layers` == "3 layer", `Number of nodes` == "large") %>%
mutate(Model = "Baseline") %>%
select(epoch, Model, Validation, Loss)
batch <- df_batch %>%
filter(`Middle layers` == "3 layer", `Number of nodes` == "large") %>%
mutate(Model = "Baseline + batch normalization") %>%
select(epoch, Model, Validation, Loss)
df_reg <- df_reg %>%
select(-model) %>%
filter(Model == "Baseline + batch normalization + dropout") %>%
mutate(Validation = stringr::str_replace_all(Validation, "\\.", " ")) %>%
bind_rows(batch, baseline)
best <- df_reg %>%
filter(Validation == "Validation error") %>%
group_by(Model) %>%
filter(Loss == min(Loss)) %>%
mutate(label = paste("Min validation error:", round(Loss, 4)))
ggplot(df_reg, aes(epoch, Loss)) +
geom_text(data = best, aes(x = 35, y = 0.49, label = label), size = 4, hjust = 1, vjust = 1) +
geom_point(aes(color = Validation)) +
geom_line(aes(color = Validation)) +
facet_wrap(~ Model) +
xlab("Epoch") +
theme(legend.title = element_blank(),
legend.position = "top")
Task: What are your conclusions ?
Keep in mind that weight initialization can have an impact on both the convergence rate and the accuracy of your network. In Keras, the argument kernel_initializer in the function layer_dense allows us to set up different weight initialization. The default is the “Glorot uniform” which draws samples from a uniform within \([-a,a]\) where \(a=\sqrt{\frac{6}{(fan_i+fan_o)}}\) where \(fan_i\) is the number of inputs in the weight tensor and \(fan_o\) is the number of output in the weight tensor.
fit_reg_init <- keras_model_sequential() %>%
layer_dense(units = 256, activation = "relu", kernel_initializer=initializer_random_normal(),input_shape = ncol(mnist_x)) %>%
layer_batch_normalization() %>%
layer_dropout(rate = 0.4) %>%
layer_dense(units = 128, kernel_initializer=initializer_random_normal(),activation = "relu") %>%
layer_batch_normalization() %>%
layer_dropout(rate = 0.3) %>%
layer_dense(units = 64, kernel_initializer=initializer_random_normal(),activation = "relu") %>%
layer_batch_normalization() %>%
layer_dropout(rate = 0.2) %>%
layer_dense(units = 10, kernel_initializer=initializer_random_normal(),activation = "softmax") %>%
compiler() %>%
fit(
x = mnist_x,
y = mnist_y,
epochs = 35,
batch_size = 128,
validation_split = 0.2,
verbose = FALSE
)
fit_reg_init
Final epoch (plot to see history):
loss: 0.0419
accuracy: 0.9869
val_loss: 0.07696
val_accuracy: 0.9812
plot(fit_reg_init)
`geom_smooth()` using formula 'y ~ x'
Do you see any differences ? Is it better ?
You can also adust the number of “epoch” by adding callback_early_stopping(patience = 5) to stop training if the loss has not improved after 5 epochs.
model_final <- keras_model_sequential() %>%
layer_dense(units = 256, activation = "relu", kernel_initializer=initializer_random_normal(),input_shape = ncol(mnist_x)) %>%
layer_batch_normalization() %>%
layer_dropout(rate = 0.4) %>%
layer_dense(units = 128, kernel_initializer=initializer_random_normal(),activation = "relu") %>%
layer_batch_normalization() %>%
layer_dropout(rate = 0.3) %>%
layer_dense(units = 64, kernel_initializer=initializer_random_normal(),activation = "relu") %>%
layer_batch_normalization() %>%
layer_dropout(rate = 0.2) %>%
layer_dense(units = 10, kernel_initializer=initializer_random_normal(),activation = "softmax") %>%
compile(
loss = 'categorical_crossentropy',
optimizer = optimizer_rmsprop(),
metrics = c('accuracy')
)
model_final.fit <- model_final %>%
fit(
x = mnist_x,
y = mnist_y,
epochs = 35,
batch_size = 128,
validation_split = 0.2,
callbacks = list(callback_early_stopping(patience = 5)),
verbose = FALSE
)
model_final.fit
Final epoch (plot to see history):
loss: 0.06366
accuracy: 0.9808
val_loss: 0.08111
val_accuracy: 0.9791
# Optimal
min(model_final.fit$metrics$val_loss)
[1] 0.07242727
max(model_final.fit$metrics$val_acc)
[1] 0.97975
plot(model_final.fit)
`geom_smooth()` using formula 'y ~ x'
Present the performance of your final model on the test set:
Reference
Prediction 0 1 2 3 4 5 6 7 8 9
0 969 0 1 1 1 2 3 1 2 6
1 1 1126 1 0 2 0 3 6 0 5
2 1 4 1019 3 4 0 1 11 4 0
3 1 1 1 996 0 11 0 3 5 8
4 1 0 1 0 958 0 1 0 4 13
5 0 1 0 2 0 866 2 0 1 0
6 4 2 0 0 7 7 948 0 2 1
7 1 0 6 4 2 1 0 1001 3 6
8 2 1 3 3 2 3 0 2 952 3
9 0 0 0 1 6 2 0 4 1 967
Accuracy
0.9802