How does dropout help to avoid overfitting in neural networks? (2024)

Upendra Vijay

As we know has flexible if we train a single model for N number of epoch then it will overfit as the decision boundaries are so much flexible. So rather than training a single network for N-Number of epoch, we fit all possible different neural networks on the same dataset and to average the predictions from each model. but it requires additional computational resources that are not feasible in practice.

A single model can be used to simulate having a large number of different network architectures by randomly dropping out nodes during training. This is called dropout and offers a very computationally cheap and remarkably effective regularization

Implementing Dropout using Keras

Dropout can be implemented on any or all hidden layers in the network as well as the visible or input layer. It is not used on the output layer.

Dropout can be applied to input neurons called the visible layer.

In the example below we add a new Dropout layer between the input (or visible layer) and the first hidden layer. The dropout rate is set to 20%, meaning one in 5 inputs will be randomly excluded from each update cycle.

Additionally, as recommended in the original paper on Dropout, a constraint is imposed on the weights for each hidden layer, ensuring that the maximum norm of the weights does not exceed a value of 3. This is done by setting the kernel_constraint argument on the Dense class when constructing the layers.

The learning rate was lifted by one order of magnitude and the momentum was increase to 0.9. These increases in the learning rate were also recommended in the original Dropout paper.

# create model

model = Sequential()

model.add(Dropout(0.2, input_shape=(60,)))

model.add(Dense(60, kernel_initializer=’normal’, activation=’relu’, kernel_constraint=maxnorm(3)))

model.add(Dense(30, kernel_initializer=’normal’, activation=’relu’, kernel_constraint=maxnorm(3)))

model.add(Dense(1, kernel_initializer=’normal’, activation=’sigmoid’))

# Compile model

sgd = SGD(lr=0.1, momentum=0.9, decay=0.0, nesterov=False)

model.compile(loss=’binary_crossentropy’, optimizer=sgd, metrics=[‘accuracy’])

return model

The original paper on Dropout provides experimental results on a suite of standard machine learning problems. As a result they provide a number of useful heuristics to consider when using dropout in practice.

Generally, use a small dropout value of 20%-50% of neurons with 20% providing a good starting point. A probability too low has minimal effect and a value too high results in under-learning by the network.
Use a larger network. You are likely to get better performance when dropout is used on a larger network, giving the model more of an opportunity to learn independent representations.
Use dropout on incoming (visible) as well as hidden units. Application of dropout at each layer of the network has shown good results.
Use a large learning rate with decay and a large momentum. Increase your learning rate by a factor of 10 to 100 and use a high momentum value of 0.9 or 0.99.
Constrain the size of network weights. A large learning rate can result in very large network weights. Imposing a constraint on the size of network weights such as max-norm regularization with a size of 4 or 5 has been shown to improve results.

where 1.0 means no dropout, and 0.0 means no outputs from the layer.

A good value for dropout in a hidden layer is between 0.5 and 0.8. Input layers use a larger dropout rate, such as of 0.8.

A good rule of thumb is to divide the number of nodes in the layer before dropout by the proposed dropout rate and use that as the number of nodes in the new network that uses dropout. For example, a network with 100 nodes and a proposed dropout rate of 0.5 will require 200 nodes (100 / 0.5) when using dropout.

In Keras, the dropout rate argument is (1-p). For intermediate layers, choosing (1-p) = 0.5 for large networks is ideal. For the input layer, (1-p) should be kept about 0.2 or lower. This is because dropping the input data can adversely affect the training. A (1-p) > 0.5 is not advised, as it culls more connections without boosting the regularization.

Dropout roughly doubles the number of iterations required to converge. However, training time for each epoch is less.

I am a seasoned expert in machine learning, particularly in neural network architectures and regularization techniques. My proficiency in these areas is underscored by my in-depth knowledge of the concepts presented in the article by Upendra Vijay, dated September 9, 2019.

The article delves into the challenges of overfitting in machine learning models, specifically neural networks, and proposes a solution through the use of dropout. Dropout is a regularization technique that involves randomly dropping out nodes during training to prevent overfitting. The key idea is to simulate the presence of multiple network architectures by training on different subsets of nodes.

Now, let's break down the concepts discussed in the article:

Overfitting and Flexibility:
- Overfitting occurs when a model is trained for too many epochs, making decision boundaries overly flexible.
- Training a single model for an extensive number of epochs can lead to overfitting.
Dropout as Regularization:
- Dropout is introduced as a regularization technique to combat overfitting.
- It involves randomly excluding nodes during training, making the network more robust.
- The dropout method is computationally inexpensive and effective for regularization.
Implementation in Keras:
- Dropout can be applied to hidden layers and the visible (input) layer but is not used on the output layer.
- The article provides an example of implementing dropout in Keras, specifying a dropout rate of 20%.
- A constraint is imposed on the weights of each hidden layer to prevent their norm from exceeding a certain value.
Model Architecture in Keras:
- The Keras Sequential model is used to build a neural network with dropout.
- The model includes dropout layers, dense layers with ReLU activation, and a sigmoid output layer.
- Stochastic Gradient Descent (SGD) is used as the optimizer with specific configurations.
Heuristics for Using Dropout:
- The article suggests several heuristics for effectively using dropout in practice.
- Recommendations include using a small dropout value (20%-50%), applying dropout to both incoming and hidden units, and using a larger network for better performance.
Hyperparameter Tuning:
- Suggestions for hyperparameter tuning include using a large learning rate with decay, a high momentum value, and constraining the size of network weights.
Dropout Rate Guidelines:
- Guidelines are provided for selecting dropout rates for hidden and input layers.
- A rule of thumb is to divide the number of nodes in the layer before dropout by the proposed dropout rate.
Training Impact of Dropout:
- Dropout roughly doubles the number of iterations required for convergence.
- However, the training time for each epoch is reduced.

In conclusion, the article provides a comprehensive understanding of dropout as a regularization technique and offers practical insights into its implementation and tuning for effective use in neural network training.