3 Images for Why Weight and Data Normalization are Important
Setup: Deep Neural Net with 4 Dense layers (100, 20, 10, 1) sizes trained on 4898 samples with 11 features, 1000 epochs, Relu activations
Bad Weight initialization (from random [0,1] interval) and no data scaling (preprocessing layer). Plot cuts off cause of gradient explosion leading to nan
values
Normalized weights but still no data normalization. Gradient explodes a bit later.
Healthy training plot. Input data was normalized, weights were sampled from the appropriate normal distributions. Tested also with 13 layers, the gradient begins to explode at around 13 layers. I think there is also an initialization adjustment of variance based on layer depth, that could help, but honestly, it's good enough for now.