3 Images for Why Weight and Data Normalization are Important

Setup: Deep Neural Net with 4 Dense layers (100, 20, 10, 1) sizes trained on 4898 samples with 11 features, 1000 epochs, Relu activations Bad Weight initialization (from random [0,1] interval) and no data scaling (preprocessing layer). Plot cuts off cause of gradient explosion leading to nan values Normalized weights but still no data normalization. Gradient explodes a bit later. Healthy training plot. Input data was normalized, weights were sampled from the appropriate normal distributions. Tested also with 13 layers, the gradient begins to explode at around 13 layers. I think there is also an initialization adjustment of variance based on layer depth, that could help, but honestly, it's good enough for now.

Comments (0)

raML - Near Goals

Big: 1. Model compilation 2. Validation 3. Optimizers Small: 1. Lambda Layer 2. Data normalization as a layer (maybe?) Progress so far: Implemented model compilation. Now, creating a Deep Neural Network is as easy as it is in Keras model = Sequential([  Dense(size=3, input_shape=X.shape),  Dense(size=1, activation=Sigmoid) ]) model.compile(cost=MSE(), metrics=[RMSE()]) Looks just like Keras, you say? Well good, cause Keras does model creation the right way! I've also added Relu, but still testing to make sure it's working right. This actually made me realize, I should organize optimizers! Update After investigating, found out that the problem is most likely in exploding gradients. Didn't expect it to appear that early! Update 2 Oh this is so cool! After finding out the exploding gradient in a relatively small network, I knew that it probably wasn't due to the learning rate (although making it smaller did help), but rather it was due to weight initialization - that's actually worth writing a separate blog post about, but basically, I used to sample from a uniform 0,1 distribution, but it's much better to sample from a (normal) distribution centered at 0 (note: that doesn't fully solve it, for best performance, one need to take into account variation also, which should depend on layer's depth)

Comments (0)

Morning Goal

Let's see if I can stick to this new rubric. (Ramil from the future: "No") raML: 1. Lambda Layer 2. Implement More Datasets 3. Add more cost functions (RMSE) 4. Come up with a better DNN model creation procedure Update. Progress so far: Added more datasets, added Metrics, improved tqdm (the progress bar thing in terminal that tracks training progress). Here we have MSE Loss and RMSE metric tracked for model training on a Swedish Auto Insurance dataset. So beautiful

Comments (0)

Human Interactions are Hard!

It's never clear what the right thing to say is, at the moment, especially, when your words can and will be used against you . Stress, man, stress! On a different note, raML project (yeap, what an awesome name) is going great! Here is a sigmoid trained with MSE. Yeah, yeah, I shouldn't use MSE for logistic, but that's not the point! Update: Alright, fine! To all the (non-existing) haters, I've added the CrossEntropy Loss. Kids, the demo below is why you should use appropriate loss function. We can see that MSE after 100k epochs is as good as CrossEntropy after 10k! Woah, that's cool! (Note to skeptics: yes, I've also compared both at 10k and MSE is much worse. Note to skeptics^2: yes, all initial conditions were the same, stop doubting!)

Comments (0)