Revealing the Dark Magic Behind Deep Learning

Introduction

Tricks Testing Checklist

Training Schemes:

  • Which learning rate regime to choose
  • Which optimizer to choose
  • Should we always pretrain a model on ImageNet
  • Which batch size to use (-)

Regularization Tricks:

  • AutoAugment
  • Weight decay
  • Label smoothing
  • Scheduled regularizations
  • Mixup
  • Auxiliary loss (-)
  • Crop factor, padding and resizing schemes(-)
  • Drop path (-)
  • CutOut (-)

Architecture Enhancements:

  • Squeeze-And-Excite (SE) layers
  • Stem activation functions
  • Attention pooling (-)

Training Schemes

Which Learning Rate Regime to Choose

Which Optimizer to Choose

Should We Always Pretrain a Model on ImageNet

Regularization Tricks

AutoAugment

Weight Decay

Label Smoothing

Scheduled Regularizations

Mixup

  • Using more epochs.
  • With and without label smoothing.
  • Mixup with reduced weight decay.
  • Mixup without autoAugment.
  • Mixup with const mix factor of 0.5 instead of drawn from betta function.

Architecture Enhancements

Squeeze-And-Excitation (SE) Layers

def se_block(in_block, ch, ratio=16):
x = GlobalAveragePooling2D()(in_block)
x = Dense(ch//ratio, activation='relu')(x)
x = Dense(ch, activation='sigmoid')(x)
return multiply()([in_block, x])

Downscaling Layers Activation Functions

Summary

  • Learning rate regime — cycles of cosine power annealing, with decaying amplitude and initial warm-up.
  • Optimizer — SGD with Nesterov momentum.
  • Pretraining on ImageNet — only for large models and lack of sufficient data.
  • AutoAugment — use AutoAugment regime instead of standard color augmentations.
  • Weight decay — don’t use a “generic value”.
  • Label smoothing — use in classification tasks, with a smoothing factor of 0.1.
  • Mixup — don’t use for small models.
  • Squeeze-and-excitation layers — use.
  • Activation functions — use Leaky-Relu instead of Relu for downscaling layers.
  • Which batch size to use
  • Auxiliary loss
  • Crop factor, padding, and resizing schemes
  • Drop path
  • CutOut
  • Attention pooling

Alibaba Tech

--

--

--

First-hand & in-depth information about Alibaba's tech innovation in Artificial Intelligence, Big Data & Computer Engineering. Follow us on Facebook!

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

BAYESIAN DEEP LEARNING

TensorFlow Eager vs PyTorch: Comparison

How To Implement Linear Regression Using Python In Machine Learning

How To Implement Linear Regression Using Python In Machine Learning — Verzeo

Deep learning : Simplest way to implement YOLO — an extremely powerful object detection algorithm…

How Does Stochastic Gradient Descent Find the Global Minima?

How LinkedIn, Uber, Lyft, Airbnb and Netflix are Solving Data Management and Discovery for Machine…

An Easy Introduction to Machine Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Tech

Alibaba Tech

First-hand & in-depth information about Alibaba's tech innovation in Artificial Intelligence, Big Data & Computer Engineering. Follow us on Facebook!

More from Medium

ML Arxiv Haul #4

Path to production for ML models

Robotika Tutorial

Understanding The Difference Between AI, ML, And DL: Using An Incredibly Simple Example