5 minute read

This is a blog post for Assignment 1 from my Data Mining class.

Jupyter Notebook can be found here.

TL;DR: It’s hard, we got some improvements, state-of-the-art is unbeatable.

The CIFAR-10 Dataset

In summary, CIFAR-10 is a set of 60,000 tiny images of 10 object class which is very popular for beginners learning neural networks. More information is available at CIFAR-10 and CIFAR-100 Website.

Sample CIFAR-10 images
Sample CIFAR-10 images

For the given sample images, as you can see, this is not an easy task. The label “cat” could be both just the face of the cat or a whole cat. Likewise, the label “ship” is used to described a picture which is mostly background and only a very small ship in it.

We will be using Python, or more specifically, PyTorch to create, train, and evaluate models on this data set.

The Challenges

The bulk of it is how to define a model that works “well”. I put well in quotation marks because there are so many thing to tweak e.g. number of layers, channels per layer, kernel size, activation functions, etc.

Ironically enough, the answer always seems to be more complexity. Because that’s exactly what I did to overcome the default model underfitting. There seems to be the problem of avoiding overfitting as well, which is mitigated by trying out different training parameters.

There’s also the issue of implementation where you need to calculate the transition between the last CONV and the first FC layer, which was expedited with the help of ptrblck.

The Contributions

I made 3 models, which 2 out of 3 outperformed the reference model. They’re not state-of-the-art, but they’re something…I’ve also find that the batch size 64 works well (32 and 128 closely mimics 64’s performance), for both time and hardware constraints.

The Models

Here we have a few models, including the one given to us by the PyTorch Tutorial

PyTorch Net

Default Net network architecture
Default Net network architecture

As you can see this model is simple, not much going on, and as expected, will perform not so great. Training did not take long either, but the model seems to overfit after approximately 7-10 epochs.

Simple Net

A modified version of Default Net to test out the idea that maybe somewhat similar model could some how miraculously works. It didn’t. More on it’s performance down below. This is basically shooting in the dark hoping to hit a fly 20 miles away.

Simple Net network architecture
Simple Net network architecture

Since simplicity isn’t the answer, let’s try complexity.

Referencing the layering pattern suggested by CS321n CNN Lecture, I decided to try out a few different configuration of hidden layers. This include modifying number of channels, number of layers, added more pooling, and introduce dropouts and normalization. Data augmentation will be introduced later.

The following few models have mostly similar architecture, just tweaked differently.

Stacked Net

Stacked Net network architecture
Stacked Net network architecture

Basically stacking layers.

Channel Net

Channel Net network architecture
Channel Net network architecture

More channels? Maybe?

Wobbly Net

Wobbly Net network architecture
Wobbly Net network architecture

This is just a guess to see if it does anything. I’m calling it Wobbly because the models number of channels increase then decrease then increase again, for both CONV and FC layers.

Performance Comparison

In this section we will train them somewhat equally to establish both a baseline performance measure, as well as see if we can improve beyond what was given to us. Also, CHARTS!

Performance Metrics

Ultimately, we only care about accuracy, but loss is also provided to give some insight into how well each model trains based on their configuration, optimizer, and other hyper-parameters. They are all trained equally.

  • either SGD or Adam optimizer
  • learning rate = \(0.001\)
  • batch size = \(64\)


Final accuracy comparison
Final accuracy comparison

Simple Net performed similar to reference model. Nothing to say there. Hovering 60% on its best day.

Stacked Net increase accuracy by about 10%, so on average at 70% accuracy.

Channel Net and Wobbly Net have significantly higher number of channels, which leads much better results.

Channel Net SGD gives 78% accuracy, while Adam gives 80%.

Wobbly Net turns out to be the most accurate of them all at 79% for SGD and 83% for Adam.

It should be noticed though that they all started off kind of different, particularly I believe they are influenced by the model complexity and not their optimizer.

Train accuracy is omitted since they all overfit anyway (runs to around 98%)
Train accuracy is omitted since they all overfit anyway (runs to around 98%)


Not sure what they quite mean. Theoretically they are all just decreasing in disregard if our model is fitting well or not. The overfitting detection part is already taken care of by the train/split test thingy.

Loss decreasing over multiple passes
Loss decreasing over multiple passes

One thing to notice is that Adam significantly outperforms SGD for my models. Also, it would appear that hugely complex models often starts at a lower loss than simpler model, suggesting that highly complex models might be a good fit for such a classification task, since it accounts for more features, hence, reduces loss across the board.


Mostly a hardware consideration. Boils down to “Do you have a gigantic GPU memory to accommodate your model?”.

Optimizer Options

While there are many optimizers available from PyTorch, and the program did implemented parameters to accommodate switching out optimizer, I decided to focus on just two: SGD and Adam. Perhaps at a later time I can run all models on all optimizers.

Batch Size Options

Through experimentation 64 seems to be a good number for my models. Again, different model behave differently to different batch sizes, so it is very important to tweak it accordingly.

No I did not just magically landed on the number 64. Batch size \(\leq16\) took way too long to train, while batch sizes \(\geq128\) makes little to no difference to performance. In fact, some of them e.g. Adagrad seems to performed worse at larger batch sizes.

Why do I do use power of 2 i.e. \([4, 8, 16, 32, 64, 128, 256]\) for batch sizes? Because most people uses them. Is there any practical difference to say \([5, 45, 90, 150, 300]\)? Probably not. (refer to the “tune my parameter” point above)

Learning Rate Options

Oddly enough nothing seems to work beyond or under \(0.001\). Maybe it’s worth investigating the options PyTorch has to offer, such as Learning Rate Scheduler.

Training Time Consideration

Each of the training iteration (see the Notebook) is timed and compared. While I trained my model locally on NVIDIA Quadro P4000, it still took a significant amount of time on just training. I can see why simpler models and well prepared data is favorable, since it can drastically affect training time.

Comparison of training time
Comparison of training time

Imagine training without GPU… 😱


[1] Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009.

[2] CS231n: Convolutional Neural Networks for Visual Recognition, Fei-Fei Li

[3] PyTorch Discussion Layer Dimension Output, ptrblck