Inferno is a utility library built around [PyTorch](http://pytorch.org/), designed to help you train and even build complex pytorch models. And in this tutorial, we’ll see how! If you’re new to PyTorch, I highly recommended you work through the [Pytorch tutorials](http://pytorch.org/tutorials/) first.
Building a PyTorch Model¶
Inferno’s training machinery works with just about any valid [Pytorch module](http://pytorch.org/docs/master/nn.html#torch.nn.Module). However, to make things even easier, we also provide pre-configured layers that work out-of-the-box. Let’s use them to build a convolutional neural network for Cifar-10.
import torch.nn as nn from inferno.extensions.layers.convolutional import ConvELU2D from inferno.extensions.layers.reshape import Flatten
ConvELU2D is a 2-dimensional convolutional layer with orthogonal weight initialization and [ELU](http://pytorch.org/docs/master/nn.html#torch.nn.ELU) activation. Flatten reshapes the 4 dimensional activation tensor to a matrix. Let’s use the Sequential container to chain together a bunch of convolutional and pooling layers, followed by a linear and softmax layer.
model = nn.Sequential( ConvELU2D(in_channels=3, out_channels=256, kernel_size=3), nn.MaxPool2d(kernel_size=2, stride=2), ConvELU2D(in_channels=256, out_channels=256, kernel_size=3), nn.MaxPool2d(kernel_size=2, stride=2), ConvELU2D(in_channels=256, out_channels=256, kernel_size=3), nn.MaxPool2d(kernel_size=2, stride=2), Flatten(), nn.Linear(in_features=(256 * 4 * 4), out_features=10), nn.Softmax() )
Models this size don’t win competitions anymore, but it’ll do for our purpose.
With our model built, it’s time to worry about the data generators. Or is it?
from inferno.io.box.cifar import get_cifar10_loaders train_loader, validate_loader = get_cifar10_loaders('path/to/cifar10', download=True, train_batch_size=128, test_batch_size=100)
CIFAR-10 works out-of-the-box (pun very much intended) with all the fancy data-augmentation and normalization. Of course, it’s perfectly fine if you have your own [DataLoader](http://pytorch.org/docs/master/data.html#torch.utils.data.DataLoader).
Preparing the Trainer¶
With our model and data loaders good to go, it’s finally time to build the trainer. To start, let’s initialize one.
from inferno.trainers.basic import Trainer trainer = Trainer(model) # Tell trainer about the data loaders trainer.bind_loader('train', train_loader).bind_loader('validate', validate_loader)
Now to the things we could do with it.
Setting up Checkpointing¶
When training a model for days, it’s usually a good idea to store the current training state to disk every once in a while. To set this up, we tell trainer where to store these checkpoints and how often.
So we’re saving once every 25 epochs. But what if an epoch takes forever, and you don’t wish to wait that long?
In this setting, you’re saving once every 1000 iterations (= batches). But we might also want to create a checkpoint when the validation score is the best. Easy as 1, 2,
Remember that a checkpoint contains the entire training state, and not just the model. Everything is included in the checkpoint file, including optimizer, criterion, and callbacks but __not the data loaders__.
Setting up Validation¶
Let’s say you wish to validate once every 2 epochs.
To be able to validate, you’ll need to specify a validation metric.
Inferno looks for a metric ‘CategoricalError’ in inferno.extensions.metrics. To specify your own metric, subclass inferno.extensions.metrics.base.Metric and implement the forward method. With that done, you could:
Note that the metric applies to `torch.Tensor`s, and not on `torch.autograd.Variable`s. Also, a metric might be way too expensive to evaluate every training iteration without slowing down the training. If this is the case and you’d like to evaluate the metric every (say) 10 training iterations:
However, while validating, the metric is evaluated once every iteration.
Setting up the Criterion and Optimizer¶
With that out of the way, let’s set up a training criterion and an optimizer.
# set up the criterion trainer.build_criterion('CrossEntropyLoss')
The trainer looks for a ‘CrossEntropyLoss’ in torch.nn, which it finds. But any of the following would have worked:
What this means is that if you have your own loss criterion that has the same API as any of the criteria found in torch.nn, you should be fine by just plugging it in.
The same holds for the optimizer:
Like for criteria, the trainer looks for a ‘Adam’ in torch.optim (among other places), and initializes it with model’s parameters. Any keywords you might use for torch.optim.Adam, you could pass them to the build_optimizer method.
Or alternatively, you could use:
from torch.optim import Adam trainer.build_optimizer(Adam, weight_decay=0.0005)
If you implemented your own optimizer (by subclassing torch.optim.Optimizer), you should be able to use it instead of Adam. Alternatively, if you already have an optimizer instance, you could do:
optimizer = MyOptimizer(model.parameters(), **optimizer_kwargs) trainer.build_optimizer(optimizer)
Setting up Training Duration¶
You probably don’t want to train forever, in which case you must specify:
If you like to train indefinitely (or until you’re happy with the results), use:
In this case, you’ll need to interrupt the training manually with a KeyboardInterrupt.
Setting up Callbacks¶
Callbacks are pretty handy when it comes to interacting with the Trainer. More precisely: Trainer defines a number of events as ‘triggers’ for callbacks. Currently, these are:
BEGIN_OF_FIT, END_OF_FIT, BEGIN_OF_TRAINING_RUN, END_OF_TRAINING_RUN, BEGIN_OF_EPOCH, END_OF_EPOCH, BEGIN_OF_TRAINING_ITERATION, END_OF_TRAINING_ITERATION, BEGIN_OF_VALIDATION_RUN, END_OF_VALIDATION_RUN, BEGIN_OF_VALIDATION_ITERATION, END_OF_VALIDATION_ITERATION, BEGIN_OF_SAVE, END_OF_SAVE
As an example, let’s build a simple callback to interrupt the training on NaNs. We check at the end of every training iteration whether the training loss is NaN, and accordingly raise a RuntimeError.
import numpy as np from inferno.trainers.callbacks.base import Callback class NaNDetector(Callback): def end_of_training_iteration(self, **_): # The callback object has the trainer as an attribute. # The trainer populates its 'states' with torch tensors (NOT VARIABLES!) training_loss = self.trainer.get_state('training_loss') # Extract float from torch tensor training_loss = training_loss if np.isnan(training_loss): raise RuntimeError("NaNs detected!")
With the callback defined, all we need to do is register it with the trainer:
So the next time you get RuntimeError: “NaNs detected!, you know the drill.
Inferno supports logging scalars and images to Tensorboard out-of-the-box, though this requires you have at least [tensorflow-cpu](https://github.com/tensorflow/tensorflow) installed. Let’s say you want to log scalars every iteration and images every 20 iterations:
from inferno.trainers.callbacks.logging.tensorboard import TensorboardLogger trainer.build_logger(TensorboardLogger(log_scalars_every=(1, 'iteration'), log_images_every=(20, 'iterations')), log_directory='/path/to/log/directory')
After you’ve started training, use a bash shell to fire up tensorboard with:
$ tensorboard --logdir=/path/to/log/directory --port=6007
and navigate to localhost:6007 with your favorite browser.
Fine print: missing the log_images_every keyword argument to TensorboardLogger will result in images being logged every iteration. If you don’t have a fast hard drive, this might actually slow down the training. To not log images, just use log_images_every=’never’.
To use just one GPU:
For multi-GPU data-parallel training, simply pass trainer.cuda a list of devices:
trainer.cuda(devices=[0, 1, 2, 3])
__Pro-tip__: Say you only want to use GPUs 0, 3, 5 and 7 (your colleagues might love you for this). Before running your training script, simply:
$ export CUDA_VISIBLE_DEVICES=0,3,5,7 $ python train.py
This maps device 0 to 0, 3 to 1, 5 to 2 and 7 to 3.
One more thing¶
Once you have everything configured, use
to commence training! This last step is kinda important. :wink: