Welcome to the PyTorch Docker Assignment. This assignment is designed to help understand and work with Docker and PyTorch.
This project trains a neural network on the MNIST dataset using PyTorch. The project is containerized with Docker, making it easy to reproduce the environment. In this assignment contains:
- Create a Dockerfile for a PyTorch (CPU version) environment.
- Keep the size of your Docker image under 1GB (uncompressed).
- Train any model on the MNIST dataset inside the Docker container.
- Save the trained model checkpoint to the host operating system.
- Add an option to resume model training from a checkpoint.
The provided starter code in train.py provides a basic structure for loading data, defining a model, and running training and testing loops. And with this submission, the code is completed.
Below are the instructions to build and run the code using Docker.
- Docker installed on your machine.
The provided Dockerfile
does the following:
- Base Image: Uses
python:3.9-slim
as the base image. - Working Directory: Sets
/workspace
as the working directory inside the container. - Package Installation: Installs specific versions of
numpy
,torch
, andtorchvision
usingpip
. - Copy Files: Copies train.py to the working directory.
- Command to Execute: The default command to run the training script is python
train.py
.
Navigate to the directory containing the Dockerfile
and run the following command to build the Docker image:
docker build -t mnist-trainer:latest .
This command:
- Builds the Docker image and tags it as
mnist-trainer:latest
.
Once the image is built, you can run the container using the following command:
docker run --rm -it -v $(pwd)/data:/workspace/data mnist-trainer:latest
Explanation:
--rm
: Automatically removes the container once it exits.-it
: Runs the container interactively, allowing you to see the training output in real time.-v $(pwd)/data:/workspace/data
: Mounts thedata
directory from your host system into the container at/workspace/data
, allowing MNIST data and model checkpoints to persist between runs.mnist-trainer:latest
: Specifies the Docker image to run.
To resume training from a checkpoint, first make sure a model checkpoint exists at ./model_checkpoint.pth
. Then, add the --resume
flag when running the container:
docker run --rm -it -v $(pwd)/data:/workspace/data mnist-trainer:latest --resume
This will load the existing checkpoint and continue training.
- To view the logs: Use the following command to check the logs of the running container:
docker logs <container-id>
- To save the model: After training, the model checkpoint will be saved in
./model_checkpoint.pth
on your local machine.
- The model architecture and training script can be modified in
train.py
. - The container will automatically download the MNIST dataset during the training process if not already present.
All the tests run with the script tests/grading.sh
completed successfully on gitpod.
After the assignment completion, push code to the Github repository. The Github Actions workflow will automatically build the Docker image, run training script, and check if the assignment requirements have been met. Check the Github Actions tab for the results of these checks. It is made sure that all checks are passing before the assignment submission.