This repository contains an implementation of the Pix2Pix model using PyTorch. The Pix2Pix model leverages Conditional Generative Adversarial Networks (CGANs) to learn a mapping from an input image (e.g., a semantic segmentation map) to a corresponding output image (e.g., a photorealistic cityscape). This approach was first introduced in the paper:
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros (2017). Image-to-Image Translation with Conditional Adversarial Networks. CVPR.
- Introduction
- Theoretical Background
- Dataset
- Installation & Requirements
- Usage
- Hyperparameters and Settings
- Results
- Loss Plots
- References
This project implements Pix2Pix to translate segmented images, as prepared by self-driving cars (e.g., Cityscapes label maps), into realistic images. Such a model can aid in tasks like data augmentation, improved visualization, and sim-to-real adaptation in autonomous driving pipelines.
Unlike traditional GANs that generate data from random noise, Conditional GANs (CGANs) incorporate conditional inputs. The model takes a given input (such as a segmentation map) and tries to produce an output that looks realistic and aligns with the provided condition. The discriminator thus evaluates pairs of (input, output) to determine if they are "real" or "fake."
The generator architecture is based on a U-Net: an encoder-decoder network with skip connections.
- Encoder: Extracts features and reduces spatial resolution.
- Decoder: Reconstructs the image from latent features back to the original spatial size.
- Skip Connections: Preserve fine-grained details from early layers, improving the quality and sharpness of generated images.
Instead of evaluating the entire image holistically, the PatchGAN discriminator classifies each N×N patch of the image as real or fake. This helps the model focus on local texture details, leading to sharper and more realistic outputs.
- Adversarial Loss (GAN Loss): Encourages the generator to produce outputs indistinguishable from real images.
- L1 Loss: Ensures that the generated image is closely aligned with the target image at a pixel level, improving structural fidelity.
We use a Cityscapes-based Pix2Pix dataset, which contains pairs of:
- Input (Segmented) Images: Semantic label maps.
- Target (Real) Images: Corresponding realistic cityscape photographs.
Download the dataset from Kaggle. Extract it into a directory like:
cityscapes_pix2pix/
train/
{image_number}.jpg
val/
{image_number}.jpg
train/
: Training pairs of images (segmented and real).val/
: Validation pairs of images.
-
Clone the repository:
git clone https://github.com/pooriyasafaei/cityscapes_pix2pix.git
-
Install dependencies (Python 3.7+ recommended):
pip install -r requirements.txt
Key Dependencies:
- PyTorch
- Torchvision
- PIL (Pillow)
- Matplotlib
- NumPy
-
Ensure you have GPU support for training, as it will be significantly faster.
- Download and prepare
citysxapes
dataset and load images by running the first cells. - Adjust hyperparameters and paths in
train
section as needed. We set a default parameters for you in the notebook. - Run the training cells. The training script will periodically display generated samples and save model checkpoints.
- After training, use the trained generator to translate new segmented images using
show_generated_images
function. This will generate real-like images corresponding to your segmented inputs.
- Learning Rate:
2e-4
- Batch Size:
4
- Epochs:
50
- Lambda_L1 (L1 Loss Weight):
100
- Optimizer: Adam (β1=0.5, β2=0.999)
These values follow recommendations from the Pix2Pix paper and are known to produce stable training dynamics and realistic outputs.
During training, after a number of steps, generated images are displayed alongside the input segmented map and the real target image. Over time, the generated outputs should gain detail and more closely resemble the target distributions.
You can expect results where:
- Early epochs: Blurry and less detailed outputs.
- Later epochs: Increasingly realistic images with sharper boundaries and textures.
After training completes, the loss functions for both the Generator and Discriminator can be plotted using last three cells in the notebook. You should see the Discriminator loss stabilizing and the Generator loss converging.
Here you can see a sample generated from the segmented image by the default hyperparameters and after 50 epochs.
-
Pix2Pix Paper: Image-to-Image Translation with Conditional Adversarial Networks by Isola et al.
-
Related Works:
If you find this repository helpful or use it in your research, consider citing the original Pix2Pix paper.
Enjoy experimenting with Pix2Pix!