Fast Prototyping of Deep Learning Multimodal Experiments with Hydra, PyTorch Lightning, and W&B

As a computer vision and deep learning researcher, I often want to quickly prototype models and pipelines, including vision, language, and multimodal systems, without rewriting boilerplate every time. In this post, I’ll walk through building a flexible Python library for rapid prototyping using Hydra, PyTorch Lightning, and Weights & Biases (W&B).

Overview

The goal of this library is to:

Easily switch between datasets, models, and loggers using Hydra configs.
Support vision models (like ResNet) and multimodal pipelines (CLIP + LLM).
Track experiments and outputs in W&B, including images, labels, prompts, and generated text.
Batch process images for research experiments.

1. Project Structure

We started with a clean Python package:

myproto/
├── README.md
├── requirements.txt
├── src/
│   └── myproto/
│       ├── __init__.py
│       ├── datasets/
│       ├── models/
│       ├── trainers/
│       ├── vlm/
│       ├── llm/
│       ├── pipelines/
│       └── utils/
├── configs/
│   ├── config.yaml
│   ├── dataset/
│   ├── model/
│   ├── trainer/
│   └── logger/
├── scripts/
│   └── train.py
└── tests/

The design principle is config-driven modularity: datasets, models, and loggers are all swappable via Hydra.

2. Setting Up Dependencies

We used pip for simplicity. Core packages:

pip install torch torchvision torchaudio
pip install pytorch-lightning hydra-core omegaconf
pip install wandb timm einops albumentations opencv-python-headless
pip install transformers datasets

Optional for dev: pytest, black, isort.

3. Config-Driven Design

Example Hydra configs:

configs/config.yaml

defaults:
  - dataset: fake
  - model: resnet18
  - trainer: default
  - logger: wandb
  - input: input

seed: 42

You can override any option from the CLI:

python scripts/train.py model=resnet18 dataset=fake
python scripts/train.py model=multimodal input.image_path=example.jpg

4. Building a Vision Model

We added a model factory for ResNet18:

# src/myproto/models/resnet.py
import torchvision.models as tv
import torch.nn as nn

def build_resnet18(num_classes=10, pretrained=False):
    model = tv.resnet18(weights="IMAGENET1K_V1" if pretrained else None)
    in_features = model.fc.in_features
    model.fc = nn.Linear(in_features, num_classes)
    return model

5. PyTorch Lightning Trainer

A LightningModule for training classifiers:

import torch
from torch import nn
import torchmetrics
import pytorch_lightning as pl

class ImageClassifier(pl.LightningModule):
    def __init__(self, model, lr=1e-3):
        super().__init__()
        self.model = model
        self.criterion = nn.CrossEntropyLoss()
        self.train_acc = torchmetrics.Accuracy(task="multiclass", num_classes=model.fc.out_features)
        self.lr = lr

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = self.criterion(logits, y)
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=self.lr)

6. Multimodal Pipeline (CLIP + LLM)

We created a reusable pipeline to:

Run CLIP zero-shot on an image to get top-k labels.
Build a prompt using the labels.
Generate text using an LLM.

from transformers import CLIPProcessor, CLIPModel, pipeline
from PIL import Image
import torch

class MultiModalPipeline:
    def __init__(self, clip_model_name="openai/clip-vit-base-patch32",
                 clip_device=-1, llm_model_name="gpt2", llm_device=-1):
        self.clip_device = "cpu" if clip_device == -1 else f"cuda:{clip_device}"
        self.clip_model = CLIPModel.from_pretrained(clip_model_name).to(self.clip_device)
        self.clip_processor = CLIPProcessor.from_pretrained(clip_model_name)
        self.llm = pipeline("text-generation", model=llm_model_name, device=llm_device)

    def generate_from_image(self, image_path, candidate_labels, top_k=3, prompt_template=None):
        image = Image.open(image_path).convert("RGB")
        inputs = self.clip_processor(text=candidate_labels, images=image, return_tensors="pt", padding=True).to(self.clip_device)
        with torch.no_grad():
            logits = self.clip_model(**inputs).logits_per_image[0]
            probs = logits.softmax(dim=0)
        label_probs = sorted(zip(candidate_labels, probs.tolist()), key=lambda x: x[1], reverse=True)[:top_k]
        prompt = prompt_template.format(labels=", ".join([f"{l} ({p:.2f})" for l, p in label_probs]))
        generated_text = self.llm(prompt)[0]["generated_text"]
        return {"label_probs": label_probs, "prompt": prompt, "generated_text": generated_text, "image": image}

7. W&B Logging

We integrated W&B logging for both training and multimodal runs:

from pytorch_lightning.loggers import WandbLogger
import wandb

logger = WandbLogger(project="myproto", log_model=True)

# Example: logging a multimodal run
logger.experiment.log({
    "input_image": wandb.Image(out["image"]),
    "top_labels": {l: p for l, p in out["label_probs"]},
    "prompt": out["prompt"],
    "generated_text": out["generated_text"],
})

8. Batch Processing Images

We added a batch runner to process a folder of images:

import os
from myproto.models import build_model

pipeline = build_model(cfg.model)
image_files = [os.path.join(cfg.input_batch.image_folder, f)
               for f in os.listdir(cfg.input_batch.image_folder)
               if f.lower().endswith((".png", ".jpg"))]

for img_path in image_files:
    out = pipeline.generate_from_image(img_path, cfg.input_batch.candidate_labels,
                                       top_k=cfg.input_batch.top_k,
                                       prompt_template=cfg.input_batch.prompt_template)
    logger.experiment.log({
        "input_image": wandb.Image(out["image"], caption=os.path.basename(img_path)),
        "top_labels": {l: p for l, p in out["label_probs"]},
        "prompt": out["prompt"],
        "generated_text": out["generated_text"],
    })

9. Running Experiments

Train CV model:

python scripts/train.py model=resnet18 dataset=fake

Run multimodal single image:

python scripts/train.py model=multimodal input.image_path=example.jpg

Run multimodal batch:

python scripts/multimodal_batch.py model=multimodal input_batch.image_folder=./images

All outputs are logged to W&B, so you can track experiments, generated texts, and images in one dashboard.

10. Key Takeaways

Hydra configs allow fully modular, swappable components.
PyTorch Lightning simplifies training loops and logging.
W&B integration captures both training metrics and multimodal outputs.
Multimodal pipelines are fully reusable and batchable.
This setup is perfect for rapid prototyping of CV and multimodal research ideas.

This library now serves as a single entrypoint for experiments, letting you easily switch between CV training, LLM generation, and multimodal pipelines, all while keeping everything reproducible and tracked.

GitHub repo Link