Fast Prototyping of Deep Learning Multimodal Experiments with Hydra, PyTorch Lightning, and W&B

Table Of Content
By CamelEdge
Updated on Thu Sep 18 2025
Introduction
As a computer vision and deep learning researcher, I often want to quickly prototype models and pipelines, including vision, language, and multimodal systems, without rewriting boilerplate every time. In this post, I’ll walk through building a flexible Python library for rapid prototyping using Hydra, PyTorch Lightning, and Weights & Biases (W&B).
Overview
The goal of this library is to:
- Easily switch between datasets, models, and loggers using Hydra configs.
- Support vision models (like ResNet) and multimodal pipelines (CLIP + LLM).
- Track experiments and outputs in W&B, including images, labels, prompts, and generated text.
- Batch process images for research experiments.
1. Project Structure
We started with a clean Python package:
myproto/
├── README.md
├── requirements.txt
├── src/
│ └── myproto/
│ ├── __init__.py
│ ├── datasets/
│ ├── models/
│ ├── trainers/
│ ├── vlm/
│ ├── llm/
│ ├── pipelines/
│ └── utils/
├── configs/
│ ├── config.yaml
│ ├── dataset/
│ ├── model/
│ ├── trainer/
│ └── logger/
├── scripts/
│ └── train.py
└── tests/
The design principle is config-driven modularity: datasets, models, and loggers are all swappable via Hydra.
2. Setting Up Dependencies
We used pip for simplicity. Core packages:
pip install torch torchvision torchaudio
pip install pytorch-lightning hydra-core omegaconf
pip install wandb timm einops albumentations opencv-python-headless
pip install transformers datasets
Optional for dev: pytest
, black
, isort
.
3. Config-Driven Design
Example Hydra configs:
configs/config.yaml
defaults:
- dataset: fake
- model: resnet18
- trainer: default
- logger: wandb
- input: input
seed: 42
You can override any option from the CLI:
python scripts/train.py model=resnet18 dataset=fake
python scripts/train.py model=multimodal input.image_path=example.jpg
4. Building a Vision Model
We added a model factory for ResNet18:
# src/myproto/models/resnet.py
import torchvision.models as tv
import torch.nn as nn
def build_resnet18(num_classes=10, pretrained=False):
model = tv.resnet18(weights="IMAGENET1K_V1" if pretrained else None)
in_features = model.fc.in_features
model.fc = nn.Linear(in_features, num_classes)
return model
5. PyTorch Lightning Trainer
A LightningModule for training classifiers:
import torch
from torch import nn
import torchmetrics
import pytorch_lightning as pl
class ImageClassifier(pl.LightningModule):
def __init__(self, model, lr=1e-3):
super().__init__()
self.model = model
self.criterion = nn.CrossEntropyLoss()
self.train_acc = torchmetrics.Accuracy(task="multiclass", num_classes=model.fc.out_features)
self.lr = lr
def forward(self, x):
return self.model(x)
def training_step(self, batch, batch_idx):
x, y = batch
logits = self(x)
loss = self.criterion(logits, y)
self.log("train_loss", loss)
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=self.lr)
6. Multimodal Pipeline (CLIP + LLM)
We created a reusable pipeline to:
- Run CLIP zero-shot on an image to get top-k labels.
- Build a prompt using the labels.
- Generate text using an LLM.
from transformers import CLIPProcessor, CLIPModel, pipeline
from PIL import Image
import torch
class MultiModalPipeline:
def __init__(self, clip_model_name="openai/clip-vit-base-patch32",
clip_device=-1, llm_model_name="gpt2", llm_device=-1):
self.clip_device = "cpu" if clip_device == -1 else f"cuda:{clip_device}"
self.clip_model = CLIPModel.from_pretrained(clip_model_name).to(self.clip_device)
self.clip_processor = CLIPProcessor.from_pretrained(clip_model_name)
self.llm = pipeline("text-generation", model=llm_model_name, device=llm_device)
def generate_from_image(self, image_path, candidate_labels, top_k=3, prompt_template=None):
image = Image.open(image_path).convert("RGB")
inputs = self.clip_processor(text=candidate_labels, images=image, return_tensors="pt", padding=True).to(self.clip_device)
with torch.no_grad():
logits = self.clip_model(**inputs).logits_per_image[0]
probs = logits.softmax(dim=0)
label_probs = sorted(zip(candidate_labels, probs.tolist()), key=lambda x: x[1], reverse=True)[:top_k]
prompt = prompt_template.format(labels=", ".join([f"{l} ({p:.2f})" for l, p in label_probs]))
generated_text = self.llm(prompt)[0]["generated_text"]
return {"label_probs": label_probs, "prompt": prompt, "generated_text": generated_text, "image": image}
7. W&B Logging
We integrated W&B logging for both training and multimodal runs:
from pytorch_lightning.loggers import WandbLogger
import wandb
logger = WandbLogger(project="myproto", log_model=True)
# Example: logging a multimodal run
logger.experiment.log({
"input_image": wandb.Image(out["image"]),
"top_labels": {l: p for l, p in out["label_probs"]},
"prompt": out["prompt"],
"generated_text": out["generated_text"],
})
8. Batch Processing Images
We added a batch runner to process a folder of images:
import os
from myproto.models import build_model
pipeline = build_model(cfg.model)
image_files = [os.path.join(cfg.input_batch.image_folder, f)
for f in os.listdir(cfg.input_batch.image_folder)
if f.lower().endswith((".png", ".jpg"))]
for img_path in image_files:
out = pipeline.generate_from_image(img_path, cfg.input_batch.candidate_labels,
top_k=cfg.input_batch.top_k,
prompt_template=cfg.input_batch.prompt_template)
logger.experiment.log({
"input_image": wandb.Image(out["image"], caption=os.path.basename(img_path)),
"top_labels": {l: p for l, p in out["label_probs"]},
"prompt": out["prompt"],
"generated_text": out["generated_text"],
})
9. Running Experiments
Train CV model:
python scripts/train.py model=resnet18 dataset=fake
Run multimodal single image:
python scripts/train.py model=multimodal input.image_path=example.jpg
Run multimodal batch:
python scripts/multimodal_batch.py model=multimodal input_batch.image_folder=./images
All outputs are logged to W&B, so you can track experiments, generated texts, and images in one dashboard.
10. Key Takeaways
- Hydra configs allow fully modular, swappable components.
- PyTorch Lightning simplifies training loops and logging.
- W&B integration captures both training metrics and multimodal outputs.
- Multimodal pipelines are fully reusable and batchable.
- This setup is perfect for rapid prototyping of CV and multimodal research ideas.
This library now serves as a single entrypoint for experiments, letting you easily switch between CV training, LLM generation, and multimodal pipelines, all while keeping everything reproducible and tracked.