CamelEdge
productivity

Fast Prototyping of Deep Learning Multimodal Experiments with Hydra, PyTorch Lightning, and W&B

laptop with code
Table Of Content

    By CamelEdge

    Updated on Thu Sep 18 2025


    As a computer vision and deep learning researcher, I often want to quickly prototype models and pipelines, including vision, language, and multimodal systems, without rewriting boilerplate every time. In this post, I’ll walk through building a flexible Python library for rapid prototyping using Hydra, PyTorch Lightning, and Weights & Biases (W&B).


    Overview

    The goal of this library is to:

    1. Easily switch between datasets, models, and loggers using Hydra configs.
    2. Support vision models (like ResNet) and multimodal pipelines (CLIP + LLM).
    3. Track experiments and outputs in W&B, including images, labels, prompts, and generated text.
    4. Batch process images for research experiments.

    1. Project Structure

    We started with a clean Python package:

    myproto/
    ├── README.md
    ├── requirements.txt
    ├── src/
    │   └── myproto/
    │       ├── __init__.py
    │       ├── datasets/
    │       ├── models/
    │       ├── trainers/
    │       ├── vlm/
    │       ├── llm/
    │       ├── pipelines/
    │       └── utils/
    ├── configs/
    │   ├── config.yaml
    │   ├── dataset/
    │   ├── model/
    │   ├── trainer/
    │   └── logger/
    ├── scripts/
    │   └── train.py
    └── tests/
    

    The design principle is config-driven modularity: datasets, models, and loggers are all swappable via Hydra.

    2. Setting Up Dependencies

    We used pip for simplicity. Core packages:

    pip install torch torchvision torchaudio
    pip install pytorch-lightning hydra-core omegaconf
    pip install wandb timm einops albumentations opencv-python-headless
    pip install transformers datasets
    

    Optional for dev: pytest, black, isort.


    3. Config-Driven Design

    Example Hydra configs:

    configs/config.yaml

    defaults:
      - dataset: fake
      - model: resnet18
      - trainer: default
      - logger: wandb
      - input: input
    
    seed: 42
    

    You can override any option from the CLI:

    python scripts/train.py model=resnet18 dataset=fake
    python scripts/train.py model=multimodal input.image_path=example.jpg
    

    4. Building a Vision Model

    We added a model factory for ResNet18:

    # src/myproto/models/resnet.py
    import torchvision.models as tv
    import torch.nn as nn
    
    def build_resnet18(num_classes=10, pretrained=False):
        model = tv.resnet18(weights="IMAGENET1K_V1" if pretrained else None)
        in_features = model.fc.in_features
        model.fc = nn.Linear(in_features, num_classes)
        return model
    

    5. PyTorch Lightning Trainer

    A LightningModule for training classifiers:

    import torch
    from torch import nn
    import torchmetrics
    import pytorch_lightning as pl
    
    class ImageClassifier(pl.LightningModule):
        def __init__(self, model, lr=1e-3):
            super().__init__()
            self.model = model
            self.criterion = nn.CrossEntropyLoss()
            self.train_acc = torchmetrics.Accuracy(task="multiclass", num_classes=model.fc.out_features)
            self.lr = lr
    
        def forward(self, x):
            return self.model(x)
    
        def training_step(self, batch, batch_idx):
            x, y = batch
            logits = self(x)
            loss = self.criterion(logits, y)
            self.log("train_loss", loss)
            return loss
    
        def configure_optimizers(self):
            return torch.optim.Adam(self.parameters(), lr=self.lr)
    

    6. Multimodal Pipeline (CLIP + LLM)

    We created a reusable pipeline to:

    1. Run CLIP zero-shot on an image to get top-k labels.
    2. Build a prompt using the labels.
    3. Generate text using an LLM.
    from transformers import CLIPProcessor, CLIPModel, pipeline
    from PIL import Image
    import torch
    
    class MultiModalPipeline:
        def __init__(self, clip_model_name="openai/clip-vit-base-patch32",
                     clip_device=-1, llm_model_name="gpt2", llm_device=-1):
            self.clip_device = "cpu" if clip_device == -1 else f"cuda:{clip_device}"
            self.clip_model = CLIPModel.from_pretrained(clip_model_name).to(self.clip_device)
            self.clip_processor = CLIPProcessor.from_pretrained(clip_model_name)
            self.llm = pipeline("text-generation", model=llm_model_name, device=llm_device)
    
        def generate_from_image(self, image_path, candidate_labels, top_k=3, prompt_template=None):
            image = Image.open(image_path).convert("RGB")
            inputs = self.clip_processor(text=candidate_labels, images=image, return_tensors="pt", padding=True).to(self.clip_device)
            with torch.no_grad():
                logits = self.clip_model(**inputs).logits_per_image[0]
                probs = logits.softmax(dim=0)
            label_probs = sorted(zip(candidate_labels, probs.tolist()), key=lambda x: x[1], reverse=True)[:top_k]
            prompt = prompt_template.format(labels=", ".join([f"{l} ({p:.2f})" for l, p in label_probs]))
            generated_text = self.llm(prompt)[0]["generated_text"]
            return {"label_probs": label_probs, "prompt": prompt, "generated_text": generated_text, "image": image}
    

    7. W&B Logging

    We integrated W&B logging for both training and multimodal runs:

    from pytorch_lightning.loggers import WandbLogger
    import wandb
    
    logger = WandbLogger(project="myproto", log_model=True)
    
    # Example: logging a multimodal run
    logger.experiment.log({
        "input_image": wandb.Image(out["image"]),
        "top_labels": {l: p for l, p in out["label_probs"]},
        "prompt": out["prompt"],
        "generated_text": out["generated_text"],
    })
    

    8. Batch Processing Images

    We added a batch runner to process a folder of images:

    import os
    from myproto.models import build_model
    
    pipeline = build_model(cfg.model)
    image_files = [os.path.join(cfg.input_batch.image_folder, f)
                   for f in os.listdir(cfg.input_batch.image_folder)
                   if f.lower().endswith((".png", ".jpg"))]
    
    for img_path in image_files:
        out = pipeline.generate_from_image(img_path, cfg.input_batch.candidate_labels,
                                           top_k=cfg.input_batch.top_k,
                                           prompt_template=cfg.input_batch.prompt_template)
        logger.experiment.log({
            "input_image": wandb.Image(out["image"], caption=os.path.basename(img_path)),
            "top_labels": {l: p for l, p in out["label_probs"]},
            "prompt": out["prompt"],
            "generated_text": out["generated_text"],
        })
    

    9. Running Experiments

    Train CV model:

    python scripts/train.py model=resnet18 dataset=fake
    

    Run multimodal single image:

    python scripts/train.py model=multimodal input.image_path=example.jpg
    

    Run multimodal batch:

    python scripts/multimodal_batch.py model=multimodal input_batch.image_folder=./images
    

    All outputs are logged to W&B, so you can track experiments, generated texts, and images in one dashboard.


    10. Key Takeaways

    • Hydra configs allow fully modular, swappable components.
    • PyTorch Lightning simplifies training loops and logging.
    • W&B integration captures both training metrics and multimodal outputs.
    • Multimodal pipelines are fully reusable and batchable.
    • This setup is perfect for rapid prototyping of CV and multimodal research ideas.

    This library now serves as a single entrypoint for experiments, letting you easily switch between CV training, LLM generation, and multimodal pipelines, all while keeping everything reproducible and tracked.

    GitHub repo Link