DistilCLIP

Overview

What this is

The compact read before the technical details.

A from-scratch CLIP implementation pairing a Vision Transformer with DistilBERT for multimodal image-text understanding. Trained on the Naruto BLIP Captions dataset for 25 epochs, it learns joint representations of images and text, enabling zero-shot image classification and text-to-image retrieval. Built entirely in PyTorch to demystify contrastive learning and multimodal architectures.

Capabilities

What it actually does

The useful parts, pulled out of the paragraph wall.

From-scratch CLIP implementation pairing a Vision Transformer (ViT) with DistilBERT instead of the original heavier encoders, reducing parameter count while preserving multimodal alignment

Trained on Flickr8K with contrastive loss to align image-text pairs in a shared latent space — achieving cross-modal retrieval without relying on OpenAI's pretrained weights

Demonstrates end-to-end multimodal pipeline: image preprocessing, text tokenization, dual-encoder forward pass, contrastive loss computation, and cosine similarity inference

Implementation

Technology with jobs attached

Names are less useful than responsibilities. This is what each piece is doing.

PyTorch

Core deep learning framework for model definition, training loop, and GPU-accelerated tensor computation

Vision Transformer (ViT)

Image encoder — processes images as patch sequences with positional embeddings and transformer blocks

DistilBERT

Text encoder — pretrained lightweight transformer that tokenizes and encodes captions into dense vectors

AdamW

Optimizer with weight decay regularization and cosine annealing LR schedule for stable contrastive training

HuggingFace Datasets

Data loading and management for the Flickr8K image-caption dataset