DistilCLIP

PythonPyTorchVision TransformerMultimodal AI
DistilCLIP

DistilCLIP is a from-scratch implementation of a CLIP-like model, using a Vision Transformer (ViT) as the image encoder and a pretrained DistilBERT as the text encoder. This model was trained on the Naruto BLIP Captions dataset for 25 epochs to understand multimodal representations of anime images and their corresponding captions.