Llama3 took the world by storm, outperforming GPT3.5 in almost all benchmarks and GPT4 on several. And then GPT4o came out, reclaiming the throne with its multimodal finesse. Today, we’re releasing something to change that: Llama3-V, the first-ever multimodal model built on top of Llama3. As a bonus, we train everything in under $500.
To address performance, Llama3-V shows a 10–20% boost over Llava, the current SOTA and most popular model for multimodal understanding. Additionally, it fares comparably to the closed-source models of 100x the size on all metrics except MMMU.
🤗 Check us out on Huggingface

### Model Architecture
The bulk of our engineering efforts go into making Llama3 understand visual information. To do so, we take an input image and embed it into a series of patch embeddings using the SigLIP model. These embeddings align with the textual tokens via a projection block, which applies two self-attention blocks to put the textual and visual embeddings in the same plane.
Finally, the visual tokens from the projection block are prepended to the textual tokens, and the joint representation is passed into Llama3.
SigLIP: A Deep Dive

Image: The SigLIP model processes an image into lower-dimensional linear spaces for effective embedding alignment.
SigLIP (Sigmoid Loss for Language Image Pre-Training) is an image embedding model that is similar to CLIP. However, unlike CLIP, which uses a contrastive loss with softmax normalization, SigLIP utilizes a pairwise sigmoid loss.
At a high-level, SigLIP’s vision encoder splits the image into a sequence of non-overlapping image patches and projects them into a lower-dimensional linear embedding space, producing a sequence of patch embeddings. These patch embeddings go through a vision encoder, which applies self-attention to capture long-range dependencies and extract higher-level visual features.
For our purposes, we directly use the original SigLIP model trained by Google DeepMind.
Inference Optimizations
Training these models is expensive. To optimize computational resources, we make two major optimizations:
Caching Mechanism: Since the SigLIP model is smaller than Llama3, running everything serially leaves little GPU utilization when SigLIP is operational. Instead, we pre-compute the image embeddings for better GPU utilization and batch size management. Learn why caching is important in project management.
MPS/MLX Optimizations: By running inference on an MPS optimized SigLIP model on our MacBooks, we achieve a throughput of 32 images/second. For more on maximizing performance, see how to enhance your team's efficiency.
How It Was Trained
Precompute the embeddings from SigLIP: The process begins with passing images into the SigLIP embedding model to obtain a vector representation or embedding of the image. We then apply a learned weight matrix to get the projected multimodal vision embedding.
Supervised Finetuning
Following pretraining, we perform supervised finetuning to enhance the model's performance. In this step, we freeze our computed embeddings (from the projection layer) and proceed with instruction finetuning using 1M examples. This fine-tuning strengthens the model for multimodal text output.
Edworking is the best and smartest decision for SMEs and startups to be more productive. Edworking is a FREE superapp of productivity that includes all you need for work powered by AI in the same superapp, connecting Task Management, Docs, Chat, Videocall, and File Management. Save money today by not paying for Slack, Trello, Dropbox, Zoom, and Notion. Learn more about Edworking's features.
In Summary
Written by Aksh Garg
ML Researcher at Stanford, Point72, and Jump. Ex - Tesla, SpaceX, D. E. Shaw
Remember these 3 key ideas for your startup:
Cost-Effective Training: The ability to train a powerful multimodal model like Llama3-V for less than $500 opens new doors for smaller enterprises looking to implement advanced AI without breaking the bank. To explore more about this, read effective strategies for resource allocation.
Optimized Computational Resources: The utilization of pre-computed embeddings and other optimization techniques like MPS/MLX ensure that startups maximize their computational resources, making high-level AI models more accessible.
Enhanced Performance through Finetuning: Leveraging supervised finetuning to adapt the model for specific tasks means startups can deploy AI technologies effectively, bolstering their operational capabilities and their product offerings.
For more details, see the original source.