Revolutionary Llama3-V: Budget-Friendly GPT-4V Competitor

BY Mark Howell 1 years ago4 MINS READ
article cover

Llama3 took the world by storm, outperforming GPT3.5 in almost all benchmarks and GPT4 on several. And then GPT4o came out, reclaiming the throne with its multimodal finesse. Today, we’re releasing something to change that: Llama3-V, the first-ever multimodal model built on top of Llama3. As a bonus, we train everything in under $500.
To address performance, Llama3-V shows a 10–20% boost over Llava, the current SOTA and most popular model for multimodal understanding. Additionally, it fares comparably to the closed-source models of 100x the size on all metrics except MMMU.
🤗 Check us out on Huggingface

### Model Architecture
The bulk of our engineering efforts go into making Llama3 understand visual information. To do so, we take an input image and embed it into a series of patch embeddings using the SigLIP model. These embeddings align with the textual tokens via a projection block, which applies two self-attention blocks to put the textual and visual embeddings in the same plane.
Finally, the visual tokens from the projection block are prepended to the textual tokens, and the joint representation is passed into Llama3.

SigLIP: A Deep Dive

Image: The SigLIP model processes an image into lower-dimensional linear spaces for effective embedding alignment.
SigLIP (Sigmoid Loss for Language Image Pre-Training) is an image embedding model that is similar to CLIP. However, unlike CLIP, which uses a contrastive loss with softmax normalization, SigLIP utilizes a pairwise sigmoid loss.
At a high-level, SigLIP’s vision encoder splits the image into a sequence of non-overlapping image patches and projects them into a lower-dimensional linear embedding space, producing a sequence of patch embeddings. These patch embeddings go through a vision encoder, which applies self-attention to capture long-range dependencies and extract higher-level visual features.
For our purposes, we directly use the original SigLIP model trained by Google DeepMind.

Inference Optimizations

Training these models is expensive. To optimize computational resources, we make two major optimizations:

  1. Caching Mechanism: Since the SigLIP model is smaller than Llama3, running everything serially leaves little GPU utilization when SigLIP is operational. Instead, we pre-compute the image embeddings for better GPU utilization and batch size management. Learn why caching is important in project management.

  2. MPS/MLX Optimizations: By running inference on an MPS optimized SigLIP model on our MacBooks, we achieve a throughput of 32 images/second. For more on maximizing performance, see how to enhance your team's efficiency.

How It Was Trained

Precompute the embeddings from SigLIP: The process begins with passing images into the SigLIP embedding model to obtain a vector representation or embedding of the image. We then apply a learned weight matrix to get the projected multimodal vision embedding.

Supervised Finetuning

Following pretraining, we perform supervised finetuning to enhance the model's performance. In this step, we freeze our computed embeddings (from the projection layer) and proceed with instruction finetuning using 1M examples. This fine-tuning strengthens the model for multimodal text output.
Edworking is the best and smartest decision for SMEs and startups to be more productive. Edworking is a FREE superapp of productivity that includes all you need for work powered by AI in the same superapp, connecting Task Management, Docs, Chat, Videocall, and File Management. Save money today by not paying for Slack, Trello, Dropbox, Zoom, and Notion. Learn more about Edworking's features.

In Summary

Written by Aksh Garg
ML Researcher at Stanford, Point72, and Jump. Ex - Tesla, SpaceX, D. E. Shaw

Remember these 3 key ideas for your startup:

  1. Cost-Effective Training: The ability to train a powerful multimodal model like Llama3-V for less than $500 opens new doors for smaller enterprises looking to implement advanced AI without breaking the bank. To explore more about this, read effective strategies for resource allocation.

  2. Optimized Computational Resources: The utilization of pre-computed embeddings and other optimization techniques like MPS/MLX ensure that startups maximize their computational resources, making high-level AI models more accessible.

  3. Enhanced Performance through Finetuning: Leveraging supervised finetuning to adapt the model for specific tasks means startups can deploy AI technologies effectively, bolstering their operational capabilities and their product offerings.

For more details, see the original source.

article cover
About the Author: Mark Howell Linkedin

Mark Howell is a talented content writer for Edworking's blog, consistently producing high-quality articles on a daily basis. As a Sales Representative, he brings a unique perspective to his writing, providing valuable insights and actionable advice for readers in the education industry. With a keen eye for detail and a passion for sharing knowledge, Mark is an indispensable member of the Edworking team. His expertise in task management ensures that he is always on top of his assignments and meets strict deadlines. Furthermore, Mark's skills in project management enable him to collaborate effectively with colleagues, contributing to the team's overall success and growth. As a reliable and diligent professional, Mark Howell continues to elevate Edworking's blog and brand with his well-researched and engaging content.

Trendy NewsSee All Articles
CoverEdit PDFs Securely & Freely: Breeze PDF In-Browser SolutionBreeze PDF is a free, offline browser-based PDF editor ensuring privacy. It offers text, image, and signature additions, form fields, merging, page deletion, and password protection without uploads.
BY Mark Howell 2 mo ago
CoverDecoding R1: The Future of AI Reasoning ModelsR1 is an affordable, open-source AI model emphasizing reasoning, enabling innovation and efficiency, while influencing AI advancements and geopolitical dynamics.
BY Mark Howell 26 January 2025
CoverSteam Brick: A Minimalist Gaming Console Redefines PortabilitySteam Brick: A modified, screenless Steam Deck for travel, focusing on portability by using external displays and inputs. A creative yet impractical DIY project with potential risks.
BY Mark Howell 26 January 2025
CoverVisual Prompt Injections: Essential Guide for StartupsThe Beginner's Guide to Visual Prompt Injections explores vulnerabilities in AI models like GPT-4V, highlighting security risks for startups and offering strategies to mitigate potential data compromises.
BY Mark Howell 13 November 2024
CoverGraph-Based AI: Pioneering Future Innovation PathwaysGraph-based AI, developed by MIT's Markus J. Buehler, bridges unrelated fields, revealing shared complexity patterns, accelerating innovation by uncovering novel ideas and designs, fostering unprecedented growth opportunities.
BY Mark Howell 13 November 2024
CoverRevolutionary Image Protection: Watermark Anything with Localized MessagesWatermark Anything enables embedding multiple localized watermarks in images, balancing imperceptibility and robustness. It uses Python, PyTorch, and CUDA, with COCO dataset, under CC-BY-NC license.
BY Mark Howell 13 November 2024
CoverJungle Music's Role in Shaping 90s Video Game SoundtracksJungle music in the 90s revolutionized video game soundtracks, enhancing fast-paced gameplay on PlayStation and Nintendo 64, and fostering a cultural revolution through its energetic beats and immersive experiences.
BY Mark Howell 13 November 2024
CoverMastering Probability-Generating Functions: A Guide for EntrepreneursProbability-generating functions (pgfs) are mathematical tools used in probability theory for data analysis, risk management, and predictive modeling, crucial for startups and SMEs in strategic decision-making.
BY Mark Howell 31 October 2024
Try EdworkingA new way to work from  anywhere, for everyone for Free!
Sign up Now