Researchers from Meta, University of Southern California, Carnegie Mellon University, and University of California San Diego recently open-sourced MEGALODON, a large language model (LLM) with an unlimited context length. MEGALODON boasts linear computational complexity and outperforms a similarly-sized Llama 2 model on a range of benchmarks. This innovative LLM addresses several shortcomings of the Transformer neural architecture that underlies most LLMs. Instead of the conventional multihead attention, MEGALODON employs a chunk-wise attention system. Furthermore, the research team introduced sequence-based parallelism during training, which enhances scalability for long-context training.
When evaluated on standard LLM benchmarks such as WinoGrande and MMLU, MEGALODON demonstrated superior performance compared to a Llama 2 model with the same parameters, training data, and computational budget. The researchers noted:
> "MEGALODON achieves impressive improvements on both training perplexity and across downstream benchmarks. Importantly, experimental results on long-context modeling demonstrate MEGALODON’s ability to model sequences of unlimited length."
Additional experiments across various data modalities illustrated robust improvements in MEGALODON, which provide a promising direction for future work involving large-scale multi-modality pretraining.
Challenges with Transformer Architecture
While the Transformer architecture has become the de facto standard for most Generative AI models, it has some notable drawbacks. Specifically, its self-attention mechanism has a quadratic complexity in terms of both computation and storage, limiting the input context length. As a result, several alternatives to the standard self-attention model have been developed recently, including structured state space models (SSMs) like Mamba, which scale linearly with context length. Another noteworthy scheme is the RWKV Project's attention-free Transformer model, which has no maximum input context length.
Innovations in MEGALODON
MEGALODON builds on the research team's previous model, MEGA (exponential moving average with gated attention), with several new features. While MEGA uses a "classical" exponential moving average (EMA) within its attention mechanism, MEGALODON computes a complex EMA (CEMA). Mathematically, the CEMA component makes MEGALODON equivalent to a simplified state space model with diagonal state matrix.
The research team trained a seven-billion parameter model, MEGALODON-7B, using the same 2-trillion token dataset as Llama2-7B and applied the same training hyperparameters. MEGALODON-7B was found to be more computationally efficient. When the Llama model was scaled up to a 32k context length, MEGALODON-7B was significantly faster.
Besides standard LLM benchmarks, the researchers also tested MEGALODON-7B's performance on the SCROLLS long-context question-answering benchmark and compared its results with several baseline models, including a modified Llama 2 model with a 32k context length. MEGALODON outperformed all baseline models on the NarrativeQA subtask and achieved results "competitive" with Llama 2 across all tasks.
In a discussion about MEGALODON on Hacker News, one user questioned the model's performance on recall tasks, as non-Transformer models often underperform in this area. Another user responded:
> "For what it's worth, RWKV's website mentions that while it's bad on recall, for the vast majority of tasks, you can simply ask the question before the content, and it'll handle the task just fine."

Image description: Visualization of the MEGALODON neural network structure.
Edworking is the best and smartest decision for SMEs and startups to be more productive. Edworking is a FREE superapp of productivity that includes all you need for work powered by AI in the same superapp, connecting Task Management, Docs, Chat, Videocall, and File Management. Save money today by not paying for Slack, Trello, Dropbox, Zoom, and Notion.
Remember these 3 key ideas for your startup:
- Scalability in Model Training: The sequence-based parallelism introduced in MEGALODON's training process enhances scalability, making it efficient for handling longer context lengths. This advancement can lead to significant improvements in AI-driven applications for startups by reducing computational costs and improving model performance.
- Innovative Attention Mechanisms: By utilizing chunk-wise attention instead of the standard multihead attention, MEGALODON addresses the complexity challenges of Transformer architectures. This innovation can be particularly useful for startups working on AI and machine learning projects, leading to more efficient and powerful models.
- Practical Applications and Benchmarks: MEGALODON's superior performance on benchmarks such as WinoGrande, MMLU, and SCROLLS highlights its potential for practical applications. Startups aiming to develop advanced AI systems should consider leveraging MEGALODON to enhance their product offerings and stay competitive in the market.
For more details, see the original source.






