Scaling Notion’s Data Lake: From Postgres to Massive S3 Lake

BY Mark Howell 14 July 20244 MINS READ
article cover

Today in Edworking News we want to talk about Building and scaling Notion’s data lake
In the past three years Notion’s data has expanded 10x due to user and content growth, with a doubling rate of 6-12 months. Managing this rapid growth while meeting the ever-increasing data demands of critical product and analytics use cases, especially our recent Notion AI features, meant building and scaling Notion’s data lake. Here’s how we did it.

Notion’s Data Model and Growth

Everything you see in Notion—texts, images, headings, lists, database rows, pages, etc—despite differing front-end representations and behaviors, is modeled as a "block" entity in the back end and stored in the Postgres database with a consistent structure, schema, and associated metadata. Due to active user engagement and content creation, block data has been doubling every 6 to 12 months. By the start of 2021, we had over 20 billion block rows in Postgres. As of now, this figure has skyrocketed to over 200 billion blocks, translating to hundreds of terabytes of data, even after compression.

Challenges Faced in Scaling

Operability

The sheer volume of data meant that managing 480 Fivetran connectors, especially during periods of Postgres re-sharding and maintenance, was overly burdensome.

Speed, Data Freshness, and Cost

Ingesting data into Snowflake started to lag, becoming slower and costlier due to Notion’s predominantly update-heavy workload. Traditional data warehouses optimized for insert-heavy tasks struggled with Notion's unique requirements.

Use Case Support

The complexity and volume of data transformation needed to support Notion's AI and search operations became too complicated for standard SQL interfaces.

Building and Scaling Notion’s In-House Data Lake

Given these challenges, we decided to build our own in-house data lake with the following objectives:

  • A repository capable of storing both raw and processed data at scale.

  • Efficient, scalable data ingestion, especially for update-heavy workloads.

  • Enabling AI, search, and other use cases requiring denormalized data.

Data Lake Design Principles

  1. Choosing a Data Repository and Lake
    We chose S3 as it aligned with our existing AWS stack and offered scalability, cost efficiency, and integration with various data processing engines.

  2. Processing Engine Choice
    Spark was selected as the processing engine due to its built-in functions for complex data processing, scalable architecture, and cost-efficiency as an open-source framework.

  3. Ingestion Approach
    We opted for a hybrid design: incrementally ingesting changed data most of the time, and periodically taking full snapshots to bootstrap tables in S3.

  4. Streamlining Incremental Ingestion
    Utilizing Kafka Debezium CDC connectors and Apache Hudi for data ingestion, we focused on managing our update-heavy workload effectively.

  5. Ingest Raw Data Before Processing
    By first ingesting raw data into S3, we established a singular source of truth and simplified debugging. Post-ingestion, this raw data was processed and denormalized as needed before transferring cleaned and structured data to downstream systems.

Operating Our Data Lake

After extensive testing and performance tuning, our setup using AWS EKS clusters, Kafka clusters, Deltastreamer, and Spark jobs has proven scalable and effective. Our CDC and Kafka setups stream i.e. Postgres changes, and the Hudi setup maintains data state with a minimal delay.

The Payoff

The transformation to using an in-house data lake has brought substantial financial and operational benefits:

  • Cost Savings: Moving critical datasets to the data lake saved over a million dollars in 2022 alone, with even higher savings projected for 2023 and 2024.

  • Efficiency: The time required for data ingestion decreased significantly—from over a day to just minutes for small tables and a couple of hours for larger ones.

  • Resilience: Re-syncing can be completed smoothly within 24 hours, not impacting live databases.

Looking Ahead

This change revolutionized our data storage and handling, facilitating the successful launch of Notion AI features in recent years. We continue to refine and enhance our data lake to support more advanced functionalities efficiently.
Remember these 3 key ideas for your startup:

  • Data Management: Efficiently managing and scaling your data infrastructure can transform operational efficiency and support advanced product capabilities (learn more).

  • Cost Efficiency: Exploring open-source solutions like Apache Spark and Hudi can lead to substantial cost reductions without compromising performance.

  • Future-Proofing: Building scalable infrastructure creates a robust foundation for future innovation, such as integrating AI and machine learning features.
    Edworking is the best and smartest decision for SMEs and startups to be more productive. Edworking is a FREE superapp of productivity that includes all you need for work powered by AI in the same superapp, connecting Task Management, Docs, Chat, Videocall, and File Management. Save money today by not paying for Slack, Trello, Dropbox, Zoom, and Notion.
    For more details, see the original source.

Further Reading

Stay tuned for additional posts detailing advanced functionalities built on top of our data lake, including our Search and AI Embedding RAG Infra!

article cover
About the Author: Mark Howell Linkedin

Mark Howell is a talented content writer for Edworking's blog, consistently producing high-quality articles on a daily basis. As a Sales Representative, he brings a unique perspective to his writing, providing valuable insights and actionable advice for readers in the education industry. With a keen eye for detail and a passion for sharing knowledge, Mark is an indispensable member of the Edworking team. His expertise in task management ensures that he is always on top of his assignments and meets strict deadlines. Furthermore, Mark's skills in project management enable him to collaborate effectively with colleagues, contributing to the team's overall success and growth. As a reliable and diligent professional, Mark Howell continues to elevate Edworking's blog and brand with his well-researched and engaging content.

Trendy NewsSee All Articles
CoverDecoding R1: The Future of AI Reasoning ModelsR1 is an affordable, open-source AI model emphasizing reasoning, enabling innovation and efficiency, while influencing AI advancements and geopolitical dynamics.
BY Mark Howell 26 January 2025
CoverSteam Brick: A Minimalist Gaming Console Redefines PortabilitySteam Brick: A modified, screenless Steam Deck for travel, focusing on portability by using external displays and inputs. A creative yet impractical DIY project with potential risks.
BY Mark Howell 26 January 2025
CoverVisual Prompt Injections: Essential Guide for StartupsThe Beginner's Guide to Visual Prompt Injections explores vulnerabilities in AI models like GPT-4V, highlighting security risks for startups and offering strategies to mitigate potential data compromises.
BY Mark Howell 13 November 2024
CoverGraph-Based AI: Pioneering Future Innovation PathwaysGraph-based AI, developed by MIT's Markus J. Buehler, bridges unrelated fields, revealing shared complexity patterns, accelerating innovation by uncovering novel ideas and designs, fostering unprecedented growth opportunities.
BY Mark Howell 13 November 2024
CoverRevolutionary Image Protection: Watermark Anything with Localized MessagesWatermark Anything enables embedding multiple localized watermarks in images, balancing imperceptibility and robustness. It uses Python, PyTorch, and CUDA, with COCO dataset, under CC-BY-NC license.
BY Mark Howell 13 November 2024
CoverJungle Music's Role in Shaping 90s Video Game SoundtracksJungle music in the 90s revolutionized video game soundtracks, enhancing fast-paced gameplay on PlayStation and Nintendo 64, and fostering a cultural revolution through its energetic beats and immersive experiences.
BY Mark Howell 13 November 2024
CoverMastering Probability-Generating Functions: A Guide for EntrepreneursProbability-generating functions (pgfs) are mathematical tools used in probability theory for data analysis, risk management, and predictive modeling, crucial for startups and SMEs in strategic decision-making.
BY Mark Howell 31 October 2024
CoverMastering Tokenization: Key to Successful AI ApplicationsTokenization is crucial in NLP for AI apps, influencing data processing. Understanding tokenizers enhances AI performance, ensuring meaningful interactions and minimizing Garbage In, Garbage Out issues.
BY Mark Howell 23 October 2024
Try EdworkingA new way to work from  anywhere, for everyone for Free!
Sign up Now