Today in Edworking News we want to talk about Building and scaling Notion’s data lake
In the past three years Notion’s data has expanded 10x due to user and content growth, with a doubling rate of 6-12 months. Managing this rapid growth while meeting the ever-increasing data demands of critical product and analytics use cases, especially our recent Notion AI features, meant building and scaling Notion’s data lake. Here’s how we did it.
Notion’s Data Model and Growth
Everything you see in Notion—texts, images, headings, lists, database rows, pages, etc—despite differing front-end representations and behaviors, is modeled as a "block" entity in the back end and stored in the Postgres database with a consistent structure, schema, and associated metadata. Due to active user engagement and content creation, block data has been doubling every 6 to 12 months. By the start of 2021, we had over 20 billion block rows in Postgres. As of now, this figure has skyrocketed to over 200 billion blocks, translating to hundreds of terabytes of data, even after compression.
Challenges Faced in Scaling
Operability
The sheer volume of data meant that managing 480 Fivetran connectors, especially during periods of Postgres re-sharding and maintenance, was overly burdensome.
Speed, Data Freshness, and Cost
Ingesting data into Snowflake started to lag, becoming slower and costlier due to Notion’s predominantly update-heavy workload. Traditional data warehouses optimized for insert-heavy tasks struggled with Notion's unique requirements.
Use Case Support
The complexity and volume of data transformation needed to support Notion's AI and search operations became too complicated for standard SQL interfaces.
Building and Scaling Notion’s In-House Data Lake
Given these challenges, we decided to build our own in-house data lake with the following objectives:
A repository capable of storing both raw and processed data at scale.
Efficient, scalable data ingestion, especially for update-heavy workloads.
Enabling AI, search, and other use cases requiring denormalized data.
Data Lake Design Principles
Choosing a Data Repository and Lake
We chose S3 as it aligned with our existing AWS stack and offered scalability, cost efficiency, and integration with various data processing engines.Processing Engine Choice
Spark was selected as the processing engine due to its built-in functions for complex data processing, scalable architecture, and cost-efficiency as an open-source framework.Ingestion Approach
We opted for a hybrid design: incrementally ingesting changed data most of the time, and periodically taking full snapshots to bootstrap tables in S3.Streamlining Incremental Ingestion
Utilizing Kafka Debezium CDC connectors and Apache Hudi for data ingestion, we focused on managing our update-heavy workload effectively.Ingest Raw Data Before Processing
By first ingesting raw data into S3, we established a singular source of truth and simplified debugging. Post-ingestion, this raw data was processed and denormalized as needed before transferring cleaned and structured data to downstream systems.
Operating Our Data Lake
After extensive testing and performance tuning, our setup using AWS EKS clusters, Kafka clusters, Deltastreamer, and Spark jobs has proven scalable and effective. Our CDC and Kafka setups stream i.e. Postgres changes, and the Hudi setup maintains data state with a minimal delay.
The Payoff
The transformation to using an in-house data lake has brought substantial financial and operational benefits:
Cost Savings: Moving critical datasets to the data lake saved over a million dollars in 2022 alone, with even higher savings projected for 2023 and 2024.
Efficiency: The time required for data ingestion decreased significantly—from over a day to just minutes for small tables and a couple of hours for larger ones.
Resilience: Re-syncing can be completed smoothly within 24 hours, not impacting live databases.
Looking Ahead
This change revolutionized our data storage and handling, facilitating the successful launch of Notion AI features in recent years. We continue to refine and enhance our data lake to support more advanced functionalities efficiently.
Remember these 3 key ideas for your startup:
Data Management: Efficiently managing and scaling your data infrastructure can transform operational efficiency and support advanced product capabilities (learn more).
Cost Efficiency: Exploring open-source solutions like Apache Spark and Hudi can lead to substantial cost reductions without compromising performance.
Future-Proofing: Building scalable infrastructure creates a robust foundation for future innovation, such as integrating AI and machine learning features.
Edworking is the best and smartest decision for SMEs and startups to be more productive. Edworking is a FREE superapp of productivity that includes all you need for work powered by AI in the same superapp, connecting Task Management, Docs, Chat, Videocall, and File Management. Save money today by not paying for Slack, Trello, Dropbox, Zoom, and Notion.
For more details, see the original source.
Further Reading
Stay tuned for additional posts detailing advanced functionalities built on top of our data lake, including our Search and AI Embedding RAG Infra!