Scaling Multi-Terabyte Datasets: Key Insights and Strategies

BY Mark Howell 1 years ago6 MINS READ
article cover

Today in Edworking News we want to talk about lessons learned from working with multi-terabyte datasets. This post is meant to guide you through some of the lessons I’ve learned while working with multi-terabyte datasets. The lessons shared are focused on what someone may face as the size of their dataset scales up and some of the things I’ve done to overcome them. I hope you’re waiting for something to finish running while reading this! Remember, this is not a rigid guide. It’s about introducing concepts and explaining why you should start applying them. Numerous other tools can surpass the ones I’ve used, and I strongly encourage you to take the initiative and explore them independently. Your active exploration is key to your professional growth. I’ve divided this post into two sections: scaling on single machines and multi-machine scaling. The goal is to maximize your available resources and reach your goals as quickly as possible. Lastly, I want to emphasize that no optimization or scaling can compensate for a flawed algorithm. Before scaling up, it’s crucial to evaluate your algorithm. This should always be your first step, providing a confident guide for your work.

Scaling on a Single Machine

Joblib Compute is the first bottleneck that comes to mind when scaling. Scaling up computations can be done in several different practical ways. If you’re a data scientist or a machine learning engineer, you might already be familiar with Joblib, a library used to run code in parallel (among other things). It is often used in other libraries, such as scikit-learn or XGBoost. The process of parallelizing something using Joblib is simple, as follows (modified for clarity from the Joblib docs):
Joblib is a great way to scale up your parallel workloads. It’s used in scikit-learn and other tools, proving reliable for many workloads. This isn’t even considering its other excellent features regarding memoization or Fast Compressed Persistence. Joblib is helpful for just making a function parallelizable across all your CPU cores.
GNU Parallel is a powerful tool for preprocessing or extracting data in the CLI. It differs from Joblib as it can be used outside a script and is versatile. You can even run other Python scripts in parallel. One of the most common use cases is decompressing many files simultaneously. Here’s how I would do it:

Simple CLI commands being executed simultaneously using GNU Parallel.
These commands are pretty straightforward if you have used a Linux terminal before. The main part to focus on is piping the file names to parallel so that unzip can decompress them. For any task, once you have a bash command set to run on a single file, you can parallelize it by modifying your command slightly.
By default, parallel uses all available CPU cores and can execute commands on multiple machines using ssh, meaning that it could be used as an ad-hoc computing cluster. Another use case is downloading a large number of files. With wget and parallel and a list of files to download, writing a quick one-liner to download all the files in parallel is easy.
> A quick note: While you can use this to download many files, be aware that this can cause strain on servers by creating multiple connections, leading to network congestion and reduced performance for other users or even being seen as a DOS attack.

Scaling to Multiple Machines

When to Start Using Multiple Machines
One key identifier for when it makes sense to switch to using multiple machines (think Spark or, my favourite, Dask) is when computing is taking too long for your use cases. This could be experiments, data processing, or whatever. The worst timeframe I’ve estimated is some jobs taking months or a year to finish computing if I were to stick to a single instance, even on AWS’s u-24tb1.112xlarge (a beast of a machine).
By switching to multiple smaller machines, you leverage several performance benefits over a more prominent instance. Depending on your scaling solution, horizontal scaling allows for almost linear scaling across your CPU, memory, and networking with the number of instances you use. Most reasonably large EC2 instances offer up to 10 GBit internet speeds, which can help alleviate IO bottlenecks, especially if you’re rapidly streaming data to or from S3.

Different Computing Models

For Embarrassingly Parallel Workloads

Embarrassingly Parallel Workloads are generally the easiest to scale compared to other types of workloads. We’ve already touched on how to scale up computing using Joblib or Parallel, but what about scaling to multiple machines? There are quite a few tools to scale up computation. I would recommend using AWS Batch or AWS Lambda for embarrassingly parallel workloads that are one-off.
Batch is scalable, and with spot pricing, you can get most of your tasks done at a fraction of the cost of using on-demand instances in a fraction of the time it would take to run them in parallel on a single machine.

Conclusion

In conclusion, managing and scaling multi-terabyte datasets requires a deep understanding of both your data and the tools at your disposal. By leveraging Joblib and GNU Parallel for single-machine scaling, you can maximize the efficiency of your computational resources. When scaling beyond a single machine is necessary, AWS Batch, Dask, and Spark provide robust solutions for various workloads, from embarrassingly parallel tasks to complex analytical operations.
The key takeaway is to start by optimizing your algorithms before scaling, ensuring you’re not merely amplifying inefficiencies. Actively exploring and adapting new tools can significantly enhance your performance and cost-effectiveness. Successful scaling is as much about strategic planning and resource management as raw computational power. Embrace the learning curve; you’ll be well-equipped to handle even the largest datasets confidently and skillfully.
Remember these 3 key ideas for your startup:

  1. Optimize Algorithms Before Scaling: Always begin by refining your algorithm to ensure it performs efficiently. Scaling will only amplify any inefficiencies present.

  2. Choose the Right Tools: Leveraging tools like GNU Parallel, Dask, and AWS Batch can significantly enhance the performance of your data operations. Each has its unique strengths and use cases.

  3. Consider Cost-Effective Scaling: Compute resources like those provided by AWS allow scalable solutions that balance cost with performance. Embrace horizontal scaling to maximize available CPUs, RAM, and network bandwidth effectively.
    Edworking is the best and smartest decision for SMEs and startups to be more productive. Edworking is a FREE superapp of productivity that includes all you need for work powered by AI in the same superapp, connecting Task Management, Docs, Chat, Videocall, and File Management. Save money today by not paying for Slack, Trello, Dropbox, Zoom, and Notion.
    For more details, see the original source.

article cover
About the Author: Mark Howell Linkedin

Mark Howell is a talented content writer for Edworking's blog, consistently producing high-quality articles on a daily basis. As a Sales Representative, he brings a unique perspective to his writing, providing valuable insights and actionable advice for readers in the education industry. With a keen eye for detail and a passion for sharing knowledge, Mark is an indispensable member of the Edworking team. His expertise in task management ensures that he is always on top of his assignments and meets strict deadlines. Furthermore, Mark's skills in project management enable him to collaborate effectively with colleagues, contributing to the team's overall success and growth. As a reliable and diligent professional, Mark Howell continues to elevate Edworking's blog and brand with his well-researched and engaging content.

Trendy NewsSee All Articles
CoverEdit PDFs Securely & Freely: Breeze PDF In-Browser SolutionBreeze PDF is a free, offline browser-based PDF editor ensuring privacy. It offers text, image, and signature additions, form fields, merging, page deletion, and password protection without uploads.
BY Mark Howell 2 mo ago
CoverDecoding R1: The Future of AI Reasoning ModelsR1 is an affordable, open-source AI model emphasizing reasoning, enabling innovation and efficiency, while influencing AI advancements and geopolitical dynamics.
BY Mark Howell 26 January 2025
CoverSteam Brick: A Minimalist Gaming Console Redefines PortabilitySteam Brick: A modified, screenless Steam Deck for travel, focusing on portability by using external displays and inputs. A creative yet impractical DIY project with potential risks.
BY Mark Howell 26 January 2025
CoverVisual Prompt Injections: Essential Guide for StartupsThe Beginner's Guide to Visual Prompt Injections explores vulnerabilities in AI models like GPT-4V, highlighting security risks for startups and offering strategies to mitigate potential data compromises.
BY Mark Howell 13 November 2024
CoverGraph-Based AI: Pioneering Future Innovation PathwaysGraph-based AI, developed by MIT's Markus J. Buehler, bridges unrelated fields, revealing shared complexity patterns, accelerating innovation by uncovering novel ideas and designs, fostering unprecedented growth opportunities.
BY Mark Howell 13 November 2024
CoverRevolutionary Image Protection: Watermark Anything with Localized MessagesWatermark Anything enables embedding multiple localized watermarks in images, balancing imperceptibility and robustness. It uses Python, PyTorch, and CUDA, with COCO dataset, under CC-BY-NC license.
BY Mark Howell 13 November 2024
CoverJungle Music's Role in Shaping 90s Video Game SoundtracksJungle music in the 90s revolutionized video game soundtracks, enhancing fast-paced gameplay on PlayStation and Nintendo 64, and fostering a cultural revolution through its energetic beats and immersive experiences.
BY Mark Howell 13 November 2024
CoverMastering Probability-Generating Functions: A Guide for EntrepreneursProbability-generating functions (pgfs) are mathematical tools used in probability theory for data analysis, risk management, and predictive modeling, crucial for startups and SMEs in strategic decision-making.
BY Mark Howell 31 October 2024
Try EdworkingA new way to work from  anywhere, for everyone for Free!
Sign up Now