High-Speed IPC in Rust: A Ping Pong Analysis

BY Mark Howell 19 June 20249 MINS READ
article cover

Today in Edworking News we wanted to explore different ways of communicating between different processes executing on the same machine, and doing so as fast as possible. We're focusing on high-speed inter-process communication (IPC), but some of these approaches can be extended across a network. We'll do this exploration in Rust. A reminder that since these are independent processes, most approaches you'd take within-process are unavailable to us. Rather than communicating between threads, or between asynchronous routines, these are techniques to share data between different programs. They might not even both be written in Rust. The code will mostly be snippets, but the full source is available here, with benchmark results at the end.

The Problem

We want to send a message ("ping") from one process to another, and when it's received, reply with an acknowledgment ("pong"). This cycle gives us an opportunity to time how long it takes to round trip between two processes. Timing is complicated, and there is a note on it below, but we'll run lots of cycles and calculate the average time from there. We'll set up all the experiments as similarly as possible, with a producer process sending a ping and a consumer process replying with a pong.
Performance profiling can lead one down a deep rabbit hole, but hopefully, this experiment is simple enough that we'll be able to largely isolate the effect of the communication. Though I don't doubt keen-eyed readers will highlight optimizations missed or operations outside the communication that are not as computationally-free as I've assumed.
Note that we're focusing on low-latency here, not high-throughput. Within High-Performance Computing, they are related but are focused on different goals. As an example to illustrate the difference, imagine a piece of software that performs linear algebra tasks by outsourcing those computations to the GPU. For some problem sets (like training Neural Networks), the time taken to complete the training will be significantly faster on the GPU - the calculations performed per second (or throughput) will be much higher. However, there is a cost to marshalling and shipping the data onto the GPU in the first place, and it will never be quicker to multiply two small matrices together like that.

A Note on Timing

We tend to assume that computers keep precise and synchronized clocks. And compared to humans, they largely do. However, when trying to measure very quick operations, there are limits to this. For starters, I'm working on a PC with a 4GHz processor. In order to measure something in single cycles, that means I need a clock capable of 0.25ns time resolution - or the time taken for light to travel roughly 10cm. Electrical signals move significantly slower than this, so for a clock outside the processor, even the time taken to sample the timer will dwarf the time taken to perform a few cycles of calculation. Consider the clock attached to the coin battery on your motherboard. This allows the system to be disconnected from the mains, plugged back in, and still know the current time. This is known as a Real Time Clock (RTC). These mostly run at 32.768 kHz (2^15 Hz), only giving them a theoretical resolution of 30 µs - or about a hundred thousand clock cycles. What's more, they are often configured to produce time at a resolution much lower than that - clearly that clock isn't going to cut it.
The traditional solution is to use a Time Stamp Counter or TSC. This is a processor register that keeps ticking up at the rate of the processor clock speed - so should offer sufficient resolution for our needs. Traditionally the x86 instruction RDTSC is used to read its value. However, given the nature of modern CPUs with varying clock speeds, hyperthreading, deep pipelines, and the fear of cache/timing attacks, it's more complicated than this. On Windows, Microsoft suggests using the `QueryPerformanceCounter` function, and on Linux `clock_gettime(CLOCK_MONOTONIC)` - but note these are both system calls (more on those later). Their resolution is also hardware dependent, if you have a newer device (within 10 years) this may be an HPET, or it could be a modern incarnation of the TSC. Either way, these benchmarks are going to yield different results on different hardware and different operating systems, even if the code ran similarly.
Post takeaway - Timing short-duration events is difficult. If in doubt, run enough iterations of your event such that the total completed time is in milliseconds, and then whatever timing source your benchmarking suite relies on should lead to an accurate result.

Benchmarking

This was an opportunity for me to try out Divan for benchmarking. It's a comfy benchmarking tool, with the goal of being more ergonomic than Criterion. I'll save offering judgment yet as I haven't used it a whole lot, but it seems to do what I need it to do. For each approach we will: Fortunately, Divan gives the tools for that, and allows us to annotate the benches with how many operations were executed and then produce averages based on that (with some caveats).

Approach 1: Pipes

This is the first thing that comes to mind to connect processes on the same machine. Like `cat | grep` we'll just connect `stdout` of the producer to `stdin` of the consumer, and vice-versa. This will work on Windows, Linux, and presumably MacOS. The consumer process reads five bytes into an array from `stdin`, checks if they're equal to "ping" followed by a newline, and then responds appropriately. It'll also respond to "pong". The producer process is a little more complex as it has to create and handle the consumer first, but pushes out a "ping", waits for a response, and then panics if it's not "pong". Aside from some fiddly `ref mut` treatment for the pipes, this was pretty easy to write.

Approach 2: TCP

A natural approach would be to try a client and server connected via HTTP. This felt dangerously like benchmarking HTTP servers though, so instead, I just went straight to TCP. All in all, this was fairly simple. Currently "ping" is written to the socket, copied off, and then checked. "Pong" is then written back.

Approach 3: UDP

Naturally, the next approach was to try UDP. UDP is traditionally used in these contexts for a "fire and forget" mechanism. Unlike TCP, the protocol doesn't offer a way of recovering lost or out-of-order packets. This can be an advantage because it keeps the connection from getting too "chatty" but if consistency is important, those layers need to be implemented manually - either in or out of band.

Approach 4: Shared Memory

Shared memory is a known rapid way of sharing data between processes. One process allocates a block of memory and passes that handle to another process. Each process is then free to read from or write to that block of memory independently. If your first instinct is to fear synchronization and race conditions, you'd be absolutely correct. What's worse is that out of the box, Rust doesn't help us here, despite usually being very helpful with that kind of thing. We're on our own, and it's going to be unsafe.

Results

I've added results for Windows and Linux, but take these with a significant grain of salt as they are different machines. It's probably fair to compare them within platforms though. The time per operation is similar for most of the approaches, apart from using shared memory. With shared memory, we can perform a ping-pong in under 200ns, or around 1000 processor cycles. I have to admit, I still found this a little disappointing. Moving a few bytes around should be faster than that, but I'm going to resist digging too deep yet. Preparing an environment with core-pinning and the correct thread priority is tricky, and given we have to do this with two concurrently running processes, it's even more difficult.

Conclusion

I was surprised at how similarly most things performed. I did a cursory investigation into Linux-specific approaches like `dbus` and Unix Domain Sockets, but they seemed to be in the same ballpark as the non-shared memory approaches. The only other thing to try would be memory-mapped files, but I thought I'd save that for when I wanted to try something similar with larger blocks of data.
If I had to do this in Production, for the majority of workloads I'd probably still use an HTTP / TCP connection. It's portable, reliable on message failure, and I could split it across machines if needs be. However, for the cases where latency really matters, the maintenance overhead of using shared memory is worth it.

Remember these 3 key ideas for your startup:

  1. Low-Latency Communications: When designing products that require fast and efficient data transfer between processes or services, **shared memory** can be the most effective solution, especially for latency-sensitive applications. Ensure that you allocate ample resources for developing and maintaining this communication method to avoid synchronization issues.

  2. Benchmarking and Timing: Always perform detailed benchmarking and testing to ensure the reliability and efficiency of inter-process communication methods. This includes understanding the complexities of system calls and clock synchronization to provide accurate performance metrics. Tools like Divan can assist in this process, providing ergonomic solutions for benchmarking.

  3. Consider Scalability and Maintenance: While low-level efficient communication methods like shared memory offer excellent performance, consider the maintenance overhead and scalability when used in production environments. Easy-to-implement solutions like TCP/HTTP might be more suitable for general use cases.
    Edworking is the best and smartest decision for SMEs and startups to be more productive. Edworking is a FREE superapp of productivity that includes all you need for work powered by AI in the same superapp, connecting Task Management, Docs, Chat, Videocall, and File Management. Save money today by not paying for Slack, Trello, Dropbox, Zoom, and Notion.
    For anyone who wants a deeper dive, or to offer critiques and improvements, the code is available here original source.

article cover
About the Author: Mark Howell Linkedin

Mark Howell is a talented content writer for Edworking's blog, consistently producing high-quality articles on a daily basis. As a Sales Representative, he brings a unique perspective to his writing, providing valuable insights and actionable advice for readers in the education industry. With a keen eye for detail and a passion for sharing knowledge, Mark is an indispensable member of the Edworking team. His expertise in task management ensures that he is always on top of his assignments and meets strict deadlines. Furthermore, Mark's skills in project management enable him to collaborate effectively with colleagues, contributing to the team's overall success and growth. As a reliable and diligent professional, Mark Howell continues to elevate Edworking's blog and brand with his well-researched and engaging content.

Trendy NewsSee All Articles
CoverEdit PDFs Securely & Freely: Breeze PDF In-Browser SolutionBreeze PDF is a free, offline browser-based PDF editor ensuring privacy. It offers text, image, and signature additions, form fields, merging, page deletion, and password protection without uploads.
BY Mark Howell 1 mo ago
CoverDecoding R1: The Future of AI Reasoning ModelsR1 is an affordable, open-source AI model emphasizing reasoning, enabling innovation and efficiency, while influencing AI advancements and geopolitical dynamics.
BY Mark Howell 26 January 2025
CoverSteam Brick: A Minimalist Gaming Console Redefines PortabilitySteam Brick: A modified, screenless Steam Deck for travel, focusing on portability by using external displays and inputs. A creative yet impractical DIY project with potential risks.
BY Mark Howell 26 January 2025
CoverVisual Prompt Injections: Essential Guide for StartupsThe Beginner's Guide to Visual Prompt Injections explores vulnerabilities in AI models like GPT-4V, highlighting security risks for startups and offering strategies to mitigate potential data compromises.
BY Mark Howell 13 November 2024
CoverGraph-Based AI: Pioneering Future Innovation PathwaysGraph-based AI, developed by MIT's Markus J. Buehler, bridges unrelated fields, revealing shared complexity patterns, accelerating innovation by uncovering novel ideas and designs, fostering unprecedented growth opportunities.
BY Mark Howell 13 November 2024
CoverRevolutionary Image Protection: Watermark Anything with Localized MessagesWatermark Anything enables embedding multiple localized watermarks in images, balancing imperceptibility and robustness. It uses Python, PyTorch, and CUDA, with COCO dataset, under CC-BY-NC license.
BY Mark Howell 13 November 2024
CoverJungle Music's Role in Shaping 90s Video Game SoundtracksJungle music in the 90s revolutionized video game soundtracks, enhancing fast-paced gameplay on PlayStation and Nintendo 64, and fostering a cultural revolution through its energetic beats and immersive experiences.
BY Mark Howell 13 November 2024
CoverMastering Probability-Generating Functions: A Guide for EntrepreneursProbability-generating functions (pgfs) are mathematical tools used in probability theory for data analysis, risk management, and predictive modeling, crucial for startups and SMEs in strategic decision-making.
BY Mark Howell 31 October 2024
Try EdworkingA new way to work from  anywhere, for everyone for Free!
Sign up Now