High-Speed Inter-Process Communication Strategies In Rust

Today in Edworking News we wanted to explore different ways of communicating between different processes executing on the same machine, and doing so as fast as possible. We're focusing on high-speed inter-process communication (IPC), but some of these approaches can be extended across a network. We'll do this exploration in Rust. A reminder that since these are independent processes, most approaches you'd take within-process are unavailable to us. Rather than communicating between threads, or between asynchronous routines, these are techniques to share data between different programs. They might not even both be written in Rust. The code will mostly be snippets, but the full source is available here, with benchmark results at the end.

The Problem

We want to send a message ("ping") from one process to another, and when it's received, reply with an acknowledgment ("pong"). This cycle gives us an opportunity to time how long it takes to round trip between two processes. Timing is complicated, and there is a note on it below, but we'll run lots of cycles and calculate the average time from there. We'll set up all the experiments as similarly as possible, with a producer process sending a ping and a consumer process replying with a pong.
Performance profiling can lead one down a deep rabbit hole, but hopefully, this experiment is simple enough that we'll be able to largely isolate the effect of the communication. Though I don't doubt keen-eyed readers will highlight optimizations missed or operations outside the communication that are not as computationally-free as I've assumed.
Note that we're focusing on low-latency here, not high-throughput. Within High-Performance Computing, they are related but are focused on different goals. As an example to illustrate the difference, imagine a piece of software that performs linear algebra tasks by outsourcing those computations to the GPU. For some problem sets (like training Neural Networks), the time taken to complete the training will be significantly faster on the GPU - the calculations performed per second (or throughput) will be much higher. However, there is a cost to marshalling and shipping the data onto the GPU in the first place, and it will never be quicker to multiply two small matrices together like that.

A Note on Timing

We tend to assume that computers keep precise and synchronized clocks. And compared to humans, they largely do. However, when trying to measure very quick operations, there are limits to this. For starters, I'm working on a PC with a 4GHz processor. In order to measure something in single cycles, that means I need a clock capable of 0.25ns time resolution - or the time taken for light to travel roughly 10cm. Electrical signals move significantly slower than this, so for a clock outside the processor, even the time taken to sample the timer will dwarf the time taken to perform a few cycles of calculation. Consider the clock attached to the coin battery on your motherboard. This allows the system to be disconnected from the mains, plugged back in, and still know the current time. This is known as a Real Time Clock (RTC). These mostly run at 32.768 kHz (2^15 Hz), only giving them a theoretical resolution of 30 µs - or about a hundred thousand clock cycles. What's more, they are often configured to produce time at a resolution much lower than that - clearly that clock isn't going to cut it.
The traditional solution is to use a Time Stamp Counter or TSC. This is a processor register that keeps ticking up at the rate of the processor clock speed - so should offer sufficient resolution for our needs. Traditionally the x86 instruction RDTSC is used to read its value. However, given the nature of modern CPUs with varying clock speeds, hyperthreading, deep pipelines, and the fear of cache/timing attacks, it's more complicated than this. On Windows, Microsoft suggests using the `QueryPerformanceCounter` function, and on Linux `clock_gettime(CLOCK_MONOTONIC)` - but note these are both system calls (more on those later). Their resolution is also hardware dependent, if you have a newer device (within 10 years) this may be an HPET, or it could be a modern incarnation of the TSC. Either way, these benchmarks are going to yield different results on different hardware and different operating systems, even if the code ran similarly.
Post takeaway - Timing short-duration events is difficult. If in doubt, run enough iterations of your event such that the total completed time is in milliseconds, and then whatever timing source your benchmarking suite relies on should lead to an accurate result.

Benchmarking

This was an opportunity for me to try out Divan for benchmarking. It's a comfy benchmarking tool, with the goal of being more ergonomic than Criterion. I'll save offering judgment yet as I haven't used it a whole lot, but it seems to do what I need it to do. For each approach we will: Fortunately, Divan gives the tools for that, and allows us to annotate the benches with how many operations were executed and then produce averages based on that (with some caveats).

Approach 1: Pipes

This is the first thing that comes to mind to connect processes on the same machine. Like `cat | grep` we'll just connect `stdout` of the producer to `stdin` of the consumer, and vice-versa. This will work on Windows, Linux, and presumably MacOS. The consumer process reads five bytes into an array from `stdin`, checks if they're equal to "ping" followed by a newline, and then responds appropriately. It'll also respond to "pong". The producer process is a little more complex as it has to create and handle the consumer first, but pushes out a "ping", waits for a response, and then panics if it's not "pong". Aside from some fiddly `ref mut` treatment for the pipes, this was pretty easy to write.

Approach 2: TCP

A natural approach would be to try a client and server connected via HTTP. This felt dangerously like benchmarking HTTP servers though, so instead, I just went straight to TCP. All in all, this was fairly simple. Currently "ping" is written to the socket, copied off, and then checked. "Pong" is then written back.

Approach 3: UDP

Naturally, the next approach was to try UDP. UDP is traditionally used in these contexts for a "fire and forget" mechanism. Unlike TCP, the protocol doesn't offer a way of recovering lost or out-of-order packets. This can be an advantage because it keeps the connection from getting too "chatty" but if consistency is important, those layers need to be implemented manually - either in or out of band.

Approach 4: Shared Memory

Shared memory is a known rapid way of sharing data between processes. One process allocates a block of memory and passes that handle to another process. Each process is then free to read from or write to that block of memory independently. If your first instinct is to fear synchronization and race conditions, you'd be absolutely correct. What's worse is that out of the box, Rust doesn't help us here, despite usually being very helpful with that kind of thing. We're on our own, and it's going to be unsafe.

Results

I've added results for Windows and Linux, but take these with a significant grain of salt as they are different machines. It's probably fair to compare them within platforms though. The time per operation is similar for most of the approaches, apart from using shared memory. With shared memory, we can perform a ping-pong in under 200ns, or around 1000 processor cycles. I have to admit, I still found this a little disappointing. Moving a few bytes around should be faster than that, but I'm going to resist digging too deep yet. Preparing an environment with core-pinning and the correct thread priority is tricky, and given we have to do this with two concurrently running processes, it's even more difficult.

Conclusion

I was surprised at how similarly most things performed. I did a cursory investigation into Linux-specific approaches like `dbus` and Unix Domain Sockets, but they seemed to be in the same ballpark as the non-shared memory approaches. The only other thing to try would be memory-mapped files, but I thought I'd save that for when I wanted to try something similar with larger blocks of data.
If I had to do this in Production, for the majority of workloads I'd probably still use an HTTP / TCP connection. It's portable, reliable on message failure, and I could split it across machines if needs be. However, for the cases where latency really matters, the maintenance overhead of using shared memory is worth it.

Remember these 3 key ideas for your startup:

Low-Latency Communications: When designing products that require fast and efficient data transfer between processes or services, **shared memory** can be the most effective solution, especially for latency-sensitive applications. Ensure that you allocate ample resources for developing and maintaining this communication method to avoid synchronization issues.
Benchmarking and Timing: Always perform detailed benchmarking and testing to ensure the reliability and efficiency of inter-process communication methods. This includes understanding the complexities of system calls and clock synchronization to provide accurate performance metrics. Tools like Divan can assist in this process, providing ergonomic solutions for benchmarking.
Consider Scalability and Maintenance: While low-level efficient communication methods like shared memory offer excellent performance, consider the maintenance overhead and scalability when used in production environments. Easy-to-implement solutions like TCP/HTTP might be more suitable for general use cases.
Edworking is the best and smartest decision for SMEs and startups to be more productive. Edworking is a FREE superapp of productivity that includes all you need for work powered by AI in the same superapp, connecting Task Management, Docs, Chat, Videocall, and File Management. Save money today by not paying for Slack, Trello, Dropbox, Zoom, and Notion.
For anyone who wants a deeper dive, or to offer critiques and improvements, the code is available here original source.