Using Rust to corrode insane Python run-times
There are times when adopting a standard approach just isn’t good enough. This post is about making minimal changes for maximum effect where it matters.
There are times when adopting a standard approach just isn’t good enough. This post is about making minimal changes for maximum effect where it matters.
The problem domain
At Vortexa we use Python extensively. Great for prototyping, great for data science, Python will often make itself into production and although not the fastest of languages, it is often perfectly good enough.
Recently however, we’ve found one particular Python task was taking 30 hours to run. Invoked when we wanted to recalculate our world due to some model changes, this run time was actually impacting our QA feedback cycle, making it harder to bring model updates to production. If we could address just this, it would accelerate model improvements bringing real benefits to the team and our customers.
Without too much irrelevant detail, the task in question is processing GPS signals from ships, and filtering them through a set of polygons before applying other algorithms.
So why is this code so slow?
No assumptions, the starting point has to be measuring it.
I created a copy of the code (copy/paste), but adapted it so I could process a small dataset and in future compare different techniques. The code structure for this test was still faithfully reproducing the processing load of the production code. I used the excellent pyinstrument module to drill down into what was going on in Python. To prevent distorting the results due to the short duration of the run, I started profiling after all the one-off initialization work was complete.
Here are the results:
The units are in seconds.
The main method represents the whole post-initialization processing of the algorithm. The test_python method is testing part of the logic I believed was slow. All the other logic is still present however.
So out of 34.3 seconds of run-time, 29.8 seconds is in the filtering logic I alluded to earlier, and 25.1 seconds of that is in matplotlib doing point-in-polygon math. Here we go again?
So what’s wrong?
The test I ran was using nearly 8M vessel positions. We are looking at a few hundred zones within the world, a few hundred polygons to filter with. We’re using pandas, the vessel positions are in a dataframe, but we’re passing this dataframe to matplotlib for every polygon we want to test.
We’re making hundreds of calls to a library passing millions of records in each call. In production, we’re processing perhaps 2500x more data, so one begins to see where the 30 hours figure is coming from.
What can we do about it?
Perhaps matplotlib isn’t the right tool for heavy processing in a production task? The code is already using pandas, so why not try geopandas? Then we can test all the polygons in one library call.
Here are the results:
Again, results in seconds.
So this is a bit of a disaster, we’ve increased our runtime by a factor of ten! Geopandas (and the libraries it in turn calls) use 423 stack frames vs 5 for matplotlib, which I find stunning. The above trace also shows that even the creation of the GeoDataFrames takes longer than the entire matplotlib based processing.
So we have a choice here. We could:
1. Try breaking the data into chunks, and then using multi-processing (ugly in Python) to leverage a more powerful cloud virtual machine, sticking with matplotlib
2. Write a very small native custom library to do the math we want, using threads
The first approach could work, but it is not likely to be very cost-effective, we’d just be running multiple copies of slow code in parallel. I decided to try the second option.
Planning our native library
Taking into account some lessons learnt in the earlier Java point-in-polygon experience, we can use some techniques, such as:
- Avoid calling the library for every polygon, make one call per dataframe to reduce the library-call overhead massively
- Avoid heavyweight geometry libraries when the actual problem is very simple, the overhead hurts performance significantly
- Do bounding-box tests on every polygon
- Use integer-based math (faster than floating point), 32-bits where possible
- Use threads
It should be noted that Java is certainly not the answer here. Java integration with Python is truly horrible. Let’s not fight the Python ethos of native libraries.
Rust
Recently I’d been doing some experimental work with PyO3, which allows two-way Rust / Python integration. Here we’ll focus on Python importing and using a module implemented in Rust.
Here are the specifications:
- Implement a Python class in Rust
- Take an array of geojson strings in the constructor, our polygons
- Take numpy arrays of lat/lon coordinates from the vessel positions dataframe
- Return a numpy array of results (for easy integration) showing for each coordinate pair, which polygon (if any) was hit
The whole enchilada takes about 300 lines of Rust code, including documentation and unit tests. It also replaces approximately 30 lines of Python code (plus the matplotlib calls) too. PyO3 works well with the numpy and ndarray crates (Rust libraries), allowing easy integration with numpy arrays and thus pandas. We use rayon to parallelize the processing.
Does it work?
It does. Well this blog post would be pretty boring otherwise right?
The test data is the same.
“We have gone from 29.8 seconds of processing with matplotlib, to 2.9 seconds with Rust.”
Python used a single thread, and Rust 8 threads (4 cores hyperthreaded on an Intel i7 so call it 4-5x more compute available). This includes the Python overheard of plugging the results back into the pandas dataframe. Comparing the actual matplotlib vs the Rust library calls shows a 24x improvement. The output has been examined and has been shown to be exactly the same.
Our new solution is (at the functional level i.e. dataframe in/out), 10x faster. Upping the core count to 4 with code which is ultimately running in a Kubernetes cluster within AWS is perfectly reasonable when the goal here is a faster QA cycle. Taking into account the post-filter algorithms, the Rust processing is about 20% of the task’s total running time, so there is little point adding further threads unless other parts of the task can benefit.
Production
This surgical, very specific code change is now in production.
“The process which used to take 30 hours, now takes 6 hours, 500% of the original speed.”
This improvement isn’t just academic, and it isn’t just about decreasing processing costs.
“Our internal process for bringing model changes to customers, including QA, is now a day faster – every time.”
This is a considered, targeted optimisation.
One has to take into consideration that we have added a new technology here, effectively complicating the code and making it harder for someone to maintain the source repository. However, this is mitigated by making the new library very specific in scope. The business logic has not changed, one method has, so long as point-in-polygon “just works” – and we have the unit tests to prove it – this new code causes no harm.
Vortexa
I have been working over five years at Vortexa, and the exciting technical challenges keep coming down a firehose. I love this work. If you think you would too, be it Python, data-science, Java, or even Rust, see our job vacancies based in London, Singapore and Houston today.