Site Reliability Engineering at Vortexa

Site Reliability Engineering at Vortexa

A look into using site reliability to maintain trustworthy, recent, and accessible data.

16 September, 2020
Vortexa Analysts
Vortexa Analysts

Vortexa provide the most complete data on global waterborne oil & gas flows through harnessing powerful analytics tools in near-realtime that, for the first time, give the energy market full transparency.

We apply different practices to make sure that this data can be trusted. In this article, we’ll focus on how we, Data Services Team, keep the data recent and always accessible by adopting the mindset of Site Reliability Engineering.

Being able to peek inside every tanker vessel in the world or understand when a ship diverts due to external factors, rather than simply provides an en-route destination update in an automated fashion is hard. To do that and more, we maintain over 200+ software components and 20+ machine learning models. To keep things reliable, we combine operational skills with software engineering which is how Site Reliability Engineering can be defined:

SREs are like DevOps, but with an in-depth understanding of the software they operate.

 

Understanding

Understanding the purpose of each software component and its high-level design makes keeping the system reliable much easier. Knowing whether a component is a ship-to-ship transfer detection model or an internal report, and what will happen if it’s, for example, restarted helps to make informed and measured decisions.

To make the task of learning the system easier, we in the Data Services Team broadly split the system into 2 parts:

  • things that keep the data recent (e.g. various data processing pipelines and producers)
  • things that keep the data accessible (e.g. APIs, web frontends, excel add-in, etc)

This alone allows for better incident response by understanding the impact better: if there were to be an API failure, the negative impact is immediate. If on the other hand, we experience a failure of a scheduled job, the impact is generally gradual data degradation over time.

 

Monitoring

To know whether things are running smoothly, you need to monitor them. We use a combination of CloudWatch, Prometheus, and Grafana to achieve that. In fact, monitoring got so prevalent at some point, that we became overwhelmed with noise. That’s where having a good system to classify how essential each component is proves invaluable. We’ve got parts of the system that could be taken offline for a day or two, without any noticeable impact on the clients, and parts where the outage of a few seconds will be quickly noticed. To distinguish between failures of essential and non-essential components, we use trustworthy Slack channels.

 

Codifying the knowledge

Fixing production issues used to be very distinct from software engineering. Developers would frantically run various commands to bring the system back. When the incident was resolved though, the knowledge of the fix would stay in their head. The operator pattern helps solve that knowledge silo. In its shortest it can be explained as:

That thing that you did to fix production? Write code to do it for you next time.

It sounds like it would be complex to implement this in practice, but there are a surprising number of simple things you can do to keep the distributed system operational. Starting with a humble automated retry:

A majority of failures in a distributed system can be fixed with a retry

Network errors, hardware failures, spot instances being taken away from you – all of these things tend to not be a problem the second time around.

 

Advising

There are short-term mitigations and long-term solutions. As the Data Services Team we developed a good understanding of the entire Vortexa ecosystem. We also track a variety of metrics, such as:

  • Mean time between failures: if there’s a problem period, how ‘intense’ is it. A good proxy for stress levels in difficult times.
  • Failure frequency: how often do we need to intervene, to keep a piece of Vortexa running. A good proxy for the operational workload.
  • Failure count / invocation count: how much can we trust a particular service. A good proxy for the overall reliability.

Based on that, we can spot the worst offenders and work closely with other development teams to advise them on the long-term solutions.

 

Final thoughts

If putting your software engineering skills to use in operations is something you like, or if you are a DevOps who loves understanding the software they’re automating, don’t hesitate to drop us a line at careers@vortexa.com 

Vortexa Analysts
Vortexa
Vortexa Analysts