Thursday, 16 January, 2020
It is easy to be annoyed by strange anomalies when they are sighted within otherwise clean (or perhaps not-quite-so-clean) datasets. This annoyance is immediately followed by eagerness to filter them out and move on. Even though having clean, well-curated datasets is an important step in the process of creating robust models, one should resist the urge to purge all anomalies immediately — in doing so, there is a real risk of throwing away valuable insights that could lead to significant improvements in your models, products, or even business processes.
So what exactly do I mean by “data anomalies”? There is no single definition for what constitutes an anomaly, as it depends both on the nature of the data and one’s understanding of the processes generating that data (i.e., anomaly is in the eye of the beholder). They are essentially patterns that deviate significantly from the expected behaviour, leading one to believe that there’s either (1) an error somewhere or (2) a new, unknown cause for the observed deviation. Either possibility should give one pause before hitting the delete button and moving on. If it’s an error, is it random inconsequential noise or a systematic issue somewhere in the process? Could the underlying reason be causing other, less visible issues in the data? If it’s not an error but a new phenomenon, what are its implications? Does it herald a new trend in the market which the business would otherwise miss out? If some of these questions could apply to your data, then anomalies may actually be valuable and deserve to be examined with due care.
At Vortexa we obtain vessel and cargo data from multiple sources in order to generate the most complete view into waterborne oil flows around the world. As in other industries, data quality can vary considerably across different sources, and thus to avoid the infamous GIGO (garbage in, garbage out) we have set up a process to clean and curate each training dataset used by our Machine Learning models. In this post, I describe some lessons we’ve learned as we’ve grappled with some anomalies in our datasets.
Anomalies can be detected using model-free or model-based approaches. Model-free methods rely on a distance metric to identify samples that are “far away” in some sense from other observations within a dataset. Some examples of model-free methods are clustering, nearest-neighbour, and information-theoretic approaches. These methods do not assume a particular structure or distribution in the data, other than the existence of groups of points that are relatively close to one another (clusters) and points that do not seem to belong to any cluster (anomalies). In contrast, model-based methods are based on a set of assumptions about the process generating the data. I will focus on model-based anomaly detection for the remainder of this post.
Let’s start by looking at a classic textbook example of a model-based anomaly detector. In this example, our observations are univariate real numbers which we represent as variable x. If we assume that x is generated as independent random samples from a normal distribution with mean μ and standard deviation σ (i.e. x ∼ N(μ, σ)), then we can define as anomalous all observations that are more than 3 standard deviations away from the mean (i.e., |x-μ| > 3σ). Then, if our assumption is correct, the probability of observing an anomaly by chance is less than 0.3%. If the number of anomalies turns out to be significantly larger than this, we can be certain that they are generated by a different kind of process than represented by our model and need further investigation.
Machine Learning methods can be used to build efficient anomaly detectors. Assuming that one starts with a curated, anomaly-free training dataset Dcomprised of data points (xᵢ, yᵢ) where xᵢ are feature vectors and yᵢ are class labels, supervised learning methods such as logistic regression, Bayesian networks, and neural networks (among many others) can be used to estimate P(y|x) — the conditional probability distribution of class labels given a set of features. This estimated distribution will reflect the patterns in D as well as the underlying assumptions in the chosen supervised learning algorithm. This model can be used to detect potential anomalies among new, unseen data points (xᵢ’, yᵢ’) by checking for samples that contain an unlikely class label. In other words, for a given probability threshold τ, anomalies are defined as data points that have class label probabilities below the threshold: P(y=yᵢ’ | xᵢ’) < τ.
Anomaly detection is an old problem in statistics and a multitude of algorithms have been created over the years to address it, some of which are more appropriate in specific domains than others. The advantage of the model-based approach proposed above is that it can be readily applied if one has already built a classification model from a curated dataset. If however you do not have an anomaly-free training dataset or your data does not contain categorical output labels, then you may try modifying the approach above (e.g., by using a density estimation method) or using a model-free approach.
Diagnosing the underlying issue(s) causing the anomalies is the most valuable step in the clean-up process, but also the hardest. In some cases, it may require deep expertise in the industry or process generating the data as well as a solid understanding of statistics and the assumptions inherent in your model. If you used the model-based approach proposed in the previous section, then all we know is that the detected anomalies deviate from the patterns in the training dataset as captured by the supervised learning method. We now need to understand what may be the cause for this deviation — there are several possibilities:
Author: Gustavo Santos, Head of Predictions and Marketing Modelling
For more practical tips on Data Science and Tech - follow the VorTECHsa team on Medium