Outlier analysis
Aggarwal (2017)
This is quite a long and dense book, so while I read the first seven chapters in detail, I put more time into some of the later chapters than others depending on how applicable they were to what I currently work on. These notes cover some of the main points from the first ten of the thirteen chapters (this corresponds to roughly the first 350 pages), leaving out spatial outlier detection, outliers in graphs (networks), and the final chapter on applications.
Chapter 1: Introduction
Outliers contrast with normal data points (or inliers). Some authors differentiate between weak outliers (noise) and strong outliers (anomalies).
Some outlier detections such as z-scores are designed to detect extreme values, but these are only a subset of possible outliers. For example, in , the value 50 would be an outlier.
Contextual (as opposed to global) outliers are values that would be expected in some cases, but not after conditioning on other information. For example, snow might be expected in winter, but not in summer. An example would be linkage outliers in social networks, which connect nodes or subgraphs that wouldn't usually be connected.
Instance-based methods, also known as lazy learners or memory-based methods, do not construct a training model upfront. Instead, they assess each test instance by comparing it to the most relevant training data. Examples include k-nearest neighbor detectors and the Local Outlier Factor (LOF). While they are flexible and don't require a distinct training phase, instance-based approaches can be computationally expensive and are vulnerable to the curse of dimensionality.
Explicit generalization methods create a summarized model of normal data upfront and score each data point based on its deviation from this model, for example by constructing a linear hyperplane to describe normal data and measuring the distance of each point from this hyperplane as an outlier score. One challenge is the potential for overfitting, as the same dataset is used for both training and scoring, but this is usually only an issue if the model is very flexible.
Information-theoretic approaches look at the impact of removing a point on the compression size of data. They can be useful when quantifying coding costs is more convenient than directly measuring deviations.
Feature selection in outlier detection is challenging due to its (usually) unsupervised nature, though you can use kurtosis or remove features with low correlation (outliers tend to violate normal data dependencies). Subspace methods (discussed in chapter 5) can be used as an alternative.
Chapter 2: Probabilistic models
Gaussian mixture models assume data are generated from processes where each point belongs to one of a fixed number of Gaussian clusters, with parameters estimated using maximum likelihood. These can be viewed as probabilitistic versions of clustering algorithms.
Using a single cluster results in the Mahalanobis method, which is useful for detecting extreme values. Other approaches include Tukey's one based on the interquartile range.
In depth-based methods, extreme values are found based on iteratively removing points lying on the convex hull of the data. This can be computationally expensive for large datasets.
Deviation-based methods look at the impact of points on the variance. The impact of a set of points is known as the smoothing factor.
Angle-based methods are based on the idea that points further from the centre tend to enclose the data in a smaller angle.
Limitations:
Difficulty determining the right assumptions
Potential for overfitting
Chapter 3: Linear models
Principal component analysis (PCA) identifies lower-dimensional subspaces that minimize reconstruction error after projection. PCA helps reveal directions of global correlation and is more stable in the presence of outliers compared to methods focusing on dependent variables.
For non-linear data, PCA can be extended using kernel methods, transforming data so that nonlinear structures align along linear hyperplanes in the transformed space.
One-class support vector machines construct models assuming all provided examples are normal, separating them from the origin in a feature space transformed using kernel functions. In the support-vector data description, data are enclosed in a hypersphere rather than being linearly separated from the origin.
Regression can also be applied in an unsupervised context by predicting each feature from all of the others and aggregating the errors over all of the features.
I'm not certain why neural networks were put into the chapter on linear models, but autoencoders learn compressed representations of data, with reconstruction error indicating potential outliers. These can work well if you have a lot of data and aren't concerned about interpretability.
As with linear regression, you can also apply replicator networks in which each attribute is predicted from all the others. Techniques like dropout training prevent overfitting by randomly dropping units during training, discouraging neurons from co-adapting.
Limitations:
Correlations may not be global
Potential for overfitting
Performance of some of the approaches on large datasets
Interpretability
Chapter 4: Proximity-based outlier detection
Proximity-based methods model outliers as isolated points based on similarity or distance functions. Clustering methods partition data points, while density-based methods partition the data space. Incorporating cluster size and averaging scores from multiple clusterings can improve performance (clustering approaches can be very sensitive to the algorithm used and initial conditions).
Isolation forests recursively partition data to isolate instances, with outliers requiring fewer partitions due to their sparsity. These scale well and are better than many alternatives at handling high-dimensional data, but are not very interpretable.
Limitations:
Difficulty distinguishing between anomalies and noise
Curse of dimensionality
Chapter 5: High-dimensional outlier detection
While this is challenging due to issues such as the curse of dimensionality, subspace-based methods (which look for outliers in lower-dimensional projections of the data) can help and are often more interpretable.
Chapter 6: Outlier ensembles
In sequential ensembles, models are applied one after another. The result is either based on the final one or on a weighted combination of the sequence.
Independent ensembles are based on multiple algorithms or multiple instantiations of the same algorithm. They are particularly useful in higher dimensions as they allow the exploration of multiple subspaces.
Variance-reduction techniques include feature bagging, subsampling, and isolation forests.
In two-phase algorithms, obvious outliers are removed based on an algorithm trained on the entire dataset, then a more robust model is trained on the cleaned dataset.
In weighted bagging (or wagging), training examples are weighted randomly, while in randomized feature weighting, the features are weighted/scaled randomly to introduce diversity.
While ensembles tend to be more robust and offer superior performance, they may not always be worth the additional complexity and computational cost.
Chapter 7: Supervised outlier detection
Class imbalance can be addressed using cost-sensitive learning, where errors in classifying the anomalous class are penalized more heavily, or adaptive resampling techniques like SMOTE, which synthetically oversamples minority classes.
Alternatively, you can apply one-sided selection, where the normal class is undersampled. This may not have much impact on the model provided that the remaining normal instances are representative, and the performance gains allow you to create an ensemble.
Positive-unlabeled classification deals with settings where the negative class contains unknown proportions of outliers, requiring specialized approaches.
Chapter 8: Categorical, text, and mixed-attribute data
Feature normalization is important when dealing with mixed attribute data containing both numerical and categorical attributes. Conventional models define outliers as points expressed least precisely by a fixed compression, while information-theoretic models focus on the differential impact on compression size when removing an outlier.
Chapter 9: Time series and multidimensional streaming
We often work in a semi-supervised context, where we know about some types of outliers but not about novelties that haven't been seen before. Sometimes, you may also see infrequently recurring outliers.
Collective anomalies (where there is a cluster of observations that look fine individually, but are unlikely to occur together) are particularly relevant here.
Algorithms should account for any seasonality.
Processes need to account for data drift and concept drift. One approach is to use either a fixed window or a sliding window. One simple approach is to use tumbling windows, which are fixed-size and non-overlapping (e.g., 1:00-1:05, 1:05-1:10, etc.).
Chapter 10: Discrete sequences
Sequence-generation processes are often modelled using Markov chains or Hidden Markov Models (HMMs), where states are hidden, and observations are generated by state transitions. The Viterbi algorithm can be used to determine the most likely sequence of hidden states underlying observed sequences.
Alternatively, you can apply approaches based on frequent patterns or use neural sequence models such as RNNs or Transformers, which can be better at capturing long-range dependencies. When I tried modelling a binary sequence in practice, I found it easier to get good performance out of an RNN (specifically an LSTM) than an HMM.
We can distinguish between local and global sequence anomalies, with the former referring to anomalous subsequences and the latter to sequences that are anomalous as a whole. Pattern-based approaches may be more appropriate for detecting local anomalies and distance-based approaches for global ones.
Last updated