Outlier Detection
A practical approach to flagging anomalies in streaming signals
This is part one of a two-part series. Part one covers outlier detection in one-dimensional signal streams. Part two covers power law noise whitening, which can serve as a useful preprocessing step before detection.
This topic assumes familiarity with basic statistics (mean, variance, quartiles). No prior DSP knowledge required.
A moth’s tympanal organ is a simple threshold detector: it fires when it senses the ultrasonic calls of an approaching bat, triggering evasive maneuvers (Roeder 1962). No spectral analysis, no pattern matching: just “is this signal abnormally loud?” The streaming outlier detectors in this topic work on the same principle: flag values that exceed a robust threshold, with minimal computation and no memory of the full signal history.
Unusual things stick out. Whether it is a sudden spike in a sensor reading, an unexpected pause in network traffic, or a single data point far from the rest. Something in us immediately flags it as worth attention. The identification of rare or unexpected observations is relevant to almost every area of measurement and monitoring, and can be considered a prerequisite for making reliable inferences from data.
Although humans have an intuitive feel for outliers, the concept is genuinely hard to define precisely. Depending on the type of data, its context, and the observer’s intentions, the problem is variously called outlier detection, anomaly detection, change detection, or novelty detection. An outlier might arise from measurement variability, experimental error, or from a genuinely different underlying process. In general terms, outliers are observations that deviate significantly from the bulk of the data. But what counts as significant, and significant relative to what, is exactly what we need to pin down.
This discussion focuses on one-dimensional digital signal processing (DSP): univariate, time-dependent data. We want to detect outliers in a data stream (a time series of samples) and do so in something close to real time.
Defining normality from descriptive statistics
One practical approach assumes that signal values originate from a single distribution, or a mixture of distributions. If we can characterise that distribution statistically, we can define normality and flag values that fall far outside it.
Useful univariate descriptive statistics include:
- Quartiles divide the data into four equal parts. \(Q_1\) is the value below which 25% of samples fall, \(Q_2\) (the median) divides the data in half, and \(Q_3\) is the value below which 75% of samples fall.
- Measures of central tendency: the arithmetic mean (\(\mu\)), median (\(Q_2\)), and geometric mean. The geometric mean is most appropriate for data expressed as ratios or growth rates, and is only defined for positive values.
- Measures of spread: standard deviation (\(\sigma\)), coefficient of variation (CV), interquartile range (IQR), and median absolute deviation (MAD). CV is the ratio of \(\sigma\) to \(\mu\). IQR is \(Q_3 - Q_1\), the spread of the middle half of the data. MAD is the median of the absolute deviations from the data’s median.
A note on terminology: in mathematical statistics, the Gaussian distribution is also called the normal distribution. We deliberately avoid that term here, since “normality” is exactly what we are trying to define for the purposes of outlier detection. Calling the reference distribution “normal” risks circular reasoning.
For a Gaussian distribution, the median equals the mean, \(Q_1 \approx \mu - 0.6745\sigma\) and \(Q_3 \approx \mu + 0.6745\sigma\), and \(\sigma \approx 1.4826 \cdot \text{MAD}\). The relationship between these statistics changes significantly for other distributions: for a skewed Beta distribution, for instance, the mean, median, and quartiles can be spread quite differently.
Threshold-based detection
Given a description of the distribution, we can define fences, thresholds beyond which a value is flagged as an outlier.
Three-sigma rule. For a zero-mean Gaussian process, roughly 4.6% of samples fall outside \(\mu \pm 2\sigma\), and only 0.3% fall outside \(\mu \pm 3\sigma\). Designating samples beyond three sigma as outliers is a common starting point:
\[\text{outlier if } x \notin [\mu - 3\sigma,\; \mu + 3\sigma]\]
Tukey’s fences. Based on the IQR, which is more robust than \(\sigma\) to heavy tails and non-Gaussian distributions. Values are flagged as outliers if they fall outside:
\[[Q_1 - 1.5 \cdot \text{IQR},\;\; Q_3 + 1.5 \cdot \text{IQR}]\]
Note that for some distributions (the Cauchy distribution, also called the Lorentzian in physics, being a notable example) the mean and variance are not even defined, making \(\sigma\)-based methods inappropriate. IQR and MAD are safer choices in practice.
MAD-based rule. Similar in structure to the three-sigma rule, but with \(\sigma\) estimated robustly from the MAD:
\[\sigma \approx k \cdot \text{MAD}\]
where \(k\) depends on the assumed distribution. For Gaussian data, \(k \approx 1.4826\). This gives fences at \(\text{median} \pm k \cdot \text{MAD}\), and is more resistant to the influence of outliers on the estimate of spread, which is exactly what you want when trying to detect outliers.
Estimating statistics from a sliding window
For a streaming signal, we cannot know the global distribution in advance. We need to estimate statistics locally, updating them as new samples arrive. A natural approach is a sliding window (or circular buffer): keep the last \(N\) samples in memory and compute statistics over them.
See detector.py for the implementations. The key classes are:
OutlierDetector: IQR-based sliding windowOutlierDetectorMAD: MAD-based sliding windowOutlierDetectorFrugalMAD: recursive frugal estimator (see below)
One subtlety worth noting: the current sample \(x\) should be tested against fences computed from the previous buffer state, before appending. Otherwise the sample influences its own outlier test, biasing fences slightly outward and reducing sensitivity. The effect is small for large windows but can matter when \(N\) is small.
A memory-efficient alternative: recursive estimation
Sliding windows are simple and interpretable, but they require storing \(N\) samples. For resource-constrained systems, or very large \(N\), this can be a problem.
An interesting alternative is an adaptation of the Frugal-1U-Median algorithm (Ma, Muthukrishnan, and Sandler 2014), which estimates the median using a single memory cell. The algorithm drifts toward the new sample by a small step if the sample exceeds the current estimate, and drifts away if it does not. We apply the same idea to both the median and the MAD.
The step size is tied to the MAD estimate itself: as the algorithm converges, the step size shrinks, giving finer adjustments around the settled estimate. Despite using only two memory cells (median and MAD), after convergence the results are comparable to a 100-sample sliding window, a surprisingly good trade-off.
The main limitation is convergence time. During the warm-up period the estimates can be unreliable, and the choice of the initial step size is somewhat arbitrary. The \(k\) multiplier also needs more tuning for non-Gaussian signals, because the frugal estimator does not know the underlying distribution.
Summary
Three practical outlier detectors for streaming one-dimensional signals, all based on robust statistics (IQR or MAD) rather than mean and standard deviation:
| Method | Memory | Latency | Notes |
|---|---|---|---|
| IQR sliding window | \(O(N)\) | Low | Simple, slight self-contamination bias |
| MAD sliding window | \(O(N)\) | Low | Less biased, noisier fences |
| Frugal MAD | \(O(1)\) | Low | Minimal memory, needs tuning, slow convergence |
Detection is only part of the problem. Once an outlier is flagged, you still need to decide what to do with it: discard it, replace it with an interpolated value, or trigger an alert. That is a separate problem not covered here, but it is worth designing for from the start.
In practice, the signal arriving at the detector may not be white noise; it may exhibit correlated samples and a non-flat power spectrum. This can cause any of the above methods to behave poorly. Part two discusses power law noise, how to characterise it, and how to whiten it before detection.