Smoothing on Hardware

Moving average and exponential smoothing on ESP32-S3 and STM32F4

Smoothing is often the first DSP operation applied to sensor data on a microcontroller — before peak detection, zero-crossing analysis, or any downstream processing. The moving average and exponential moving average (EMA) are trivially cheap to implement and cover the majority of embedded smoothing needs.

For the theory (frequency response, Savitzky-Golay, kernel smoothing, trade-offs), see the main smoothing page.

Exponential moving average (EMA)

The EMA is the simplest recursive smoother and the most commonly used on microcontrollers. As written below it costs two multiplies and one addition (it can be factored to $y[n] = x[n] + \alpha\,(y[n-1] - x[n])$ for one multiply plus a subtract), with a single state variable:

\[y[n] = \alpha \cdot y[n-1] + (1 - \alpha) \cdot x[n]\]

Note the convention: here $\alpha$ weights the previous output, so $\alpha$ near 1 means heavy smoothing, the opposite of the main page’s smoothing-factor $\alpha$ (which weights the new sample, small $\alpha$ = heavy smoothing). They relate by $\alpha_\text{here} = 1 - \alpha_\text{main}$.

typedef struct {
    float alpha;
    float state;
    int initialised;
} EmaFilter;

void ema_init(EmaFilter *f, float alpha) {
    f->alpha = alpha;
    f->state = 0.0f;
    f->initialised = 0;
}

float ema_process(EmaFilter *f, float x) {
    if (!f->initialised) {
        f->state = x;
        f->initialised = 1;
    } else {
        f->state = f->alpha * f->state + (1.0f - f->alpha) * x;
    }
    return f->state;
}

The time constant is $\tau = -1 / \ln(\alpha)$ samples. For $\alpha = 0.9$ at 100 Hz sample rate, $\tau \approx 9.5$ samples (95 ms). For $\alpha = 0.99$, $\tau \approx 100$ samples (1 s).

Tip

On targets without hardware FPU (e.g., Cortex-M0), the EMA can be implemented in fixed-point with a power-of-two scaling: $y[n] = y[n-1] - (y[n-1] \gg K) + (x[n] \gg K)$, where $\alpha = 1 - 2^{-K}$. This requires only shifts and additions — no multiplications.

Moving average with circular buffer

The $N$-point moving average requires storing the last $N$ samples. A circular buffer avoids shifting the entire array on every sample:

typedef struct {
    float *buffer;
    float sum;        // running sum for O(1) update
    int head;
    int N;
    int count;        // samples received so far (for startup)
} MovingAverage;

void ma_init(MovingAverage *f, float *buffer, int N) {
    f->buffer = buffer;
    f->sum = 0.0f;
    f->head = 0;
    f->N = N;
    f->count = 0;
    for (int i = 0; i < N; i++) buffer[i] = 0.0f;
}

float ma_process(MovingAverage *f, float x) {
    // Subtract the oldest sample, add the new one
    f->sum -= f->buffer[f->head];
    f->buffer[f->head] = x;
    f->sum += x;

    f->head++;
    if (f->head >= f->N) f->head = 0;

    if (f->count < f->N) f->count++;

    return (f->count > 0) ? f->sum / f->count : 0.0f;
}

The running-sum trick makes the moving average O(1) per sample regardless of window size — one subtraction, one addition, and one division. This is the same cost as the EMA but requires $N$ floats of memory.

Floating-point drift

The running sum accumulates rounding errors over millions of samples. For long-running embedded systems (days or weeks), periodically recompute the sum from scratch: sum = 0; for (i = 0; i < N; i++) sum += buffer[i];. Once per second is sufficient and costs negligible CPU.

STM32F4 (NUCLEO-F446RE): sensor noise reduction

A typical use case: smooth noisy ADC readings before threshold comparison or feature extraction.

#include "arm_math.h"

#define ADC_FS        1000   // 1 kHz ADC sample rate
#define MA_LEN        16     // 16-sample moving average (16 ms window)

static float ma_buf[MA_LEN];
static MovingAverage ma;
static EmaFilter ema;

void sensor_init(void) {
    ma_init(&ma, ma_buf, MA_LEN);
    ema_init(&ema, 0.995f);  // ~200 ms time constant at 1 kHz (tau = -1/ln(0.995) ~= 200 samples)
}

// Called from ADC DMA callback at 1 kHz
void process_adc_sample(uint16_t raw) {
    float voltage = (float)raw * 3.3f / 4096.0f;

    // Moving average for noise reduction
    float smoothed_ma = ma_process(&ma, voltage);

    // EMA for slow-varying baseline tracking
    float baseline = ema_process(&ema, voltage);

    // Use smoothed_ma for threshold detection,
    // baseline for drift compensation
}

For CMSIS-DSP block processing, arm_mean_f32 computes the mean of a block — equivalent to a non-overlapping moving average:

float32_t block[64];
float32_t mean;
arm_mean_f32(block, 64, &mean);

ESP32-S3: temperature / accelerometer smoothing

A common pattern on ESP32-S3: smooth a sensor reading before displaying or transmitting via BLE.

static EmaFilter temp_filter;
static float ma_buf[32];
static MovingAverage accel_filter;

void sensor_task(void *param) {
    ema_init(&temp_filter, 0.998f);         // slow: ~5 s time constant at 100 Hz (tau = -1/ln(0.998) ~= 500 samples)
    ma_init(&accel_filter, ma_buf, 32);     // 32-sample window (320 ms at 100 Hz)

    while (true) {
        float temperature = read_temperature_sensor();
        float accel_z = read_accelerometer_z();

        float temp_smooth = ema_process(&temp_filter, temperature);
        float accel_smooth = ma_process(&accel_filter, accel_z);

        // Update BLE characteristics or display
        vTaskDelay(pdMS_TO_TICKS(10));  // 100 Hz
    }
}

Performance budget

Smoothing is negligible on any modern MCU:

Filter	Operations per sample	Cycles (est.)	Memory
EMA	2 multiply + 1 add	~5	4 bytes (state)
MA (N=16, running sum)	1 add + 1 sub + 1 div	~10	4N + 8 bytes
MA (N=64, running sum)	Same	~10	4N + 8 bytes

At 1 kHz sample rate on a 180 MHz Cortex-M4F, even a 64-point MA uses about 0.006% of the CPU budget (~10 cycles/sample × 1 kHz ÷ 180 MHz). The smoothing filter itself is never the bottleneck — the sensor read (I2C, SPI, ADC) dominates.

When to use which

Criterion	EMA	Moving average
Memory	4 bytes (fixed)	4N bytes (scales with window)
Startup behaviour	Immediate (uses first sample)	Ramps up over N samples
Frequency response	First-order IIR (gradual rolloff)	Sinc-like (nulls at $f_s/N$)
Step response	Exponential rise	Linear rise over N samples
Best for	Baseline tracking, rate smoothing	Noise reduction, pre-filtering
Avoid when	Sharp cutoff needed	Memory-constrained and N is large

--- title: "Smoothing on Hardware" subtitle: "Moving average and exponential smoothing on ESP32-S3 and STM32F4" --- Smoothing is often the first DSP operation applied to sensor data on a microcontroller --- before peak detection, zero-crossing analysis, or any downstream processing. The moving average and exponential moving average (EMA) are trivially cheap to implement and cover the majority of embedded smoothing needs. For the theory (frequency response, Savitzky-Golay, kernel smoothing, trade-offs), see the [main smoothing page](index.qmd). --- ## Exponential moving average (EMA) The EMA is the simplest recursive smoother and the most commonly used on microcontrollers. As written below it costs two multiplies and one addition (it can be factored to $y[n] = x[n] + \alpha\,(y[n-1] - x[n])$ for one multiply plus a subtract), with a single state variable: $$y[n] = \alpha \cdot y[n-1] + (1 - \alpha) \cdot x[n]$$ Note the convention: here $\alpha$ weights the **previous output**, so $\alpha$ near 1 means heavy smoothing, the opposite of the [main page](index.qmd)'s smoothing-factor $\alpha$ (which weights the new sample, small $\alpha$ = heavy smoothing). They relate by $\alpha_\text{here} = 1 - \alpha_\text{main}$. ```c typedef struct { float alpha; float state; int initialised; } EmaFilter; void ema_init(EmaFilter *f, float alpha) { f->alpha = alpha; f->state = 0.0f; f->initialised = 0; } float ema_process(EmaFilter *f, float x) { if (!f->initialised) { f->state = x; f->initialised = 1; } else { f->state = f->alpha * f->state + (1.0f - f->alpha) * x; } return f->state; } ``` The time constant is $\tau = -1 / \ln(\alpha)$ samples. For $\alpha = 0.9$ at 100 Hz sample rate, $\tau \approx 9.5$ samples (95 ms). For $\alpha = 0.99$, $\tau \approx 100$ samples (1 s). ::: {.callout-tip} On targets without hardware FPU (e.g., Cortex-M0), the EMA can be implemented in fixed-point with a power-of-two scaling: $y[n] = y[n-1] - (y[n-1] \gg K) + (x[n] \gg K)$, where $\alpha = 1 - 2^{-K}$. This requires only shifts and additions --- no multiplications. ::: --- ## Moving average with circular buffer The $N$-point moving average requires storing the last $N$ samples. A circular buffer avoids shifting the entire array on every sample: ```c typedef struct { float *buffer; float sum; // running sum for O(1) update int head; int N; int count; // samples received so far (for startup) } MovingAverage; void ma_init(MovingAverage *f, float *buffer, int N) { f->buffer = buffer; f->sum = 0.0f; f->head = 0; f->N = N; f->count = 0; for (int i = 0; i < N; i++) buffer[i] = 0.0f; } float ma_process(MovingAverage *f, float x) { // Subtract the oldest sample, add the new one f->sum -= f->buffer[f->head]; f->buffer[f->head] = x; f->sum += x; f->head++; if (f->head >= f->N) f->head = 0; if (f->count < f->N) f->count++; return (f->count > 0) ? f->sum / f->count : 0.0f; } ``` The running-sum trick makes the moving average **O(1) per sample** regardless of window size --- one subtraction, one addition, and one division. This is the same cost as the EMA but requires $N$ floats of memory. ::: {.callout-warning title="Floating-point drift"} The running sum accumulates rounding errors over millions of samples. For long-running embedded systems (days or weeks), periodically recompute the sum from scratch: `sum = 0; for (i = 0; i < N; i++) sum += buffer[i];`. Once per second is sufficient and costs negligible CPU. ::: --- ## STM32F4 (NUCLEO-F446RE): sensor noise reduction A typical use case: smooth noisy ADC readings before threshold comparison or feature extraction. ```c #include "arm_math.h" #define ADC_FS 1000 // 1 kHz ADC sample rate #define MA_LEN 16 // 16-sample moving average (16 ms window) static float ma_buf[MA_LEN]; static MovingAverage ma; static EmaFilter ema; void sensor_init(void) { ma_init(&ma, ma_buf, MA_LEN); ema_init(&ema, 0.995f); // ~200 ms time constant at 1 kHz (tau = -1/ln(0.995) ~= 200 samples) } // Called from ADC DMA callback at 1 kHz void process_adc_sample(uint16_t raw) { float voltage = (float)raw * 3.3f / 4096.0f; // Moving average for noise reduction float smoothed_ma = ma_process(&ma, voltage); // EMA for slow-varying baseline tracking float baseline = ema_process(&ema, voltage); // Use smoothed_ma for threshold detection, // baseline for drift compensation } ``` For CMSIS-DSP block processing, `arm_mean_f32` computes the mean of a block --- equivalent to a non-overlapping moving average: ```c float32_t block[64]; float32_t mean; arm_mean_f32(block, 64, &mean); ``` --- ## ESP32-S3: temperature / accelerometer smoothing A common pattern on ESP32-S3: smooth a sensor reading before displaying or transmitting via BLE. ```c static EmaFilter temp_filter; static float ma_buf[32]; static MovingAverage accel_filter; void sensor_task(void *param) { ema_init(&temp_filter, 0.998f); // slow: ~5 s time constant at 100 Hz (tau = -1/ln(0.998) ~= 500 samples) ma_init(&accel_filter, ma_buf, 32); // 32-sample window (320 ms at 100 Hz) while (true) { float temperature = read_temperature_sensor(); float accel_z = read_accelerometer_z(); float temp_smooth = ema_process(&temp_filter, temperature); float accel_smooth = ma_process(&accel_filter, accel_z); // Update BLE characteristics or display vTaskDelay(pdMS_TO_TICKS(10)); // 100 Hz } } ``` --- ## Performance budget Smoothing is negligible on any modern MCU: | Filter | Operations per sample | Cycles (est.) | Memory | |---|---|---|---| | EMA | 2 multiply + 1 add | ~5 | 4 bytes (state) | | MA (N=16, running sum) | 1 add + 1 sub + 1 div | ~10 | 4N + 8 bytes | | MA (N=64, running sum) | Same | ~10 | 4N + 8 bytes | At 1 kHz sample rate on a 180 MHz Cortex-M4F, even a 64-point MA uses about 0.006% of the CPU budget (~10 cycles/sample × 1 kHz ÷ 180 MHz). The smoothing filter itself is never the bottleneck --- the sensor read (I2C, SPI, ADC) dominates. --- ## When to use which | Criterion | EMA | Moving average | |---|---|---| | Memory | 4 bytes (fixed) | 4N bytes (scales with window) | | Startup behaviour | Immediate (uses first sample) | Ramps up over N samples | | Frequency response | First-order IIR (gradual rolloff) | Sinc-like (nulls at $f_s/N$) | | Step response | Exponential rise | Linear rise over N samples | | Best for | Baseline tracking, rate smoothing | Noise reduction, pre-filtering | | Avoid when | Sharp cutoff needed | Memory-constrained and N is large |