Multirate on Hardware

Decimation, CIC filters, and polyphase FIR on ESP32-S3 and STM32F4

Multirate processing is where embedded DSP shines. A sigma-delta ADC oversampling at 3 MHz must be decimated to 48 kHz before any meaningful processing can happen — and the decimation filter runs at the input rate, so efficiency matters enormously. The CIC filter exists precisely because of this constraint: it achieves high decimation ratios using only additions, no multiplications.

This page covers practical implementations of decimation, CIC, and polyphase FIR on microcontrollers. For the theory (spectral folding, polyphase decomposition, multi-stage design), see the main multirate page.


CIC decimator: the multiply-free filter

The CIC (Cascaded Integrator-Comb) filter is the quintessential “exists because of hardware constraints” algorithm (Hogenauer 1981). A \(K\)-th order CIC decimating by \(M\):

\[H(z) = \left(\frac{1 - z^{-M}}{1 - z^{-1}}\right)^K\]

decomposes into integrators running at the high rate and comb sections at the low rate. The entire structure requires zero multiplications — only additions, subtractions, and delays.

Bare-metal CIC implementation

A first-order CIC (\(K=1\)) decimating by \(M\):

#include <stdint.h>

typedef struct {
    int32_t integrator;   // runs at high rate
    int32_t comb_delay;   // runs at low rate
    int count;
    int M;                // decimation factor
} CicDecimator;

void cic_init(CicDecimator *cic, int M) {
    cic->integrator = 0;
    cic->comb_delay = 0;
    cic->count = 0;
    cic->M = M;
}

// Call at the HIGH sample rate. Returns 1 when a decimated output is ready.
int cic_process(CicDecimator *cic, int32_t in, int32_t *out) {
    // Integrator: accumulate every input sample
    cic->integrator += in;

    if (++cic->count >= cic->M) {
        cic->count = 0;
        // Comb: difference at the low rate
        *out = cic->integrator - cic->comb_delay;
        cic->comb_delay = cic->integrator;
        return 1;  // output ready
    }
    return 0;  // no output this sample
}

For a higher-order CIC (\(K > 1\)), cascade \(K\) integrators at the high rate and \(K\) combs at the low rate:

#define CIC_ORDER 3

typedef struct {
    int32_t integrators[CIC_ORDER];
    int32_t comb_delays[CIC_ORDER];
    int count;
    int M;
} CicDecimatorN;

void cic_n_init(CicDecimatorN *cic, int M) {
    for (int k = 0; k < CIC_ORDER; k++) {
        cic->integrators[k] = 0;
        cic->comb_delays[k] = 0;
    }
    cic->count = 0;
    cic->M = M;
}

int cic_n_process(CicDecimatorN *cic, int32_t in, int32_t *out) {
    // K integrators at the high rate
    int32_t val = in;
    for (int k = 0; k < CIC_ORDER; k++) {
        cic->integrators[k] += val;
        val = cic->integrators[k];
    }

    if (++cic->count >= cic->M) {
        cic->count = 0;
        // K combs at the low rate
        for (int k = 0; k < CIC_ORDER; k++) {
            int32_t tmp = val;
            val = val - cic->comb_delays[k];
            cic->comb_delays[k] = tmp;
        }
        *out = val;
        return 1;
    }
    return 0;
}
Integer overflow in CIC

The integrator word width must accommodate the worst-case accumulation. For a \(K\)-th order CIC decimating by \(M\) with \(B\)-bit input, the required register width is:

\[B_{\text{out}} = B_{\text{in}} + K \cdot \lceil\log_2 M\rceil\]

For 16-bit input, \(K=3\), \(M=16\): \(B_{\text{out}} = 16 + 3 \times 4 = 28\) bits. A 32-bit int32_t suffices. For larger \(M\) or \(K\), use 64-bit accumulators.

The overflows in the integrator are intentional — CIC relies on modular arithmetic (two’s complement wrap-around). Using unsigned integers or overflow-checking compilers will break it.

Why CIC first, FIR second

A CIC filter has a sinc-like frequency response with passband droop and poor stopband rejection. It is not suitable as a standalone filter for most applications. The standard approach is a two-stage pipeline:

  1. CIC for coarse decimation (e.g., \(\div 16\) or \(\div 32\)) — cheap, runs at the high rate
  2. FIR for fine decimation and passband compensation — more expensive per tap, but runs at the reduced rate

This combination gives the sharp cutoff of an FIR filter with the efficiency of a CIC for the high-rate heavy lifting.


STM32F4 (NUCLEO-F446RE): CMSIS-DSP decimation

CMSIS-DSP provides arm_fir_decimate_f32 which combines an FIR anti-aliasing filter with downsampling in a single optimised call. It computes only the output samples that will be kept — the efficiency gain of polyphase decomposition, without the user having to implement it.

FIR decimation

#include "arm_math.h"

#define BLOCK_SIZE_IN   256    // input block (high rate)
#define DECIM_FACTOR    4
#define BLOCK_SIZE_OUT  (BLOCK_SIZE_IN / DECIM_FACTOR)
#define NUM_TAPS        32     // anti-aliasing FIR length

static float32_t fir_state[NUM_TAPS + BLOCK_SIZE_IN - 1];
static float32_t fir_coeffs[NUM_TAPS];  // lowpass at 1/(2*DECIM_FACTOR)
static arm_fir_decimate_instance_f32 decim;

static float32_t input[BLOCK_SIZE_IN];
static float32_t output[BLOCK_SIZE_OUT];

void init_decimator(void) {
    // Design coefficients in Python, copy here
    // scipy.signal.firwin(32, 1.0/8, window='hamming')
    arm_fir_decimate_init_f32(&decim, NUM_TAPS, DECIM_FACTOR,
                               fir_coeffs, fir_state, BLOCK_SIZE_IN);
}

void process_block(void) {
    arm_fir_decimate_f32(&decim, input, output, BLOCK_SIZE_IN);
    // output[] now contains BLOCK_SIZE_OUT = 64 samples at fs/4
}

FIR interpolation

The inverse operation uses arm_fir_interpolate_f32:

#define INTERP_FACTOR   4
#define BLOCK_SIZE_LO   64
#define BLOCK_SIZE_HI   (BLOCK_SIZE_LO * INTERP_FACTOR)
#define INTERP_TAPS     32   // per phase: total filter length = INTERP_TAPS * INTERP_FACTOR

static float32_t interp_state[INTERP_TAPS + BLOCK_SIZE_LO - 1];
static float32_t interp_coeffs[INTERP_TAPS * INTERP_FACTOR];
static arm_fir_interpolate_instance_f32 interp;

static float32_t lo_rate[BLOCK_SIZE_LO];
static float32_t hi_rate[BLOCK_SIZE_HI];

void init_interpolator(void) {
    arm_fir_interpolate_init_f32(&interp, INTERP_FACTOR, INTERP_TAPS,
                                  interp_coeffs, interp_state, BLOCK_SIZE_LO);
}

void upsample_block(void) {
    arm_fir_interpolate_f32(&interp, lo_rate, hi_rate, BLOCK_SIZE_LO);
    // hi_rate[] now contains 256 samples at 4x the input rate
}
Tip

The INTERP_TAPS parameter in CMSIS-DSP is the number of taps per polyphase phase, not the total filter length. The total number of coefficients is INTERP_TAPS * INTERP_FACTOR. Design the full prototype filter in Python, then pass all coefficients — CMSIS-DSP handles the polyphase decomposition internally.

CIC + FIR two-stage pipeline on STM32F4

For high decimation factors, combine a hand-written CIC first stage with CMSIS-DSP FIR second stage:

// Example: decimate from 1 MHz to 8 kHz (factor 125 = 25 x 5)
// Stage 1: CIC decimate by 25 (1 MHz -> 40 kHz)
// Stage 2: FIR decimate by 5  (40 kHz -> 8 kHz)

CicDecimatorN cic_stage;    // order-3 CIC, M=25
arm_fir_decimate_instance_f32 fir_stage;

void process_sample_high_rate(int32_t adc_sample) {
    int32_t cic_out;
    if (cic_n_process(&cic_stage, adc_sample, &cic_out)) {
        // CIC output at 40 kHz — feed to FIR buffer
        fir_input_buf[fir_idx++] = (float32_t)cic_out / CIC_GAIN;
        if (fir_idx >= FIR_BLOCK_SIZE) {
            arm_fir_decimate_f32(&fir_stage, fir_input_buf,
                                  fir_output_buf, FIR_BLOCK_SIZE);
            fir_idx = 0;
            // fir_output_buf[] is now at 8 kHz
        }
    }
}

The CIC gain for order \(K\) and decimation \(M\) is \(M^K\) (here \(25^3 = 15625\)). Divide by this after the CIC stage to normalise.

Performance budget (NUCLEO-F446RE, 180 MHz)

Stage Rate Operations per output sample Est. cycles Time
CIC order-3, \(\div 25\) 1 MHz in, 40 kHz out 3 adds per input = 75 adds/out ~100 0.6 us
FIR 32-tap, \(\div 5\) 40 kHz in, 8 kHz out 32 MACs per output (6-7 taps per phase \(\times\) 5 phases) ~40 0.2 us
Total per 8 kHz output ~140 0.8 us
Available per output (8 kHz) 22,500 125 us
Utilisation 0.6%

Without the CIC, a single-stage 32-tap FIR at 1 MHz would cost 32 MACs per input sample = 32M MACs/s. With CIC first, the FIR runs at 40 kHz and costs only 256K MACs/s — a 125x reduction.


ESP32-S3: CIC + ESP-DSP FIR

The ESP32-S3 lacks CMSIS-DSP but provides ESP-DSP with optimised FIR functions. The CIC code is pure C and runs identically on both platforms.

ESP-DSP FIR decimation

#include "dsps_fir.h"

#define FIR_TAPS    32
#define DECIM       4
#define BLOCK_IN    256
#define BLOCK_OUT   (BLOCK_IN / DECIM)

static fir_f32_t fir;
static float coeffs[FIR_TAPS];
static float delay[FIR_TAPS];
static float input[BLOCK_IN];
static float output[BLOCK_IN];  // filtered at high rate

void init_fir(void) {
    dsps_fir_init_f32(&fir, coeffs, delay, FIR_TAPS);
}

void decimate_block(float *in, float *out, int n_in) {
    // Filter at high rate
    dsps_fir_f32(&fir, in, output, n_in);

    // Downsample: keep every DECIM-th sample
    for (int i = 0; i < n_in / DECIM; i++)
        out[i] = output[i * DECIM];
}
Note

ESP-DSP’s dsps_fir_f32 does not have a built-in decimation mode like CMSIS-DSP’s arm_fir_decimate_f32. The filter runs at the full input rate, and downsampling is done separately. This is less efficient (computes samples that are discarded), but for moderate decimation factors the overhead is acceptable. For large factors, use the CIC + FIR two-stage approach.

Two-stage pipeline on ESP32-S3

// Decimate from 48 kHz to 8 kHz (factor 6 = 3 x 2)
// Stage 1: CIC order-3 (CIC_ORDER), decimate by 3 (48 kHz -> 16 kHz)
// Stage 2: FIR 32-tap, decimate by 2 (16 kHz -> 8 kHz)

CicDecimatorN cic;
fir_f32_t fir;

void i2s_process_block(int32_t *samples, int n) {
    static float fir_buf[256];
    static float fir_out[256];
    static int fir_idx = 0;  // static: persists across calls

    for (int i = 0; i < n; i++) {
        // INMP441: 24-bit audio in bits 31:8 of the 32-bit I2S frame
        float x = (float)(samples[i] >> 8) / 8388608.0f;
        int32_t cic_out;
        if (cic_n_process(&cic, (int32_t)(x * 32768), &cic_out)) {
            if (fir_idx < 256) {  // bounds guard
                fir_buf[fir_idx++] = (float)cic_out / (3 * 3 * 3);  // CIC gain = M^K = 3^3 = 27
            }

            if (fir_idx >= 64) {
                // FIR filter
                dsps_fir_f32(&fir, fir_buf, fir_out, 64);
                // Downsample by 2
                for (int j = 0; j < 32; j++) {
                    float out_sample = fir_out[j * 2];
                    // out_sample is now at 8 kHz — process further
                }
                fir_idx = 0;
            }
        }
    }
}

Performance budget (ESP32-S3, 240 MHz)

Stage Rate Operation Est. cycles Time
CIC order-3, \(\div 3\) 48 kHz in 3 adds per sample ~6/sample 0.025 us
FIR 32-tap (ESP-DSP) 16 kHz 32 MACs per sample ~60/sample 0.25 us
Downsample \(\div 2\) 16 kHz → 8 kHz Copy ~2 0.01 us
Total per 8 kHz output ~0.66 us
Available per output (8 kHz) 125 us
Utilisation ~0.5%

Platform comparison

Feature STM32F4 (NUCLEO-F446RE) ESP32-S3
FIR decimation library arm_fir_decimate_f32 (polyphase, optimised) dsps_fir_f32 + manual downsample
FIR interpolation library arm_fir_interpolate_f32 No built-in (manual implementation)
Fixed-point support Q15/Q31 variants with SIMD s16 FIR with SIMD (~2.4x speedup)
CIC implementation Hand-written (same C on both) Hand-written (same C on both)
ADC for high-rate input Up to 2.4 Msps (12-bit, DMA) ~100 ksps practical (noisy ADC)
Best for High-rate sensor decimation, SDR front-ends Audio-rate multirate, I2S codec interfacing

Recommendation

  • STM32F4 for high-rate decimation: the Cortex-M4’s ADC can sample at MHz rates, and CMSIS-DSP’s polyphase FIR decimation avoids computing discarded samples. Choose this for sigma-delta post-processing, vibration analysis, or SDR.
  • ESP32-S3 for audio-rate conversion: 48 → 16 → 8 kHz pipelines for voice processing, or upsampling for DAC output. The s16 FIR SIMD gives a 2.4x speedup over float.

References

Hogenauer, Eugene B. 1981. “An Economical Class of Digital Filters for Decimation and Interpolation.” IEEE Transactions on Acoustics, Speech, and Signal Processing 29 (2): 155–62.