Multirate on Hardware

Decimation, CIC filters, and polyphase FIR on ESP32-S3 and STM32F4

Multirate processing is where embedded DSP shines. A sigma-delta ADC oversampling at 3 MHz must be decimated to 48 kHz before any meaningful processing can happen — and the decimation filter runs at the input rate, so efficiency matters enormously. The CIC filter exists precisely because of this constraint: it achieves high decimation ratios using only additions, no multiplications.

This page covers practical implementations of decimation, CIC, and polyphase FIR on microcontrollers. For the theory (spectral folding, polyphase decomposition, multi-stage design), see the main multirate page.

CIC decimator: the multiply-free filter

The CIC (Cascaded Integrator-Comb) filter is the quintessential “exists because of hardware constraints” algorithm (Hogenauer 1981). A $K$-th order CIC decimating by $M$:

\[H(z) = \left(\frac{1 - z^{-M}}{1 - z^{-1}}\right)^K\]

decomposes into integrators running at the high rate and comb sections at the low rate. The entire structure requires zero multiplications — only additions, subtractions, and delays.

Bare-metal CIC implementation

A first-order CIC ($K=1$) decimating by $M$:

#include <stdint.h>

typedef struct {
    int32_t integrator;   // runs at high rate
    int32_t comb_delay;   // runs at low rate
    int count;
    int M;                // decimation factor
} CicDecimator;

void cic_init(CicDecimator *cic, int M) {
    cic->integrator = 0;
    cic->comb_delay = 0;
    cic->count = 0;
    cic->M = M;
}

// Call at the HIGH sample rate. Returns 1 when a decimated output is ready.
int cic_process(CicDecimator *cic, int32_t in, int32_t *out) {
    // Integrator: accumulate every input sample
    cic->integrator += in;

    if (++cic->count >= cic->M) {
        cic->count = 0;
        // Comb: difference at the low rate
        *out = cic->integrator - cic->comb_delay;
        cic->comb_delay = cic->integrator;
        return 1;  // output ready
    }
    return 0;  // no output this sample
}

For a higher-order CIC ($K > 1$), cascade $K$ integrators at the high rate and $K$ combs at the low rate:

#define CIC_ORDER 3

typedef struct {
    int32_t integrators[CIC_ORDER];
    int32_t comb_delays[CIC_ORDER];
    int count;
    int M;
} CicDecimatorN;

void cic_n_init(CicDecimatorN *cic, int M) {
    for (int k = 0; k < CIC_ORDER; k++) {
        cic->integrators[k] = 0;
        cic->comb_delays[k] = 0;
    }
    cic->count = 0;
    cic->M = M;
}

int cic_n_process(CicDecimatorN *cic, int32_t in, int32_t *out) {
    // K integrators at the high rate
    int32_t val = in;
    for (int k = 0; k < CIC_ORDER; k++) {
        cic->integrators[k] += val;
        val = cic->integrators[k];
    }

    if (++cic->count >= cic->M) {
        cic->count = 0;
        // K combs at the low rate
        for (int k = 0; k < CIC_ORDER; k++) {
            int32_t tmp = val;
            val = val - cic->comb_delays[k];
            cic->comb_delays[k] = tmp;
        }
        *out = val;
        return 1;
    }
    return 0;
}

Integer overflow in CIC

The integrator word width must accommodate the worst-case accumulation. For a $K$-th order CIC decimating by $M$ with $B$-bit input, the required register width is:

\[B_{\text{out}} = B_{\text{in}} + K \cdot \lceil\log_2 M\rceil\]

For 16-bit input, $K=3$, $M=16$: $B_{\text{out}} = 16 + 3 \times 4 = 28$ bits. A 32-bit int32_t suffices. For larger $M$ or $K$, use 64-bit accumulators.

The overflows in the integrator are intentional — CIC relies on modular arithmetic (two’s complement wrap-around). Using unsigned integers or overflow-checking compilers will break it.

Why CIC first, FIR second

A CIC filter has a sinc-like frequency response with passband droop and poor stopband rejection. It is not suitable as a standalone filter for most applications. The standard approach is a two-stage pipeline:

CIC for coarse decimation (e.g., $\div 16$ or $\div 32$) — cheap, runs at the high rate
FIR for fine decimation and passband compensation — more expensive per tap, but runs at the reduced rate

This combination gives the sharp cutoff of an FIR filter with the efficiency of a CIC for the high-rate heavy lifting.

STM32F4 (NUCLEO-F446RE): CMSIS-DSP decimation

CMSIS-DSP provides arm_fir_decimate_f32 which combines an FIR anti-aliasing filter with downsampling in a single optimised call. It computes only the output samples that will be kept — the efficiency gain of polyphase decomposition, without the user having to implement it.

FIR decimation

#include "arm_math.h"

#define BLOCK_SIZE_IN   256    // input block (high rate)
#define DECIM_FACTOR    4
#define BLOCK_SIZE_OUT  (BLOCK_SIZE_IN / DECIM_FACTOR)
#define NUM_TAPS        32     // anti-aliasing FIR length

static float32_t fir_state[NUM_TAPS + BLOCK_SIZE_IN - 1];
static float32_t fir_coeffs[NUM_TAPS];  // lowpass at 1/(2*DECIM_FACTOR)
static arm_fir_decimate_instance_f32 decim;

static float32_t input[BLOCK_SIZE_IN];
static float32_t output[BLOCK_SIZE_OUT];

void init_decimator(void) {
    // Design coefficients in Python, copy here
    // scipy.signal.firwin(32, 1.0/8, window='hamming')
    arm_fir_decimate_init_f32(&decim, NUM_TAPS, DECIM_FACTOR,
                               fir_coeffs, fir_state, BLOCK_SIZE_IN);
}

void process_block(void) {
    arm_fir_decimate_f32(&decim, input, output, BLOCK_SIZE_IN);
    // output[] now contains BLOCK_SIZE_OUT = 64 samples at fs/4
}

FIR interpolation

The inverse operation uses arm_fir_interpolate_f32:

#define INTERP_FACTOR   4
#define BLOCK_SIZE_LO   64
#define BLOCK_SIZE_HI   (BLOCK_SIZE_LO * INTERP_FACTOR)
#define INTERP_TAPS     32   // per phase: total filter length = INTERP_TAPS * INTERP_FACTOR

static float32_t interp_state[INTERP_TAPS + BLOCK_SIZE_LO - 1];
static float32_t interp_coeffs[INTERP_TAPS * INTERP_FACTOR];
static arm_fir_interpolate_instance_f32 interp;

static float32_t lo_rate[BLOCK_SIZE_LO];
static float32_t hi_rate[BLOCK_SIZE_HI];

void init_interpolator(void) {
    arm_fir_interpolate_init_f32(&interp, INTERP_FACTOR, INTERP_TAPS,
                                  interp_coeffs, interp_state, BLOCK_SIZE_LO);
}

void upsample_block(void) {
    arm_fir_interpolate_f32(&interp, lo_rate, hi_rate, BLOCK_SIZE_LO);
    // hi_rate[] now contains 256 samples at 4x the input rate
}

Tip

The INTERP_TAPS parameter in CMSIS-DSP is the number of taps per polyphase phase, not the total filter length. The total number of coefficients is INTERP_TAPS * INTERP_FACTOR. Design the full prototype filter in Python, then pass all coefficients — CMSIS-DSP handles the polyphase decomposition internally.

CIC + FIR two-stage pipeline on STM32F4

For high decimation factors, combine a hand-written CIC first stage with CMSIS-DSP FIR second stage:

// Example: decimate from 1 MHz to 8 kHz (factor 125 = 25 x 5)
// Stage 1: CIC decimate by 25 (1 MHz -> 40 kHz)
// Stage 2: FIR decimate by 5  (40 kHz -> 8 kHz)

CicDecimatorN cic_stage;    // order-3 CIC, M=25
arm_fir_decimate_instance_f32 fir_stage;

void process_sample_high_rate(int32_t adc_sample) {
    int32_t cic_out;
    if (cic_n_process(&cic_stage, adc_sample, &cic_out)) {
        // CIC output at 40 kHz — feed to FIR buffer
        fir_input_buf[fir_idx++] = (float32_t)cic_out / CIC_GAIN;
        if (fir_idx >= FIR_BLOCK_SIZE) {
            arm_fir_decimate_f32(&fir_stage, fir_input_buf,
                                  fir_output_buf, FIR_BLOCK_SIZE);
            fir_idx = 0;
            // fir_output_buf[] is now at 8 kHz
        }
    }
}

The CIC gain for order $K$ and decimation $M$ is $M^K$ (here $25^3 = 15625$). Divide by this after the CIC stage to normalise.

Performance budget (NUCLEO-F446RE, 180 MHz)

Stage	Rate	Operations per output sample	Est. cycles	Time
CIC order-3, $\div 25$	1 MHz in, 40 kHz out	3 adds per input = 75 adds/out	~100	0.6 us
FIR 32-tap, $\div 5$	40 kHz in, 8 kHz out	32 MACs per output (6-7 taps per phase $\times$ 5 phases)	~40	0.2 us
Total per 8 kHz output			~140	0.8 us
Available per output (8 kHz)			22,500	125 us
Utilisation				0.6%

Without the CIC, a single-stage 32-tap FIR at 1 MHz would cost 32 MACs per input sample = 32M MACs/s. With CIC first, the FIR runs at 40 kHz and costs only 256K MACs/s — a 125x reduction.

ESP32-S3: CIC + ESP-DSP FIR

The ESP32-S3 lacks CMSIS-DSP but provides ESP-DSP with optimised FIR functions. The CIC code is pure C and runs identically on both platforms.

ESP-DSP FIR decimation

#include "dsps_fir.h"

#define FIR_TAPS    32
#define DECIM       4
#define BLOCK_IN    256
#define BLOCK_OUT   (BLOCK_IN / DECIM)

static fir_f32_t fir;
static float coeffs[FIR_TAPS];
static float delay[FIR_TAPS];
static float input[BLOCK_IN];
static float output[BLOCK_IN];  // filtered at high rate

void init_fir(void) {
    dsps_fir_init_f32(&fir, coeffs, delay, FIR_TAPS);
}

void decimate_block(float *in, float *out, int n_in) {
    // Filter at high rate
    dsps_fir_f32(&fir, in, output, n_in);

    // Downsample: keep every DECIM-th sample
    for (int i = 0; i < n_in / DECIM; i++)
        out[i] = output[i * DECIM];
}

Note

ESP-DSP’s dsps_fir_f32 does not have a built-in decimation mode like CMSIS-DSP’s arm_fir_decimate_f32. The filter runs at the full input rate, and downsampling is done separately. This is less efficient (computes samples that are discarded), but for moderate decimation factors the overhead is acceptable. For large factors, use the CIC + FIR two-stage approach.

Two-stage pipeline on ESP32-S3

// Decimate from 48 kHz to 8 kHz (factor 6 = 3 x 2)
// Stage 1: CIC order-3 (CIC_ORDER), decimate by 3 (48 kHz -> 16 kHz)
// Stage 2: FIR 32-tap, decimate by 2 (16 kHz -> 8 kHz)

CicDecimatorN cic;
fir_f32_t fir;

void i2s_process_block(int32_t *samples, int n) {
    static float fir_buf[256];
    static float fir_out[256];
    static int fir_idx = 0;  // static: persists across calls

    for (int i = 0; i < n; i++) {
        // INMP441: 24-bit audio in bits 31:8 of the 32-bit I2S frame
        float x = (float)(samples[i] >> 8) / 8388608.0f;
        int32_t cic_out;
        if (cic_n_process(&cic, (int32_t)(x * 32768), &cic_out)) {
            if (fir_idx < 256) {  // bounds guard
                fir_buf[fir_idx++] = (float)cic_out / (3 * 3 * 3);  // CIC gain = M^K = 3^3 = 27
            }

            if (fir_idx >= 64) {
                // FIR filter
                dsps_fir_f32(&fir, fir_buf, fir_out, 64);
                // Downsample by 2
                for (int j = 0; j < 32; j++) {
                    float out_sample = fir_out[j * 2];
                    // out_sample is now at 8 kHz — process further
                }
                fir_idx = 0;
            }
        }
    }
}

Performance budget (ESP32-S3, 240 MHz)

Stage	Rate	Operation	Est. cycles	Time
CIC order-3, $\div 3$	48 kHz in	3 adds per sample	~6/sample	0.025 us
FIR 32-tap (ESP-DSP)	16 kHz	32 MACs per sample	~60/sample	0.25 us
Downsample $\div 2$	16 kHz → 8 kHz	Copy	~2	0.01 us
Total per 8 kHz output				~0.66 us
Available per output (8 kHz)				125 us
Utilisation				~0.5%

Platform comparison

Feature	STM32F4 (NUCLEO-F446RE)	ESP32-S3
FIR decimation library	`arm_fir_decimate_f32` (polyphase, optimised)	`dsps_fir_f32` + manual downsample
FIR interpolation library	`arm_fir_interpolate_f32`	No built-in (manual implementation)
Fixed-point support	Q15/Q31 variants with SIMD	s16 FIR with SIMD (~2.4x speedup)
CIC implementation	Hand-written (same C on both)	Hand-written (same C on both)
ADC for high-rate input	Up to 2.4 Msps (12-bit, DMA)	~100 ksps practical (noisy ADC)
Best for	High-rate sensor decimation, SDR front-ends	Audio-rate multirate, I2S codec interfacing

Recommendation

STM32F4 for high-rate decimation: the Cortex-M4’s ADC can sample at MHz rates, and CMSIS-DSP’s polyphase FIR decimation avoids computing discarded samples. Choose this for sigma-delta post-processing, vibration analysis, or SDR.
ESP32-S3 for audio-rate conversion: 48 → 16 → 8 kHz pipelines for voice processing, or upsampling for DAC output. The s16 FIR SIMD gives a 2.4x speedup over float.

References

Hogenauer, Eugene B. 1981. “An Economical Class of Digital Filters for Decimation and Interpolation.” IEEE Transactions on Acoustics, Speech, and Signal Processing 29 (2): 155–62.

--- title: "Multirate on Hardware" subtitle: "Decimation, CIC filters, and polyphase FIR on ESP32-S3 and STM32F4" --- Multirate processing is where embedded DSP shines. A sigma-delta ADC oversampling at 3 MHz must be decimated to 48 kHz before any meaningful processing can happen --- and the decimation filter runs at the *input* rate, so efficiency matters enormously. The CIC filter exists precisely because of this constraint: it achieves high decimation ratios using only additions, no multiplications. This page covers practical implementations of decimation, CIC, and polyphase FIR on microcontrollers. For the theory (spectral folding, polyphase decomposition, multi-stage design), see the [main multirate page](index.qmd). --- ## CIC decimator: the multiply-free filter The CIC (Cascaded Integrator-Comb) filter is the quintessential "exists because of hardware constraints" algorithm [@hogenauer1981cascaded]. A $K$-th order CIC decimating by $M$: $$H(z) = \left(\frac{1 - z^{-M}}{1 - z^{-1}}\right)^K$$ decomposes into integrators running at the high rate and comb sections at the low rate. The entire structure requires **zero multiplications** --- only additions, subtractions, and delays. ### Bare-metal CIC implementation A first-order CIC ($K=1$) decimating by $M$: ```c #include <stdint.h> typedef struct { int32_t integrator; // runs at high rate int32_t comb_delay; // runs at low rate int count; int M; // decimation factor } CicDecimator; void cic_init(CicDecimator *cic, int M) { cic->integrator = 0; cic->comb_delay = 0; cic->count = 0; cic->M = M; } // Call at the HIGH sample rate. Returns 1 when a decimated output is ready. int cic_process(CicDecimator *cic, int32_t in, int32_t *out) { // Integrator: accumulate every input sample cic->integrator += in; if (++cic->count >= cic->M) { cic->count = 0; // Comb: difference at the low rate *out = cic->integrator - cic->comb_delay; cic->comb_delay = cic->integrator; return 1; // output ready } return 0; // no output this sample } ``` For a higher-order CIC ($K > 1$), cascade $K$ integrators at the high rate and $K$ combs at the low rate: ```c #define CIC_ORDER 3 typedef struct { int32_t integrators[CIC_ORDER]; int32_t comb_delays[CIC_ORDER]; int count; int M; } CicDecimatorN; void cic_n_init(CicDecimatorN *cic, int M) { for (int k = 0; k < CIC_ORDER; k++) { cic->integrators[k] = 0; cic->comb_delays[k] = 0; } cic->count = 0; cic->M = M; } int cic_n_process(CicDecimatorN *cic, int32_t in, int32_t *out) { // K integrators at the high rate int32_t val = in; for (int k = 0; k < CIC_ORDER; k++) { cic->integrators[k] += val; val = cic->integrators[k]; } if (++cic->count >= cic->M) { cic->count = 0; // K combs at the low rate for (int k = 0; k < CIC_ORDER; k++) { int32_t tmp = val; val = val - cic->comb_delays[k]; cic->comb_delays[k] = tmp; } *out = val; return 1; } return 0; } ``` ::: {.callout-warning title="Integer overflow in CIC"} The integrator word width must accommodate the worst-case accumulation. For a $K$-th order CIC decimating by $M$ with $B$-bit input, the required register width is: $$B_{\text{out}} = B_{\text{in}} + K \cdot \lceil\log_2 M\rceil$$ For 16-bit input, $K=3$, $M=16$: $B_{\text{out}} = 16 + 3 \times 4 = 28$ bits. A 32-bit `int32_t` suffices. For larger $M$ or $K$, use 64-bit accumulators. The overflows in the integrator are **intentional** --- CIC relies on modular arithmetic (two's complement wrap-around). Using unsigned integers or overflow-checking compilers will break it. ::: ### Why CIC first, FIR second A CIC filter has a sinc-like frequency response with passband droop and poor stopband rejection. It is not suitable as a standalone filter for most applications. The standard approach is a **two-stage pipeline**: 1. **CIC** for coarse decimation (e.g., $\div 16$ or $\div 32$) --- cheap, runs at the high rate 2. **FIR** for fine decimation and passband compensation --- more expensive per tap, but runs at the reduced rate This combination gives the sharp cutoff of an FIR filter with the efficiency of a CIC for the high-rate heavy lifting. --- ## STM32F4 (NUCLEO-F446RE): CMSIS-DSP decimation CMSIS-DSP provides `arm_fir_decimate_f32` which combines an FIR anti-aliasing filter with downsampling in a single optimised call. It computes only the output samples that will be kept --- the efficiency gain of polyphase decomposition, without the user having to implement it. ### FIR decimation ```c #include "arm_math.h" #define BLOCK_SIZE_IN 256 // input block (high rate) #define DECIM_FACTOR 4 #define BLOCK_SIZE_OUT (BLOCK_SIZE_IN / DECIM_FACTOR) #define NUM_TAPS 32 // anti-aliasing FIR length static float32_t fir_state[NUM_TAPS + BLOCK_SIZE_IN - 1]; static float32_t fir_coeffs[NUM_TAPS]; // lowpass at 1/(2*DECIM_FACTOR) static arm_fir_decimate_instance_f32 decim; static float32_t input[BLOCK_SIZE_IN]; static float32_t output[BLOCK_SIZE_OUT]; void init_decimator(void) { // Design coefficients in Python, copy here // scipy.signal.firwin(32, 1.0/8, window='hamming') arm_fir_decimate_init_f32(&decim, NUM_TAPS, DECIM_FACTOR, fir_coeffs, fir_state, BLOCK_SIZE_IN); } void process_block(void) { arm_fir_decimate_f32(&decim, input, output, BLOCK_SIZE_IN); // output[] now contains BLOCK_SIZE_OUT = 64 samples at fs/4 } ``` ### FIR interpolation The inverse operation uses `arm_fir_interpolate_f32`: ```c #define INTERP_FACTOR 4 #define BLOCK_SIZE_LO 64 #define BLOCK_SIZE_HI (BLOCK_SIZE_LO * INTERP_FACTOR) #define INTERP_TAPS 32 // per phase: total filter length = INTERP_TAPS * INTERP_FACTOR static float32_t interp_state[INTERP_TAPS + BLOCK_SIZE_LO - 1]; static float32_t interp_coeffs[INTERP_TAPS * INTERP_FACTOR]; static arm_fir_interpolate_instance_f32 interp; static float32_t lo_rate[BLOCK_SIZE_LO]; static float32_t hi_rate[BLOCK_SIZE_HI]; void init_interpolator(void) { arm_fir_interpolate_init_f32(&interp, INTERP_FACTOR, INTERP_TAPS, interp_coeffs, interp_state, BLOCK_SIZE_LO); } void upsample_block(void) { arm_fir_interpolate_f32(&interp, lo_rate, hi_rate, BLOCK_SIZE_LO); // hi_rate[] now contains 256 samples at 4x the input rate } ``` ::: {.callout-tip} The `INTERP_TAPS` parameter in CMSIS-DSP is the number of taps **per polyphase phase**, not the total filter length. The total number of coefficients is `INTERP_TAPS * INTERP_FACTOR`. Design the full prototype filter in Python, then pass all coefficients --- CMSIS-DSP handles the polyphase decomposition internally. ::: ### CIC + FIR two-stage pipeline on STM32F4 For high decimation factors, combine a hand-written CIC first stage with CMSIS-DSP FIR second stage: ```c // Example: decimate from 1 MHz to 8 kHz (factor 125 = 25 x 5) // Stage 1: CIC decimate by 25 (1 MHz -> 40 kHz) // Stage 2: FIR decimate by 5 (40 kHz -> 8 kHz) CicDecimatorN cic_stage; // order-3 CIC, M=25 arm_fir_decimate_instance_f32 fir_stage; void process_sample_high_rate(int32_t adc_sample) { int32_t cic_out; if (cic_n_process(&cic_stage, adc_sample, &cic_out)) { // CIC output at 40 kHz — feed to FIR buffer fir_input_buf[fir_idx++] = (float32_t)cic_out / CIC_GAIN; if (fir_idx >= FIR_BLOCK_SIZE) { arm_fir_decimate_f32(&fir_stage, fir_input_buf, fir_output_buf, FIR_BLOCK_SIZE); fir_idx = 0; // fir_output_buf[] is now at 8 kHz } } } ``` The CIC gain for order $K$ and decimation $M$ is $M^K$ (here $25^3 = 15625$). Divide by this after the CIC stage to normalise. ### Performance budget (NUCLEO-F446RE, 180 MHz) | Stage | Rate | Operations per output sample | Est. cycles | Time | |---|---|---|---|---| | CIC order-3, $\div 25$ | 1 MHz in, 40 kHz out | 3 adds per input = 75 adds/out | ~100 | 0.6 us | | FIR 32-tap, $\div 5$ | 40 kHz in, 8 kHz out | 32 MACs per output (6-7 taps per phase $\times$ 5 phases) | ~40 | 0.2 us | | **Total per 8 kHz output** | | | **~140** | **0.8 us** | | Available per output (8 kHz) | | | 22,500 | 125 us | | **Utilisation** | | | | **0.6%** | Without the CIC, a single-stage 32-tap FIR at 1 MHz would cost 32 MACs per *input* sample = 32M MACs/s. With CIC first, the FIR runs at 40 kHz and costs only 256K MACs/s --- a **125x** reduction. --- ## ESP32-S3: CIC + ESP-DSP FIR The ESP32-S3 lacks CMSIS-DSP but provides ESP-DSP with optimised FIR functions. The CIC code is pure C and runs identically on both platforms. ### ESP-DSP FIR decimation ```c #include "dsps_fir.h" #define FIR_TAPS 32 #define DECIM 4 #define BLOCK_IN 256 #define BLOCK_OUT (BLOCK_IN / DECIM) static fir_f32_t fir; static float coeffs[FIR_TAPS]; static float delay[FIR_TAPS]; static float input[BLOCK_IN]; static float output[BLOCK_IN]; // filtered at high rate void init_fir(void) { dsps_fir_init_f32(&fir, coeffs, delay, FIR_TAPS); } void decimate_block(float *in, float *out, int n_in) { // Filter at high rate dsps_fir_f32(&fir, in, output, n_in); // Downsample: keep every DECIM-th sample for (int i = 0; i < n_in / DECIM; i++) out[i] = output[i * DECIM]; } ``` ::: {.callout-note} ESP-DSP's `dsps_fir_f32` does not have a built-in decimation mode like CMSIS-DSP's `arm_fir_decimate_f32`. The filter runs at the full input rate, and downsampling is done separately. This is less efficient (computes samples that are discarded), but for moderate decimation factors the overhead is acceptable. For large factors, use the CIC + FIR two-stage approach. ::: ### Two-stage pipeline on ESP32-S3 ```c // Decimate from 48 kHz to 8 kHz (factor 6 = 3 x 2) // Stage 1: CIC order-3 (CIC_ORDER), decimate by 3 (48 kHz -> 16 kHz) // Stage 2: FIR 32-tap, decimate by 2 (16 kHz -> 8 kHz) CicDecimatorN cic; fir_f32_t fir; void i2s_process_block(int32_t *samples, int n) { static float fir_buf[256]; static float fir_out[256]; static int fir_idx = 0; // static: persists across calls for (int i = 0; i < n; i++) { // INMP441: 24-bit audio in bits 31:8 of the 32-bit I2S frame float x = (float)(samples[i] >> 8) / 8388608.0f; int32_t cic_out; if (cic_n_process(&cic, (int32_t)(x * 32768), &cic_out)) { if (fir_idx < 256) { // bounds guard fir_buf[fir_idx++] = (float)cic_out / (3 * 3 * 3); // CIC gain = M^K = 3^3 = 27 } if (fir_idx >= 64) { // FIR filter dsps_fir_f32(&fir, fir_buf, fir_out, 64); // Downsample by 2 for (int j = 0; j < 32; j++) { float out_sample = fir_out[j * 2]; // out_sample is now at 8 kHz — process further } fir_idx = 0; } } } } ``` ### Performance budget (ESP32-S3, 240 MHz) | Stage | Rate | Operation | Est. cycles | Time | |---|---|---|---|---| | CIC order-3, $\div 3$ | 48 kHz in | 3 adds per sample | ~6/sample | 0.025 us | | FIR 32-tap (ESP-DSP) | 16 kHz | 32 MACs per sample | ~60/sample | 0.25 us | | Downsample $\div 2$ | 16 kHz → 8 kHz | Copy | ~2 | 0.01 us | | **Total per 8 kHz output** | | | | **~0.66 us** | | Available per output (8 kHz) | | | | 125 us | | **Utilisation** | | | | **~0.5%** | --- ## Platform comparison | Feature | STM32F4 (NUCLEO-F446RE) | ESP32-S3 | |---|---|---| | FIR decimation library | `arm_fir_decimate_f32` (polyphase, optimised) | `dsps_fir_f32` + manual downsample | | FIR interpolation library | `arm_fir_interpolate_f32` | No built-in (manual implementation) | | Fixed-point support | Q15/Q31 variants with SIMD | s16 FIR with SIMD (~2.4x speedup) | | CIC implementation | Hand-written (same C on both) | Hand-written (same C on both) | | ADC for high-rate input | Up to 2.4 Msps (12-bit, DMA) | ~100 ksps practical (noisy ADC) | | Best for | High-rate sensor decimation, SDR front-ends | Audio-rate multirate, I2S codec interfacing | ### Recommendation - **STM32F4** for high-rate decimation: the Cortex-M4's ADC can sample at MHz rates, and CMSIS-DSP's polyphase FIR decimation avoids computing discarded samples. Choose this for sigma-delta post-processing, vibration analysis, or SDR. - **ESP32-S3** for audio-rate conversion: 48 → 16 → 8 kHz pipelines for voice processing, or upsampling for DAC output. The s16 FIR SIMD gives a 2.4x speedup over float.