Biquad on Hardware

Real-time biquad filters on STM32F4 and ESP32

The biquad is the workhorse of embedded audio and sensor DSP. Every parametric equalizer, crossover network, and feedback control loop on a microcontroller is built from cascaded second-order sections. The structure is minimal, five coefficients and two state variables, making it ideal for platforms with tight memory and cycle budgets.

This page covers practical biquad implementations for STM32F4 and ESP32 targets. For the theory, filter structures, and coefficient design, see the main biquad page. For why we cascade biquads instead of implementing high-order filters directly, see the SOS discussion in the filter design chapter.


STM32F4: Direct Form II biquad

The STM32F4 series (Cortex-M4F, up to 180 MHz on the NUCLEO-F446RE, single-precision FPU) is the natural home for biquad filters. A single biquad section takes roughly 10 cycles with hardware FPU, leaving room for dozens of cascaded sections at audio sample rates.

Bare-metal implementation

The Direct Form II implementation uses two state variables (w[1] and w[2]) and computes the output in two steps:

// Direct Form II biquad — single section, single sample
// b0, b1, b2: numerator coefficients
// a1, a2: denominator coefficients (negated, see note below)
// w[]: state variables, persisted between calls

void biquad_df2(float x, float *y,
                float b0, float b1, float b2,
                float a1, float a2, float w[3])
{
    w[0] = x + a1*w[1] + a2*w[2];
    *y   = b0*w[0] + b1*w[1] + b2*w[2];
    w[2] = w[1];
    w[1] = w[0];
}
Negated denominator coefficients

In this implementation, a1 and a2 are stored with their signs flipped relative to the transfer function \(H(z) = \frac{b_0 + b_1 z^{-1} + b_2 z^{-2}}{1 + a_1 z^{-1} + a_2 z^{-2}}\). The code computes w[0] = x + a1*w[1] rather than w[0] = x - a1*w[1], so the stored a1 is \(-a_1\) from the transfer function.

This convention matches CMSIS-DSP and many DSP textbooks. It replaces a subtraction with an addition in the inner loop, saving one cycle on architectures without a fused negate-multiply-accumulate instruction. When porting coefficients from SciPy (which uses the un-negated convention), negate a1 and a2 before loading them into the filter.

Cascading multiple sections

A higher-order filter is a cascade of biquad sections. Each section’s output feeds the next section’s input:

#define MAX_SECTIONS 6

typedef struct {
    float b0, b1, b2;
    float a1, a2;       // negated convention
    float w[3];         // state variables
} biquad_section_t;

typedef struct {
    biquad_section_t sections[MAX_SECTIONS];
    int n_sections;
} biquad_cascade_t;

float biquad_cascade_process(biquad_cascade_t *cascade, float x)
{
    float signal = x;
    for (int k = 0; k < cascade->n_sections; k++) {
        biquad_section_t *s = &cascade->sections[k];
        s->w[0] = signal + s->a1 * s->w[1] + s->a2 * s->w[2];
        signal   = s->b0 * s->w[0] + s->b1 * s->w[1] + s->b2 * s->w[2];
        s->w[2]  = s->w[1];
        s->w[1]  = s->w[0];
    }
    return signal;
}

Each section adds 5 multiplies and 4 adds. A 6th-order Butterworth (3 sections) takes roughly 30 multiply-accumulates, well under 1 µs at 180 MHz with hardware FPU.

CMSIS-DSP alternative

ARM’s CMSIS-DSP library provides optimised biquad implementations, optimised for the Cortex-M4 FPU; the fixed-point Q15/Q31 variants additionally use SIMD integer instructions:

#include "arm_math.h"

#define NUM_SECTIONS  3
#define BLOCK_SIZE    1

// Coefficient array: {b0, b1, b2, a1, a2} per section (a1, a2 negated)
static float32_t coeffs[5 * NUM_SECTIONS];
// State array: 4 state variables per section
static float32_t state[4 * NUM_SECTIONS];
static arm_biquad_casc_df1_inst_f32 filter;

void init_filter(void) {
    arm_biquad_cascade_df1_init_f32(&filter, NUM_SECTIONS,
                                     coeffs, state);
}

void process_sample(float32_t *in, float32_t *out) {
    arm_biquad_cascade_df1_f32(&filter, in, out, BLOCK_SIZE);
}

The CMSIS-DSP implementation uses Direct Form I (not DF-II) with four state variables per section. This is deliberate: DF-I is more robust for fixed-point variants (Q15, Q31) because the separate input and output delay lines prevent feedback overflow from corrupting input history. The floating-point version uses the same structure for API consistency.

Tip

For block processing (e.g., processing 64 samples at a time from a DMA buffer), set BLOCK_SIZE to the block length. CMSIS-DSP will process all samples in a single call with loop-unrolled inner code, significantly faster than calling once per sample.


ESP32: C++ biquad for real-time audio

The ESP32-S3 (dual-core Xtensa LX7, 240 MHz, single-precision FPU) is well-suited for audio biquad processing. It lacks CMSIS-DSP, but the biquad is simple enough to implement directly. The built-in I2S peripheral connects directly to MEMS microphones and DAC codecs without external ADC hardware.

Biquad class

A C++ biquad class suitable for real-time audio on ESP32:

class BiquadSection {
public:
    float b0, b1, b2, a1, a2;  // a1, a2 negated
    float w1 = 0.0f, w2 = 0.0f;

    void set_coefficients(float _b0, float _b1, float _b2,
                          float _a1, float _a2) {
        b0 = _b0; b1 = _b1; b2 = _b2;
        a1 = _a1; a2 = _a2;
    }

    float process(float x) {
        float w0 = x + a1 * w1 + a2 * w2;
        float y  = b0 * w0 + b1 * w1 + b2 * w2;
        w2 = w1;
        w1 = w0;
        return y;
    }

    void reset() { w1 = 0.0f; w2 = 0.0f; }
};

class BiquadCascade {
public:
    static constexpr int MAX_SECTIONS = 8;
    BiquadSection sections[MAX_SECTIONS];
    int n_sections = 0;

    void add_section(float b0, float b1, float b2,
                     float a1, float a2) {
        if (n_sections < MAX_SECTIONS) {
            sections[n_sections].set_coefficients(b0, b1, b2, a1, a2);
            n_sections++;
        }
    }

    float process(float x) {
        float signal = x;
        for (int k = 0; k < n_sections; k++) {
            signal = sections[k].process(signal);
        }
        return signal;
    }

    void reset() {
        for (int k = 0; k < n_sections; k++)
            sections[k].reset();
    }
};

I2S microphone setup

Legacy API

The code below uses the ESP-IDF v4.x I2S API (driver/i2s.h), which was removed in ESP-IDF v5.2. For the v5.x API (driver/i2s_std.h), see the pitch detection or beamforming embedded pages.

The ESP32 I2S peripheral connects directly to digital MEMS microphones like the INMP441:

#include "driver/i2s.h"

i2s_config_t i2s_config = {
    .mode = (i2s_mode_t)(I2S_MODE_MASTER | I2S_MODE_RX),
    .sample_rate = 16000,
    .bits_per_sample = I2S_BITS_PER_SAMPLE_32BIT,
    .channel_format = I2S_CHANNEL_FMT_ONLY_LEFT,
    .communication_format = I2S_COMM_FORMAT_STAND_I2S,
    .dma_buf_count = 4,
    .dma_buf_len = 64,
    .use_apll = true,
};

i2s_pin_config_t pin_config = {
    .bck_io_num = 26,
    .ws_io_num = 25,
    .data_out_num = I2S_PIN_NO_CHANGE,
    .data_in_num = 22,
};

void audio_init() {
    i2s_driver_install(I2S_NUM_0, &i2s_config, 0, NULL);
    i2s_set_pin(I2S_NUM_0, &pin_config);
}

The audio processing loop reads I2S samples, runs the biquad cascade, and writes the filtered output:

BiquadCascade eq;

void audio_task(void *param) {
    int32_t raw_samples[64];
    size_t bytes_read;

    while (true) {
        i2s_read(I2S_NUM_0, raw_samples, sizeof(raw_samples),
                 &bytes_read, portMAX_DELAY);

        int n_samples = bytes_read / sizeof(int32_t);
        if (n_samples > 64) n_samples = 64;  // clamp to buffer size
        for (int i = 0; i < n_samples; i++) {
            // Convert 32-bit I2S to float [-1, 1]
            float x = (float)raw_samples[i] / 2147483648.0f;
            float y = eq.process(x);
            // Output y to DAC, I2S TX, or buffer for further processing
        }
    }
}

Pin the audio task to core 1 (xTaskCreatePinnedToCore(audio_task, "audio", 4096, NULL, 5, NULL, 1)) and keep Wi-Fi on core 0 to avoid jitter from wireless stack interrupts.


Fixed-point considerations

On platforms without hardware FPU (or where power consumption demands fixed-point), biquad coefficients and state must be represented in integer format.

Q-format representation

Format Bits Range Resolution Use case
Q15 16 \([-1, 1 - 2^{-15})\) \(3.05 \times 10^{-5}\) Low-power, small filters
Q31 32 \([-1, 1 - 2^{-31})\) \(4.66 \times 10^{-10}\) High-quality audio

A coefficient stored in Q15 is an integer in \([-32768, 32767]\) that represents the value \(\text{integer} / 32768\). All arithmetic stays in the integer domain.

Why accumulator width matters

A single biquad multiply-accumulate step multiplies a Q15 coefficient by a Q15 sample, producing a Q30 result in 32 bits. Summing five such products can overflow a 32-bit accumulator. The solution is a 64-bit accumulator for Q31 coefficients, or a 32-bit accumulator for Q15:

Coefficient format Multiply result Accumulator width needed
Q15 × Q15 Q30 (32 bits) 32 bits (tight, CMSIS-DSP uses 64-bit accumulator for safety)
Q31 × Q31 Q62 (64 bits) 64 bits

ARM Cortex-M4 provides the SMLAL instruction (signed multiply-accumulate long) that multiplies two 32-bit values and accumulates into a 64-bit register pair, exactly what is needed for Q31 biquad processing.

Limit cycles

In fixed-point biquad implementations, rounding errors in the feedback path can cause limit cycles: small persistent oscillations at the output even when the input is zero. Using a wider accumulator and rounding (rather than truncating) the output reduces this problem. CMSIS-DSP’s Q15 biquad uses a 64-bit accumulator internally for this reason.

Coefficients greater than 1

Biquad coefficients can exceed \(\pm 1\) (e.g., a peaking EQ with high gain). Since Q-format represents only \([-1, 1)\), these coefficients cannot be stored directly. Solutions:

  1. Pre-scale coefficients so all values fit in \([-1, 1)\), and apply a compensating gain at the output
  2. Use Q14 or Q30 format with a wider range \([-2, 2)\) at the cost of one bit of precision
  3. Store the integer and fractional parts separately (rarely done for biquads)

The coefficient quantization demo on the theory page shows how Q15 quantization shifts the response of a narrow bandpass filter.


Audio EQ on embedded

A parametric equalizer on a microcontroller is a cascade of biquad sections with user-adjustable parameters. A typical 3-band EQ (low shelf, parametric mid, high shelf) requires just three biquad sections, under 100 bytes of memory and less than 1 µs of processing per sample.

Coefficient updates and click avoidance

When the user adjusts an EQ knob, the biquad coefficients change. Swapping coefficients instantaneously between samples produces discontinuities in the output, audible as clicks or pops. Two approaches to smooth updates:

Crossfade. Maintain two copies of the filter state. When coefficients change, start feeding input to the new filter while fading out the old one over a short window (typically 5–10 ms, i.e., 80–160 samples at 16 kHz). This doubles the memory and computation during the crossfade but guarantees a smooth transition.

// Simplified crossfade between old and new biquad
for (int i = 0; i < fade_len; i++) {
    float alpha = (float)i / fade_len;
    float y_old = biquad_cascade_process(&old_filter, x[i]);
    float y_new = biquad_cascade_process(&new_filter, x[i]);
    output[i] = (1.0f - alpha) * y_old + alpha * y_new;
}

Parameter smoothing. Instead of updating coefficients in one step, interpolate the design parameters (frequency, Q, gain) smoothly and recompute coefficients each block. This is more expensive (trigonometric functions in the coefficient formulas) but produces more natural-sounding transitions. Many audio plugin frameworks use this approach, recomputing coefficients once per block (e.g., every 64 samples) with smoothed parameters.

Note

Linear interpolation of biquad coefficients does not correspond to linear interpolation of the frequency response. Intermediate states during a coefficient crossfade may have unexpected resonance peaks; see the open questions on the theory page.


Platform comparison

Feature STM32F4 ESP32
Clock 180 MHz (NUCLEO-F446RE) 240 MHz
FPU Yes (Cortex-M4F) Yes (Xtensa LX7)
CMSIS-DSP biquad Yes (arm_biquad_cascade_df1_f32, Q15, Q31) No
I2S Via external codec Built-in
Memory (SRAM) 192 KB 512 KB
Audio latency (typical) < 1 ms (bare-metal, sample-by-sample) 4–8 ms (FreeRTOS, DMA blocks)
Wi-Fi/BT No (needs external module) Built-in
Real-time determinism Excellent (bare-metal or RTOS) Good (with core pinning)
Unit cost ~EUR 10 ~EUR 5
Best for Deterministic DSP, production audio Prototyping, IoT audio

Recommendation

  • STM32F4 for production audio DSP: deterministic timing, CMSIS-DSP library with optimised fixed-point variants, and established use in professional audio equipment. Choose this when latency, reproducibility, or fixed-point performance matter.
  • ESP32 for prototyping and connected devices: cheaper, built-in I2S and wireless, easier to get audio flowing with minimal external hardware. Choose this for EQ demos, IoT sensor filtering, or wireless audio experiments.

FIR comparison

For contrast, here is a general-purpose FIR filter using a circular buffer, the standard embedded technique that avoids shifting the entire delay line on every sample:

// FIR filter with circular buffer
// N: number of taps, buf[]: circular sample buffer
// coeffs[]: filter coefficients, ind: current buffer index

void fir_filter(float x, float *y, float *buf, const float *coeffs,
                int N, int *ind)
{
    buf[*ind] = x;
    if (*ind < N - 1) (*ind)++; else *ind = 0;

    float acc = 0.0f;
    int k = *ind;
    for (int i = 0; i < N; i++) {
        acc += buf[k] * coeffs[N - i - 1];
        if (k < N - 1) k++; else k = 0;
    }
    *y = acc;
}

The circular buffer wraps the index instead of shifting data: each sample requires \(N\) multiplies but zero memory moves. The memmove-based approach (shift the buffer on every sample) is simpler but costs \(O(N)\) memory operations on top of the \(N\) multiplies.

When to use FIR vs biquad

Criterion FIR Biquad (IIR)
Phase Linear (symmetric coefficients) Nonlinear
Memory \(N\) coefficients + \(N\) samples 5 coefficients + 2 state variables per section
Computation \(N\) MACs per sample ~5 MACs per section per sample
Stability Always stable Can be unstable if poles outside unit circle
Sharp cutoff Requires many taps (100+) A few sections suffice
Typical use Anti-aliasing, matched filtering, linear-phase EQ Audio EQ, control loops, sensor filtering

Rule of thumb for microcontrollers: if you need linear phase or very specific impulse response shapes (e.g., matched filtering), use FIR. If you need a sharp frequency-selective filter with minimal memory and computation (e.g., audio crossover, DC removal, bandpass for sensor data), use cascaded biquads. On a Cortex-M4F at 180 MHz (NUCLEO-F446RE) with 8 kHz sample rate, you can run a 256-tap FIR or a 12th-order IIR (6 biquad sections), but the biquad cascade uses ~48 bytes of state (6 sections × 2 floats) versus ~1 KB for the FIR’s 256-sample delay line (2 KB including its coefficients).