Biquad on Hardware
Real-time biquad filters on STM32F4 and ESP32
The biquad is the workhorse of embedded audio and sensor DSP. Every parametric equalizer, crossover network, and feedback control loop on a microcontroller is built from cascaded second-order sections. The structure is minimal, five coefficients and two state variables, making it ideal for platforms with tight memory and cycle budgets.
This page covers practical biquad implementations for STM32F4 and ESP32 targets. For the theory, filter structures, and coefficient design, see the main biquad page. For why we cascade biquads instead of implementing high-order filters directly, see the SOS discussion in the filter design chapter.
STM32F4: Direct Form II biquad
The STM32F4 series (Cortex-M4F, up to 180 MHz on the NUCLEO-F446RE, single-precision FPU) is the natural home for biquad filters. A single biquad section takes roughly 10 cycles with hardware FPU, leaving room for dozens of cascaded sections at audio sample rates.
Bare-metal implementation
The Direct Form II implementation uses two state variables (w[1] and w[2]) and computes the output in two steps:
// Direct Form II biquad — single section, single sample
// b0, b1, b2: numerator coefficients
// a1, a2: denominator coefficients (negated, see note below)
// w[]: state variables, persisted between calls
void biquad_df2(float x, float *y,
float b0, float b1, float b2,
float a1, float a2, float w[3])
{
w[0] = x + a1*w[1] + a2*w[2];
*y = b0*w[0] + b1*w[1] + b2*w[2];
w[2] = w[1];
w[1] = w[0];
}In this implementation, a1 and a2 are stored with their signs flipped relative to the transfer function \(H(z) = \frac{b_0 + b_1 z^{-1} + b_2 z^{-2}}{1 + a_1 z^{-1} + a_2 z^{-2}}\). The code computes w[0] = x + a1*w[1] rather than w[0] = x - a1*w[1], so the stored a1 is \(-a_1\) from the transfer function.
This convention matches CMSIS-DSP and many DSP textbooks. It replaces a subtraction with an addition in the inner loop, saving one cycle on architectures without a fused negate-multiply-accumulate instruction. When porting coefficients from SciPy (which uses the un-negated convention), negate a1 and a2 before loading them into the filter.
Cascading multiple sections
A higher-order filter is a cascade of biquad sections. Each section’s output feeds the next section’s input:
#define MAX_SECTIONS 6
typedef struct {
float b0, b1, b2;
float a1, a2; // negated convention
float w[3]; // state variables
} biquad_section_t;
typedef struct {
biquad_section_t sections[MAX_SECTIONS];
int n_sections;
} biquad_cascade_t;
float biquad_cascade_process(biquad_cascade_t *cascade, float x)
{
float signal = x;
for (int k = 0; k < cascade->n_sections; k++) {
biquad_section_t *s = &cascade->sections[k];
s->w[0] = signal + s->a1 * s->w[1] + s->a2 * s->w[2];
signal = s->b0 * s->w[0] + s->b1 * s->w[1] + s->b2 * s->w[2];
s->w[2] = s->w[1];
s->w[1] = s->w[0];
}
return signal;
}Each section adds 5 multiplies and 4 adds. A 6th-order Butterworth (3 sections) takes roughly 30 multiply-accumulates, well under 1 µs at 180 MHz with hardware FPU.
CMSIS-DSP alternative
ARM’s CMSIS-DSP library provides optimised biquad implementations, optimised for the Cortex-M4 FPU; the fixed-point Q15/Q31 variants additionally use SIMD integer instructions:
#include "arm_math.h"
#define NUM_SECTIONS 3
#define BLOCK_SIZE 1
// Coefficient array: {b0, b1, b2, a1, a2} per section (a1, a2 negated)
static float32_t coeffs[5 * NUM_SECTIONS];
// State array: 4 state variables per section
static float32_t state[4 * NUM_SECTIONS];
static arm_biquad_casc_df1_inst_f32 filter;
void init_filter(void) {
arm_biquad_cascade_df1_init_f32(&filter, NUM_SECTIONS,
coeffs, state);
}
void process_sample(float32_t *in, float32_t *out) {
arm_biquad_cascade_df1_f32(&filter, in, out, BLOCK_SIZE);
}The CMSIS-DSP implementation uses Direct Form I (not DF-II) with four state variables per section. This is deliberate: DF-I is more robust for fixed-point variants (Q15, Q31) because the separate input and output delay lines prevent feedback overflow from corrupting input history. The floating-point version uses the same structure for API consistency.
For block processing (e.g., processing 64 samples at a time from a DMA buffer), set BLOCK_SIZE to the block length. CMSIS-DSP will process all samples in a single call with loop-unrolled inner code, significantly faster than calling once per sample.
ESP32: C++ biquad for real-time audio
The ESP32-S3 (dual-core Xtensa LX7, 240 MHz, single-precision FPU) is well-suited for audio biquad processing. It lacks CMSIS-DSP, but the biquad is simple enough to implement directly. The built-in I2S peripheral connects directly to MEMS microphones and DAC codecs without external ADC hardware.
Biquad class
A C++ biquad class suitable for real-time audio on ESP32:
class BiquadSection {
public:
float b0, b1, b2, a1, a2; // a1, a2 negated
float w1 = 0.0f, w2 = 0.0f;
void set_coefficients(float _b0, float _b1, float _b2,
float _a1, float _a2) {
b0 = _b0; b1 = _b1; b2 = _b2;
a1 = _a1; a2 = _a2;
}
float process(float x) {
float w0 = x + a1 * w1 + a2 * w2;
float y = b0 * w0 + b1 * w1 + b2 * w2;
w2 = w1;
w1 = w0;
return y;
}
void reset() { w1 = 0.0f; w2 = 0.0f; }
};
class BiquadCascade {
public:
static constexpr int MAX_SECTIONS = 8;
BiquadSection sections[MAX_SECTIONS];
int n_sections = 0;
void add_section(float b0, float b1, float b2,
float a1, float a2) {
if (n_sections < MAX_SECTIONS) {
sections[n_sections].set_coefficients(b0, b1, b2, a1, a2);
n_sections++;
}
}
float process(float x) {
float signal = x;
for (int k = 0; k < n_sections; k++) {
signal = sections[k].process(signal);
}
return signal;
}
void reset() {
for (int k = 0; k < n_sections; k++)
sections[k].reset();
}
};I2S microphone setup
The code below uses the ESP-IDF v4.x I2S API (driver/i2s.h), which was removed in ESP-IDF v5.2. For the v5.x API (driver/i2s_std.h), see the pitch detection or beamforming embedded pages.
The ESP32 I2S peripheral connects directly to digital MEMS microphones like the INMP441:
#include "driver/i2s.h"
i2s_config_t i2s_config = {
.mode = (i2s_mode_t)(I2S_MODE_MASTER | I2S_MODE_RX),
.sample_rate = 16000,
.bits_per_sample = I2S_BITS_PER_SAMPLE_32BIT,
.channel_format = I2S_CHANNEL_FMT_ONLY_LEFT,
.communication_format = I2S_COMM_FORMAT_STAND_I2S,
.dma_buf_count = 4,
.dma_buf_len = 64,
.use_apll = true,
};
i2s_pin_config_t pin_config = {
.bck_io_num = 26,
.ws_io_num = 25,
.data_out_num = I2S_PIN_NO_CHANGE,
.data_in_num = 22,
};
void audio_init() {
i2s_driver_install(I2S_NUM_0, &i2s_config, 0, NULL);
i2s_set_pin(I2S_NUM_0, &pin_config);
}The audio processing loop reads I2S samples, runs the biquad cascade, and writes the filtered output:
BiquadCascade eq;
void audio_task(void *param) {
int32_t raw_samples[64];
size_t bytes_read;
while (true) {
i2s_read(I2S_NUM_0, raw_samples, sizeof(raw_samples),
&bytes_read, portMAX_DELAY);
int n_samples = bytes_read / sizeof(int32_t);
if (n_samples > 64) n_samples = 64; // clamp to buffer size
for (int i = 0; i < n_samples; i++) {
// Convert 32-bit I2S to float [-1, 1]
float x = (float)raw_samples[i] / 2147483648.0f;
float y = eq.process(x);
// Output y to DAC, I2S TX, or buffer for further processing
}
}
}Pin the audio task to core 1 (xTaskCreatePinnedToCore(audio_task, "audio", 4096, NULL, 5, NULL, 1)) and keep Wi-Fi on core 0 to avoid jitter from wireless stack interrupts.
Fixed-point considerations
On platforms without hardware FPU (or where power consumption demands fixed-point), biquad coefficients and state must be represented in integer format.
Q-format representation
| Format | Bits | Range | Resolution | Use case |
|---|---|---|---|---|
| Q15 | 16 | \([-1, 1 - 2^{-15})\) | \(3.05 \times 10^{-5}\) | Low-power, small filters |
| Q31 | 32 | \([-1, 1 - 2^{-31})\) | \(4.66 \times 10^{-10}\) | High-quality audio |
A coefficient stored in Q15 is an integer in \([-32768, 32767]\) that represents the value \(\text{integer} / 32768\). All arithmetic stays in the integer domain.
Why accumulator width matters
A single biquad multiply-accumulate step multiplies a Q15 coefficient by a Q15 sample, producing a Q30 result in 32 bits. Summing five such products can overflow a 32-bit accumulator. The solution is a 64-bit accumulator for Q31 coefficients, or a 32-bit accumulator for Q15:
| Coefficient format | Multiply result | Accumulator width needed |
|---|---|---|
| Q15 × Q15 | Q30 (32 bits) | 32 bits (tight, CMSIS-DSP uses 64-bit accumulator for safety) |
| Q31 × Q31 | Q62 (64 bits) | 64 bits |
ARM Cortex-M4 provides the SMLAL instruction (signed multiply-accumulate long) that multiplies two 32-bit values and accumulates into a 64-bit register pair, exactly what is needed for Q31 biquad processing.
In fixed-point biquad implementations, rounding errors in the feedback path can cause limit cycles: small persistent oscillations at the output even when the input is zero. Using a wider accumulator and rounding (rather than truncating) the output reduces this problem. CMSIS-DSP’s Q15 biquad uses a 64-bit accumulator internally for this reason.
Coefficients greater than 1
Biquad coefficients can exceed \(\pm 1\) (e.g., a peaking EQ with high gain). Since Q-format represents only \([-1, 1)\), these coefficients cannot be stored directly. Solutions:
- Pre-scale coefficients so all values fit in \([-1, 1)\), and apply a compensating gain at the output
- Use Q14 or Q30 format with a wider range \([-2, 2)\) at the cost of one bit of precision
- Store the integer and fractional parts separately (rarely done for biquads)
The coefficient quantization demo on the theory page shows how Q15 quantization shifts the response of a narrow bandpass filter.
Audio EQ on embedded
A parametric equalizer on a microcontroller is a cascade of biquad sections with user-adjustable parameters. A typical 3-band EQ (low shelf, parametric mid, high shelf) requires just three biquad sections, under 100 bytes of memory and less than 1 µs of processing per sample.
Coefficient updates and click avoidance
When the user adjusts an EQ knob, the biquad coefficients change. Swapping coefficients instantaneously between samples produces discontinuities in the output, audible as clicks or pops. Two approaches to smooth updates:
Crossfade. Maintain two copies of the filter state. When coefficients change, start feeding input to the new filter while fading out the old one over a short window (typically 5–10 ms, i.e., 80–160 samples at 16 kHz). This doubles the memory and computation during the crossfade but guarantees a smooth transition.
// Simplified crossfade between old and new biquad
for (int i = 0; i < fade_len; i++) {
float alpha = (float)i / fade_len;
float y_old = biquad_cascade_process(&old_filter, x[i]);
float y_new = biquad_cascade_process(&new_filter, x[i]);
output[i] = (1.0f - alpha) * y_old + alpha * y_new;
}Parameter smoothing. Instead of updating coefficients in one step, interpolate the design parameters (frequency, Q, gain) smoothly and recompute coefficients each block. This is more expensive (trigonometric functions in the coefficient formulas) but produces more natural-sounding transitions. Many audio plugin frameworks use this approach, recomputing coefficients once per block (e.g., every 64 samples) with smoothed parameters.
Linear interpolation of biquad coefficients does not correspond to linear interpolation of the frequency response. Intermediate states during a coefficient crossfade may have unexpected resonance peaks; see the open questions on the theory page.
Platform comparison
| Feature | STM32F4 | ESP32 |
|---|---|---|
| Clock | 180 MHz (NUCLEO-F446RE) | 240 MHz |
| FPU | Yes (Cortex-M4F) | Yes (Xtensa LX7) |
| CMSIS-DSP biquad | Yes (arm_biquad_cascade_df1_f32, Q15, Q31) |
No |
| I2S | Via external codec | Built-in |
| Memory (SRAM) | 192 KB | 512 KB |
| Audio latency (typical) | < 1 ms (bare-metal, sample-by-sample) | 4–8 ms (FreeRTOS, DMA blocks) |
| Wi-Fi/BT | No (needs external module) | Built-in |
| Real-time determinism | Excellent (bare-metal or RTOS) | Good (with core pinning) |
| Unit cost | ~EUR 10 | ~EUR 5 |
| Best for | Deterministic DSP, production audio | Prototyping, IoT audio |
Recommendation
- STM32F4 for production audio DSP: deterministic timing, CMSIS-DSP library with optimised fixed-point variants, and established use in professional audio equipment. Choose this when latency, reproducibility, or fixed-point performance matter.
- ESP32 for prototyping and connected devices: cheaper, built-in I2S and wireless, easier to get audio flowing with minimal external hardware. Choose this for EQ demos, IoT sensor filtering, or wireless audio experiments.
FIR comparison
For contrast, here is a general-purpose FIR filter using a circular buffer, the standard embedded technique that avoids shifting the entire delay line on every sample:
// FIR filter with circular buffer
// N: number of taps, buf[]: circular sample buffer
// coeffs[]: filter coefficients, ind: current buffer index
void fir_filter(float x, float *y, float *buf, const float *coeffs,
int N, int *ind)
{
buf[*ind] = x;
if (*ind < N - 1) (*ind)++; else *ind = 0;
float acc = 0.0f;
int k = *ind;
for (int i = 0; i < N; i++) {
acc += buf[k] * coeffs[N - i - 1];
if (k < N - 1) k++; else k = 0;
}
*y = acc;
}The circular buffer wraps the index instead of shifting data: each sample requires \(N\) multiplies but zero memory moves. The memmove-based approach (shift the buffer on every sample) is simpler but costs \(O(N)\) memory operations on top of the \(N\) multiplies.
When to use FIR vs biquad
| Criterion | FIR | Biquad (IIR) |
|---|---|---|
| Phase | Linear (symmetric coefficients) | Nonlinear |
| Memory | \(N\) coefficients + \(N\) samples | 5 coefficients + 2 state variables per section |
| Computation | \(N\) MACs per sample | ~5 MACs per section per sample |
| Stability | Always stable | Can be unstable if poles outside unit circle |
| Sharp cutoff | Requires many taps (100+) | A few sections suffice |
| Typical use | Anti-aliasing, matched filtering, linear-phase EQ | Audio EQ, control loops, sensor filtering |
Rule of thumb for microcontrollers: if you need linear phase or very specific impulse response shapes (e.g., matched filtering), use FIR. If you need a sharp frequency-selective filter with minimal memory and computation (e.g., audio crossover, DC removal, bandpass for sensor data), use cascaded biquads. On a Cortex-M4F at 180 MHz (NUCLEO-F446RE) with 8 kHz sample rate, you can run a 256-tap FIR or a 12th-order IIR (6 biquad sections), but the biquad cascade uses ~48 bytes of state (6 sections × 2 floats) versus ~1 KB for the FIR’s 256-sample delay line (2 KB including its coefficients).