Matched Filtering on Hardware
Chirp generation and echo detection on STM32F4 and ESP32
Matched filtering is computationally a cross-correlation: an FIR filter with the time-reversed template as coefficients. For short templates (< 256 taps), direct convolution on a microcontroller is practical. For longer templates, frequency-domain processing via FFT is more efficient. For the theory, chirp design, and Python prototypes, see the main matched filtering page.
STM32F4: correlation via CMSIS-DSP
The CMSIS-DSP library provides arm_correlate_f32 for direct correlation and arm_rfft_fast_f32 for FFT-based processing.
Direct correlation
#include "arm_math.h"
#define TEMPLATE_LEN 128
#define SIGNAL_LEN 2048
#define OUTPUT_LEN (SIGNAL_LEN + TEMPLATE_LEN - 1)
static float32_t template[TEMPLATE_LEN];
static float32_t received[SIGNAL_LEN];
static float32_t output[OUTPUT_LEN];
void run_matched_filter(void) {
arm_correlate_f32(received, SIGNAL_LEN, template, TEMPLATE_LEN,
output);
// Find peak
float32_t peak_val;
uint32_t peak_idx;
arm_max_f32(output, OUTPUT_LEN, &peak_val, &peak_idx);
}FFT-based correlation (for longer templates)
#include "arm_math.h"
#include "arm_const_structs.h"
#define NFFT 4096
static float32_t fft_template[NFFT]; // packed real-FFT output
static float32_t fft_template_conj[NFFT]; // conjugated template spectrum
static float32_t fft_signal[NFFT];
static float32_t fft_output[NFFT];
static arm_rfft_fast_instance_f32 rfft;
void init_fft_matched_filter(float32_t *tmpl, uint32_t len) {
arm_rfft_fast_init_f32(&rfft, NFFT);
// Zero-pad and FFT the template (done once)
static float32_t pad_buf[NFFT]; // static: 16 KB, too large for stack
memset(pad_buf, 0, sizeof(pad_buf));
memcpy(pad_buf, tmpl, len * sizeof(float32_t));
arm_rfft_fast_f32(&rfft, pad_buf, fft_template, 0);
// arm_rfft_fast_f32 packs its output as:
// [DC_real, Nyquist_real, Re[1], Im[1], Re[2], Im[2], ...]
// DC and Nyquist are purely real (imaginary = 0), packed into
// the first two elements. The remaining bins are interleaved
// complex pairs from bin 1 to NFFT/2-1.
// Conjugate the template spectrum for matched filtering.
// Correlation = IFFT{ conj(H) * X }, so we need conj(H).
// DC and Nyquist are real-valued, so they stay unchanged.
fft_template_conj[0] = fft_template[0]; // DC (real)
fft_template_conj[1] = fft_template[1]; // Nyquist (real)
// Negate imaginary parts of bins 1..NFFT/2-1
for (uint32_t i = 2; i < NFFT; i += 2) {
fft_template_conj[i] = fft_template[i]; // real
fft_template_conj[i + 1] = -fft_template[i + 1]; // -imag
}
}
// Note: arm_rfft_fast_f32 modifies the input buffer in place.
// Copy signal to a scratch buffer first if you need the original data.
void correlate_fft(float32_t *signal, float32_t *output) {
arm_rfft_fast_f32(&rfft, signal, fft_signal, 0);
// Handle DC and Nyquist bins separately: they are packed as two
// real values in elements [0] and [1], not as a complex pair.
// For real-valued signals, DC * DC and Nyq * Nyq are just scalar
// multiplies (imaginary parts are zero).
fft_output[0] = fft_template_conj[0] * fft_signal[0]; // DC
fft_output[1] = fft_template_conj[1] * fft_signal[1]; // Nyquist
// Complex multiply the remaining bins (interleaved complex pairs)
// conj(template) * signal, starting from bin 1
arm_cmplx_mult_cmplx_f32(&fft_template_conj[2], &fft_signal[2],
&fft_output[2], NFFT / 2 - 1);
arm_rfft_fast_f32(&rfft, fft_output, output, 1); // inverse FFT
}arm_rfft_fast_f32 exploits the conjugate symmetry of real-valued signals. It only stores bins 0 through \(N/2\) and packs the output as [DC, Nyquist, Re[1], Im[1], Re[2], Im[2], ...]. The DC and Nyquist bins are both purely real (their imaginary parts are exactly zero), so they share the first two elements. This means you cannot pass the full output directly to arm_cmplx_mult_cmplx_f32: those first two elements would be misinterpreted as a single complex number. The code above handles this by multiplying DC and Nyquist separately as scalars, then calling the complex multiply on the remaining bins.
A second subtlety: cross-correlation requires multiplying by the conjugate of the template spectrum. arm_cmplx_mult_cmplx_f32 performs a straight complex multiply (not conjugate multiply), so the template must be explicitly conjugated first.
Performance
| Method | Template length | Cycles/sample | Time (180 MHz) |
|---|---|---|---|
| Direct correlation | 128 | ~130 | 0.7 µs |
| Direct correlation | 512 | ~520 | 2.9 µs |
| FFT (4096-point) | any ≤ 4096 | ~50 (amortised) | 0.3 µs |
The crossover where FFT becomes faster is around template length 50 on Cortex-M4 (the ~50-cycle amortised FFT cost equals direct correlation at roughly 50 taps). Times assume NUCLEO-F446RE at 180 MHz.
ESP32: ultrasonic ranging demo
A practical matched filtering demo on ESP32-S3: transmit an ultrasonic chirp, record the echo, matched-filter it, and measure distance. This is a simplified bat sonar.
Hardware
| Component | Purpose | Notes |
|---|---|---|
| ESP32-S3-DevKitC | Processing | 240 MHz, FPU |
| HC-SR04 or piezo transducer pair | Ultrasonic TX/RX | 40 kHz center frequency |
| Op-amp + envelope detector | Signal conditioning | For piezo-based setup |
| OLED display (SSD1306) | Show distance | Optional, I2C |
The HC-SR04 module is the simplest option. It has built-in TX and RX transducers and a trigger/echo interface. However, it uses a simple threshold detector internally. For a true matched filter demo, use separate piezo transducers so you can access the raw received waveform.
Approach
- Generate chirp: compute a chirp waveform in memory (e.g., 38–42 kHz over 1 ms)
- Transmit: output the chirp via DAC or PWM to the TX transducer
- Record: sample the RX transducer via ADC (needs ~100 kHz sample rate)
- Matched filter: correlate received signal with stored chirp template
- Detect peak: find the correlation peak and convert to distance
#include <math.h>
#include <string.h>
#define FS 100000 // 100 kHz sample rate
#define CHIRP_LEN 100 // 1 ms chirp at 100 kHz
#define RX_LEN 1000 // 10 ms recording (max ~1.7 m range)
static float chirp_template[CHIRP_LEN];
static float rx_buffer[RX_LEN];
static float mf_output[RX_LEN];
void generate_chirp(float f0, float f1, float duration) {
for (int i = 0; i < CHIRP_LEN; i++) {
float t = (float)i / FS;
float freq = f0 + (f1 - f0) * t / duration;
// Phase is integral of instantaneous frequency: 2*pi*(f0*t + 0.5*(f1-f0)*t^2/T)
chirp_template[i] = sinf(2.0f * M_PI * (f0 * t + 0.5f * (f1 - f0) * t * t / duration));
}
}
void matched_filter_direct(void) {
for (int n = 0; n < RX_LEN; n++) {
float sum = 0;
for (int k = 0; k < CHIRP_LEN && (n - k) >= 0; k++) {
// Cross-correlation: access template in reversed order
// (convolution uses chirp_template[k], but matched filtering
// requires correlation, which time-reverses the template)
sum += chirp_template[CHIRP_LEN - 1 - k] * rx_buffer[n - k];
}
mf_output[n] = sum;
}
}
float estimate_distance(void) {
// Find peak in matched filter output
float max_val = 0;
int max_idx = 0;
for (int i = CHIRP_LEN; i < RX_LEN; i++) { // skip direct path
if (fabsf(mf_output[i]) > max_val) {
max_val = fabsf(mf_output[i]);
max_idx = i;
}
}
float delay_s = (float)max_idx / FS;
return 343.0f * delay_s / 2.0f; // distance in metres
}Performance on ESP32
A 100-tap matched filter on 1000 samples requires 100,000 multiply-accumulates. At 240 MHz on the ESP32-S3 this takes ~0.5 ms, well within real-time constraints for a 10 ms measurement cycle.
For longer chirps or higher sample rates, use the ESP-DSP library which provides optimised FFT functions:
#include "dsps_fft2r.h"
// ESP-DSP provides dsps_fft2r_fc32() for radix-2 FFT
// On ESP32-S3: 1024-point f32 FFT ~98K cycles (~408 us at 240 MHz)Platform comparison for matched filtering
| Feature | STM32F4 (NUCLEO-F446RE) | ESP32-S3 |
|---|---|---|
| CMSIS-DSP correlation | Yes | No (use ESP-DSP) |
| FFT library | arm_rfft_fast_f32 | dsps_fft2r_fc32 |
| Max practical template | ~1024 (direct), ~4096 (FFT) | ~512 (direct), ~2048 (FFT) |
| ADC sample rate | Up to 2.4 Msps | Up to 2 Msps (noisy above 100 ksps) |
| DAC for chirp TX | 12-bit, 1 Msps | 8-bit, ~100 ksps |
| Best for | High-precision ranging, radar | Simple ultrasonic demos, IoT integration |
Recommendation
- ESP32-S3 for a quick ultrasonic ranging demo, cheap, self-contained, Wi-Fi for remote display
- STM32F4 (NUCLEO-F446RE) for serious pulse compression work, better ADC/DAC, deterministic timing, CMSIS-DSP library