Pitch Detection on Hardware
Real-time fundamental frequency estimation on ESP32-S3 and STM32F4
A real-time pitch detector must estimate \(f_0\) from a short audio frame, decide whether speech is present, and deliver the result — all within the time it takes for the next frame to arrive. On a microcontroller at 16 kHz with 32 ms frames, that budget is about 6 million cycles (5.76 million at 180 MHz, 7.68 million on a 240 MHz ESP32-S3). Both the autocorrelation and periodogram methods fit comfortably.
For the theory, comparison of methods, and Python prototypes, see the main pitch detection page.
STM32F4 (NUCLEO-F446RE): autocorrelation with CMSIS-DSP
The Cortex-M4F at 180 MHz with hardware FPU and CMSIS-DSP makes the autocorrelation method practical even for longer frames. The key CMSIS-DSP function is arm_correlate_f32, which computes the full cross-correlation — when applied to a signal with itself, it gives the autocorrelation.
Autocorrelation-based pitch detection
#include "arm_math.h"
#define FRAME_SIZE 512 // 32 ms at 16 kHz
#define FS 16000
#define F0_MIN 80 // Hz
#define F0_MAX 500 // Hz
#define LAG_MIN (FS / F0_MAX) // 32 samples
#define LAG_MAX (FS / F0_MIN) // 200 samples
static float32_t frame[FRAME_SIZE];
static float32_t acf[2 * FRAME_SIZE - 1];
float estimate_pitch_acf(float32_t *frame) {
// Compute autocorrelation via CMSIS-DSP
arm_correlate_f32(frame, FRAME_SIZE, frame, FRAME_SIZE, acf);
// ACF is symmetric; centre peak is at index FRAME_SIZE - 1
float32_t *acf_pos = &acf[FRAME_SIZE - 1]; // positive lags only
float32_t acf_zero = acf_pos[0]; // normalisation factor
if (acf_zero < 1e-6f) return 0.0f; // silence
// Find highest peak in the lag range [LAG_MIN, LAG_MAX]
float32_t max_val = 0;
uint32_t best_lag = 0;
for (uint32_t lag = LAG_MIN; lag <= LAG_MAX; lag++) {
float32_t val = acf_pos[lag] / acf_zero; // normalised
if (val > max_val) {
max_val = val;
best_lag = lag;
}
}
if (max_val < 0.3f || best_lag == 0) return 0.0f; // unvoiced or no valid lag
return (float)FS / best_lag;
}arm_correlate_f32 computes the full \((2N-1)\)-point correlation. For pitch detection you only need lags 32–200, so a direct loop over that range is actually faster than the full CMSIS-DSP call. Use the direct approach for production; the CMSIS-DSP call is shown here for clarity.
Restricted-lag autocorrelation (faster)
For real-time use, computing only the lags you need is significantly cheaper:
float estimate_pitch_acf_fast(float32_t *frame) {
float32_t acf_zero = 0;
arm_dot_prod_f32(frame, frame, FRAME_SIZE, &acf_zero);
if (acf_zero < 1e-6f) return 0.0f;
float32_t max_val = 0;
uint32_t best_lag = 0;
for (uint32_t lag = LAG_MIN; lag <= LAG_MAX; lag++) {
float32_t dot;
arm_dot_prod_f32(frame, &frame[lag], FRAME_SIZE - lag, &dot);
float32_t val = dot / acf_zero;
if (val > max_val) {
max_val = val;
best_lag = lag;
}
}
if (max_val < 0.3f || best_lag == 0) return 0.0f;
return (float)FS / best_lag;
}Each arm_dot_prod_f32 call is optimised by CMSIS-DSP using the Cortex-M4’s MAC instructions. For 169 lags (80–500 Hz range, inclusive) and a 512-sample frame, this requires roughly \(169 \times 512 \approx 87\text{K}\) multiply-accumulates — about 100K cycles on a Cortex-M4F, well under 1 ms at 180 MHz.
I2S audio input
The NUCLEO-F446RE connects to an INMP441 MEMS microphone via I2S. Configure SAI (Serial Audio Interface) or I2S with DMA for continuous acquisition:
// I2S DMA double-buffer: ping-pong between two frame buffers
static volatile int32_t dma_buf[2][FRAME_SIZE]; // volatile: written by DMA
static volatile uint8_t buf_ready = 0;
// DMA half-transfer and transfer-complete callbacks
void HAL_I2S_RxHalfCpltCallback(I2S_HandleTypeDef *hi2s) {
buf_ready = 1;
}
void HAL_I2S_RxCpltCallback(I2S_HandleTypeDef *hi2s) {
buf_ready = 2;
}
void process_audio(void) {
HAL_I2S_Receive_DMA(&hi2s2, (uint16_t *)dma_buf, FRAME_SIZE * 2);
while (1) {
// Read and clear atomically to avoid ISR race
__disable_irq();
uint8_t b = buf_ready;
buf_ready = 0;
__enable_irq();
if (b) {
volatile int32_t *buf = dma_buf[b - 1];
// INMP441: 24-bit audio in bits 31:8 of the 32-bit I2S frame
for (int i = 0; i < FRAME_SIZE; i++)
frame[i] = (float)(buf[i] >> 8) / 8388608.0f;
float f0 = estimate_pitch_acf_fast(frame);
if (f0 > 0) {
// Output via UART, display, or DAC
}
}
}
}Performance budget (NUCLEO-F446RE, 180 MHz)
| Stage | Operation | Est. cycles | Time |
|---|---|---|---|
| I2S → float conversion | 512 multiplies | ~1K | 5.6 us |
| Restricted ACF (169 lags) | 169 dot products | ~100K | 556 us |
| Peak search | 169 comparisons | ~500 | 2.8 us |
| Total per 32 ms frame | ~102K | ~564 us | |
| Available per frame | 5,760K | 32 ms | |
| Utilisation | ~1.8% |
ESP32-S3: FFT-based pitch detection with BLE output
The ESP32-S3 is well-suited for a wireless pitch detection system: built-in I2S for the microphone, ESP-DSP for FFT, and BLE for transmitting the result to a phone or display.
Architecture
Biquad bandpass pre-filter
A 4th-order Butterworth bandpass (80–500 Hz) removes DC, low-frequency rumble, and high harmonics:
typedef struct {
float b0, b1, b2, a1, a2;
float x1, x2, y1, y2;
} BiquadState;
static BiquadState sos[2]; // 2 sections = 4th-order
float biquad_process(BiquadState *s, float x) {
float y = s->b0 * x + s->b1 * s->x1 + s->b2 * s->x2
- s->a1 * s->y1 - s->a2 * s->y2;
s->x2 = s->x1; s->x1 = x;
s->y2 = s->y1; s->y1 = y;
return y;
}
float bandpass_filter(float sample) {
float out = sample;
for (int i = 0; i < 2; i++)
out = biquad_process(&sos[i], out);
return out;
}This uses Direct Form I with un-negated a1, a2 (SciPy convention). Coefficients can be pasted directly from scipy.signal.butter(..., output='sos'). See ADR-005 for the project’s coefficient convention.
FFT periodogram pitch estimation
ESP-DSP provides dsps_fft2r_fc32 for radix-2 FFT:
#include "dsps_fft2r.h"
#include "dsps_wind_hann.h"
#define NFFT 512
static float fft_buf[NFFT * 2]; // interleaved complex
static float window[NFFT];
void pitch_init(void) {
dsps_fft2r_init_fc32(NULL, NFFT);
dsps_wind_hann_f32(window, NFFT);
}
float estimate_pitch_fft(float *frame) {
// Apply Hann window and pack as complex (imaginary = 0)
for (int i = 0; i < NFFT; i++) {
fft_buf[2*i] = frame[i] * window[i];
fft_buf[2*i + 1] = 0.0f;
}
dsps_fft2r_fc32(fft_buf, NFFT);
dsps_bit_rev_fc32(fft_buf, NFFT);
// Find peak in 80-500 Hz band
int k_min = (int)(80.0f * NFFT / FS);
int k_max = (int)(500.0f * NFFT / FS);
float max_power = 0;
int peak_bin = k_min;
for (int k = k_min; k <= k_max; k++) {
float re = fft_buf[2*k];
float im = fft_buf[2*k + 1];
float power = re*re + im*im;
if (power > max_power) {
max_power = power;
peak_bin = k;
}
}
// Confidence check: peak must be well above median
if (max_power < 1e-6f) return 0.0f;
return (float)peak_bin * FS / NFFT;
}Voice activity detection
The VAD gate prevents pitch output during silence and unvoiced consonants:
typedef struct {
float noise_floor;
float alpha; // noise floor tracking
} VadState;
bool is_voiced(VadState *vad, float power, float centroid, float zcr) {
// Adaptive noise floor tracking
if (power < vad->noise_floor * 1.5f)
vad->noise_floor = vad->alpha * vad->noise_floor
+ (1.0f - vad->alpha) * power;
return (power > vad->noise_floor * 3.0f)
&& (centroid > 80.0f)
&& (centroid < 2000.0f)
&& (zcr < 0.15f);
}I2S microphone setup
#include "driver/i2s_std.h"
static i2s_chan_handle_t rx_chan;
void i2s_mic_init(void) {
i2s_chan_config_t chan_cfg = I2S_CHANNEL_DEFAULT_CONFIG(
I2S_NUM_0, I2S_ROLE_MASTER);
i2s_new_channel(&chan_cfg, NULL, &rx_chan);
i2s_std_config_t std_cfg = {
.clk_cfg = I2S_STD_CLK_DEFAULT_CONFIG(16000),
.slot_cfg = I2S_STD_PHILIPS_SLOT_DEFAULT_CONFIG(
I2S_DATA_BIT_WIDTH_32BIT, I2S_SLOT_MODE_MONO),
.gpio_cfg = {
.bclk = GPIO_NUM_26, // verify against your DevKitC variant pinout
.ws = GPIO_NUM_25,
.din = GPIO_NUM_22,
.dout = I2S_GPIO_UNUSED,
},
};
i2s_channel_init_std_mode(rx_chan, &std_cfg);
i2s_channel_enable(rx_chan);
}This uses the ESP-IDF v5.x I2S driver API (i2s_std), which replaces the legacy driver/i2s.h API used in the older adaptive filtering and biquad embedded pages.
Spectral centroid
The spectral centroid is the power-weighted mean frequency — a key VAD feature:
float compute_spectral_centroid(float *fft_buf, int nfft) {
float weighted_sum = 0, power_sum = 0;
for (int k = 1; k < nfft / 2; k++) {
float re = fft_buf[2*k];
float im = fft_buf[2*k + 1];
float power = re*re + im*im;
float freq = (float)k * FS / nfft;
weighted_sum += freq * power;
power_sum += power;
}
return (power_sum > 1e-10f) ? weighted_sum / power_sum : 0.0f;
}Main processing task
#define FRAME_SIZE 512
#define FS 16000
void pitch_task(void *param) {
int32_t raw_samples[FRAME_SIZE];
float frame[FRAME_SIZE];
size_t bytes_read;
pitch_init();
while (true) {
i2s_channel_read(rx_chan, raw_samples, sizeof(raw_samples),
&bytes_read, portMAX_DELAY);
int n = bytes_read / sizeof(int32_t);
if (n > FRAME_SIZE) n = FRAME_SIZE; // clamp to buffer size
// INMP441: 24-bit audio in bits 31:8 of the 32-bit I2S frame
float power = 0;
float zcr_count = 0;
float prev = 0;
for (int i = 0; i < n; i++) {
float sample = (float)(raw_samples[i] >> 8) / 8388608.0f;
frame[i] = bandpass_filter(sample);
power += frame[i] * frame[i];
if (i > 0 && ((prev < 0 && frame[i] >= 0) ||
(prev >= 0 && frame[i] < 0)))
zcr_count++;
prev = frame[i];
}
power /= n;
float zcr = zcr_count / (2.0f * n);
// Spectral centroid (from FFT, computed during pitch estimation)
float f0 = estimate_pitch_fft(frame);
// VAD check
float centroid = compute_spectral_centroid(fft_buf, NFFT);
if (is_voiced(&vad, power, centroid, zcr) && f0 > 0) {
// Send via BLE or display
update_ble_characteristic(f0);
}
}
}
void app_main(void) {
i2s_mic_init();
ble_init();
xTaskCreatePinnedToCore(pitch_task, "pitch", 8192, NULL, 5, NULL, 1);
}Performance budget (ESP32-S3, 240 MHz)
| Stage | Operation | Est. cycles | Time |
|---|---|---|---|
| I2S read + conversion | 512 samples | ~2K | 8 us |
| Biquad bandpass (2 SOS) | 512 x 10 MACs | ~10K | 42 us |
| Hann window | 512 multiplies | ~1K | 4 us |
| 512-point FFT (ESP-DSP) | radix-2 f32 | ~30K | 125 us |
| Peak search (80-500 Hz) | ~14 bins | ~100 | 0.4 us |
| VAD features | power, centroid, ZCR | ~2K | 8 us |
| Total per 32 ms frame | ~45K | ~187 us | |
| Available per frame | 7,680K | 32 ms | |
| Utilisation | ~0.6% |
Platform comparison
| Feature | STM32F4 (NUCLEO-F446RE) | ESP32-S3 |
|---|---|---|
| Clock | 180 MHz (Cortex-M4F) | 240 MHz (Xtensa LX7) |
| Pitch method | Autocorrelation (CMSIS-DSP dot product) | FFT periodogram (ESP-DSP) |
| FFT library | arm_rfft_fast_f32 |
dsps_fft2r_fc32 |
| Wireless output | External module (UART → BLE dongle) | Built-in BLE |
| I2S microphone | Via SAI peripheral + DMA | Built-in I2S driver |
| Real-time determinism | Excellent (bare-metal or RTOS) | Good (with core pinning) |
| Unit cost | ~EUR 18 (Nucleo) + EUR 2 (mic) | ~EUR 8 (DevKitC) + EUR 2 (mic) |
| Best for | Low-latency pitch tracking, industrial voice analysis | Wireless pitch feedback, speech therapy apps |
Recommendation
- ESP32-S3 for wireless applications: speech therapy tools, musical tuners with phone display, voice coaching apps. The built-in BLE eliminates external modules.
- STM32F4 for deterministic, low-latency applications: real-time voice processing pipelines, studio-grade pitch tracking, integration with existing audio hardware.
Bill of materials
ESP32-S3 wireless pitch detector
| Component | Purpose | Approx. cost |
|---|---|---|
| ESP32-S3-DevKitC | Processing + BLE | EUR 8 |
| INMP441 breakout | I2S MEMS microphone | EUR 2 |
| SSD1306 OLED (128x64) | Local pitch display (optional) | EUR 3 |
| Breadboard + wires | EUR 3 | |
| Total | ~EUR 16 |
STM32F4 pitch tracker
| Component | Purpose | Approx. cost |
|---|---|---|
| NUCLEO-F446RE | Processing | EUR 18 |
| INMP441 breakout | I2S MEMS microphone | EUR 2 |
| USB-UART adapter | Output to PC (or use Nucleo’s built-in ST-Link UART) | EUR 0 |
| Breadboard + wires | EUR 3 | |
| Total | ~EUR 23 |