Pitch Detection on Hardware

Real-time fundamental frequency estimation on ESP32-S3 and STM32F4

A real-time pitch detector must estimate \(f_0\) from a short audio frame, decide whether speech is present, and deliver the result — all within the time it takes for the next frame to arrive. On a microcontroller at 16 kHz with 32 ms frames, that budget is about 6 million cycles (5.76 million at 180 MHz, 7.68 million on a 240 MHz ESP32-S3). Both the autocorrelation and periodogram methods fit comfortably.

For the theory, comparison of methods, and Python prototypes, see the main pitch detection page.


STM32F4 (NUCLEO-F446RE): autocorrelation with CMSIS-DSP

The Cortex-M4F at 180 MHz with hardware FPU and CMSIS-DSP makes the autocorrelation method practical even for longer frames. The key CMSIS-DSP function is arm_correlate_f32, which computes the full cross-correlation — when applied to a signal with itself, it gives the autocorrelation.

Autocorrelation-based pitch detection

#include "arm_math.h"

#define FRAME_SIZE   512    // 32 ms at 16 kHz
#define FS           16000
#define F0_MIN       80     // Hz
#define F0_MAX       500    // Hz
#define LAG_MIN      (FS / F0_MAX)   // 32 samples
#define LAG_MAX      (FS / F0_MIN)   // 200 samples

static float32_t frame[FRAME_SIZE];
static float32_t acf[2 * FRAME_SIZE - 1];

float estimate_pitch_acf(float32_t *frame) {
    // Compute autocorrelation via CMSIS-DSP
    arm_correlate_f32(frame, FRAME_SIZE, frame, FRAME_SIZE, acf);

    // ACF is symmetric; centre peak is at index FRAME_SIZE - 1
    float32_t *acf_pos = &acf[FRAME_SIZE - 1];  // positive lags only
    float32_t acf_zero = acf_pos[0];             // normalisation factor

    if (acf_zero < 1e-6f) return 0.0f;  // silence

    // Find highest peak in the lag range [LAG_MIN, LAG_MAX]
    float32_t max_val = 0;
    uint32_t best_lag = 0;
    for (uint32_t lag = LAG_MIN; lag <= LAG_MAX; lag++) {
        float32_t val = acf_pos[lag] / acf_zero;  // normalised
        if (val > max_val) {
            max_val = val;
            best_lag = lag;
        }
    }

    if (max_val < 0.3f || best_lag == 0) return 0.0f;  // unvoiced or no valid lag

    return (float)FS / best_lag;
}
Tip

arm_correlate_f32 computes the full \((2N-1)\)-point correlation. For pitch detection you only need lags 32–200, so a direct loop over that range is actually faster than the full CMSIS-DSP call. Use the direct approach for production; the CMSIS-DSP call is shown here for clarity.

Restricted-lag autocorrelation (faster)

For real-time use, computing only the lags you need is significantly cheaper:

float estimate_pitch_acf_fast(float32_t *frame) {
    float32_t acf_zero = 0;
    arm_dot_prod_f32(frame, frame, FRAME_SIZE, &acf_zero);
    if (acf_zero < 1e-6f) return 0.0f;

    float32_t max_val = 0;
    uint32_t best_lag = 0;

    for (uint32_t lag = LAG_MIN; lag <= LAG_MAX; lag++) {
        float32_t dot;
        arm_dot_prod_f32(frame, &frame[lag], FRAME_SIZE - lag, &dot);
        float32_t val = dot / acf_zero;
        if (val > max_val) {
            max_val = val;
            best_lag = lag;
        }
    }

    if (max_val < 0.3f || best_lag == 0) return 0.0f;
    return (float)FS / best_lag;
}

Each arm_dot_prod_f32 call is optimised by CMSIS-DSP using the Cortex-M4’s MAC instructions. For 169 lags (80–500 Hz range, inclusive) and a 512-sample frame, this requires roughly \(169 \times 512 \approx 87\text{K}\) multiply-accumulates — about 100K cycles on a Cortex-M4F, well under 1 ms at 180 MHz.

I2S audio input

The NUCLEO-F446RE connects to an INMP441 MEMS microphone via I2S. Configure SAI (Serial Audio Interface) or I2S with DMA for continuous acquisition:

// I2S DMA double-buffer: ping-pong between two frame buffers
static volatile int32_t dma_buf[2][FRAME_SIZE];  // volatile: written by DMA
static volatile uint8_t buf_ready = 0;

// DMA half-transfer and transfer-complete callbacks
void HAL_I2S_RxHalfCpltCallback(I2S_HandleTypeDef *hi2s) {
    buf_ready = 1;
}
void HAL_I2S_RxCpltCallback(I2S_HandleTypeDef *hi2s) {
    buf_ready = 2;
}

void process_audio(void) {
    HAL_I2S_Receive_DMA(&hi2s2, (uint16_t *)dma_buf, FRAME_SIZE * 2);

    while (1) {
        // Read and clear atomically to avoid ISR race
        __disable_irq();
        uint8_t b = buf_ready;
        buf_ready = 0;
        __enable_irq();

        if (b) {
            volatile int32_t *buf = dma_buf[b - 1];
            // INMP441: 24-bit audio in bits 31:8 of the 32-bit I2S frame
            for (int i = 0; i < FRAME_SIZE; i++)
                frame[i] = (float)(buf[i] >> 8) / 8388608.0f;

            float f0 = estimate_pitch_acf_fast(frame);
            if (f0 > 0) {
                // Output via UART, display, or DAC
            }
        }
    }
}

Performance budget (NUCLEO-F446RE, 180 MHz)

Stage Operation Est. cycles Time
I2S → float conversion 512 multiplies ~1K 5.6 us
Restricted ACF (169 lags) 169 dot products ~100K 556 us
Peak search 169 comparisons ~500 2.8 us
Total per 32 ms frame ~102K ~564 us
Available per frame 5,760K 32 ms
Utilisation ~1.8%

ESP32-S3: FFT-based pitch detection with BLE output

The ESP32-S3 is well-suited for a wireless pitch detection system: built-in I2S for the microphone, ESP-DSP for FFT, and BLE for transmitting the result to a phone or display.

Architecture

mic INMP441 (I2S) bp Biquad bandpass (80–500 Hz) mic->bp zcr ZCR bp->zcr fft FFT bp->fft vad VAD (power + centroid) bp->vad rf0 rough f₀ zcr->rf0 sf0 spectral peak f₀ fft->sf0 out BLE characteristic vad->out rf0->out sf0->out
Figure 1: On-device architecture: the I2S mic feeds a biquad bandpass, then parallel time-domain, spectral, and voice-activity branches whose result is published as a BLE characteristic.

Biquad bandpass pre-filter

A 4th-order Butterworth bandpass (80–500 Hz) removes DC, low-frequency rumble, and high harmonics:

typedef struct {
    float b0, b1, b2, a1, a2;
    float x1, x2, y1, y2;
} BiquadState;

static BiquadState sos[2];  // 2 sections = 4th-order

float biquad_process(BiquadState *s, float x) {
    float y = s->b0 * x + s->b1 * s->x1 + s->b2 * s->x2
                        - s->a1 * s->y1 - s->a2 * s->y2;
    s->x2 = s->x1;  s->x1 = x;
    s->y2 = s->y1;  s->y1 = y;
    return y;
}

float bandpass_filter(float sample) {
    float out = sample;
    for (int i = 0; i < 2; i++)
        out = biquad_process(&sos[i], out);
    return out;
}
Note

This uses Direct Form I with un-negated a1, a2 (SciPy convention). Coefficients can be pasted directly from scipy.signal.butter(..., output='sos'). See ADR-005 for the project’s coefficient convention.

FFT periodogram pitch estimation

ESP-DSP provides dsps_fft2r_fc32 for radix-2 FFT:

#include "dsps_fft2r.h"
#include "dsps_wind_hann.h"

#define NFFT 512

static float fft_buf[NFFT * 2];  // interleaved complex
static float window[NFFT];

void pitch_init(void) {
    dsps_fft2r_init_fc32(NULL, NFFT);
    dsps_wind_hann_f32(window, NFFT);
}

float estimate_pitch_fft(float *frame) {
    // Apply Hann window and pack as complex (imaginary = 0)
    for (int i = 0; i < NFFT; i++) {
        fft_buf[2*i]     = frame[i] * window[i];
        fft_buf[2*i + 1] = 0.0f;
    }

    dsps_fft2r_fc32(fft_buf, NFFT);
    dsps_bit_rev_fc32(fft_buf, NFFT);

    // Find peak in 80-500 Hz band
    int k_min = (int)(80.0f * NFFT / FS);
    int k_max = (int)(500.0f * NFFT / FS);
    float max_power = 0;
    int peak_bin = k_min;

    for (int k = k_min; k <= k_max; k++) {
        float re = fft_buf[2*k];
        float im = fft_buf[2*k + 1];
        float power = re*re + im*im;
        if (power > max_power) {
            max_power = power;
            peak_bin = k;
        }
    }

    // Confidence check: peak must be well above median
    if (max_power < 1e-6f) return 0.0f;

    return (float)peak_bin * FS / NFFT;
}

Voice activity detection

The VAD gate prevents pitch output during silence and unvoiced consonants:

typedef struct {
    float noise_floor;
    float alpha;  // noise floor tracking
} VadState;

bool is_voiced(VadState *vad, float power, float centroid, float zcr) {
    // Adaptive noise floor tracking
    if (power < vad->noise_floor * 1.5f)
        vad->noise_floor = vad->alpha * vad->noise_floor
                         + (1.0f - vad->alpha) * power;

    return (power > vad->noise_floor * 3.0f)
        && (centroid > 80.0f)
        && (centroid < 2000.0f)
        && (zcr < 0.15f);
}

I2S microphone setup

#include "driver/i2s_std.h"

static i2s_chan_handle_t rx_chan;

void i2s_mic_init(void) {
    i2s_chan_config_t chan_cfg = I2S_CHANNEL_DEFAULT_CONFIG(
        I2S_NUM_0, I2S_ROLE_MASTER);
    i2s_new_channel(&chan_cfg, NULL, &rx_chan);

    i2s_std_config_t std_cfg = {
        .clk_cfg = I2S_STD_CLK_DEFAULT_CONFIG(16000),
        .slot_cfg = I2S_STD_PHILIPS_SLOT_DEFAULT_CONFIG(
            I2S_DATA_BIT_WIDTH_32BIT, I2S_SLOT_MODE_MONO),
        .gpio_cfg = {
            .bclk = GPIO_NUM_26,   // verify against your DevKitC variant pinout
            .ws   = GPIO_NUM_25,
            .din  = GPIO_NUM_22,
            .dout = I2S_GPIO_UNUSED,
        },
    };
    i2s_channel_init_std_mode(rx_chan, &std_cfg);
    i2s_channel_enable(rx_chan);
}
Note

This uses the ESP-IDF v5.x I2S driver API (i2s_std), which replaces the legacy driver/i2s.h API used in the older adaptive filtering and biquad embedded pages.

Spectral centroid

The spectral centroid is the power-weighted mean frequency — a key VAD feature:

float compute_spectral_centroid(float *fft_buf, int nfft) {
    float weighted_sum = 0, power_sum = 0;
    for (int k = 1; k < nfft / 2; k++) {
        float re = fft_buf[2*k];
        float im = fft_buf[2*k + 1];
        float power = re*re + im*im;
        float freq = (float)k * FS / nfft;
        weighted_sum += freq * power;
        power_sum += power;
    }
    return (power_sum > 1e-10f) ? weighted_sum / power_sum : 0.0f;
}

Main processing task

#define FRAME_SIZE 512
#define FS         16000

void pitch_task(void *param) {
    int32_t raw_samples[FRAME_SIZE];
    float frame[FRAME_SIZE];
    size_t bytes_read;

    pitch_init();

    while (true) {
        i2s_channel_read(rx_chan, raw_samples, sizeof(raw_samples),
                         &bytes_read, portMAX_DELAY);
        int n = bytes_read / sizeof(int32_t);
        if (n > FRAME_SIZE) n = FRAME_SIZE;  // clamp to buffer size

        // INMP441: 24-bit audio in bits 31:8 of the 32-bit I2S frame
        float power = 0;
        float zcr_count = 0;
        float prev = 0;
        for (int i = 0; i < n; i++) {
            float sample = (float)(raw_samples[i] >> 8) / 8388608.0f;
            frame[i] = bandpass_filter(sample);
            power += frame[i] * frame[i];
            if (i > 0 && ((prev < 0 && frame[i] >= 0) ||
                          (prev >= 0 && frame[i] < 0)))
                zcr_count++;
            prev = frame[i];
        }
        power /= n;
        float zcr = zcr_count / (2.0f * n);

        // Spectral centroid (from FFT, computed during pitch estimation)
        float f0 = estimate_pitch_fft(frame);

        // VAD check
        float centroid = compute_spectral_centroid(fft_buf, NFFT);
        if (is_voiced(&vad, power, centroid, zcr) && f0 > 0) {
            // Send via BLE or display
            update_ble_characteristic(f0);
        }
    }
}

void app_main(void) {
    i2s_mic_init();
    ble_init();
    xTaskCreatePinnedToCore(pitch_task, "pitch", 8192, NULL, 5, NULL, 1);
}

Performance budget (ESP32-S3, 240 MHz)

Stage Operation Est. cycles Time
I2S read + conversion 512 samples ~2K 8 us
Biquad bandpass (2 SOS) 512 x 10 MACs ~10K 42 us
Hann window 512 multiplies ~1K 4 us
512-point FFT (ESP-DSP) radix-2 f32 ~30K 125 us
Peak search (80-500 Hz) ~14 bins ~100 0.4 us
VAD features power, centroid, ZCR ~2K 8 us
Total per 32 ms frame ~45K ~187 us
Available per frame 7,680K 32 ms
Utilisation ~0.6%

Platform comparison

Feature STM32F4 (NUCLEO-F446RE) ESP32-S3
Clock 180 MHz (Cortex-M4F) 240 MHz (Xtensa LX7)
Pitch method Autocorrelation (CMSIS-DSP dot product) FFT periodogram (ESP-DSP)
FFT library arm_rfft_fast_f32 dsps_fft2r_fc32
Wireless output External module (UART → BLE dongle) Built-in BLE
I2S microphone Via SAI peripheral + DMA Built-in I2S driver
Real-time determinism Excellent (bare-metal or RTOS) Good (with core pinning)
Unit cost ~EUR 18 (Nucleo) + EUR 2 (mic) ~EUR 8 (DevKitC) + EUR 2 (mic)
Best for Low-latency pitch tracking, industrial voice analysis Wireless pitch feedback, speech therapy apps

Recommendation

  • ESP32-S3 for wireless applications: speech therapy tools, musical tuners with phone display, voice coaching apps. The built-in BLE eliminates external modules.
  • STM32F4 for deterministic, low-latency applications: real-time voice processing pipelines, studio-grade pitch tracking, integration with existing audio hardware.

Bill of materials

ESP32-S3 wireless pitch detector

Component Purpose Approx. cost
ESP32-S3-DevKitC Processing + BLE EUR 8
INMP441 breakout I2S MEMS microphone EUR 2
SSD1306 OLED (128x64) Local pitch display (optional) EUR 3
Breadboard + wires EUR 3
Total ~EUR 16

STM32F4 pitch tracker

Component Purpose Approx. cost
NUCLEO-F446RE Processing EUR 18
INMP441 breakout I2S MEMS microphone EUR 2
USB-UART adapter Output to PC (or use Nucleo’s built-in ST-Link UART) EUR 0
Breadboard + wires EUR 3
Total ~EUR 23