Pitch Detection on Hardware

Real-time fundamental frequency estimation on ESP32-S3 and STM32F4

A real-time pitch detector must estimate $f_0$ from a short audio frame, decide whether speech is present, and deliver the result, all within the time it takes for the next frame to arrive. On a microcontroller at 16 kHz with 32 ms frames, that budget is about 6 million cycles (5.76 million at 180 MHz, 7.68 million on a 240 MHz ESP32-S3). Both the autocorrelation and periodogram methods fit comfortably.

For the theory, comparison of methods, and Python prototypes, see the main pitch detection page.

STM32F4 (NUCLEO-F446RE): autocorrelation with CMSIS-DSP

The Cortex-M4F at 180 MHz with hardware FPU and CMSIS-DSP makes the autocorrelation method practical even for longer frames. The key CMSIS-DSP function is arm_correlate_f32, which computes the full cross-correlation: when applied to a signal with itself, it gives the autocorrelation.

Autocorrelation-based pitch detection

#include "arm_math.h"

#define FRAME_SIZE   512    // 32 ms at 16 kHz
#define FS           16000
#define F0_MIN       80     // Hz
#define F0_MAX       500    // Hz
#define LAG_MIN      (FS / F0_MAX)   // 32 samples
#define LAG_MAX      (FS / F0_MIN)   // 200 samples

static float32_t frame[FRAME_SIZE];
static float32_t acf[2 * FRAME_SIZE - 1];

float estimate_pitch_acf(float32_t *frame) {
    // Compute autocorrelation via CMSIS-DSP
    arm_correlate_f32(frame, FRAME_SIZE, frame, FRAME_SIZE, acf);

    // ACF is symmetric; centre peak is at index FRAME_SIZE - 1
    float32_t *acf_pos = &acf[FRAME_SIZE - 1];  // positive lags only
    float32_t acf_zero = acf_pos[0];             // normalisation factor

    if (acf_zero < 1e-6f) return 0.0f;  // silence

    // Find highest peak in the lag range [LAG_MIN, LAG_MAX]
    float32_t max_val = 0;
    uint32_t best_lag = 0;
    for (uint32_t lag = LAG_MIN; lag <= LAG_MAX; lag++) {
        float32_t val = acf_pos[lag] / acf_zero;  // normalised
        if (val > max_val) {
            max_val = val;
            best_lag = lag;
        }
    }

    if (max_val < 0.3f || best_lag == 0) return 0.0f;  // unvoiced or no valid lag

    return (float)FS / best_lag;
}

Tip

arm_correlate_f32 computes the full $(2N-1)$-point correlation. For pitch detection you only need lags 32 to 200, so a direct loop over that range is actually faster than the full CMSIS-DSP call. Use the direct approach for production; the CMSIS-DSP call is shown here for clarity.

Restricted-lag autocorrelation (faster)

For real-time use, computing only the lags you need is significantly cheaper:

float estimate_pitch_acf_fast(float32_t *frame) {
    float32_t acf_zero = 0;
    arm_dot_prod_f32(frame, frame, FRAME_SIZE, &acf_zero);
    if (acf_zero < 1e-6f) return 0.0f;

    float32_t max_val = 0;
    uint32_t best_lag = 0;

    for (uint32_t lag = LAG_MIN; lag <= LAG_MAX; lag++) {
        float32_t dot;
        arm_dot_prod_f32(frame, &frame[lag], FRAME_SIZE - lag, &dot);
        float32_t val = dot / acf_zero;
        if (val > max_val) {
            max_val = val;
            best_lag = lag;
        }
    }

    if (max_val < 0.3f || best_lag == 0) return 0.0f;
    return (float)FS / best_lag;
}

Each arm_dot_prod_f32 call is optimised by CMSIS-DSP using the Cortex-M4’s MAC instructions. For 169 lags (80 to 500 Hz range, inclusive) and a 512-sample frame, this requires roughly $169 \times 512 \approx 87\text{K}$ multiply-accumulates, about 100K cycles on a Cortex-M4F, well under 1 ms at 180 MHz.

I2S audio input

The NUCLEO-F446RE connects to an INMP441 MEMS microphone via I2S. Configure SAI (Serial Audio Interface) or I2S with DMA for continuous acquisition:

// I2S DMA double-buffer: ping-pong between two frame buffers
static volatile int32_t dma_buf[2][FRAME_SIZE];  // volatile: written by DMA
static volatile uint8_t buf_ready = 0;

// DMA half-transfer and transfer-complete callbacks
void HAL_I2S_RxHalfCpltCallback(I2S_HandleTypeDef *hi2s) {
    buf_ready = 1;
}
void HAL_I2S_RxCpltCallback(I2S_HandleTypeDef *hi2s) {
    buf_ready = 2;
}

void process_audio(void) {
    HAL_I2S_Receive_DMA(&hi2s2, (uint16_t *)dma_buf, FRAME_SIZE * 2);

    while (1) {
        // Read and clear atomically to avoid ISR race
        __disable_irq();
        uint8_t b = buf_ready;
        buf_ready = 0;
        __enable_irq();

        if (b) {
            volatile int32_t *buf = dma_buf[b - 1];
            // INMP441: 24-bit audio in bits 31:8 of the 32-bit I2S frame
            for (int i = 0; i < FRAME_SIZE; i++)
                frame[i] = (float)(buf[i] >> 8) / 8388608.0f;

            float f0 = estimate_pitch_acf_fast(frame);
            if (f0 > 0) {
                // Output via UART, display, or DAC
            }
        }
    }
}

Performance budget (NUCLEO-F446RE, 180 MHz)

Stage	Operation	Est. cycles	Time
I2S → float conversion	512 multiplies	~1K	5.6 us
Restricted ACF (169 lags)	169 dot products	~100K	556 us
Peak search	169 comparisons	~500	2.8 us
Total per 32 ms frame		~102K	~564 us
Available per frame		5,760K	32 ms
Utilisation			~1.8%

ESP32-S3: FFT-based pitch detection with BLE output

The ESP32-S3 is well-suited for a wireless pitch detection system: built-in I2S for the microphone, ESP-DSP for FFT, and BLE for transmitting the result to a phone or display.

Architecture

Figure 1: On-device architecture: the I2S mic feeds a biquad bandpass, then parallel time-domain, spectral, and voice-activity branches whose result is published as a BLE characteristic.

Biquad bandpass pre-filter

A 4th-order Butterworth bandpass (80 to 500 Hz) removes DC, low-frequency rumble, and high harmonics:

typedef struct {
    float b0, b1, b2, a1, a2;
    float x1, x2, y1, y2;
} BiquadState;

static BiquadState sos[2];  // 2 sections = 4th-order

float biquad_process(BiquadState *s, float x) {
    float y = s->b0 * x + s->b1 * s->x1 + s->b2 * s->x2
                        - s->a1 * s->y1 - s->a2 * s->y2;
    s->x2 = s->x1;  s->x1 = x;
    s->y2 = s->y1;  s->y1 = y;
    return y;
}

float bandpass_filter(float sample) {
    float out = sample;
    for (int i = 0; i < 2; i++)
        out = biquad_process(&sos[i], out);
    return out;
}

Note

This uses Direct Form I with un-negated a1, a2 (SciPy convention). Coefficients can be pasted directly from scipy.signal.butter(..., output='sos'). See ADR-005 for the project’s coefficient convention.

FFT periodogram pitch estimation

ESP-DSP provides dsps_fft2r_fc32 for radix-2 FFT:

#include "dsps_fft2r.h"
#include "dsps_wind_hann.h"

#define NFFT 512

static float fft_buf[NFFT * 2];  // interleaved complex
static float window[NFFT];

void pitch_init(void) {
    dsps_fft2r_init_fc32(NULL, NFFT);
    dsps_wind_hann_f32(window, NFFT);
}

float estimate_pitch_fft(float *frame) {
    // Apply Hann window and pack as complex (imaginary = 0)
    for (int i = 0; i < NFFT; i++) {
        fft_buf[2*i]     = frame[i] * window[i];
        fft_buf[2*i + 1] = 0.0f;
    }

    dsps_fft2r_fc32(fft_buf, NFFT);
    dsps_bit_rev_fc32(fft_buf, NFFT);

    // Find peak in 80-500 Hz band
    int k_min = (int)(80.0f * NFFT / FS);
    int k_max = (int)(500.0f * NFFT / FS);
    float max_power = 0;
    int peak_bin = k_min;

    for (int k = k_min; k <= k_max; k++) {
        float re = fft_buf[2*k];
        float im = fft_buf[2*k + 1];
        float power = re*re + im*im;
        if (power > max_power) {
            max_power = power;
            peak_bin = k;
        }
    }

    // Confidence check: peak must be well above median
    if (max_power < 1e-6f) return 0.0f;

    return (float)peak_bin * FS / NFFT;
}

Cepstral pitch estimation (two FFTs)

The cepstrum is the other frequency-domain method, and it lands on the metal with a small trick. The real cepstrum is $c[n] = \text{IDFT}\{\log|\text{DFT}\{x\}|\}$, but $\log|X[k]|$ is real and even for a real frame, so its inverse DFT equals $1/N$ times its forward DFT. That means a second forward FFT gives the cepstrum directly: no inverse transform, and no need to unpack a real-FFT’s interleaved format. It reuses the exact dsps_fft2r_fc32 machinery the periodogram already set up.

#include <math.h>

// Cepstral pitch: two forward FFTs, reusing fft_buf / window / pitch_init above.
// dsps_fft2r_init_fc32 built the twiddle table once for NFFT in pitch_init; both
// FFT calls below reuse it, so no re-initialisation is needed between them.
float estimate_pitch_cepstrum(float *frame) {
    // 1) Windowed frame -> complex, forward FFT.
    for (int i = 0; i < NFFT; i++) {
        fft_buf[2*i]     = frame[i] * window[i];
        fft_buf[2*i + 1] = 0.0f;
    }
    dsps_fft2r_fc32(fft_buf, NFFT);
    dsps_bit_rev_fc32(fft_buf, NFFT);

    // 2) Replace the spectrum with its log-magnitude, repacked as a real
    //    sequence (imag = 0). log|X| = 0.5*log(re^2 + im^2); eps guards silence.
    for (int k = 0; k < NFFT; k++) {
        float re = fft_buf[2*k], im = fft_buf[2*k + 1];
        fft_buf[2*k]     = 0.5f * logf(re*re + im*im + 1e-12f);
        fft_buf[2*k + 1] = 0.0f;
    }

    // 3) Second forward FFT; the cepstrum is its real part divided by NFFT.
    dsps_fft2r_fc32(fft_buf, NFFT);
    dsps_bit_rev_fc32(fft_buf, NFFT);

    // 4) Peak-pick the quefrency band for a 50-400 Hz pitch. Higher pitch means
    //    a shorter period, hence a smaller quefrency index, so the bounds flip.
    int q_min = (int)(FS / 400.0f);
    int q_max = (int)(FS / 50.0f);
    float max_c = -1e30f;
    int peak_q = q_min;
    for (int q = q_min; q <= q_max; q++) {
        float c = fft_buf[2*q] / (float)NFFT;   // real part of the cepstrum
        if (c > max_c) { max_c = c; peak_q = q; }
    }
    // This returns the best pitch candidate unconditionally, with no voicing
    // test of its own. Before trusting it, gate it with the is_voiced() check
    // from the Voice activity detection section (the same test the periodogram
    // path uses), so silent or unvoiced frames are not assigned a spurious pitch.
    return FS / (float)peak_q;
}

Two NFFT-point FFTs cost roughly twice a single periodogram FFT, about $2 \times N \log_2 N \approx 2 \times 512 \times 9 \approx 9000$ butterflies, well under 100K cycles at 240 MHz and comparable to the autocorrelation budget on the NUCLEO. In exchange the cepstrum separates pitch from the vocal-tract envelope, so the same two transforms also hand you the low-quefrency coefficients that describe what is being said, the seed of an MFCC front end.

The identical recipe runs on the NUCLEO-F446RE with CMSIS-DSP, with one API difference worth stating: each ESP-DSP FFT-plus-bit-reverse pair becomes a single CMSIS call, arm_cfft_f32(&arm_cfft_sR_f32_len512, fft_buf, 0, 1), where the last two arguments select the forward transform and in-call bit-reversal. One arm_cfft_f32 call replaces both dsps_ calls, so do not call it twice. Both libraries compute an unnormalised forward FFT, so the single divide by NFFT at the peak-pick step is the right and only scaling on either target; everything between the transforms is portable float C.

Voice activity detection

The VAD gate prevents pitch output during silence and unvoiced consonants:

typedef struct {
    float noise_floor;
    float alpha;  // noise floor tracking
} VadState;

bool is_voiced(VadState *vad, float power, float centroid, float zcr) {
    // Adaptive noise floor tracking
    if (power < vad->noise_floor * 1.5f)
        vad->noise_floor = vad->alpha * vad->noise_floor
                         + (1.0f - vad->alpha) * power;

    return (power > vad->noise_floor * 3.0f)
        && (centroid > 80.0f)
        && (centroid < 2000.0f)
        && (zcr < 0.15f);
}

I2S microphone setup

#include "driver/i2s_std.h"

static i2s_chan_handle_t rx_chan;

void i2s_mic_init(void) {
    i2s_chan_config_t chan_cfg = I2S_CHANNEL_DEFAULT_CONFIG(
        I2S_NUM_0, I2S_ROLE_MASTER);
    i2s_new_channel(&chan_cfg, NULL, &rx_chan);

    i2s_std_config_t std_cfg = {
        .clk_cfg = I2S_STD_CLK_DEFAULT_CONFIG(16000),
        .slot_cfg = I2S_STD_PHILIPS_SLOT_DEFAULT_CONFIG(
            I2S_DATA_BIT_WIDTH_32BIT, I2S_SLOT_MODE_MONO),
        .gpio_cfg = {
            .bclk = GPIO_NUM_26,   // verify against your DevKitC variant pinout
            .ws   = GPIO_NUM_25,
            .din  = GPIO_NUM_22,
            .dout = I2S_GPIO_UNUSED,
        },
    };
    i2s_channel_init_std_mode(rx_chan, &std_cfg);
    i2s_channel_enable(rx_chan);
}

Note

This uses the ESP-IDF v5.x I2S driver API (i2s_std), which replaces the legacy driver/i2s.h API used in the older adaptive filtering and biquad embedded pages.

Spectral centroid

The spectral centroid is the power-weighted mean frequency, a key VAD feature:

float compute_spectral_centroid(float *fft_buf, int nfft) {
    float weighted_sum = 0, power_sum = 0;
    for (int k = 1; k < nfft / 2; k++) {
        float re = fft_buf[2*k];
        float im = fft_buf[2*k + 1];
        float power = re*re + im*im;
        float freq = (float)k * FS / nfft;
        weighted_sum += freq * power;
        power_sum += power;
    }
    return (power_sum > 1e-10f) ? weighted_sum / power_sum : 0.0f;
}

Main processing task

#define FRAME_SIZE 512
#define FS         16000

void pitch_task(void *param) {
    int32_t raw_samples[FRAME_SIZE];
    float frame[FRAME_SIZE];
    size_t bytes_read;

    pitch_init();

    while (true) {
        i2s_channel_read(rx_chan, raw_samples, sizeof(raw_samples),
                         &bytes_read, portMAX_DELAY);
        int n = bytes_read / sizeof(int32_t);
        if (n > FRAME_SIZE) n = FRAME_SIZE;  // clamp to buffer size

        // INMP441: 24-bit audio in bits 31:8 of the 32-bit I2S frame
        float power = 0;
        float zcr_count = 0;
        float prev = 0;
        for (int i = 0; i < n; i++) {
            float sample = (float)(raw_samples[i] >> 8) / 8388608.0f;
            frame[i] = bandpass_filter(sample);
            power += frame[i] * frame[i];
            if (i > 0 && ((prev < 0 && frame[i] >= 0) ||
                          (prev >= 0 && frame[i] < 0)))
                zcr_count++;
            prev = frame[i];
        }
        power /= n;
        float zcr = zcr_count / (2.0f * n);

        // Spectral centroid (from FFT, computed during pitch estimation)
        float f0 = estimate_pitch_fft(frame);

        // VAD check
        float centroid = compute_spectral_centroid(fft_buf, NFFT);
        if (is_voiced(&vad, power, centroid, zcr) && f0 > 0) {
            // Send via BLE or display
            update_ble_characteristic(f0);
        }
    }
}

void app_main(void) {
    i2s_mic_init();
    ble_init();
    xTaskCreatePinnedToCore(pitch_task, "pitch", 8192, NULL, 5, NULL, 1);
}

Performance budget (ESP32-S3, 240 MHz)

Stage	Operation	Est. cycles	Time
I2S read + conversion	512 samples	~2K	8 us
Biquad bandpass (2 SOS)	512 x 10 MACs	~10K	42 us
Hann window	512 multiplies	~1K	4 us
512-point FFT (ESP-DSP)	radix-2 f32	~30K	125 us
Peak search (80-500 Hz)	~14 bins	~100	0.4 us
VAD features	power, centroid, ZCR	~2K	8 us
Total per 32 ms frame		~45K	~187 us
Available per frame		7,680K	32 ms
Utilisation			~0.6%

Platform comparison

Feature	STM32F4 (NUCLEO-F446RE)	ESP32-S3
Clock	180 MHz (Cortex-M4F)	240 MHz (Xtensa LX7)
Pitch method	Autocorrelation (CMSIS-DSP dot product)	FFT periodogram (ESP-DSP)
FFT library	none (autocorrelation, not FFT)	`dsps_fft2r_fc32`
Wireless output	External module (UART → BLE dongle)	Built-in BLE
I2S microphone	Via SAI peripheral + DMA	Built-in I2S driver
Real-time determinism	Excellent (bare-metal or RTOS)	Good (with core pinning)
Unit cost	~EUR 18 (Nucleo) + EUR 2 (mic)	~EUR 8 (DevKitC) + EUR 2 (mic)
Best for	Low-latency pitch tracking, industrial voice analysis	Wireless pitch feedback, speech therapy apps

Recommendation

ESP32-S3 for wireless applications: speech therapy tools, musical tuners with phone display, voice coaching apps. The built-in BLE eliminates external modules.
STM32F4 for deterministic, low-latency applications: real-time voice processing pipelines, studio-grade pitch tracking, integration with existing audio hardware.

Bill of materials

ESP32-S3 wireless pitch detector

Component	Purpose	Approx. cost
ESP32-S3-DevKitC	Processing + BLE	EUR 8
INMP441 breakout	I2S MEMS microphone	EUR 2
SSD1306 OLED (128x64)	Local pitch display (optional)	EUR 3
Breadboard + wires		EUR 3
Total		~EUR 16

STM32F4 pitch tracker

Component	Purpose	Approx. cost
NUCLEO-F446RE	Processing	EUR 18
INMP441 breakout	I2S MEMS microphone	EUR 2
USB-UART adapter	Output to PC (or use Nucleo’s built-in ST-Link UART)	EUR 0
Breadboard + wires		EUR 3
Total		~EUR 23

--- title: "Pitch Detection on Hardware" subtitle: "Real-time fundamental frequency estimation on ESP32-S3 and STM32F4" --- A real-time pitch detector must estimate $f_0$ from a short audio frame, decide whether speech is present, and deliver the result, all within the time it takes for the next frame to arrive. On a microcontroller at 16 kHz with 32 ms frames, that budget is about 6 million cycles (5.76 million at 180 MHz, 7.68 million on a 240 MHz ESP32-S3). Both the autocorrelation and periodogram methods fit comfortably. For the theory, comparison of methods, and Python prototypes, see the [main pitch detection page](index.qmd). <hr> ## STM32F4 (NUCLEO-F446RE): autocorrelation with CMSIS-DSP The Cortex-M4F at 180 MHz with hardware FPU and CMSIS-DSP makes the autocorrelation method practical even for longer frames. The key CMSIS-DSP function is `arm_correlate_f32`, which computes the full cross-correlation: when applied to a signal with itself, it gives the autocorrelation. ### Autocorrelation-based pitch detection ```c #include "arm_math.h" #define FRAME_SIZE 512 // 32 ms at 16 kHz #define FS 16000 #define F0_MIN 80 // Hz #define F0_MAX 500 // Hz #define LAG_MIN (FS / F0_MAX) // 32 samples #define LAG_MAX (FS / F0_MIN) // 200 samples static float32_t frame[FRAME_SIZE]; static float32_t acf[2 * FRAME_SIZE - 1]; float estimate_pitch_acf(float32_t *frame) { // Compute autocorrelation via CMSIS-DSP arm_correlate_f32(frame, FRAME_SIZE, frame, FRAME_SIZE, acf); // ACF is symmetric; centre peak is at index FRAME_SIZE - 1 float32_t *acf_pos = &acf[FRAME_SIZE - 1]; // positive lags only float32_t acf_zero = acf_pos[0]; // normalisation factor if (acf_zero < 1e-6f) return 0.0f; // silence // Find highest peak in the lag range [LAG_MIN, LAG_MAX] float32_t max_val = 0; uint32_t best_lag = 0; for (uint32_t lag = LAG_MIN; lag <= LAG_MAX; lag++) { float32_t val = acf_pos[lag] / acf_zero; // normalised if (val > max_val) { max_val = val; best_lag = lag; } } if (max_val < 0.3f || best_lag == 0) return 0.0f; // unvoiced or no valid lag return (float)FS / best_lag; } ``` ::: {.callout-tip} `arm_correlate_f32` computes the full $(2N-1)$-point correlation. For pitch detection you only need lags 32 to 200, so a direct loop over that range is actually faster than the full CMSIS-DSP call. Use the direct approach for production; the CMSIS-DSP call is shown here for clarity. ::: ### Restricted-lag autocorrelation (faster) For real-time use, computing only the lags you need is significantly cheaper: ```c float estimate_pitch_acf_fast(float32_t *frame) { float32_t acf_zero = 0; arm_dot_prod_f32(frame, frame, FRAME_SIZE, &acf_zero); if (acf_zero < 1e-6f) return 0.0f; float32_t max_val = 0; uint32_t best_lag = 0; for (uint32_t lag = LAG_MIN; lag <= LAG_MAX; lag++) { float32_t dot; arm_dot_prod_f32(frame, &frame[lag], FRAME_SIZE - lag, &dot); float32_t val = dot / acf_zero; if (val > max_val) { max_val = val; best_lag = lag; } } if (max_val < 0.3f || best_lag == 0) return 0.0f; return (float)FS / best_lag; } ``` Each `arm_dot_prod_f32` call is optimised by CMSIS-DSP using the Cortex-M4's MAC instructions. For 169 lags (80 to 500 Hz range, inclusive) and a 512-sample frame, this requires roughly $169 \times 512 \approx 87\text{K}$ multiply-accumulates, about 100K cycles on a Cortex-M4F, well under 1 ms at 180 MHz. ### I2S audio input The NUCLEO-F446RE connects to an INMP441 MEMS microphone via I2S. Configure SAI (Serial Audio Interface) or I2S with DMA for continuous acquisition: ```c // I2S DMA double-buffer: ping-pong between two frame buffers static volatile int32_t dma_buf[2][FRAME_SIZE]; // volatile: written by DMA static volatile uint8_t buf_ready = 0; // DMA half-transfer and transfer-complete callbacks void HAL_I2S_RxHalfCpltCallback(I2S_HandleTypeDef *hi2s) { buf_ready = 1; } void HAL_I2S_RxCpltCallback(I2S_HandleTypeDef *hi2s) { buf_ready = 2; } void process_audio(void) { HAL_I2S_Receive_DMA(&hi2s2, (uint16_t *)dma_buf, FRAME_SIZE * 2); while (1) { // Read and clear atomically to avoid ISR race __disable_irq(); uint8_t b = buf_ready; buf_ready = 0; __enable_irq(); if (b) { volatile int32_t *buf = dma_buf[b - 1]; // INMP441: 24-bit audio in bits 31:8 of the 32-bit I2S frame for (int i = 0; i < FRAME_SIZE; i++) frame[i] = (float)(buf[i] >> 8) / 8388608.0f; float f0 = estimate_pitch_acf_fast(frame); if (f0 > 0) { // Output via UART, display, or DAC } } } } ``` ### Performance budget (NUCLEO-F446RE, 180 MHz) | Stage | Operation | Est. cycles | Time | |---|---|---|---| | I2S → float conversion | 512 multiplies | ~1K | 5.6 us | | Restricted ACF (169 lags) | 169 dot products | ~100K | 556 us | | Peak search | 169 comparisons | ~500 | 2.8 us | | **Total per 32 ms frame** | | **~102K** | **~564 us** | | Available per frame | | 5,760K | 32 ms | | **Utilisation** | | | **~1.8%** | <hr> ## ESP32-S3: FFT-based pitch detection with BLE output The ESP32-S3 is well-suited for a wireless pitch detection system: built-in I2S for the microphone, ESP-DSP for FFT, and BLE for transmitting the result to a phone or display. ### Architecture ```{dot} //| label: fig-pitch-arch //| echo: false //| fig-cap: "On-device architecture: the I2S mic feeds a biquad bandpass, then parallel time-domain, spectral, and voice-activity branches whose result is published as a BLE characteristic." graph { layout=neato node [fontname="sans-serif" fontsize=12] edge [arrowsize=0.7 dir=forward] mic [label="INMP441\n(I2S)" shape=box width=1.1 height=0.55 pos="0,1.5!"] bp [label="Biquad bandpass\n(80-500 Hz)" shape=box width=1.6 height=0.55 pos="2.0,1.5!"] bd [label="" shape=point width=0.06 pos="3.3,1.5!"] ct [label="" shape=none width=0 height=0 margin=0 pos="3.3,2.6!"] cb [label="" shape=none width=0 height=0 margin=0 pos="3.3,0.4!"] zcr [label="ZCR" shape=box width=0.8 height=0.45 pos="4.3,2.6!"] fft [label="FFT" shape=box width=0.8 height=0.45 pos="4.3,1.5!"] vad [label="VAD\n(power + centroid)" shape=box width=1.9 height=0.55 pos="4.9,0.4!"] rf0 [label="rough f₀" shape=plaintext pos="5.7,2.6!"] sf0 [label="spectral peak f₀" shape=plaintext pos="6.0,1.5!"] k1 [label="" shape=none width=0 height=0 margin=0 pos="8.4,2.6!"] k3 [label="" shape=none width=0 height=0 margin=0 pos="8.4,0.4!"] out [label="BLE\ncharacteristic" shape=box width=1.4 height=0.6 style=filled fillcolor="#eef3f8" pos="8.4,1.5!"] mic--bp bp--bd [arrowhead=none] bd--fft bd--ct [arrowhead=none] ct--zcr bd--cb [arrowhead=none] cb--vad zcr--rf0 fft--sf0 rf0--k1 [arrowhead=none] k1--out sf0--out vad--k3 [arrowhead=none] k3--out } ``` ### Biquad bandpass pre-filter A 4th-order Butterworth bandpass (80 to 500 Hz) removes DC, low-frequency rumble, and high harmonics: ```c typedef struct { float b0, b1, b2, a1, a2; float x1, x2, y1, y2; } BiquadState; static BiquadState sos[2]; // 2 sections = 4th-order float biquad_process(BiquadState *s, float x) { float y = s->b0 * x + s->b1 * s->x1 + s->b2 * s->x2 - s->a1 * s->y1 - s->a2 * s->y2; s->x2 = s->x1; s->x1 = x; s->y2 = s->y1; s->y1 = y; return y; } float bandpass_filter(float sample) { float out = sample; for (int i = 0; i < 2; i++) out = biquad_process(&sos[i], out); return out; } ``` ::: {.callout-note} This uses Direct Form I with un-negated `a1`, `a2` (SciPy convention). Coefficients can be pasted directly from `scipy.signal.butter(..., output='sos')`. See [ADR-005](../../docs/adr/005-embedded-platforms.md) for the project's coefficient convention. ::: ### FFT periodogram pitch estimation ESP-DSP provides `dsps_fft2r_fc32` for radix-2 FFT: ```c #include "dsps_fft2r.h" #include "dsps_wind_hann.h" #define NFFT 512 static float fft_buf[NFFT * 2]; // interleaved complex static float window[NFFT]; void pitch_init(void) { dsps_fft2r_init_fc32(NULL, NFFT); dsps_wind_hann_f32(window, NFFT); } float estimate_pitch_fft(float *frame) { // Apply Hann window and pack as complex (imaginary = 0) for (int i = 0; i < NFFT; i++) { fft_buf[2*i] = frame[i] * window[i]; fft_buf[2*i + 1] = 0.0f; } dsps_fft2r_fc32(fft_buf, NFFT); dsps_bit_rev_fc32(fft_buf, NFFT); // Find peak in 80-500 Hz band int k_min = (int)(80.0f * NFFT / FS); int k_max = (int)(500.0f * NFFT / FS); float max_power = 0; int peak_bin = k_min; for (int k = k_min; k <= k_max; k++) { float re = fft_buf[2*k]; float im = fft_buf[2*k + 1]; float power = re*re + im*im; if (power > max_power) { max_power = power; peak_bin = k; } } // Confidence check: peak must be well above median if (max_power < 1e-6f) return 0.0f; return (float)peak_bin * FS / NFFT; } ``` ### Cepstral pitch estimation (two FFTs) The [cepstrum](index.qmd#cepstrum) is the other frequency-domain method, and it lands on the metal with a small trick. The real cepstrum is $c[n] = \text{IDFT}\{\log|\text{DFT}\{x\}|\}$, but $\log|X[k]|$ is real and even for a real frame, so its inverse DFT equals $1/N$ times its *forward* DFT. That means a **second forward FFT** gives the cepstrum directly: no inverse transform, and no need to unpack a real-FFT's interleaved format. It reuses the exact `dsps_fft2r_fc32` machinery the periodogram already set up. ```c #include <math.h> // Cepstral pitch: two forward FFTs, reusing fft_buf / window / pitch_init above. // dsps_fft2r_init_fc32 built the twiddle table once for NFFT in pitch_init; both // FFT calls below reuse it, so no re-initialisation is needed between them. float estimate_pitch_cepstrum(float *frame) { // 1) Windowed frame -> complex, forward FFT. for (int i = 0; i < NFFT; i++) { fft_buf[2*i] = frame[i] * window[i]; fft_buf[2*i + 1] = 0.0f; } dsps_fft2r_fc32(fft_buf, NFFT); dsps_bit_rev_fc32(fft_buf, NFFT); // 2) Replace the spectrum with its log-magnitude, repacked as a real // sequence (imag = 0). log|X| = 0.5*log(re^2 + im^2); eps guards silence. for (int k = 0; k < NFFT; k++) { float re = fft_buf[2*k], im = fft_buf[2*k + 1]; fft_buf[2*k] = 0.5f * logf(re*re + im*im + 1e-12f); fft_buf[2*k + 1] = 0.0f; } // 3) Second forward FFT; the cepstrum is its real part divided by NFFT. dsps_fft2r_fc32(fft_buf, NFFT); dsps_bit_rev_fc32(fft_buf, NFFT); // 4) Peak-pick the quefrency band for a 50-400 Hz pitch. Higher pitch means // a shorter period, hence a smaller quefrency index, so the bounds flip. int q_min = (int)(FS / 400.0f); int q_max = (int)(FS / 50.0f); float max_c = -1e30f; int peak_q = q_min; for (int q = q_min; q <= q_max; q++) { float c = fft_buf[2*q] / (float)NFFT; // real part of the cepstrum if (c > max_c) { max_c = c; peak_q = q; } } // This returns the best pitch candidate unconditionally, with no voicing // test of its own. Before trusting it, gate it with the is_voiced() check // from the Voice activity detection section (the same test the periodogram // path uses), so silent or unvoiced frames are not assigned a spurious pitch. return FS / (float)peak_q; } ``` Two `NFFT`-point FFTs cost roughly twice a single periodogram FFT, about $2 \times N \log_2 N \approx 2 \times 512 \times 9 \approx 9000$ butterflies, well under 100K cycles at 240 MHz and comparable to the autocorrelation budget on the NUCLEO. In exchange the cepstrum separates pitch from the vocal-tract envelope, so the same two transforms also hand you the low-quefrency coefficients that describe *what* is being said, the seed of an [MFCC](../mfcc/embedded.qmd) front end. The identical recipe runs on the NUCLEO-F446RE with CMSIS-DSP, with one API difference worth stating: each ESP-DSP FFT-plus-bit-reverse **pair** becomes a *single* CMSIS call, `arm_cfft_f32(&arm_cfft_sR_f32_len512, fft_buf, 0, 1)`, where the last two arguments select the forward transform and in-call bit-reversal. One `arm_cfft_f32` call replaces both `dsps_` calls, so do not call it twice. Both libraries compute an unnormalised forward FFT, so the single divide by `NFFT` at the peak-pick step is the right and only scaling on either target; everything between the transforms is portable float C. ### Voice activity detection The VAD gate prevents pitch output during silence and unvoiced consonants: ```c typedef struct { float noise_floor; float alpha; // noise floor tracking } VadState; bool is_voiced(VadState *vad, float power, float centroid, float zcr) { // Adaptive noise floor tracking if (power < vad->noise_floor * 1.5f) vad->noise_floor = vad->alpha * vad->noise_floor + (1.0f - vad->alpha) * power; return (power > vad->noise_floor * 3.0f) && (centroid > 80.0f) && (centroid < 2000.0f) && (zcr < 0.15f); } ``` ### I2S microphone setup ```c #include "driver/i2s_std.h" static i2s_chan_handle_t rx_chan; void i2s_mic_init(void) { i2s_chan_config_t chan_cfg = I2S_CHANNEL_DEFAULT_CONFIG( I2S_NUM_0, I2S_ROLE_MASTER); i2s_new_channel(&chan_cfg, NULL, &rx_chan); i2s_std_config_t std_cfg = { .clk_cfg = I2S_STD_CLK_DEFAULT_CONFIG(16000), .slot_cfg = I2S_STD_PHILIPS_SLOT_DEFAULT_CONFIG( I2S_DATA_BIT_WIDTH_32BIT, I2S_SLOT_MODE_MONO), .gpio_cfg = { .bclk = GPIO_NUM_26, // verify against your DevKitC variant pinout .ws = GPIO_NUM_25, .din = GPIO_NUM_22, .dout = I2S_GPIO_UNUSED, }, }; i2s_channel_init_std_mode(rx_chan, &std_cfg); i2s_channel_enable(rx_chan); } ``` ::: {.callout-note} This uses the ESP-IDF v5.x I2S driver API (`i2s_std`), which replaces the legacy `driver/i2s.h` API used in the older [adaptive filtering](../adaptive-filtering/embedded.qmd) and [biquad](../../basics/09-biquad/embedded.qmd) embedded pages. ::: ### Spectral centroid The spectral centroid is the power-weighted mean frequency, a key VAD feature: ```c float compute_spectral_centroid(float *fft_buf, int nfft) { float weighted_sum = 0, power_sum = 0; for (int k = 1; k < nfft / 2; k++) { float re = fft_buf[2*k]; float im = fft_buf[2*k + 1]; float power = re*re + im*im; float freq = (float)k * FS / nfft; weighted_sum += freq * power; power_sum += power; } return (power_sum > 1e-10f) ? weighted_sum / power_sum : 0.0f; } ``` ### Main processing task ```c #define FRAME_SIZE 512 #define FS 16000 void pitch_task(void *param) { int32_t raw_samples[FRAME_SIZE]; float frame[FRAME_SIZE]; size_t bytes_read; pitch_init(); while (true) { i2s_channel_read(rx_chan, raw_samples, sizeof(raw_samples), &bytes_read, portMAX_DELAY); int n = bytes_read / sizeof(int32_t); if (n > FRAME_SIZE) n = FRAME_SIZE; // clamp to buffer size // INMP441: 24-bit audio in bits 31:8 of the 32-bit I2S frame float power = 0; float zcr_count = 0; float prev = 0; for (int i = 0; i < n; i++) { float sample = (float)(raw_samples[i] >> 8) / 8388608.0f; frame[i] = bandpass_filter(sample); power += frame[i] * frame[i]; if (i > 0 && ((prev < 0 && frame[i] >= 0) || (prev >= 0 && frame[i] < 0))) zcr_count++; prev = frame[i]; } power /= n; float zcr = zcr_count / (2.0f * n); // Spectral centroid (from FFT, computed during pitch estimation) float f0 = estimate_pitch_fft(frame); // VAD check float centroid = compute_spectral_centroid(fft_buf, NFFT); if (is_voiced(&vad, power, centroid, zcr) && f0 > 0) { // Send via BLE or display update_ble_characteristic(f0); } } } void app_main(void) { i2s_mic_init(); ble_init(); xTaskCreatePinnedToCore(pitch_task, "pitch", 8192, NULL, 5, NULL, 1); } ``` ### Performance budget (ESP32-S3, 240 MHz) | Stage | Operation | Est. cycles | Time | |---|---|---|---| | I2S read + conversion | 512 samples | ~2K | 8 us | | Biquad bandpass (2 SOS) | 512 x 10 MACs | ~10K | 42 us | | Hann window | 512 multiplies | ~1K | 4 us | | 512-point FFT (ESP-DSP) | radix-2 f32 | ~30K | 125 us | | Peak search (80-500 Hz) | ~14 bins | ~100 | 0.4 us | | VAD features | power, centroid, ZCR | ~2K | 8 us | | **Total per 32 ms frame** | | **~45K** | **~187 us** | | Available per frame | | 7,680K | 32 ms | | **Utilisation** | | | **~0.6%** | <hr> ## Platform comparison | Feature | STM32F4 (NUCLEO-F446RE) | ESP32-S3 | |---|---|---| | Clock | 180 MHz (Cortex-M4F) | 240 MHz (Xtensa LX7) | | Pitch method | Autocorrelation (CMSIS-DSP dot product) | FFT periodogram (ESP-DSP) | | FFT library | none (autocorrelation, not FFT) | `dsps_fft2r_fc32` | | Wireless output | External module (UART → BLE dongle) | Built-in BLE | | I2S microphone | Via SAI peripheral + DMA | Built-in I2S driver | | Real-time determinism | Excellent (bare-metal or RTOS) | Good (with core pinning) | | Unit cost | ~EUR 18 (Nucleo) + EUR 2 (mic) | ~EUR 8 (DevKitC) + EUR 2 (mic) | | Best for | Low-latency pitch tracking, industrial voice analysis | Wireless pitch feedback, speech therapy apps | ### Recommendation - **ESP32-S3** for wireless applications: speech therapy tools, musical tuners with phone display, voice coaching apps. The built-in BLE eliminates external modules. - **STM32F4** for deterministic, low-latency applications: real-time voice processing pipelines, studio-grade pitch tracking, integration with existing audio hardware. <hr> ## Bill of materials ### ESP32-S3 wireless pitch detector | Component | Purpose | Approx. cost | |---|---|---| | ESP32-S3-DevKitC | Processing + BLE | EUR 8 | | INMP441 breakout | I2S MEMS microphone | EUR 2 | | SSD1306 OLED (128x64) | Local pitch display (optional) | EUR 3 | | Breadboard + wires | | EUR 3 | | **Total** | | **~EUR 16** | ### STM32F4 pitch tracker | Component | Purpose | Approx. cost | |---|---|---| | NUCLEO-F446RE | Processing | EUR 18 | | INMP441 breakout | I2S MEMS microphone | EUR 2 | | USB-UART adapter | Output to PC (or use Nucleo's built-in ST-Link UART) | EUR 0 | | Breadboard + wires | | EUR 3 | | **Total** | | **~EUR 23** |