Beamforming on Hardware

Multi-microphone direction-of-arrival estimation on ESP32-S3 and STM32F4

A microphone array on a microcontroller turns the delay-and-sum beamformer from a simulation into a real-time direction finder. The challenge is multi-channel synchronous acquisition — all microphones must be sampled at the same instant, with known inter-channel timing. I2S handles this naturally for 2 channels (stereo); more channels require either multiple I2S peripherals or TDM mode.

For the theory (array geometry, beam patterns, TDOA estimation, frequency-domain beamforming), see the main beamforming page.


Array hardware

Microphone array options

Configuration Mics Channels needed I2S requirement Use case
2-mic stereo 2x INMP441 2 (L+R on one I2S) 1 I2S peripheral Left/right DOA, noise reduction
4-mic linear 4x INMP441 4 (2 stereo I2S) 2 I2S peripherals Azimuth DOA, moderate resolution
4-mic square 4x INMP441 4 2 I2S peripherals 2D DOA (azimuth + elevation)

The INMP441 has a left/right channel select pin (L/R): tie it low for left channel, high for right. Two INMP441 modules on one I2S bus give synchronous stereo — the simplest array.

2-mic array on ESP32-S3

    d = 50 mm
  |<-------->|
[MIC0 (L)]  [MIC1 (R)]
  |             |
  |-- I2S_0 ---|
       |
   ESP32-S3

With \(d = 50\) mm spacing and sound speed \(c = 343\) m/s:

  • Maximum inter-mic delay: \(\tau_\text{max} = d/c = 146\) us
  • At \(f_s = 16\) kHz: \(\tau_\text{max} \approx 2.3\) samples
  • Angular resolution limited by array size — adequate for left/right/centre classification

4-mic linear array on ESP32-S3

The ESP32-S3 has 2 I2S peripherals. Each runs in stereo, giving 4 synchronous channels:

    d = 50 mm
  |<-------->|
[M0(L)] [M1(R)] [M2(L)] [M3(R)]
  |         |      |         |
  |-- I2S0 -|      |-- I2S1 -|
       |                |
       ESP32-S3
Inter-peripheral synchronisation

The two I2S peripherals share the same APLL clock source, so their sample clocks are phase-locked. However, the DMA transfers may not start at exactly the same instant. For precise TDOA estimation, calibrate the inter-peripheral offset by correlating a known reference signal at startup, or tie both I2S word-select (WS) lines to the same GPIO and start them simultaneously.


ESP32-S3: delay-and-sum beamformer

I2S stereo microphone setup (2-mic)

#include "driver/i2s_std.h"

#define FS          16000
#define FRAME_SIZE  256

static i2s_chan_handle_t rx_chan;

void i2s_mic_array_init(void) {
    i2s_chan_config_t chan_cfg = I2S_CHANNEL_DEFAULT_CONFIG(
        I2S_NUM_0, I2S_ROLE_MASTER);
    i2s_new_channel(&chan_cfg, NULL, &rx_chan);

    i2s_std_config_t std_cfg = {
        .clk_cfg = I2S_STD_CLK_DEFAULT_CONFIG(FS),
        .slot_cfg = I2S_STD_PHILIPS_SLOT_DEFAULT_CONFIG(
            I2S_DATA_BIT_WIDTH_32BIT, I2S_SLOT_MODE_STEREO),
        .gpio_cfg = {
            .bclk = GPIO_NUM_26,
            .ws   = GPIO_NUM_25,
            .din  = GPIO_NUM_22,
            .dout = I2S_GPIO_UNUSED,
        },
    };
    i2s_channel_init_std_mode(rx_chan, &std_cfg);
    i2s_channel_enable(rx_chan);
}

Deinterleave stereo to two channels

I2S stereo data arrives interleaved: L, R, L, R, … Deinterleave into separate buffers:

void deinterleave(int32_t *interleaved, float *ch0, float *ch1, int n_frames) {
    for (int i = 0; i < n_frames; i++) {
        ch0[i] = (float)(interleaved[2*i]     >> 8) / 8388608.0f;
        ch1[i] = (float)(interleaved[2*i + 1] >> 8) / 8388608.0f;
    }
}

TDOA estimation via cross-correlation

The time-difference of arrival between two microphones is the lag of the cross-correlation peak. For a 2-mic array, one TDOA value gives the angle of arrival:

#include <math.h>

#define MAX_LAG  4   // max lag in samples (~250 us at 16 kHz)

// Cross-correlation for small lag range (brute-force, fast for small MAX_LAG)
float estimate_tdoa(float *ch0, float *ch1, int n) {
    float best_corr = -1e30f;
    int best_lag = 0;

    for (int lag = -MAX_LAG; lag <= MAX_LAG; lag++) {
        float sum = 0;
        int start = (lag > 0) ? lag : 0;
        int end   = (lag > 0) ? n : n + lag;
        for (int i = start; i < end; i++) {
            sum += ch0[i] * ch1[i - lag];
        }
        if (sum > best_corr) {
            best_corr = sum;
            best_lag = lag;
        }
    }

    return (float)best_lag / FS;  // TDOA in seconds
}

// Convert TDOA to angle of arrival
float tdoa_to_angle(float tdoa, float mic_spacing) {
    float sin_theta = tdoa * 343.0f / mic_spacing;
    // Clamp to valid range (numerical errors can push slightly outside [-1,1])
    if (sin_theta > 1.0f) sin_theta = 1.0f;
    if (sin_theta < -1.0f) sin_theta = -1.0f;
    return asinf(sin_theta) * 180.0f / M_PI;  // degrees from broadside
}

Delay-and-sum with fractional delay

Integer-sample delay is simple (index offset), but for angles that produce fractional-sample TDOA, linear interpolation improves accuracy:

// Apply fractional delay using linear interpolation.
// delay_samples can be positive or negative.
void apply_fractional_delay(float *in, float *out, int n, float delay_samples) {
    int d_int = (int)floorf(delay_samples);
    float frac = delay_samples - d_int;

    for (int i = 0; i < n; i++) {
        int idx = i - d_int;
        if (idx >= 1 && idx < n) {
            out[i] = (1.0f - frac) * in[idx] + frac * in[idx - 1];
        } else if (idx >= 0 && idx < n) {
            out[i] = (1.0f - frac) * in[idx];  // no past sample available
        } else {
            out[i] = 0.0f;
        }
    }
}

// Delay-and-sum beamformer for 2 channels
void beam_steer(float *ch0, float *ch1, float *output, int n,
                float angle_deg, float mic_spacing) {
    if (n > FRAME_SIZE) n = FRAME_SIZE;  // bounds guard

    float theta = angle_deg * M_PI / 180.0f;
    float delay_s = mic_spacing * sinf(theta) / 343.0f;
    float delay_samples = delay_s * FS;

    static float ch1_delayed[FRAME_SIZE];  // static: avoid stack overflow in FreeRTOS task
    apply_fractional_delay(ch1, ch1_delayed, n, delay_samples);

    for (int i = 0; i < n; i++)
        output[i] = 0.5f * (ch0[i] + ch1_delayed[i]);
}

Beam scanning for DOA estimation

Scan across angles and find the steering direction that maximises output power:

float estimate_doa(float *ch0, float *ch1, int n, float mic_spacing) {
    float best_power = 0;
    float best_angle = 0;
    static float output[FRAME_SIZE];  // static: avoid stack overflow

    // Scan from -90 to +90 degrees in 5-degree steps
    for (int angle = -90; angle <= 90; angle += 5) {
        beam_steer(ch0, ch1, output, n, (float)angle, mic_spacing);

        // Compute output power
        float power = 0;
        for (int i = 0; i < n; i++)
            power += output[i] * output[i];

        if (power > best_power) {
            best_power = power;
            best_angle = (float)angle;
        }
    }
    return best_angle;
}

Main task

void beamforming_task(void *param) {
    int32_t raw[FRAME_SIZE * 2];  // stereo interleaved
    float ch0[FRAME_SIZE], ch1[FRAME_SIZE];
    size_t bytes_read;

    float mic_spacing = 0.05f;  // 50 mm

    while (true) {
        i2s_channel_read(rx_chan, raw, sizeof(raw),
                         &bytes_read, portMAX_DELAY);
        int n = bytes_read / (2 * sizeof(int32_t));
        if (n > FRAME_SIZE) n = FRAME_SIZE;  // clamp to buffer size

        deinterleave(raw, ch0, ch1, n);

        // Choose one method — both shown for comparison
        // Method 1: TDOA via cross-correlation (fast, integer-sample resolution)
        float tdoa = estimate_tdoa(ch0, ch1, n);
        float angle_xcorr = tdoa_to_angle(tdoa, mic_spacing);

        // Method 2: Beam scan (more robust, fractional-sample, but slower)
        float angle_scan = estimate_doa(ch0, ch1, n, mic_spacing);

        // Output via UART, OLED, or BLE
    }
}

void app_main(void) {
    i2s_mic_array_init();
    xTaskCreatePinnedToCore(beamforming_task, "beam", 8192,
                            NULL, 5, NULL, 1);
}

STM32F4 (NUCLEO-F446RE): CMSIS-DSP cross-correlation

The STM32F4 approach uses arm_correlate_f32 for the TDOA estimation and multi-channel ADC with DMA for sensor acquisition.

Multi-channel ADC setup

For non-audio sensor arrays (vibration sensors, ultrasonic transducers), use the on-chip ADC in scan mode with DMA:

// ADC1 scanning 4 channels simultaneously via DMA
// Each ADC conversion takes ~1 us at 12-bit resolution
// Scan rate: configure timer trigger for desired sample rate

#define N_CHANNELS  4
#define FRAME_SIZE  256

static uint16_t adc_dma_buf[2][FRAME_SIZE * N_CHANNELS];  // double buffer
static float channels[N_CHANNELS][FRAME_SIZE];

// DMA transfer complete callback
void HAL_ADC_ConvCpltCallback(ADC_HandleTypeDef *hadc) {
    uint16_t *buf = adc_dma_buf[1];
    // Deinterleave scan data: ch0, ch1, ch2, ch3, ch0, ch1, ...
    for (int i = 0; i < FRAME_SIZE; i++) {
        for (int ch = 0; ch < N_CHANNELS; ch++) {
            channels[ch][i] = (float)buf[i * N_CHANNELS + ch] / 4096.0f;
        }
    }
}

TDOA via CMSIS-DSP

#include "arm_math.h"

#define CORR_LEN  (2 * FRAME_SIZE - 1)

static float32_t corr_output[CORR_LEN];

float estimate_tdoa_cmsis(float32_t *ch_ref, float32_t *ch_test, int n) {
    arm_correlate_f32(ch_ref, n, ch_test, n, corr_output);

    // Peak is at index (n - 1) for zero lag
    // Search around zero lag within MAX_LAG
    int centre = n - 1;
    float32_t max_val;
    uint32_t max_idx;
    arm_max_f32(&corr_output[centre - MAX_LAG],
                2 * MAX_LAG + 1, &max_val, &max_idx);

    int lag = (int)max_idx - MAX_LAG;
    return (float)lag / FS;
}
Tip

For a 2-mic system where you only need lags \(\pm 4\), the brute-force loop (9 dot products of 256 samples each) is faster than arm_correlate_f32 (which computes all 511 lags). Use arm_dot_prod_f32 in a loop, similar to the pitch detection restricted-lag approach.

Performance budget (NUCLEO-F446RE, 180 MHz)

Stage Operation Est. cycles Time
ADC DMA deinterleave (4 ch) 1024 conversions + copy ~3K 17 us
TDOA: 9 dot products (256 samples) 2304 MACs ~3K 17 us
Angle computation asinf ~100 0.6 us
Total per 16 ms frame ~6K ~35 us
Available per frame 2,880K 16 ms
Utilisation ~0.2%

For beam scanning (37 angles at 5-degree steps), multiply the TDOA cost (~3K cycles) by 37 ≈ 111K cycles per frame, about 4% of the 2,880K-cycle budget, comfortably under 5% CPU.


Platform comparison

Feature STM32F4 (NUCLEO-F446RE) ESP32-S3
Multi-channel input ADC scan mode (4+ channels, DMA) 2x I2S stereo (4 channels max)
Correlation library arm_correlate_f32, arm_dot_prod_f32 Manual or ESP-DSP dot product
Max channels (practical) 8+ (ADC scan) 4 (2 stereo I2S)
Sample rate Up to 2.4 Msps (ultrasonic arrays) 16–48 kHz (audio arrays)
Wireless output External module Built-in WiFi/BLE
Best for Ultrasonic/vibration arrays, high channel count Audio mic arrays, smart speaker prototypes

Recommendation

  • ESP32-S3 for audio microphone arrays: voice direction detection, smart speaker prototyping, noise reduction. The I2S interface handles MEMS microphones directly, and BLE/WiFi enables wireless DOA output.
  • STM32F4 for non-audio sensor arrays: ultrasonic transducer arrays (vibration, ranging), seismic sensors, or any application needing more than 4 channels or MHz-rate sampling.

Bill of materials

ESP32-S3 stereo mic array (2-mic DOA)

Component Purpose Approx. cost
ESP32-S3-DevKitC Processing + BLE/WiFi EUR 8
2x INMP441 breakout I2S MEMS stereo mic pair EUR 4
SSD1306 OLED (128x64) DOA angle display EUR 3
3D-printed mic mount Fixed 50 mm spacing EUR 1
Breadboard + wires EUR 3
Total ~EUR 19

ESP32-S3 quad mic array (4-mic DOA)

Component Purpose Approx. cost
ESP32-S3-DevKitC Processing + BLE/WiFi EUR 8
4x INMP441 breakout 2x stereo I2S pairs EUR 8
SSD1306 OLED (128x64) DOA display EUR 3
3D-printed linear mount Fixed 50 mm spacing EUR 2
Breadboard + wires EUR 3
Total ~EUR 24