Beamforming on Hardware
Multi-microphone direction-of-arrival estimation on ESP32-S3 and STM32F4
A microphone array on a microcontroller turns the delay-and-sum beamformer from a simulation into a real-time direction finder. The challenge is multi-channel synchronous acquisition — all microphones must be sampled at the same instant, with known inter-channel timing. I2S handles this naturally for 2 channels (stereo); more channels require either multiple I2S peripherals or TDM mode.
For the theory (array geometry, beam patterns, TDOA estimation, frequency-domain beamforming), see the main beamforming page.
Array hardware
Microphone array options
| Configuration | Mics | Channels needed | I2S requirement | Use case |
|---|---|---|---|---|
| 2-mic stereo | 2x INMP441 | 2 (L+R on one I2S) | 1 I2S peripheral | Left/right DOA, noise reduction |
| 4-mic linear | 4x INMP441 | 4 (2 stereo I2S) | 2 I2S peripherals | Azimuth DOA, moderate resolution |
| 4-mic square | 4x INMP441 | 4 | 2 I2S peripherals | 2D DOA (azimuth + elevation) |
The INMP441 has a left/right channel select pin (L/R): tie it low for left channel, high for right. Two INMP441 modules on one I2S bus give synchronous stereo — the simplest array.
2-mic array on ESP32-S3
d = 50 mm
|<-------->|
[MIC0 (L)] [MIC1 (R)]
| |
|-- I2S_0 ---|
|
ESP32-S3
With \(d = 50\) mm spacing and sound speed \(c = 343\) m/s:
- Maximum inter-mic delay: \(\tau_\text{max} = d/c = 146\) us
- At \(f_s = 16\) kHz: \(\tau_\text{max} \approx 2.3\) samples
- Angular resolution limited by array size — adequate for left/right/centre classification
4-mic linear array on ESP32-S3
The ESP32-S3 has 2 I2S peripherals. Each runs in stereo, giving 4 synchronous channels:
d = 50 mm
|<-------->|
[M0(L)] [M1(R)] [M2(L)] [M3(R)]
| | | |
|-- I2S0 -| |-- I2S1 -|
| |
ESP32-S3
The two I2S peripherals share the same APLL clock source, so their sample clocks are phase-locked. However, the DMA transfers may not start at exactly the same instant. For precise TDOA estimation, calibrate the inter-peripheral offset by correlating a known reference signal at startup, or tie both I2S word-select (WS) lines to the same GPIO and start them simultaneously.
ESP32-S3: delay-and-sum beamformer
I2S stereo microphone setup (2-mic)
#include "driver/i2s_std.h"
#define FS 16000
#define FRAME_SIZE 256
static i2s_chan_handle_t rx_chan;
void i2s_mic_array_init(void) {
i2s_chan_config_t chan_cfg = I2S_CHANNEL_DEFAULT_CONFIG(
I2S_NUM_0, I2S_ROLE_MASTER);
i2s_new_channel(&chan_cfg, NULL, &rx_chan);
i2s_std_config_t std_cfg = {
.clk_cfg = I2S_STD_CLK_DEFAULT_CONFIG(FS),
.slot_cfg = I2S_STD_PHILIPS_SLOT_DEFAULT_CONFIG(
I2S_DATA_BIT_WIDTH_32BIT, I2S_SLOT_MODE_STEREO),
.gpio_cfg = {
.bclk = GPIO_NUM_26,
.ws = GPIO_NUM_25,
.din = GPIO_NUM_22,
.dout = I2S_GPIO_UNUSED,
},
};
i2s_channel_init_std_mode(rx_chan, &std_cfg);
i2s_channel_enable(rx_chan);
}Deinterleave stereo to two channels
I2S stereo data arrives interleaved: L, R, L, R, … Deinterleave into separate buffers:
void deinterleave(int32_t *interleaved, float *ch0, float *ch1, int n_frames) {
for (int i = 0; i < n_frames; i++) {
ch0[i] = (float)(interleaved[2*i] >> 8) / 8388608.0f;
ch1[i] = (float)(interleaved[2*i + 1] >> 8) / 8388608.0f;
}
}TDOA estimation via cross-correlation
The time-difference of arrival between two microphones is the lag of the cross-correlation peak. For a 2-mic array, one TDOA value gives the angle of arrival:
#include <math.h>
#define MAX_LAG 4 // max lag in samples (~250 us at 16 kHz)
// Cross-correlation for small lag range (brute-force, fast for small MAX_LAG)
float estimate_tdoa(float *ch0, float *ch1, int n) {
float best_corr = -1e30f;
int best_lag = 0;
for (int lag = -MAX_LAG; lag <= MAX_LAG; lag++) {
float sum = 0;
int start = (lag > 0) ? lag : 0;
int end = (lag > 0) ? n : n + lag;
for (int i = start; i < end; i++) {
sum += ch0[i] * ch1[i - lag];
}
if (sum > best_corr) {
best_corr = sum;
best_lag = lag;
}
}
return (float)best_lag / FS; // TDOA in seconds
}
// Convert TDOA to angle of arrival
float tdoa_to_angle(float tdoa, float mic_spacing) {
float sin_theta = tdoa * 343.0f / mic_spacing;
// Clamp to valid range (numerical errors can push slightly outside [-1,1])
if (sin_theta > 1.0f) sin_theta = 1.0f;
if (sin_theta < -1.0f) sin_theta = -1.0f;
return asinf(sin_theta) * 180.0f / M_PI; // degrees from broadside
}Delay-and-sum with fractional delay
Integer-sample delay is simple (index offset), but for angles that produce fractional-sample TDOA, linear interpolation improves accuracy:
// Apply fractional delay using linear interpolation.
// delay_samples can be positive or negative.
void apply_fractional_delay(float *in, float *out, int n, float delay_samples) {
int d_int = (int)floorf(delay_samples);
float frac = delay_samples - d_int;
for (int i = 0; i < n; i++) {
int idx = i - d_int;
if (idx >= 1 && idx < n) {
out[i] = (1.0f - frac) * in[idx] + frac * in[idx - 1];
} else if (idx >= 0 && idx < n) {
out[i] = (1.0f - frac) * in[idx]; // no past sample available
} else {
out[i] = 0.0f;
}
}
}
// Delay-and-sum beamformer for 2 channels
void beam_steer(float *ch0, float *ch1, float *output, int n,
float angle_deg, float mic_spacing) {
if (n > FRAME_SIZE) n = FRAME_SIZE; // bounds guard
float theta = angle_deg * M_PI / 180.0f;
float delay_s = mic_spacing * sinf(theta) / 343.0f;
float delay_samples = delay_s * FS;
static float ch1_delayed[FRAME_SIZE]; // static: avoid stack overflow in FreeRTOS task
apply_fractional_delay(ch1, ch1_delayed, n, delay_samples);
for (int i = 0; i < n; i++)
output[i] = 0.5f * (ch0[i] + ch1_delayed[i]);
}Beam scanning for DOA estimation
Scan across angles and find the steering direction that maximises output power:
float estimate_doa(float *ch0, float *ch1, int n, float mic_spacing) {
float best_power = 0;
float best_angle = 0;
static float output[FRAME_SIZE]; // static: avoid stack overflow
// Scan from -90 to +90 degrees in 5-degree steps
for (int angle = -90; angle <= 90; angle += 5) {
beam_steer(ch0, ch1, output, n, (float)angle, mic_spacing);
// Compute output power
float power = 0;
for (int i = 0; i < n; i++)
power += output[i] * output[i];
if (power > best_power) {
best_power = power;
best_angle = (float)angle;
}
}
return best_angle;
}Main task
void beamforming_task(void *param) {
int32_t raw[FRAME_SIZE * 2]; // stereo interleaved
float ch0[FRAME_SIZE], ch1[FRAME_SIZE];
size_t bytes_read;
float mic_spacing = 0.05f; // 50 mm
while (true) {
i2s_channel_read(rx_chan, raw, sizeof(raw),
&bytes_read, portMAX_DELAY);
int n = bytes_read / (2 * sizeof(int32_t));
if (n > FRAME_SIZE) n = FRAME_SIZE; // clamp to buffer size
deinterleave(raw, ch0, ch1, n);
// Choose one method — both shown for comparison
// Method 1: TDOA via cross-correlation (fast, integer-sample resolution)
float tdoa = estimate_tdoa(ch0, ch1, n);
float angle_xcorr = tdoa_to_angle(tdoa, mic_spacing);
// Method 2: Beam scan (more robust, fractional-sample, but slower)
float angle_scan = estimate_doa(ch0, ch1, n, mic_spacing);
// Output via UART, OLED, or BLE
}
}
void app_main(void) {
i2s_mic_array_init();
xTaskCreatePinnedToCore(beamforming_task, "beam", 8192,
NULL, 5, NULL, 1);
}STM32F4 (NUCLEO-F446RE): CMSIS-DSP cross-correlation
The STM32F4 approach uses arm_correlate_f32 for the TDOA estimation and multi-channel ADC with DMA for sensor acquisition.
Multi-channel ADC setup
For non-audio sensor arrays (vibration sensors, ultrasonic transducers), use the on-chip ADC in scan mode with DMA:
// ADC1 scanning 4 channels simultaneously via DMA
// Each ADC conversion takes ~1 us at 12-bit resolution
// Scan rate: configure timer trigger for desired sample rate
#define N_CHANNELS 4
#define FRAME_SIZE 256
static uint16_t adc_dma_buf[2][FRAME_SIZE * N_CHANNELS]; // double buffer
static float channels[N_CHANNELS][FRAME_SIZE];
// DMA transfer complete callback
void HAL_ADC_ConvCpltCallback(ADC_HandleTypeDef *hadc) {
uint16_t *buf = adc_dma_buf[1];
// Deinterleave scan data: ch0, ch1, ch2, ch3, ch0, ch1, ...
for (int i = 0; i < FRAME_SIZE; i++) {
for (int ch = 0; ch < N_CHANNELS; ch++) {
channels[ch][i] = (float)buf[i * N_CHANNELS + ch] / 4096.0f;
}
}
}TDOA via CMSIS-DSP
#include "arm_math.h"
#define CORR_LEN (2 * FRAME_SIZE - 1)
static float32_t corr_output[CORR_LEN];
float estimate_tdoa_cmsis(float32_t *ch_ref, float32_t *ch_test, int n) {
arm_correlate_f32(ch_ref, n, ch_test, n, corr_output);
// Peak is at index (n - 1) for zero lag
// Search around zero lag within MAX_LAG
int centre = n - 1;
float32_t max_val;
uint32_t max_idx;
arm_max_f32(&corr_output[centre - MAX_LAG],
2 * MAX_LAG + 1, &max_val, &max_idx);
int lag = (int)max_idx - MAX_LAG;
return (float)lag / FS;
}For a 2-mic system where you only need lags \(\pm 4\), the brute-force loop (9 dot products of 256 samples each) is faster than arm_correlate_f32 (which computes all 511 lags). Use arm_dot_prod_f32 in a loop, similar to the pitch detection restricted-lag approach.
Performance budget (NUCLEO-F446RE, 180 MHz)
| Stage | Operation | Est. cycles | Time |
|---|---|---|---|
| ADC DMA deinterleave (4 ch) | 1024 conversions + copy | ~3K | 17 us |
| TDOA: 9 dot products (256 samples) | 2304 MACs | ~3K | 17 us |
| Angle computation | asinf |
~100 | 0.6 us |
| Total per 16 ms frame | ~6K | ~35 us | |
| Available per frame | 2,880K | 16 ms | |
| Utilisation | ~0.2% |
For beam scanning (37 angles at 5-degree steps), multiply the TDOA cost (~3K cycles) by 37 ≈ 111K cycles per frame, about 4% of the 2,880K-cycle budget, comfortably under 5% CPU.
Platform comparison
| Feature | STM32F4 (NUCLEO-F446RE) | ESP32-S3 |
|---|---|---|
| Multi-channel input | ADC scan mode (4+ channels, DMA) | 2x I2S stereo (4 channels max) |
| Correlation library | arm_correlate_f32, arm_dot_prod_f32 |
Manual or ESP-DSP dot product |
| Max channels (practical) | 8+ (ADC scan) | 4 (2 stereo I2S) |
| Sample rate | Up to 2.4 Msps (ultrasonic arrays) | 16–48 kHz (audio arrays) |
| Wireless output | External module | Built-in WiFi/BLE |
| Best for | Ultrasonic/vibration arrays, high channel count | Audio mic arrays, smart speaker prototypes |
Recommendation
- ESP32-S3 for audio microphone arrays: voice direction detection, smart speaker prototyping, noise reduction. The I2S interface handles MEMS microphones directly, and BLE/WiFi enables wireless DOA output.
- STM32F4 for non-audio sensor arrays: ultrasonic transducer arrays (vibration, ranging), seismic sensors, or any application needing more than 4 channels or MHz-rate sampling.
Bill of materials
ESP32-S3 stereo mic array (2-mic DOA)
| Component | Purpose | Approx. cost |
|---|---|---|
| ESP32-S3-DevKitC | Processing + BLE/WiFi | EUR 8 |
| 2x INMP441 breakout | I2S MEMS stereo mic pair | EUR 4 |
| SSD1306 OLED (128x64) | DOA angle display | EUR 3 |
| 3D-printed mic mount | Fixed 50 mm spacing | EUR 1 |
| Breadboard + wires | EUR 3 | |
| Total | ~EUR 19 |
ESP32-S3 quad mic array (4-mic DOA)
| Component | Purpose | Approx. cost |
|---|---|---|
| ESP32-S3-DevKitC | Processing + BLE/WiFi | EUR 8 |
| 4x INMP441 breakout | 2x stereo I2S pairs | EUR 8 |
| SSD1306 OLED (128x64) | DOA display | EUR 3 |
| 3D-printed linear mount | Fixed 50 mm spacing | EUR 2 |
| Breadboard + wires | EUR 3 | |
| Total | ~EUR 24 |