Gammatone Bank on Hardware

A real-time auditory front end on STM32F4 and ESP32

A gammatone filter bank is one of the cheapest useful auditory front ends you can put on a microcontroller, and the reason is the result from the theory page: every channel is a cascade of four biquads. We already know how to run biquads in real time on these boards (see Biquad on Hardware), so a gammatone bank is just that same inner loop run once per channel. Nothing new has to be written for the filtering itself; the only new questions are how to lay out the coefficients and how many channels fit in the time budget.

This page targets the ESP32-S3 and the NUCLEO-F446RE, following the embedded conventions in ADR-005: single-precision float first, and the negated denominator-coefficient convention so the inner loop adds instead of subtracts. For the mathematics and the Python reference, see the main page.

Exporting coefficients from Python

The bank is designed once, offline, in Python. Each channel produces a (4, 6) SOS matrix; we flatten it into the five coefficients per section that the C code expects, negating a1 and a2 to match the embedded convention. The result is a plain C table that you paste into your firmware.

import numpy as np
from gammatone import erb_space, make_filterbank

fs = 16000
cfs = erb_space(200, 7000, 8)        # 8 channels for a compact example
bank = make_filterbank(fs, cfs)

print(f"// Gammatone bank: {len(cfs)} channels x 4 biquads, fs = {fs} Hz")
print(f"// Coefficients {{b0, b1, b2, -a1, -a2}} per section (negated convention)")
print(f"#define N_CHANNELS {len(cfs)}")
print(f"#define N_SECTIONS 4")
print("static const float gt_coeffs[N_CHANNELS][N_SECTIONS][5] = {")
for ch, sos in enumerate(bank):
    rows = []
    for s in sos:
        b0, b1, b2, _, a1, a2 = s
        rows.append(f"{{{b0:+.6e},{b1:+.6e},{b2:+.6e},{-a1:+.6e},{-a2:+.6e}}}")
    print(f"  {{ {', '.join(rows)} }},  // ch {ch}: {cfs[ch]:.0f} Hz")
print("};")

// Gammatone bank: 8 channels x 4 biquads, fs = 16000 Hz
// Coefficients {b0, b1, b2, -a1, -a2} per section (negated convention)
#define N_CHANNELS 8
#define N_SECTIONS 4
static const float gt_coeffs[N_CHANNELS][N_SECTIONS][5] = {
  { {+2.268354e-07,+0.000000e+00,+0.000000e+00,+1.951824e+00,-9.583074e-01}, {+1.000000e+00,+0.000000e+00,+0.000000e+00,+1.956777e+00,-9.635936e-01}, {+1.000000e+00,-1.957244e+00,+9.231267e-01,+1.957711e+00,-9.636719e-01}, {+1.000000e+00,-1.957244e+00,+9.566830e-01,+1.962664e+00,-9.689875e-01} },  // ch 0: 200 Hz
  { {+1.118939e-06,+0.000000e+00,+0.000000e+00,+1.918181e+00,-9.445324e-01}, {+1.000000e+00,+0.000000e+00,+0.000000e+00,+1.919820e+00,-9.458686e-01}, {+1.000000e+00,-1.919751e+00,+7.774661e-01,+1.919682e+00,-9.462298e-01}, {+1.000000e+00,-1.919751e+00,+9.171250e-01,+1.921321e+00,-9.475687e-01} },  // ch 1: 413 Hz
  { {+5.469281e-06,+0.000000e+00,+0.000000e+00,+1.839020e+00,-9.191707e-01}, {+1.000000e+00,+0.000000e+00,+0.000000e+00,+1.838520e+00,-9.193106e-01}, {+1.000000e+00,-1.839873e+00,+4.147455e-01,+1.841229e+00,-9.213402e-01}, {+1.000000e+00,-1.839873e+00,+8.335794e-01,+1.840721e+00,-9.214722e-01} },  // ch 2: 732 Hz
  { {+2.637444e-05,-4.407743e-05,-1.000227e-05,+1.670574e+00,-8.823718e-01}, {+1.000000e+00,+0.000000e+00,+0.000000e+00,+1.670753e+00,-8.829975e-01}, {+1.000000e+00,+0.000000e+00,+0.000000e+00,+1.671683e+00,-8.832211e-01}, {+1.000000e+00,-1.671218e+00,+6.665241e-01,+1.671861e+00,-8.838470e-01} },  // ch 3: 1210 Hz
  { {+1.246564e-04,-1.652592e-04,-2.291707e-04,+1.325519e+00,-8.297585e-01}, {+1.000000e+00,+0.000000e+00,+0.000000e+00,+1.325237e+00,-8.299960e-01}, {+1.000000e+00,+0.000000e+00,+0.000000e+00,+1.326197e+00,-8.303843e-01}, {+1.000000e+00,-1.325717e+00,+3.723292e-01,+1.325915e+00,-8.306217e-01} },  // ch 4: 1925 Hz
  { {+5.719015e-04,-3.820978e-04,-2.086938e-03,+6.680680e-01,-7.566249e-01}, {+1.000000e+00,+0.000000e+00,+0.000000e+00,+6.678808e-01,-7.567934e-01}, {+1.000000e+00,+0.000000e+00,+0.000000e+00,+6.683557e-01,-7.568714e-01}, {+1.000000e+00,-6.681182e-01,+8.903422e-04,+6.681684e-01,-7.570399e-01} },  // ch 5: 2996 Hz
  { {+2.511275e-03,+9.512850e-04,-9.029674e-03,-3.787851e-01,-6.587651e-01}, {+1.000000e+00,+3.788055e-01,-7.102874e-02,-3.785829e-01,-6.589187e-01}, {+1.000000e+00,+0.000000e+00,+0.000000e+00,-3.790282e-01,-6.589708e-01}, {+1.000000e+00,+0.000000e+00,+0.000000e+00,-3.788259e-01,-6.591244e-01} },  // ch 6: 4600 Hz
  { {+1.056254e-02,+0.000000e+00,+0.000000e+00,-1.351565e+00,-5.350786e-01}, {+1.000000e+00,+1.352209e+00,+0.000000e+00,-1.352323e+00,-5.354433e-01}, {+1.000000e+00,+4.890505e-15,+0.000000e+00,-1.352093e+00,-5.356484e-01}, {+1.000000e+00,+1.352209e+00,+4.436607e-01,-1.352852e+00,-5.360135e-01} },  // ch 7: 7000 Hz
};

Negated denominator coefficients

As on the biquad page, the stored a1, a2 are the negatives of the transfer-function coefficients, so the update reads w0 = x + a1*w1 + a2*w2. The export above already applies the sign flip. This matches CMSIS-DSP and saves a negate in the inner loop.

STM32F4 / ESP32: the bank as nested biquad cascades

A single channel is the biquad cascade you have already seen. A bank is an array of them. Because every channel reads the same input sample and writes its own output, the state stays per channel and the structure is a simple nested loop.

#define N_CHANNELS  32
#define N_SECTIONS  4

typedef struct {
    float b0, b1, b2, a1, a2;   // a1, a2 negated
    float w1, w2;               // state, persisted between samples
} biquad_t;

typedef struct {
    biquad_t section[N_SECTIONS];
} channel_t;

static channel_t bank[N_CHANNELS];

// One input sample in, one value per channel out.
void gammatone_bank_process(float x, float out[N_CHANNELS])
{
    for (int c = 0; c < N_CHANNELS; c++) {
        float s = x;
        for (int k = 0; k < N_SECTIONS; k++) {
            biquad_t *bq = &bank[c].section[k];
            float w0 = s + bq->a1 * bq->w1 + bq->a2 * bq->w2;  // DF-II
            s        = bq->b0 * w0 + bq->b1 * bq->w1 + bq->b2 * bq->w2;
            bq->w2   = bq->w1;
            bq->w1   = w0;
        }
        out[c] = s;
    }
}

Loading the coefficients is a flat copy from the exported table:

void gammatone_bank_init(const float coeffs[N_CHANNELS][N_SECTIONS][5])
{
    for (int c = 0; c < N_CHANNELS; c++) {
        for (int k = 0; k < N_SECTIONS; k++) {
            biquad_t *bq = &bank[c].section[k];
            bq->b0 = coeffs[c][k][0];
            bq->b1 = coeffs[c][k][1];
            bq->b2 = coeffs[c][k][2];
            bq->a1 = coeffs[c][k][3];   // already -a1
            bq->a2 = coeffs[c][k][4];   // already -a2
            bq->w1 = bq->w2 = 0.0f;
        }
    }
}

That is the entire filter bank. The state cost is N_CHANNELS * N_SECTIONS * 2 floats: for 32 channels, 256 floats, about 1 KB. The coefficient table is N_CHANNELS * N_SECTIONS * 5 floats, about 2.5 KB of read-only data. Both fit comfortably in the on-chip SRAM of either board.

From subbands to a feature: channel energy

The raw subband outputs are rarely the end product. For most listening tasks you want the energy in each channel over a short window, which is the live equivalent of the cochleagram on the theory page. A leaky integrator (a one-pole smoother) on the squared output is enough and costs almost nothing:

// N_CHANNELS and gammatone_bank_process() are from the bank snippet above.
#define N_CHANNELS 32
void gammatone_bank_process(float x, float out[N_CHANNELS]);

static float energy[N_CHANNELS];
static const float alpha = 0.99f;   // ~ one-pole smoother, tune to frame length

void gammatone_bank_step(float x)
{
    float out[N_CHANNELS];
    gammatone_bank_process(x, out);
    for (int c = 0; c < N_CHANNELS; c++) {
        energy[c] = alpha * energy[c] + (1.0f - alpha) * (out[c] * out[c]);
    }
}

The energy[] vector, sampled every few milliseconds, is a compact auditory feature ready for a wake-word detector, a pitch tracker, or a simple sound classifier. The pitch detection page shows the I2S microphone plumbing that feeds samples into a loop like this one on the ESP32-S3.

How many channels fit?

This is the honest engineering question. One channel is four biquad sections; on a Cortex-M4F a section is roughly 10 cycles with the hardware FPU (see the biquad timing note), so a channel is about 40 cycles per input sample. At a sample rate \(f_s\) on a core clocked at \(f_\text{clk}\), each sample must be processed within \(f_\text{clk}/f_s\) cycles.

At \(f_s = 16\) kHz, the per-sample budget is 11250 cycles on the 180 MHz NUCLEO-F446RE and 15000 cycles on the 240 MHz ESP32-S3. The filter bank uses a small fraction of it:

Channels	Cycles/sample	NUCLEO-F446RE (11250)	ESP32-S3 (15000)
16	640	5.7 %	4.3 %
32	1280	11.4 %	8.5 %
64	2560	22.8 %	17.1 %

So a 32-channel bank, a common choice for speech, leaves roughly 90 % of the core free for the energy integration, the feature consumer, and the I/O. The filtering is not the bottleneck; the DMA buffering and whatever runs downstream usually are.

These are filtering cycles only

The table counts the biquad arithmetic. It excludes I2S DMA servicing, the energy smoother (one multiply-add per channel per sample), and any feature computation. Measure on hardware with DWT->CYCCNT (Cortex-M) or esp_cpu_get_cycle_count() (ESP32) before committing to a channel count near the budget.

CMSIS-DSP alternative on the NUCLEO

On the STM32, each channel maps directly onto arm_biquad_cascade_df1_f32, the same function the biquad page uses, instantiated once per channel:

#include "arm_math.h"
#include <string.h>

#define N_CHANNELS  32
#define N_SECTIONS  4
#define BLOCK       64   // samples per processing block (e.g. a DMA half-buffer)

static arm_biquad_casd_df1_inst_f32 ch_inst[N_CHANNELS];   // CMSIS type is "casd", not "casc"
static float32_t ch_coeffs[N_CHANNELS][5 * N_SECTIONS];   // {b0,b1,b2,-a1,-a2} x 4
static float32_t ch_state[N_CHANNELS][4 * N_SECTIONS];    // DF-I: 4 states/section

// gt_coeffs is the table exported from Python above. Its [N_SECTIONS][5]
// rows are row-major identical to ch_coeffs' [5 * N_SECTIONS] layout, and it
// already stores a1, a2 negated, exactly the CMSIS convention. Without this
// load step ch_coeffs stays zero (static) and the bank would output silence.
void bank_init_cmsis(const float gt_coeffs[N_CHANNELS][N_SECTIONS][5]) {
    for (int c = 0; c < N_CHANNELS; c++) {
        memcpy(ch_coeffs[c], gt_coeffs[c], 5 * N_SECTIONS * sizeof(float32_t));
        arm_biquad_cascade_df1_init_f32(&ch_inst[c], N_SECTIONS,
                                        ch_coeffs[c], ch_state[c]);
    }
}

void bank_process_block_cmsis(const float32_t *in, float32_t out[N_CHANNELS][BLOCK],
                              uint32_t block) {
    for (int c = 0; c < N_CHANNELS; c++) {
        arm_biquad_cascade_df1_f32(&ch_inst[c], in, out[c], block);
    }
}

CMSIS-DSP uses Direct Form I (four states per section instead of two), which costs a little more state but is the more robust structure if you later move to the Q15 or Q31 fixed-point variants for a power-constrained build. Note the memcpy load step: arm_biquad_cascade_df1_init_f32 only stores the pointer to the coefficient buffer, so the buffer must be filled from the exported table first, otherwise the static (zero) coefficients make the bank emit silence. Process in blocks (e.g. 64 samples from a DMA half-buffer) so the loop-unrolled inner code earns its keep.

ESP-DSP on the ESP32-S3

The ESP32-S3 has no CMSIS-DSP, but ESP-DSP provides dsps_biquad_f32_ae32, an assembly-optimised single biquad. Call it four times per channel, or just use the portable C cascade above. At 32 channels the portable version is already under 9 % of the budget, so the optimised path is rarely worth the extra complexity here.

Fixed-point note

Everything above is single-precision float, which is the right default on both boards (ADR-005). If you target an FPU-less or power-critical part, the gammatone bank inherits the biquad’s fixed-point story unchanged: Q15 or Q31 coefficients, a wide accumulator, and gain staging so no coefficient exceeds the representable range. The one wrinkle specific to the gammatone is that low-frequency channels have poles very close to the unit circle (they ring for a long time), so they are the most sensitive to coefficient quantisation. If a fixed-point build sounds wrong, check the low channels first. The biquad fixed-point section covers the mechanics.