Gabor Filters on Hardware

One 2-D convolution, four very different microcontrollers

A Gabor filter and a Difference of Gaussians are both, at the metal, a 2-D convolution: slide a small kernel over an image and accumulate a weighted sum at every pixel. That single operation is the whole embedded story, and it is a far harsher test of a microcontroller than the 1-D biquads of the gammatone bank. An image has thousands of pixels, the kernel is two-dimensional, and the intermediate results may not even fit in RAM.

This page takes one portable C convolution and runs it, in principle, across a deliberately wide capability ladder: an 8-bit AVR with no FPU, a Cortex-M0+, a Cortex-M4F, and a Cortex-M33 with a neural-processing unit. The point is exactly that the same algorithm is trivial on one end of the ladder and impossible on the other. The board set and the shared-library, multi-toolchain style are borrowed from the HAN Embedded Machine Learning course (Arends and Veen), where a single portable C feature library is compiled unchanged onto every target.

For the mathematics and the Python reference, see the main page. Conventions follow ADR-005: single-precision float first, fixed-point as the optimisation for the parts that need it.

The key idea: separable convolution

A general 2-D convolution with a $K \times K$ kernel costs $K^2$ multiply-accumulates per pixel. For a $64 \times 64$ image and a $15 \times 15$ kernel that is about 0.92 million MACs per orientation, which is a lot to ask of a small MCU.

The escape is separability. A 2-D Gaussian factors exactly into a horizontal 1-D Gaussian followed by a vertical one, so the DoG is separable with no approximation. The 2-D Gabor is separable along its own axes, so for an axis-aligned orientation it is exact, and for an off-axis orientation a separable (row-then-column) pass is a close approximation. Separating the convolution turns $K^2$ into $2K$ MACs per pixel:

Kernel form	MACs / pixel	MACs / orientation (64×64, K=15)
Full 2-D	$K^2 = 225$	921,600
Separable (H then V)	$2K = 30$	122,880

That is a 7.5× reduction at $K = 15$ ($K/2$ in general), and it is the difference between feasible and not on the lower rungs of the ladder.

Portable separable convolution in C

The kernel below is plain C99 with no platform dependencies. It compiles unchanged on every target in the table; only the surrounding glue (how a frame is captured, where the buffers live) changes per board. This mirrors the EML course’s shared lib/ approach.

// Separable 2-D convolution: one horizontal pass (kx), one vertical pass (ky).
// img    : W*H input, row-major
// tmp    : W*H scratch (same size as img)
// out    : W*H output
// kx, ky : 1-D kernels of length K (odd). For the DoG, or any axis-aligned
//          symmetric kernel, pass the same pointer for both. For a Gabor
//          channel, kx and ky are the distinct 1-D projections of the
//          oriented kernel.
// Float-first per ADR-005; see the fixed-point section for the no-FPU parts.

void conv_separable(const float *img, float *tmp, float *out,
                    int W, int H, const float *kx, const float *ky, int K)
{
    int r = K / 2;  // half support

    // Horizontal pass: img -> tmp (uses kx)
    for (int y = 0; y < H; y++) {
        for (int x = 0; x < W; x++) {
            float acc = 0.0f;
            for (int i = -r; i <= r; i++) {
                int xx = x + i;
                if (xx < 0) xx = 0; else if (xx >= W) xx = W - 1;  // clamp
                acc += img[y * W + xx] * kx[i + r];
            }
            tmp[y * W + x] = acc;
        }
    }

    // Vertical pass: tmp -> out (uses ky)
    for (int y = 0; y < H; y++) {
        for (int x = 0; x < W; x++) {
            float acc = 0.0f;
            for (int i = -r; i <= r; i++) {
                int yy = y + i;
                if (yy < 0) yy = 0; else if (yy >= H) yy = H - 1;  // clamp
                acc += tmp[yy * W + x] * ky[i + r];
            }
            out[y * W + x] = acc;
        }
    }
}

For the DoG, pass the same 1-D Gaussian as both kx and ky, run the function twice (a narrow $\sigma_1$ and a wide $\sigma_2$), and subtract the two outputs. For a Gabor channel, kx and ky are the distinct 1-D projections of the oriented kernel; the complex (even and odd) parts are two such convolutions, and the magnitude is sqrt(even*even + odd*odd) per pixel. Generating those 1-D kernel tables is done offline in Python (the same gabor.py from the theory page) and baked into the firmware as const arrays, exactly as the gammatone coefficients were.

Separability is exact only on-axis

The row-then-column trick is mathematically exact for the Gaussian (hence the DoG) and for Gabor filters oriented along the image axes. For a 30° or 45° Gabor it is an approximation; the principled fix is a steerable or oriented-separable filter design (Freeman and Adelson 1991), which keeps the $2K$ cost while handling arbitrary orientations. For a teaching front end the plain separable approximation is usually good enough; flag it if your application is orientation-critical.

The same code on four very different targets

Here is where the ladder bites. The arithmetic above is fixed, but the resources are not. A $64 \times 64$ image is 16 KB in float and 4 KB in int8, and an FPU changes the cost of a MAC by an order of magnitude. The “frame” column below counts one buffer; note that the separable routine above uses three ($img$, $tmp$, $out$), about 48 KB in float, unless you reduce it with an in-place or two-buffer variant.

Target	Core	Clock	FPU	RAM	A 64×64 float frame	Verdict for a Gabor/DoG pass
ATmega328P (Xplained Mini)	AVR 8-bit	16 MHz	none	2 KB	does not fit (16 KB ≫ 2 KB)	Full-frame float is impossible. Needs `int8`, a much smaller tile, and line-buffered streaming. The lower bound of the ladder.
FRDM-KL25Z	Cortex-M0+	48 MHz	none	16 KB	just fits (16 KB), no room for scratch	Feasible only in fixed-point (Q15) with a shared in-place buffer. Software float would dominate the budget.
NUCLEO-F411RE	Cortex-M4F	100 MHz	single-prec	128 KB	comfortable	The natural home. `float` separable conv, ~0.12 M MAC/orientation runs in roughly a millisecond; CMSIS-DSP accelerates the inner pass.
FRDM-MCXN947	Cortex-M33 ×2 + NPU	150 MHz	single-prec	512 KB	trivial	Float on the M33, or offload the convolution to the eIQ Neutron NPU, which is built for exactly this 2-D MAC workload. The top of the ladder.

The lesson is the one the EML course makes with feature extraction: a portable algorithm is necessary but not sufficient. What decides whether a vision front end ships is the memory to hold a frame and the arithmetic to convolve it in time, and those vary by more than an order of magnitude across parts that all run the same C.

Clock figures

The NUCLEO-F411RE figure (100 MHz) is taken from the EML project’s CubeMX clock configuration. The others are the standard maximum core clocks for these parts (ATmega328P 16 MHz on the Xplained Mini, KL25Z 48 MHz, MCX-N947 dual M33 at 150 MHz). The millisecond estimate for the F411 assumes roughly one single-precision MAC per cycle on the M4F FPU; measure on hardware with DWT->CYCCNT before relying on it.

Walking down the ladder

Cortex-M4F (NUCLEO-F411RE): the comfortable case. With a hardware single-precision FPU, the float kernel above is already real-time for a small frame. The inner 1-D pass maps onto arm_conv_f32 or a hand-written MAC loop; CMSIS-DSP’s loop unrolling earns its keep on the vertical pass. This M4F rung plays the same role (real-time float filtering) as the project’s existing biquad and gammatone embedded pages, though those target the default two-platform NUCLEO-F446RE rather than this ladder’s F411RE.

Cortex-M0+ (FRDM-KL25Z): integer or nothing. The M0+ has no FPU and no DSP instructions, so every float MAC is a software-library call costing tens of cycles. The fix is to drop to Q15 fixed-point: store the image and kernels as 16-bit integers, accumulate in a 32-bit int32_t, and shift down at the end of each pass. The 16 KB SRAM holds one int16 frame (8 KB) plus an int16 scratch (8 KB) with nothing to spare, so an in-place, single-buffer separable pass is the realistic design.

AVR 8-bit (ATmega328P): the honest lower bound. With 2 KB of SRAM, a full image never lives in memory at once. A real implementation processes the image as a stream of rows, keeping only a sliding window of $K$ rows in a circular line buffer and emitting one output row at a time, in int8. This is the same line-buffer technique used by camera and FPGA pipelines, and it is the most instructive rung precisely because the constraints force the streaming structure into the open. A full-frame, multi-orientation Gabor bank is simply out of scope here, and saying so plainly is the point.

Cortex-M33 + NPU (FRDM-MCXN947): offload it. The M33 alone handles float convolution comfortably, but the MCX-N947 also carries the eIQ Neutron NPU, a fixed-function accelerator for the multiply-accumulate-heavy convolutions of neural networks. A Gabor or DoG bank is structurally identical to a depthwise convolution layer, so it can be expressed as a tiny fixed-weight conv and run on the NPU while the cores do other work. This is the bridge from classical DSP to the TinyML world the EML course targets.

Fixed-point note (the no-FPU rungs)

On the AVR and the M0+, fixed-point is not an optimisation, it is the only option. The mechanics are the same as the biquad fixed-point section:

Store the image and the 1-D kernels in Q15 (16-bit, value = integer / 32768).
Normalise the kernel first: the taps of a Gaussian sum to 1, and a Gabor kernel is zero-mean (it rejects DC). This is what bounds each pass output to Q15 range, and it is the reason a 32-bit accumulator is enough. The guarantee comes from the kernel gain, not the tap count: the raw worst case of 15 full-scale $Q15 \times Q15$ products is about $2^{34}$, which would overflow int32_t, so an un-normalised or high-gain kernel needs a 64-bit accumulator or pre-scaling (the same accumulator-width caution as the biquad page).
Shift the accumulator back to Q15 at the end of each pass, rounding rather than truncating to limit the bias.

The DoG is gentler than the Gabor here: both Gaussians are strictly positive and sum to one, so the only care needed is the subtraction at the end, which can produce negative values and therefore wants a signed output.

References

Freeman, William T., and Edward H. Adelson. 1991. “The Design and Use of Steerable Filters.” IEEE Transactions on Pattern Analysis and Machine Intelligence 13 (9): 891–906. https://doi.org/10.1109/34.93808.

--- title: "Gabor Filters on Hardware" subtitle: "One 2-D convolution, four very different microcontrollers" --- A Gabor filter and a Difference of Gaussians are both, at the metal, a **2-D convolution**: slide a small kernel over an image and accumulate a weighted sum at every pixel. That single operation is the whole embedded story, and it is a far harsher test of a microcontroller than the 1-D biquads of the [gammatone bank](../gammatone-filters/embedded.qmd). An image has thousands of pixels, the kernel is two-dimensional, and the intermediate results may not even fit in RAM. This page takes one portable C convolution and runs it, in principle, across a deliberately wide capability ladder: an 8-bit AVR with no FPU, a Cortex-M0+, a Cortex-M4F, and a Cortex-M33 with a neural-processing unit. The point is exactly that the **same algorithm** is trivial on one end of the ladder and impossible on the other. The board set and the shared-library, multi-toolchain style are borrowed from the HAN [Embedded Machine Learning course](https://gitlab.han.nl/aea/ese/eml) (Arends and Veen), where a single portable C feature library is compiled unchanged onto every target. For the mathematics and the Python reference, see the [main page](index.qmd). Conventions follow [ADR-005](../../docs/adr/005-embedded-platforms.md): single-precision `float` first, fixed-point as the optimisation for the parts that need it. <hr> ## The key idea: separable convolution A general 2-D convolution with a $K \times K$ kernel costs $K^2$ multiply-accumulates per pixel. For a $64 \times 64$ image and a $15 \times 15$ kernel that is about 0.92 million MACs **per orientation**, which is a lot to ask of a small MCU. The escape is **separability**. A 2-D Gaussian factors exactly into a horizontal 1-D Gaussian followed by a vertical one, so the DoG is separable with no approximation. The 2-D Gabor is separable along its own axes, so for an axis-aligned orientation it is exact, and for an off-axis orientation a separable (row-then-column) pass is a close approximation. Separating the convolution turns $K^2$ into $2K$ MACs per pixel: | Kernel form | MACs / pixel | MACs / orientation (64×64, K=15) | |---|---:|---:| | Full 2-D | $K^2 = 225$ | 921,600 | | Separable (H then V) | $2K = 30$ | 122,880 | That is a **7.5× reduction** at $K = 15$ ($K/2$ in general), and it is the difference between feasible and not on the lower rungs of the ladder. ### Portable separable convolution in C The kernel below is plain C99 with no platform dependencies. It compiles unchanged on every target in the table; only the surrounding glue (how a frame is captured, where the buffers live) changes per board. This mirrors the EML course's shared `lib/` approach. ```c // Separable 2-D convolution: one horizontal pass (kx), one vertical pass (ky). // img : W*H input, row-major // tmp : W*H scratch (same size as img) // out : W*H output // kx, ky : 1-D kernels of length K (odd). For the DoG, or any axis-aligned // symmetric kernel, pass the same pointer for both. For a Gabor // channel, kx and ky are the distinct 1-D projections of the // oriented kernel. // Float-first per ADR-005; see the fixed-point section for the no-FPU parts. void conv_separable(const float *img, float *tmp, float *out, int W, int H, const float *kx, const float *ky, int K) { int r = K / 2; // half support // Horizontal pass: img -> tmp (uses kx) for (int y = 0; y < H; y++) { for (int x = 0; x < W; x++) { float acc = 0.0f; for (int i = -r; i <= r; i++) { int xx = x + i; if (xx < 0) xx = 0; else if (xx >= W) xx = W - 1; // clamp acc += img[y * W + xx] * kx[i + r]; } tmp[y * W + x] = acc; } } // Vertical pass: tmp -> out (uses ky) for (int y = 0; y < H; y++) { for (int x = 0; x < W; x++) { float acc = 0.0f; for (int i = -r; i <= r; i++) { int yy = y + i; if (yy < 0) yy = 0; else if (yy >= H) yy = H - 1; // clamp acc += tmp[yy * W + x] * ky[i + r]; } out[y * W + x] = acc; } } } ``` For the **DoG**, pass the same 1-D Gaussian as both `kx` and `ky`, run the function twice (a narrow $\sigma_1$ and a wide $\sigma_2$), and subtract the two outputs. For a **Gabor** channel, `kx` and `ky` are the distinct 1-D projections of the oriented kernel; the complex (even and odd) parts are two such convolutions, and the magnitude is `sqrt(even*even + odd*odd)` per pixel. Generating those 1-D kernel tables is done offline in Python (the same `gabor.py` from the [theory page](index.qmd)) and baked into the firmware as `const` arrays, exactly as the [gammatone coefficients](../gammatone-filters/embedded.qmd) were. ::: {.callout-note title="Separability is exact only on-axis"} The row-then-column trick is mathematically exact for the Gaussian (hence the DoG) and for Gabor filters oriented along the image axes. For a 30° or 45° Gabor it is an approximation; the principled fix is a **steerable** or **oriented-separable** filter design [@freeman1991steerable], which keeps the $2K$ cost while handling arbitrary orientations. For a teaching front end the plain separable approximation is usually good enough; flag it if your application is orientation-critical. ::: <hr> ## The same code on four very different targets Here is where the ladder bites. The arithmetic above is fixed, but the resources are not. A $64 \times 64$ image is 16 KB in `float` and 4 KB in `int8`, and an FPU changes the cost of a MAC by an order of magnitude. The "frame" column below counts one buffer; note that the separable routine above uses three ($img$, $tmp$, $out$), about 48 KB in `float`, unless you reduce it with an in-place or two-buffer variant. | Target | Core | Clock | FPU | RAM | A 64×64 float frame | Verdict for a Gabor/DoG pass | |---|---|---:|---|---:|---|---| | **ATmega328P** (Xplained Mini) | AVR 8-bit | 16 MHz | none | 2 KB | does not fit (16 KB ≫ 2 KB) | Full-frame float is impossible. Needs `int8`, a much smaller tile, and line-buffered streaming. The lower bound of the ladder. | | **FRDM-KL25Z** | Cortex-M0+ | 48 MHz | none | 16 KB | just fits (16 KB), no room for scratch | Feasible only in **fixed-point** (Q15) with a shared in-place buffer. Software float would dominate the budget. | | **NUCLEO-F411RE** | Cortex-M4F | 100 MHz | single-prec | 128 KB | comfortable | The natural home. `float` separable conv, ~0.12 M MAC/orientation runs in roughly a millisecond; CMSIS-DSP accelerates the inner pass. | | **FRDM-MCXN947** | Cortex-M33 ×2 + NPU | 150 MHz | single-prec | 512 KB | trivial | Float on the M33, or offload the convolution to the **eIQ Neutron NPU**, which is built for exactly this 2-D MAC workload. The top of the ladder. | The lesson is the one the EML course makes with feature extraction: a portable algorithm is necessary but not sufficient. What decides whether a vision front end ships is the **memory to hold a frame** and the **arithmetic to convolve it in time**, and those vary by more than an order of magnitude across parts that all run the same C. ::: {.callout-important title="Clock figures"} The NUCLEO-F411RE figure (100 MHz) is taken from the EML project's CubeMX clock configuration. The others are the standard maximum core clocks for these parts (ATmega328P 16 MHz on the Xplained Mini, KL25Z 48 MHz, MCX-N947 dual M33 at 150 MHz). The millisecond estimate for the F411 assumes roughly one single-precision MAC per cycle on the M4F FPU; measure on hardware with `DWT->CYCCNT` before relying on it. ::: <hr> ## Walking down the ladder **Cortex-M4F (NUCLEO-F411RE): the comfortable case.** With a hardware single-precision FPU, the `float` kernel above is already real-time for a small frame. The inner 1-D pass maps onto `arm_conv_f32` or a hand-written MAC loop; CMSIS-DSP's loop unrolling earns its keep on the vertical pass. This M4F rung plays the same role (real-time float filtering) as the project's existing [biquad](../../basics/09-biquad/embedded.qmd) and gammatone embedded pages, though those target the default two-platform NUCLEO-F446RE rather than this ladder's F411RE. **Cortex-M0+ (FRDM-KL25Z): integer or nothing.** The M0+ has no FPU and no DSP instructions, so every `float` MAC is a software-library call costing tens of cycles. The fix is to drop to **Q15 fixed-point**: store the image and kernels as 16-bit integers, accumulate in a 32-bit `int32_t`, and shift down at the end of each pass. The 16 KB SRAM holds one `int16` frame (8 KB) plus an `int16` scratch (8 KB) with nothing to spare, so an in-place, single-buffer separable pass is the realistic design. **AVR 8-bit (ATmega328P): the honest lower bound.** With 2 KB of SRAM, a full image never lives in memory at once. A real implementation processes the image as a **stream of rows**, keeping only a sliding window of $K$ rows in a circular line buffer and emitting one output row at a time, in `int8`. This is the same line-buffer technique used by camera and FPGA pipelines, and it is the most instructive rung precisely because the constraints force the streaming structure into the open. A full-frame, multi-orientation Gabor bank is simply out of scope here, and saying so plainly is the point. **Cortex-M33 + NPU (FRDM-MCXN947): offload it.** The M33 alone handles `float` convolution comfortably, but the MCX-N947 also carries the eIQ Neutron NPU, a fixed-function accelerator for the multiply-accumulate-heavy convolutions of neural networks. A Gabor or DoG bank is structurally identical to a depthwise convolution layer, so it can be expressed as a tiny fixed-weight conv and run on the NPU while the cores do other work. This is the bridge from classical DSP to the TinyML world the EML course targets. <hr> ## Fixed-point note (the no-FPU rungs) On the AVR and the M0+, fixed-point is not an optimisation, it is the only option. The mechanics are the same as the [biquad fixed-point section](../../basics/09-biquad/embedded.qmd#fixed-point-considerations): - Store the image and the 1-D kernels in **Q15** (16-bit, value = integer / 32768). - **Normalise the kernel** first: the taps of a Gaussian sum to 1, and a Gabor kernel is zero-mean (it rejects DC). This is what bounds each pass output to Q15 range, and it is the reason a **32-bit** accumulator is enough. The guarantee comes from the kernel gain, not the tap count: the raw worst case of 15 full-scale $Q15 \times Q15$ products is about $2^{34}$, which would overflow `int32_t`, so an un-normalised or high-gain kernel needs a 64-bit accumulator or pre-scaling (the same accumulator-width caution as the [biquad page](../../basics/09-biquad/embedded.qmd#fixed-point-considerations)). - Shift the accumulator back to Q15 at the end of each pass, rounding rather than truncating to limit the bias. The DoG is gentler than the Gabor here: both Gaussians are strictly positive and sum to one, so the only care needed is the subtraction at the end, which can produce negative values and therefore wants a signed output. <hr> ## References ::: {#refs} :::