PSO on Hardware

Running the swarm on a microcontroller for online adaptation

PSO is normally a design-time tool: you run the swarm on a workstation, find good coefficients, and ship those to the device. So what would it mean to run PSO on the microcontroller? The answer is online adaptation: when the system being filtered drifts or behaves nonlinearly, the device periodically re-optimises its own filter against a freshly captured buffer of data. This is the niche where on-device PSO earns its keep, and where it differs fundamentally from the LMS adaptive filter.

The contrast is the whole point. LMS spends a tiny, fixed cost on every sample and tracks slowly. PSO spends nothing per sample, then occasionally fires a burst of computation to re-design the filter from scratch. LMS is a trickle; PSO is a scheduled flood. On an MCU that distinction decides feasibility.

This page targets the same capability ladder as the Gabor page (AVR through Cortex-M33 + NPU, ADR-005 as amended), because the interesting question is identical: the algorithm is portable, but which rungs can actually afford to run it? For the mathematics and the Python reference, see the main page.

The swarm loop in C

The PSO update itself is trivial: a handful of multiply-adds per particle per dimension. Almost all the cost is in the fitness, which for filter design means running the candidate filter over the captured data buffer and measuring the error. Here is a compact, self-contained step for one biquad design (D = 5 coefficients), float-first per ADR-005.

#include <stdint.h>
#include <stdlib.h>   // rand
#include <math.h>     // fabsf
#include <float.h>    // FLT_MAX

#define N_PARTICLES 20
#define D           5      // [b0, b1, b2, a1, a2]
#define BUF_LEN     256    // captured data buffer

// Swarm state, persisted across optimisation rounds.
static float x[N_PARTICLES][D];      // position (candidate coefficients)
static float v[N_PARTICLES][D];      // velocity
static float pbest[N_PARTICLES][D];  // each particle's best position
static float pbest_f[N_PARTICLES];   // and its fitness
static float gbest[D];
static float gbest_f;

// Uniform random float in [0, 1). Replace rand() with a better PRNG for real
// use, and seed it (srand) once at start-up so a run is reproducible.
static inline float rnd(void) { return rand() / ((float)RAND_MAX + 1.0f); }

// Fitness: filter the captured buffer with this candidate biquad and return
// the error against a reference. This is where the cycles go.
static float fitness(const float *c, const float *in, const float *ref)
{
    // Stability wall: reject poles outside the unit circle (Schur-Cohn).
    float a1 = c[3], a2 = c[4];
    if (!(fabsf(a2) < 1.0f && fabsf(a1) < 1.0f + a2))
        return 1e3f;

    float w1 = 0.0f, w2 = 0.0f, sse = 0.0f;
    for (int n = 0; n < BUF_LEN; n++) {
        float w0 = in[n] - a1 * w1 - a2 * w2;          // DF-II
        float y  = c[0] * w0 + c[1] * w1 + c[2] * w2;
        w2 = w1; w1 = w0;
        float e = y - ref[n];
        sse += e * e;
    }
    return sse / BUF_LEN;
}

// Per-coefficient search bounds: [b0,b1,b2] and a1 in [-2,2], a2 in [-1,1].
static const float lo[D] = {-2.0f, -2.0f, -2.0f, -2.0f, -1.0f};
static const float hi[D] = { 2.0f,  2.0f,  2.0f,  2.0f,  1.0f};

// Seed the swarm before the first pso_step. Without this, the static state is
// all zeros: every pbest_f is 0.0f, fitness is always >= 0, so no particle
// ever improves on its best and the swarm never moves. Set the best-fitness to
// a large value and evaluate the initial random positions.
void pso_init(const float *in, const float *ref)
{
    gbest_f = FLT_MAX;
    for (int i = 0; i < N_PARTICLES; i++) {
        for (int d = 0; d < D; d++) {
            x[i][d] = lo[d] + rnd() * (hi[d] - lo[d]);
            v[i][d] = 0.1f * (2.0f * rnd() - 1.0f) * (hi[d] - lo[d]);
            pbest[i][d] = x[i][d];
        }
        pbest_f[i] = fitness(x[i], in, ref);
        if (pbest_f[i] < gbest_f) {
            gbest_f = pbest_f[i];
            for (int d = 0; d < D; d++) gbest[d] = x[i][d];
        }
    }
}

// One PSO iteration over the whole swarm. Call pso_init() once first.
void pso_step(const float *in, const float *ref,
              float w_inertia, float c1, float c2)
{
    for (int i = 0; i < N_PARTICLES; i++) {
        for (int d = 0; d < D; d++) {
            float r1 = rnd(), r2 = rnd();
            v[i][d] = w_inertia * v[i][d]
                    + c1 * r1 * (pbest[i][d] - x[i][d])
                    + c2 * r2 * (gbest[d]    - x[i][d]);
            x[i][d] += v[i][d];
            if (x[i][d] < lo[d]) x[i][d] = lo[d];      // clip to bounds
            else if (x[i][d] > hi[d]) x[i][d] = hi[d];
        }
        float f = fitness(x[i], in, ref);
        if (f < pbest_f[i]) {
            pbest_f[i] = f;
            for (int d = 0; d < D; d++) pbest[i][d] = x[i][d];
            if (f < gbest_f) {
                gbest_f = f;
                for (int d = 0; d < D; d++) gbest[d] = x[i][d];
            }
        }
    }
}

A full optimisation is pso_init() once, then pso_step repeated for, say, 50 iterations; gbest then holds the coefficients to load into the running filter. The swarm state is N_PARTICLES*D*3 + N_PARTICLES + D + 1 floats (about 1.3 KB) plus the two captured buffers the fitness needs (the input and the reference, 256 samples each, 2 KB more), so roughly 3.3 KB in float, or 1.7 KB if stored as int16.

The coefficient sign convention here

This page keeps coefficients in the SciPy / direct-form convention (w0 = in - a1*w1 - a2*w2, a not negated), because the swarm optimises the same [b0, b1, b2, a1, a2] vector the Python module designs and the stability wall is written in that convention. That is a deliberate departure from ADR-005 convention 6 (negated a coefficients). If you load gbest into the biquad or gammatone cascade code, negate a1 and a2 first.

What one optimisation costs

The cost is dominated by fitness evaluations: N_PARTICLES particles times n_iters iterations, each filtering a BUF_LEN buffer at roughly 5 MACs per sample. For 20 particles, 50 iterations, and a 256-sample buffer that is $20 \times 50 \times 256 \times 5 = 1.28$ million MACs, plus about 25 k more for the swarm-velocity updates, so roughly 1.31 million MACs per full optimisation, the same on every board. What differs is how long those MACs take, and whether the state even fits.

Target	Core	FPU	RAM	Swarm state	One optimisation (~1.31 M MAC)
ATmega328P	AVR 8-bit, 16 MHz	none	2 KB	does not fit in `float` (3.3 KB); needs `int16` (1.7 KB) and a smaller swarm	~1.6 s, fixed-point. Only the slowest adaptation.
FRDM-KL25Z	Cortex-M0+, 48 MHz	none	16 KB	fits	~220 ms, fixed-point. Background re-tuning every few seconds is plausible.
NUCLEO-F411RE	Cortex-M4F, 100 MHz	yes	128 KB	trivial	~13 ms, `float`. Comfortable online adaptation.
FRDM-MCXN947	Cortex-M33 + NPU, 150 MHz	yes	512 KB	trivial	~9 ms on the M33; the per-particle filtering could offload to the NPU. The top of the ladder.

These are order-of-magnitude estimates

The fitness MAC count (1.28 M) is exact for the stated swarm size; the ~25 k swarm-update term and the ~1.31 M total are approximate. The times assume roughly 1 cycle per MAC on the FPU parts and a larger, hand-estimated multiplier for software fixed-point on the no-FPU parts (about 8 cycles/MAC on the M0+, ~20 on the AVR); these are not datasheet guarantees. The F411 clock (100 MHz) is from the EML project’s CubeMX configuration. Measure on hardware with DWT->CYCCNT (Cortex-M) before trusting any of them.

The shape of the table is the lesson. A 13 ms burst on the M4F can run every second or two without disturbing the audio or control loop, giving genuine online adaptation. The same burst is 1.6 s on the AVR, useful only if the environment drifts over minutes, and the swarm barely fits in memory. PSO on-device is feasible, but it lives on the upper rungs; the lower rungs are better served by the per-sample LMS filter, which never needs more than a few coefficients of state.

Fixed-point on the no-FPU rungs

On the AVR and M0+ the swarm has to run in fixed-point, with the same care as every other fixed-point filter in this workshop:

Store the data buffer in Q15, but note the coefficients do not fit Q15’s $[-1, 1)$: the search bounds run to $\pm 2$, so the positions and velocities need a format with integer bits (for example Q12, range $\pm 8$) or an explicit scale carried consistently through both the fitness filter and the swarm update. Run the fitness filter with a 32-bit accumulator (the biquad fixed-point section covers the accumulator-width reasoning).
The stability wall is the same Schur-Cohn test, and it is cheap: two comparisons before the expensive filtering loop, so unstable particles cost almost nothing.
Reduce N_PARTICLES and BUF_LEN to fit the SRAM and the time budget. Fewer particles search less thoroughly, which is an honest accuracy-for-cost trade, not a free lunch.

When not to do this

The honest engineering answer is that on-device PSO is a special-purpose tool. If the filtering problem is convex (an FIR error surface), LMS or NLMS adapts continuously for a fraction of the cost and should be preferred. On-device PSO is worth its burst cost only when the surface is genuinely multimodal or nonlinear, the environment is non-stationary, and the board sits high enough on the ladder to afford the swarm. On the M4F and M33 that is a real and useful capability; on the AVR it is mostly a demonstration of where the limit is.

References

--- title: "PSO on Hardware" subtitle: "Running the swarm on a microcontroller for online adaptation" --- PSO is normally a **design-time** tool: you run the swarm on a workstation, find good coefficients, and ship those to the device. So what would it mean to run PSO *on* the microcontroller? The answer is **online adaptation**: when the system being filtered drifts or behaves nonlinearly, the device periodically re-optimises its own filter against a freshly captured buffer of data. This is the niche where on-device PSO earns its keep, and where it differs fundamentally from the [LMS adaptive filter](../adaptive-filtering/embedded.qmd). The contrast is the whole point. LMS spends a tiny, fixed cost on **every sample** and tracks slowly. PSO spends nothing per sample, then occasionally fires a **burst** of computation to re-design the filter from scratch. LMS is a trickle; PSO is a scheduled flood. On an MCU that distinction decides feasibility. This page targets the same capability ladder as the [Gabor page](../gabor-filters/embedded.qmd) (AVR through Cortex-M33 + NPU, [ADR-005](../../docs/adr/005-embedded-platforms.md) as amended), because the interesting question is identical: the algorithm is portable, but which rungs can actually afford to run it? For the mathematics and the Python reference, see the [main page](index.qmd). <hr> ## The swarm loop in C The PSO update itself is trivial: a handful of multiply-adds per particle per dimension. Almost all the cost is in the **fitness**, which for filter design means running the candidate filter over the captured data buffer and measuring the error. Here is a compact, self-contained step for one biquad design (`D = 5` coefficients), float-first per ADR-005. ```c #include <stdint.h> #include <stdlib.h> // rand #include <math.h> // fabsf #include <float.h> // FLT_MAX #define N_PARTICLES 20 #define D 5 // [b0, b1, b2, a1, a2] #define BUF_LEN 256 // captured data buffer // Swarm state, persisted across optimisation rounds. static float x[N_PARTICLES][D]; // position (candidate coefficients) static float v[N_PARTICLES][D]; // velocity static float pbest[N_PARTICLES][D]; // each particle's best position static float pbest_f[N_PARTICLES]; // and its fitness static float gbest[D]; static float gbest_f; // Uniform random float in [0, 1). Replace rand() with a better PRNG for real // use, and seed it (srand) once at start-up so a run is reproducible. static inline float rnd(void) { return rand() / ((float)RAND_MAX + 1.0f); } // Fitness: filter the captured buffer with this candidate biquad and return // the error against a reference. This is where the cycles go. static float fitness(const float *c, const float *in, const float *ref) { // Stability wall: reject poles outside the unit circle (Schur-Cohn). float a1 = c[3], a2 = c[4]; if (!(fabsf(a2) < 1.0f && fabsf(a1) < 1.0f + a2)) return 1e3f; float w1 = 0.0f, w2 = 0.0f, sse = 0.0f; for (int n = 0; n < BUF_LEN; n++) { float w0 = in[n] - a1 * w1 - a2 * w2; // DF-II float y = c[0] * w0 + c[1] * w1 + c[2] * w2; w2 = w1; w1 = w0; float e = y - ref[n]; sse += e * e; } return sse / BUF_LEN; } // Per-coefficient search bounds: [b0,b1,b2] and a1 in [-2,2], a2 in [-1,1]. static const float lo[D] = {-2.0f, -2.0f, -2.0f, -2.0f, -1.0f}; static const float hi[D] = { 2.0f, 2.0f, 2.0f, 2.0f, 1.0f}; // Seed the swarm before the first pso_step. Without this, the static state is // all zeros: every pbest_f is 0.0f, fitness is always >= 0, so no particle // ever improves on its best and the swarm never moves. Set the best-fitness to // a large value and evaluate the initial random positions. void pso_init(const float *in, const float *ref) { gbest_f = FLT_MAX; for (int i = 0; i < N_PARTICLES; i++) { for (int d = 0; d < D; d++) { x[i][d] = lo[d] + rnd() * (hi[d] - lo[d]); v[i][d] = 0.1f * (2.0f * rnd() - 1.0f) * (hi[d] - lo[d]); pbest[i][d] = x[i][d]; } pbest_f[i] = fitness(x[i], in, ref); if (pbest_f[i] < gbest_f) { gbest_f = pbest_f[i]; for (int d = 0; d < D; d++) gbest[d] = x[i][d]; } } } // One PSO iteration over the whole swarm. Call pso_init() once first. void pso_step(const float *in, const float *ref, float w_inertia, float c1, float c2) { for (int i = 0; i < N_PARTICLES; i++) { for (int d = 0; d < D; d++) { float r1 = rnd(), r2 = rnd(); v[i][d] = w_inertia * v[i][d] + c1 * r1 * (pbest[i][d] - x[i][d]) + c2 * r2 * (gbest[d] - x[i][d]); x[i][d] += v[i][d]; if (x[i][d] < lo[d]) x[i][d] = lo[d]; // clip to bounds else if (x[i][d] > hi[d]) x[i][d] = hi[d]; } float f = fitness(x[i], in, ref); if (f < pbest_f[i]) { pbest_f[i] = f; for (int d = 0; d < D; d++) pbest[i][d] = x[i][d]; if (f < gbest_f) { gbest_f = f; for (int d = 0; d < D; d++) gbest[d] = x[i][d]; } } } } ``` A full optimisation is `pso_init()` once, then `pso_step` repeated for, say, 50 iterations; `gbest` then holds the coefficients to load into the running filter. The swarm state is `N_PARTICLES*D*3 + N_PARTICLES + D + 1` floats (about 1.3 KB) plus the two captured buffers the fitness needs (the input and the reference, 256 samples each, 2 KB more), so roughly **3.3 KB** in `float`, or 1.7 KB if stored as `int16`. ::: {.callout-important title="The coefficient sign convention here"} This page keeps coefficients in the **SciPy / direct-form** convention (`w0 = in - a1*w1 - a2*w2`, `a` not negated), because the swarm optimises the same `[b0, b1, b2, a1, a2]` vector the Python module designs and the stability wall is written in that convention. That is a deliberate departure from [ADR-005 convention 6](../../docs/adr/005-embedded-platforms.md) (negated `a` coefficients). If you load `gbest` into the [biquad](../../basics/09-biquad/embedded.qmd) or [gammatone](../gammatone-filters/embedded.qmd) cascade code, negate `a1` and `a2` first. ::: <hr> ## What one optimisation costs The cost is dominated by fitness evaluations: `N_PARTICLES` particles times `n_iters` iterations, each filtering a `BUF_LEN` buffer at roughly 5 MACs per sample. For 20 particles, 50 iterations, and a 256-sample buffer that is $20 \times 50 \times 256 \times 5 = 1.28$ million MACs, plus about 25 k more for the swarm-velocity updates, so roughly **1.31 million MACs** per full optimisation, the same on every board. What differs is how long those MACs take, and whether the state even fits. | Target | Core | FPU | RAM | Swarm state | One optimisation (~1.31 M MAC) | |---|---|---|---:|---|---| | **ATmega328P** | AVR 8-bit, 16 MHz | none | 2 KB | does not fit in `float` (3.3 KB); needs `int16` (1.7 KB) and a smaller swarm | ~1.6 s, fixed-point. Only the slowest adaptation. | | **FRDM-KL25Z** | Cortex-M0+, 48 MHz | none | 16 KB | fits | ~220 ms, fixed-point. Background re-tuning every few seconds is plausible. | | **NUCLEO-F411RE** | Cortex-M4F, 100 MHz | yes | 128 KB | trivial | ~13 ms, `float`. Comfortable online adaptation. | | **FRDM-MCXN947** | Cortex-M33 + NPU, 150 MHz | yes | 512 KB | trivial | ~9 ms on the M33; the per-particle filtering could offload to the NPU. The top of the ladder. | ::: {.callout-important title="These are order-of-magnitude estimates"} The fitness MAC count (1.28 M) is exact for the stated swarm size; the ~25 k swarm-update term and the ~1.31 M total are approximate. The times assume roughly 1 cycle per MAC on the FPU parts and a larger, hand-estimated multiplier for software fixed-point on the no-FPU parts (about 8 cycles/MAC on the M0+, ~20 on the AVR); these are not datasheet guarantees. The F411 clock (100 MHz) is from the EML project's CubeMX configuration. Measure on hardware with `DWT->CYCCNT` (Cortex-M) before trusting any of them. ::: The shape of the table is the lesson. A 13 ms burst on the M4F can run every second or two without disturbing the audio or control loop, giving genuine online adaptation. The same burst is 1.6 s on the AVR, useful only if the environment drifts over minutes, and the swarm barely fits in memory. PSO on-device is feasible, but it lives on the **upper rungs**; the lower rungs are better served by the per-sample LMS filter, which never needs more than a few coefficients of state. <hr> ## Fixed-point on the no-FPU rungs On the AVR and M0+ the swarm has to run in fixed-point, with the same care as every other fixed-point filter in this workshop: - Store the data buffer in **Q15**, but note the coefficients do not fit Q15's $[-1, 1)$: the search bounds run to $\pm 2$, so the positions and velocities need a format with integer bits (for example **Q12**, range $\pm 8$) or an explicit scale carried consistently through both the fitness filter and the swarm update. Run the fitness filter with a **32-bit accumulator** (the [biquad fixed-point section](../../basics/09-biquad/embedded.qmd#fixed-point-considerations) covers the accumulator-width reasoning). - The **stability wall** is the same Schur-Cohn test, and it is cheap: two comparisons before the expensive filtering loop, so unstable particles cost almost nothing. - Reduce `N_PARTICLES` and `BUF_LEN` to fit the SRAM and the time budget. Fewer particles search less thoroughly, which is an honest accuracy-for-cost trade, not a free lunch. <hr> ## When not to do this The honest engineering answer is that on-device PSO is a **special-purpose** tool. If the filtering problem is convex (an FIR error surface), LMS or NLMS adapts continuously for a fraction of the cost and should be preferred. On-device PSO is worth its burst cost only when the surface is genuinely multimodal or nonlinear, the environment is non-stationary, and the board sits high enough on the ladder to afford the swarm. On the M4F and M33 that is a real and useful capability; on the AVR it is mostly a demonstration of where the limit is. <hr> ## References ::: {#refs} :::