Keyword Spotting and Voice Wake Word

Modified: Jun. 4, 2026

Published: Mar. 5, 2026

Voice is one of the most natural interfaces for embedded devices, but streaming audio to the cloud for recognition wastes bandwidth, introduces latency, and raises privacy concerns. A wake word detector running entirely on the ESP32 listens continuously, classifies short audio windows locally, and only takes action when it hears the target phrase. This lesson builds that system end to end: microphone hardware, audio capture firmware, feature extraction, model training, and a real-time inference loop that triggers an LED and publishes an MQTT message. #KeywordSpotting #TinyML #ESP32

What We Are Building

Hey Device Wake Word Detector

A voice-activated trigger that listens for the keyword “yes” (from Google’s Speech Commands dataset) using an INMP441 I2S MEMS microphone connected to an ESP32. When the keyword is detected with confidence above a threshold, the ESP32 lights an LED and optionally publishes an MQTT message. All inference runs on the MCU with no cloud dependency.

Project specifications:

Parameter	Value
MCU	ESP32 DevKitC
Microphone	INMP441 I2S MEMS
Audio format	16 kHz, 16-bit mono
Window length	1 second (16000 samples)
Feature extraction	13 MFCCs, 32 ms frames, 16 ms hop
Model	Small CNN (Conv1D), ~20 KB quantized
Classes	”yes”, “no”, “unknown”, “silence”
Inference time	< 100 ms on ESP32 at 240 MHz
Detection LED	GPIO 2 (onboard)
MQTT topic	device/wakeword/detected

Bill of Materials

Ref	Component	Quantity	Notes
U1	ESP32 DevKitC	1	Reuse from previous lessons
M1	INMP441 I2S MEMS microphone breakout	1	I2S output, 3.3V compatible
	Breadboard + jumper wires	1 set

  Audio Feature Extraction Pipeline
  ──────────────────────────────────────────
  INMP441 Mic    Framing        MFCC
  (16 kHz) ──►  32ms windows ──► 13 coefficients
                16ms hop          per frame
                                     │
                                     ▼
                              ┌──────────────┐
                              │ Spectrogram  │
                              │ 62 frames x  │
                              │ 13 MFCCs     │
                              │ = 806 values │
                              └──────┬───────┘
                                     │
                                     ▼
                              ┌──────────────┐
                              │ CNN (Conv1D) │
                              │ ~20 KB int8  │
                              └──────┬───────┘
                                     │
                            ┌────────┼────────┐
                            ▼        ▼        ▼
                          "yes"   "no"   "unknown"

Audio ML on Microcontrollers

  Sliding Window Detection Loop
  ──────────────────────────────────────────
  Audio stream (continuous):
  ──────────────────────────────────────►
  │    1s window    │
  ├─────────────────┤
       │    1s window    │  (hop = 500ms)
       ├─────────────────┤
            │    1s window    │
            ├─────────────────┤

  Each window: extract MFCCs ──► classify
  If P("yes") > 0.85 for 2 consecutive
  windows ──► DETECTED (reduces false alarms)

Speech and audio classification on MCUs follows a consistent pipeline. Raw audio samples come in from a digital microphone. A feature extraction stage converts the time-domain waveform into a compact spectral representation (typically MFCCs or log-mel spectrograms). A small neural network classifies the feature matrix into one of several categories. The entire pipeline must complete within the audio window duration to maintain real-time operation.

The key constraints on an ESP32:

RAM: ~520 KB total SRAM. The audio buffer, feature matrix, model weights, and interpreter arena all compete for this space.
Flash: Up to 4 MB. The quantized model and firmware share this.
CPU: Dual-core Xtensa LX6 at 240 MHz. MFCC extraction is compute-intensive but feasible at 16 kHz sample rates.

We use 16 kHz sampling because the Speech Commands dataset is recorded at 16 kHz and human speech energy relevant to keyword detection sits below 8 kHz (the Nyquist frequency at this sample rate).

I2S Microphone Hardware Setup

The INMP441 is a digital MEMS microphone with an I2S output interface. Unlike analog microphones that require an ADC, the INMP441 outputs a digital bitstream directly, which the ESP32’s I2S peripheral reads without any external codec.

Wiring

ESP32 Pin	INMP441 Pin	Notes
GPIO 26	WS (Word Select)	Left/right channel select
GPIO 25	SCK (Serial Clock)	Bit clock
GPIO 33	SD (Serial Data)	Audio data output
3.3V	VDD	Power supply
GND	GND	Common ground
GND	L/R	Tie to GND for left channel

The L/R pin selects which channel the microphone outputs on. Tying it to GND selects the left channel. If you tie it to VDD, the mic outputs on the right channel instead. For a single-mic setup, left channel is the convention.

Why I2S Instead of Analog?

An analog electret microphone connected to the ESP32’s ADC would work for basic audio capture, but it introduces noise from the ADC conversion, requires an amplifier circuit, and the ADC on the ESP32 is only 12-bit. The INMP441 outputs 24-bit digital audio with a built-in anti-aliasing filter and much higher signal-to-noise ratio (SNR of 61 dBA). For ML applications where clean audio directly affects classification accuracy, the I2S mic is the right choice.

Audio Capture Firmware

The following ESP-IDF code configures the I2S peripheral, captures 1-second audio windows, and stores them in a buffer for processing.

I2S Configuration and Capture

#include <stdio.h>
#include <string.h>
#include <math.h>
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "driver/i2s_std.h"
#include "esp_log.h"
#include "esp_timer.h"

static const char *TAG = "wake_word";

/* Audio parameters */
#define SAMPLE_RATE       16000
#define SAMPLE_BITS       I2S_DATA_BIT_WIDTH_16BIT
#define CHANNEL           I2S_SLOT_MODE_MONO
#define AUDIO_WINDOW_SEC  1
#define AUDIO_BUFFER_LEN  (SAMPLE_RATE * AUDIO_WINDOW_SEC)

/* I2S GPIO pins */
#define I2S_WS_PIN   26
#define I2S_SCK_PIN  25
#define I2S_SD_PIN   33

/* Detection */
#define DETECTION_THRESHOLD  0.75f
#define LED_PIN              2

static i2s_chan_handle_t s_rx_chan = NULL;
static int16_t s_audio_buffer[AUDIO_BUFFER_LEN];

/* ---- I2S initialization ---- */

static void i2s_init(void)
{
    i2s_chan_config_t chan_cfg = I2S_CHANNEL_DEFAULT_CONFIG(
        I2S_NUM_0, I2S_ROLE_MASTER);
    chan_cfg.dma_desc_num = 8;
    chan_cfg.dma_frame_num = 512;

    ESP_ERROR_CHECK(i2s_new_channel(&chan_cfg, NULL, &s_rx_chan));

    i2s_std_config_t std_cfg = {
        .clk_cfg = I2S_STD_CLK_DEFAULT_CONFIG(SAMPLE_RATE),
        .slot_cfg = I2S_STD_PHILIPS_SLOT_DEFAULT_CONFIG(
            I2S_DATA_BIT_WIDTH_32BIT, I2S_SLOT_MODE_MONO),
        .gpio_cfg = {
            .mclk = I2S_GPIO_UNUSED,
            .bclk = (gpio_num_t)I2S_SCK_PIN,
            .ws   = (gpio_num_t)I2S_WS_PIN,
            .dout = I2S_GPIO_UNUSED,
            .din  = (gpio_num_t)I2S_SD_PIN,
            .invert_flags = {
                .mclk_inv = false,
                .bclk_inv = false,
                .ws_inv   = false,
            },
        },
    };

    ESP_ERROR_CHECK(i2s_channel_init_std_mode(s_rx_chan, &std_cfg));
    ESP_ERROR_CHECK(i2s_channel_enable(s_rx_chan));

    ESP_LOGI(TAG, "I2S initialized: %d Hz, mono, 32-bit frames", SAMPLE_RATE);
}

/* ---- Audio capture ---- */

static esp_err_t capture_audio_window(int16_t *out_buf, size_t num_samples)
{
    /* The INMP441 outputs 32-bit frames; we read raw and extract 16-bit */
    int32_t raw_samples[512];
    size_t bytes_read = 0;
    size_t samples_collected = 0;

    while (samples_collected < num_samples) {
        size_t to_read = 512;
        if (samples_collected + to_read > num_samples) {
            to_read = num_samples - samples_collected;
        }

        esp_err_t ret = i2s_channel_read(s_rx_chan, raw_samples,
                                          to_read * sizeof(int32_t),
                                          &bytes_read,
                                          pdMS_TO_TICKS(1000));
        if (ret != ESP_OK) {
            ESP_LOGE(TAG, "I2S read error: %s", esp_err_to_name(ret));
            return ret;
        }

        size_t got = bytes_read / sizeof(int32_t);
        for (size_t i = 0; i < got; i++) {
            /* Shift down from 24-bit left-justified to 16-bit */
            out_buf[samples_collected + i] = (int16_t)(raw_samples[i] >> 16);
        }
        samples_collected += got;
    }

    return ESP_OK;
}

The INMP441 outputs 24-bit audio left-justified in a 32-bit I2S frame. We read the full 32-bit values and shift right by 16 to extract the top 16 bits, which gives us a signed 16-bit PCM sample. This is a common pattern with I2S MEMS microphones on the ESP32.

The DMA configuration uses 8 descriptors with 512 frames each, providing enough buffering to prevent underruns during the feature extraction phase. At 16 kHz, one second of audio is 16000 samples (32 KB of int16 data), which fits comfortably in the ESP32’s SRAM.

Feature Extraction: MFCC on Device

Raw audio waveforms are not suitable as direct input to small neural networks. The standard approach for audio ML is to extract Mel-frequency cepstral coefficients (MFCCs), which compress the spectral content of each short frame into a small vector of typically 13 coefficients.

MFCC Pipeline

The MFCC computation follows these steps for each frame:

Pre-emphasis: Apply a high-pass filter to boost high frequencies: y[n] = x[n] - 0.97 * x[n-1]
Windowing: Multiply each frame by a Hann window to reduce spectral leakage
FFT: Compute the magnitude spectrum using a 512-point FFT
Mel filterbank: Apply 26 triangular filters spaced on the mel scale
Log energy: Take the logarithm of each filterbank output
DCT: Apply a discrete cosine transform and keep the first 13 coefficients

C Implementation

/* MFCC parameters */
#define FRAME_LEN      512      /* 32 ms at 16 kHz */
#define FRAME_STEP     256      /* 16 ms hop (50% overlap) */
#define NUM_MFCC       13
#define NUM_MEL_BINS   26
#define FFT_SIZE       512
#define NUM_FRAMES     ((AUDIO_BUFFER_LEN - FRAME_LEN) / FRAME_STEP + 1)

/* Pre-computed mel filterbank (stored in flash) */
static const float MEL_FILTERBANK[NUM_MEL_BINS][FFT_SIZE / 2 + 1] = {
    /* These values are generated by the Python script below.
       Each row is a triangular filter spanning the FFT bins
       corresponding to the mel-spaced center frequencies.
       For brevity, we show the structure; the actual values
       are generated and pasted from the training script. */
    {0.0f}, /* placeholder, see generate_filterbank.py */
};

/* Hann window (pre-computed) */
static float s_hann_window[FRAME_LEN];

static void precompute_hann_window(void)
{
    for (int i = 0; i < FRAME_LEN; i++) {
        s_hann_window[i] = 0.5f * (1.0f - cosf(2.0f * M_PI * i
                                                  / (FRAME_LEN - 1)));
    }
}

/* Simplified FFT magnitude (in-place, radix-2 decimation-in-time) */
static void fft_magnitude(const float *input, float *magnitude, int n)
{
    /* For production, use esp-dsp library's dsps_fft2r_fc32.
       This placeholder shows the interface. */
    float real[FFT_SIZE];
    float imag[FFT_SIZE];

    memcpy(real, input, n * sizeof(float));
    memset(imag, 0, n * sizeof(float));

    /* Use ESP-DSP FFT for performance:
       #include "dsps_fft2r.h"
       dsps_fft2r_init_fc32(NULL, FFT_SIZE);
       dsps_fft2r_fc32(real, FFT_SIZE);
       dsps_bit_rev_fc32(real, FFT_SIZE);
       dsps_cplx2reC_fc32(real, FFT_SIZE);
    */

    for (int i = 0; i <= n / 2; i++) {
        magnitude[i] = sqrtf(real[i] * real[i] + imag[i] * imag[i]);
    }
}

/* Extract MFCCs for the entire audio window */
static void extract_mfcc(const int16_t *audio, float mfcc_out[][NUM_MFCC])
{
    float frame[FRAME_LEN];
    float magnitude[FFT_SIZE / 2 + 1];
    float mel_energy[NUM_MEL_BINS];
    float dct_input[NUM_MEL_BINS];

    for (int f = 0; f < NUM_FRAMES; f++) {
        int offset = f * FRAME_STEP;

        /* Pre-emphasis and windowing */
        frame[0] = (float)audio[offset] * s_hann_window[0];
        for (int i = 1; i < FRAME_LEN; i++) {
            float sample = (float)audio[offset + i]
                           - 0.97f * (float)audio[offset + i - 1];
            frame[i] = sample * s_hann_window[i];
        }

        /* FFT magnitude spectrum */
        fft_magnitude(frame, magnitude, FFT_SIZE);

        /* Apply mel filterbank */
        for (int m = 0; m < NUM_MEL_BINS; m++) {
            mel_energy[m] = 0.0f;
            for (int k = 0; k <= FFT_SIZE / 2; k++) {
                mel_energy[m] += MEL_FILTERBANK[m][k] * magnitude[k];
            }
            /* Log energy (floor to avoid log(0)) */
            dct_input[m] = logf(mel_energy[m] + 1e-10f);
        }

        /* DCT-II to get MFCCs */
        for (int c = 0; c < NUM_MFCC; c++) {
            float sum = 0.0f;
            for (int m = 0; m < NUM_MEL_BINS; m++) {
                sum += dct_input[m] * cosf(M_PI * c * (m + 0.5f)
                                            / NUM_MEL_BINS);
            }
            mfcc_out[f][c] = sum;
        }
    }
}

Using ESP-DSP for Performance

The placeholder FFT above is intentionally simplified. For real-time performance, use the ESP-DSP library which provides hardware-optimized FFT routines:

#include "dsps_fft2r.h"
#include "dsps_wind_hann.h"

/* Initialize once at startup */
dsps_fft2r_init_fc32(NULL, FFT_SIZE);

/* In the MFCC loop, replace the manual FFT with: */
float fft_buf[FFT_SIZE * 2]; /* interleaved real/imag */
for (int i = 0; i < FFT_SIZE; i++) {
    fft_buf[i * 2]     = frame[i];  /* real */
    fft_buf[i * 2 + 1] = 0.0f;     /* imag */
}
dsps_fft2r_fc32(fft_buf, FFT_SIZE);
dsps_bit_rev_fc32(fft_buf, FFT_SIZE);
dsps_cplx2reC_fc32(fft_buf, FFT_SIZE);

for (int i = 0; i <= FFT_SIZE / 2; i++) {
    magnitude[i] = sqrtf(fft_buf[i * 2] * fft_buf[i * 2]
                         + fft_buf[i * 2 + 1] * fft_buf[i * 2 + 1]);
}

Add esp-dsp to your component dependencies in idf_component.yml:

dependencies:
  espressif/esp-dsp: "~1.4.0"

Training the Wake Word Model

We train the model on your PC using TensorFlow and Google’s Speech Commands v2 dataset. The dataset contains 105,000 one-second utterances of 35 keywords, recorded by thousands of contributors. We select “yes” as our wake word and group the remaining keywords into “unknown” and “silence” classes.

Dataset Preparation

import tensorflow as tf
import numpy as np
import os

# Download Speech Commands v2
data_dir = tf.keras.utils.get_file(
    'speech_commands_v2',
    'http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz',
    extract=True,
    cache_subdir='datasets/speech_commands')

SAMPLE_RATE = 16000
TARGET_KEYWORD = 'yes'
NEGATIVE_KEYWORD = 'no'
NUM_MFCC = 13
FRAME_LENGTH = 512
FRAME_STEP = 256
NUM_FRAMES = 62  # (16000 - 512) / 256 + 1

def decode_audio(file_path):
    """Load a WAV file and return float32 samples."""
    audio_binary = tf.io.read_file(file_path)
    waveform, sr = tf.audio.decode_wav(audio_binary, desired_channels=1)
    waveform = tf.squeeze(waveform, axis=-1)
    # Pad or truncate to exactly 1 second
    if tf.shape(waveform)[0] < SAMPLE_RATE:
        padding = SAMPLE_RATE - tf.shape(waveform)[0]
        waveform = tf.pad(waveform, [[0, padding]])
    else:
        waveform = waveform[:SAMPLE_RATE]
    return waveform

def extract_mfcc_tf(waveform):
    """Extract MFCCs using TensorFlow signal processing."""
    stft = tf.signal.stft(waveform,
                          frame_length=FRAME_LENGTH,
                          frame_step=FRAME_STEP)
    spectrogram = tf.abs(stft)

    num_spectrogram_bins = spectrogram.shape[-1]
    lower_freq = 80.0
    upper_freq = 7600.0
    num_mel_bins = 26

    linear_to_mel = tf.signal.linear_to_mel_weight_matrix(
        num_mel_bins, num_spectrogram_bins, SAMPLE_RATE,
        lower_freq, upper_freq)

    mel_spectrogram = tf.matmul(spectrogram, linear_to_mel)
    log_mel = tf.math.log(mel_spectrogram + 1e-6)

    mfccs = tf.signal.mfccs_from_log_mel_spectrograms(log_mel)[..., :NUM_MFCC]
    return mfccs

# Build file lists
keywords_dir = os.path.join(os.path.dirname(data_dir), 'speech_commands_v0.02')
yes_files = tf.io.gfile.glob(os.path.join(keywords_dir, 'yes', '*.wav'))
no_files = tf.io.gfile.glob(os.path.join(keywords_dir, 'no', '*.wav'))

# Sample unknown words from other categories
unknown_dirs = ['go', 'stop', 'left', 'right', 'up', 'down', 'on', 'off']
unknown_files = []
for d in unknown_dirs:
    files = tf.io.gfile.glob(os.path.join(keywords_dir, d, '*.wav'))
    unknown_files.extend(files[:300])

# Generate silence samples (low-amplitude noise)
silence_files = tf.io.gfile.glob(
    os.path.join(keywords_dir, '_background_noise_', '*.wav'))

print(f"yes: {len(yes_files)}, no: {len(no_files)}, "
      f"unknown: {len(unknown_files)}")

Model Architecture

The classifier uses a small 1D CNN that takes the MFCC matrix (62 frames x 13 coefficients) as input. The architecture is designed to fit within the ESP32’s memory budget after int8 quantization.

from tensorflow import keras
from tensorflow.keras import layers

def build_keyword_model():
    """Small CNN for keyword spotting."""
    model = keras.Sequential([
        layers.Input(shape=(NUM_FRAMES, NUM_MFCC)),

        # First conv block
        layers.Conv1D(32, 3, padding='same', activation='relu'),
        layers.BatchNormalization(),
        layers.MaxPooling1D(2),
        layers.Dropout(0.2),

        # Second conv block
        layers.Conv1D(64, 3, padding='same', activation='relu'),
        layers.BatchNormalization(),
        layers.MaxPooling1D(2),
        layers.Dropout(0.2),

        # Third conv block
        layers.Conv1D(64, 3, padding='same', activation='relu'),
        layers.BatchNormalization(),
        layers.MaxPooling1D(2),
        layers.Dropout(0.3),

        # Classification head
        layers.GlobalAveragePooling1D(),
        layers.Dense(32, activation='relu'),
        layers.Dropout(0.3),
        layers.Dense(4, activation='softmax'),
    ])

    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=0.001),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy'])

    return model

model = build_keyword_model()
model.summary()

Expected model summary output:

Total params: 15,456
Trainable params: 15,200
Non-trainable params: 256

Training and Validation

# Prepare datasets
# Labels: 0=yes, 1=no, 2=unknown, 3=silence
def make_dataset(file_list, label, batch_size=32):
    def process(file_path):
        waveform = decode_audio(file_path)
        mfcc = extract_mfcc_tf(waveform)
        return mfcc, label

    ds = tf.data.Dataset.from_tensor_slices(file_list)
    ds = ds.map(process, num_parallel_calls=tf.data.AUTOTUNE)
    return ds

ds_yes = make_dataset(yes_files, 0)
ds_no = make_dataset(no_files, 1)
ds_unknown = make_dataset(unknown_files, 2)

# Create silence samples from background noise
def make_silence_dataset(noise_files, num_samples=2000):
    samples = []
    for f in noise_files:
        audio_binary = tf.io.read_file(f)
        waveform, sr = tf.audio.decode_wav(audio_binary, desired_channels=1)
        waveform = tf.squeeze(waveform, axis=-1)
        # Extract random 1-second clips
        for _ in range(num_samples // len(noise_files)):
            max_start = tf.shape(waveform)[0] - SAMPLE_RATE
            start = tf.random.uniform([], 0, max_start, dtype=tf.int32)
            clip = waveform[start:start + SAMPLE_RATE]
            # Scale down to simulate silence
            clip = clip * 0.1
            mfcc = extract_mfcc_tf(clip)
            samples.append((mfcc.numpy(), 3))
    return samples

silence_data = make_silence_dataset(silence_files)

# Convert silence data to a tf.data.Dataset
silence_mfccs, silence_labels = zip(*silence_data)
ds_silence = tf.data.Dataset.from_tensor_slices(
    (list(silence_mfccs), list(silence_labels)))

# Combine and shuffle
full_ds = ds_yes.concatenate(ds_no).concatenate(ds_unknown).concatenate(ds_silence)
full_ds = full_ds.shuffle(10000)

# Split 80/10/10
total = len(yes_files) + len(no_files) + len(unknown_files) + len(silence_data)
train_size = int(0.8 * total)
val_size = int(0.1 * total)

train_ds = full_ds.take(train_size).batch(32).prefetch(tf.data.AUTOTUNE)
val_ds = full_ds.skip(train_size).take(val_size).batch(32).prefetch(tf.data.AUTOTUNE)
test_ds = full_ds.skip(train_size + val_size).batch(32).prefetch(tf.data.AUTOTUNE)

# Train
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=30,
    callbacks=[
        keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True),
        keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=3),
    ])

# Evaluate
test_loss, test_acc = model.evaluate(test_ds)
print(f"Test accuracy: {test_acc:.4f}")

You should see test accuracy above 90% after training. The “yes” class typically achieves higher precision because it has the most training examples in the Speech Commands dataset.

Converting and Deploying

Quantization and Conversion

import pathlib

# Post-training int8 quantization
def representative_dataset():
    for mfcc, label in val_ds.unbatch().take(200):
        yield [tf.expand_dims(mfcc, 0)]

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

tflite_model = converter.convert()

model_path = pathlib.Path('wake_word_model.tflite')
model_path.write_bytes(tflite_model)
print(f"Quantized model size: {len(tflite_model)} bytes")

The quantized model should be approximately 18 to 22 KB, well within the ESP32’s flash budget.

Converting to C Header

xxd -i wake_word_model.tflite > wake_word_model.h

This generates a C array that you include directly in the firmware:

/* wake_word_model.h (auto-generated by xxd) */
unsigned char wake_word_model_tflite[] = {
    0x20, 0x00, 0x00, 0x00, 0x54, 0x46, 0x4c, 0x33,
    /* ... model bytes ... */
};
unsigned int wake_word_model_tflite_len = 19456;

Generate Mel Filterbank Constants

Use this Python script to generate the C array for the mel filterbank used in the on-device MFCC extraction:

import numpy as np

def hz_to_mel(hz):
    return 2595.0 * np.log10(1.0 + hz / 700.0)

def mel_to_hz(mel):
    return 700.0 * (10.0 ** (mel / 2595.0) - 1.0)

def generate_mel_filterbank_c(num_mel_bins=26, fft_size=512,
                               sample_rate=16000, low_freq=80,
                               high_freq=7600):
    num_fft_bins = fft_size // 2 + 1
    low_mel = hz_to_mel(low_freq)
    high_mel = hz_to_mel(high_freq)

    mel_points = np.linspace(low_mel, high_mel, num_mel_bins + 2)
    hz_points = mel_to_hz(mel_points)
    bin_points = np.floor((fft_size + 1) * hz_points / sample_rate).astype(int)

    filterbank = np.zeros((num_mel_bins, num_fft_bins))
    for m in range(num_mel_bins):
        left = bin_points[m]
        center = bin_points[m + 1]
        right = bin_points[m + 2]
        for k in range(left, center):
            if center != left:
                filterbank[m, k] = (k - left) / (center - left)
        for k in range(center, right):
            if right != center:
                filterbank[m, k] = (right - k) / (right - center)

    # Print as C array
    print("static const float MEL_FILTERBANK"
          f"[{num_mel_bins}][{num_fft_bins}] = {{")
    for m in range(num_mel_bins):
        values = ', '.join(f'{v:.6f}' for v in filterbank[m])
        print(f"    {{{values}}},")
    print("};")

generate_mel_filterbank_c()

Real-Time Keyword Detection Loop

This is the main inference loop that ties everything together. It continuously captures audio, extracts features, runs the model, and acts on detections.

#include "tensorflow/lite/micro/micro_interpreter.h"
#include "tensorflow/lite/micro/micro_mutable_op_resolver.h"
#include "tensorflow/lite/schema/schema_generated.h"
#include "wake_word_model.h"

/* TFLite Micro arena */
#define TENSOR_ARENA_SIZE  (40 * 1024)
static uint8_t s_tensor_arena[TENSOR_ARENA_SIZE]
    __attribute__((aligned(16)));

/* Class labels */
static const char *CLASS_LABELS[] = {"yes", "no", "unknown", "silence"};

/* MFCC output buffer */
static float s_mfcc_features[NUM_FRAMES][NUM_MFCC];

/* Forward declarations */
static void led_init(void);
static void led_on(void);
static void led_off(void);

/* ---- TFLite Micro setup ---- */

static tflite::MicroInterpreter *s_interpreter = nullptr;
static TfLiteTensor *s_input = nullptr;
static TfLiteTensor *s_output = nullptr;

static void tflm_init(void)
{
    const tflite::Model *model = tflite::GetModel(wake_word_model_tflite);
    if (model->version() != TFLITE_SCHEMA_VERSION) {
        ESP_LOGE(TAG, "Model schema version mismatch");
        return;
    }

    static tflite::MicroMutableOpResolver<6> resolver;
    resolver.AddConv2D();
    resolver.AddReshape();
    resolver.AddFullyConnected();
    resolver.AddMaxPool2D();
    resolver.AddSoftmax();
    resolver.AddMean();

    static tflite::MicroInterpreter interpreter(
        model, resolver, s_tensor_arena, TENSOR_ARENA_SIZE);
    interpreter.AllocateTensors();

    s_interpreter = &interpreter;
    s_input = interpreter.input(0);
    s_output = interpreter.output(0);

    ESP_LOGI(TAG, "TFLite Micro initialized, arena used: %zu bytes",
             interpreter.arena_used_bytes());
}

/* ---- Inference ---- */

static int run_inference(float mfcc[][NUM_MFCC], float *confidence)
{
    /* Quantize input: float to int8 */
    float input_scale = s_input->params.scale;
    int input_zero_point = s_input->params.zero_point;

    int8_t *input_data = s_input->data.int8;
    for (int f = 0; f < NUM_FRAMES; f++) {
        for (int c = 0; c < NUM_MFCC; c++) {
            float val = mfcc[f][c];
            int32_t quantized = (int32_t)(val / input_scale) + input_zero_point;
            if (quantized < -128) quantized = -128;
            if (quantized > 127) quantized = 127;
            input_data[f * NUM_MFCC + c] = (int8_t)quantized;
        }
    }

    /* Run inference */
    if (s_interpreter->Invoke() != kTfLiteOk) {
        ESP_LOGE(TAG, "Inference failed");
        return -1;
    }

    /* Dequantize output and find argmax */
    float output_scale = s_output->params.scale;
    int output_zero_point = s_output->params.zero_point;
    int8_t *output_data = s_output->data.int8;

    int best_class = 0;
    float best_score = -1.0f;

    for (int i = 0; i < 4; i++) {
        float score = (output_data[i] - output_zero_point) * output_scale;
        if (score > best_score) {
            best_score = score;
            best_class = i;
        }
    }

    *confidence = best_score;
    return best_class;
}

/* ---- Main detection task ---- */

static void keyword_detection_task(void *arg)
{
    ESP_LOGI(TAG, "Keyword detection task started");
    int consecutive_detections = 0;

    while (1) {
        /* Capture 1 second of audio */
        esp_err_t ret = capture_audio_window(s_audio_buffer, AUDIO_BUFFER_LEN);
        if (ret != ESP_OK) {
            vTaskDelay(pdMS_TO_TICKS(100));
            continue;
        }

        /* Extract MFCC features */
        int64_t t_start = esp_timer_get_time();
        extract_mfcc(s_audio_buffer, s_mfcc_features);
        int64_t t_mfcc = esp_timer_get_time();

        /* Run inference */
        float confidence = 0.0f;
        int predicted = run_inference(s_mfcc_features, &confidence);
        int64_t t_infer = esp_timer_get_time();

        ESP_LOGI(TAG, "Class: %s (%.2f), MFCC: %lld us, Infer: %lld us",
                 CLASS_LABELS[predicted], confidence,
                 (long long)(t_mfcc - t_start),
                 (long long)(t_infer - t_mfcc));

        /* Detection logic with consecutive confirmation */
        if (predicted == 0 && confidence > DETECTION_THRESHOLD) {
            consecutive_detections++;
            if (consecutive_detections >= 2) {
                ESP_LOGW(TAG, "WAKE WORD DETECTED! Confidence: %.2f",
                         confidence);
                led_on();
                /* Trigger action here (MQTT publish, etc.) */
                vTaskDelay(pdMS_TO_TICKS(1000));
                led_off();
                consecutive_detections = 0;
            }
        } else {
            consecutive_detections = 0;
        }
    }
}

The detection loop uses a consecutive confirmation strategy: it requires two back-to-back positive classifications before triggering an action. This dramatically reduces false positives while adding only one second of latency. For a wake word detector that runs continuously, this tradeoff is well worth it.

Handling False Positives

False positives are the primary usability concern for always-listening keyword detectors. Several techniques reduce them:

1. Consecutive detection requirement. As shown in the code above, requiring two (or three) consecutive positive windows eliminates most spurious triggers. A random noise burst might briefly match the keyword, but it is unlikely to persist for two full seconds.

2. Confidence threshold tuning. The DETECTION_THRESHOLD value determines the sensitivity/specificity tradeoff. Start at 0.75 and adjust based on your deployment environment:

Threshold	Behavior
0.5	High sensitivity, more false positives. Good for quiet rooms.
0.75	Balanced. Recommended starting point.
0.9	Very few false positives, but may miss some valid keywords.

3. Noise-aware training. During training, augment the dataset by adding background noise at various SNR levels. TensorFlow’s tf.audio utilities support this:

def add_noise(waveform, noise, snr_db=10):
    """Mix signal with noise at a given SNR."""
    signal_power = tf.reduce_mean(tf.square(waveform))
    noise_power = tf.reduce_mean(tf.square(noise))
    scale = tf.sqrt(signal_power / (noise_power * 10 ** (snr_db / 10)))
    return waveform + noise / scale

4. Silence gating. Before running inference, check the RMS energy of the audio window. If it falls below a threshold (indicating silence or very low ambient noise), skip inference entirely to save power.

static float compute_rms(const int16_t *audio, size_t len)
{
    float sum = 0.0f;
    for (size_t i = 0; i < len; i++) {
        float sample = (float)audio[i] / 32768.0f;
        sum += sample * sample;
    }
    return sqrtf(sum / len);
}

/* In the detection loop, before MFCC extraction: */
float rms = compute_rms(s_audio_buffer, AUDIO_BUFFER_LEN);
if (rms < 0.01f) {
    /* Audio is effectively silence, skip inference */
    continue;
}

Triggering Actions on Detection

LED Feedback

#include "driver/gpio.h"

static void led_init(void)
{
    gpio_config_t io_conf = {
        .pin_bit_mask = (1ULL << LED_PIN),
        .mode = GPIO_MODE_OUTPUT,
        .pull_up_en = GPIO_PULLUP_DISABLE,
        .pull_down_en = GPIO_PULLDOWN_DISABLE,
        .intr_type = GPIO_INTR_DISABLE,
    };
    gpio_config(&io_conf);
    gpio_set_level(LED_PIN, 0);
}

static void led_on(void)
{
    gpio_set_level(LED_PIN, 1);
}

static void led_off(void)
{
    gpio_set_level(LED_PIN, 0);
}

MQTT Publish on Detection

If Wi-Fi is available, the device can publish a detection event to an MQTT broker. Add this to the detection handler:

#include "mqtt_client.h"

/* Assumes MQTT client is initialized and connected (see IoT Systems course at /education/iot-systems/) */
extern esp_mqtt_client_handle_t s_mqtt_client;
extern bool s_mqtt_connected;

static void publish_detection(const char *keyword, float confidence)
{
    if (!s_mqtt_connected) return;

    char payload[128];
    snprintf(payload, sizeof(payload),
             "{\"keyword\":\"%s\",\"confidence\":%.2f,"
             "\"timestamp\":%lld}",
             keyword, confidence,
             (long long)(esp_timer_get_time() / 1000000));

    esp_mqtt_client_publish(s_mqtt_client,
                             "device/wakeword/detected",
                             payload, 0, 1, 0);
    ESP_LOGI(TAG, "Published detection event: %s", payload);
}

Then in the detection handler within keyword_detection_task, call:

if (consecutive_detections >= 2) {
    ESP_LOGW(TAG, "WAKE WORD DETECTED! Confidence: %.2f", confidence);
    led_on();
    publish_detection(CLASS_LABELS[predicted], confidence);
    vTaskDelay(pdMS_TO_TICKS(1000));
    led_off();
    consecutive_detections = 0;
}

Power Considerations

An always-listening keyword detector can run from a battery if you manage power carefully.

Mode	Current Draw	Duration	Notes
Active listening + inference	~80 mA	Continuous	Both cores active, I2S running
Light sleep between windows	~2 mA	Between captures	I2S paused, CPU sleeping
Deep sleep (wake on timer)	~10 uA	Long idle periods	Requires external wake source

Duty-Cycled Listening

Instead of running the microphone continuously, you can implement a listen/sleep cycle:

static void duty_cycled_detection(void *arg)
{
    while (1) {
        /* Enable I2S and capture */
        i2s_channel_enable(s_rx_chan);
        capture_audio_window(s_audio_buffer, AUDIO_BUFFER_LEN);

        /* Process */
        float rms = compute_rms(s_audio_buffer, AUDIO_BUFFER_LEN);
        if (rms > 0.01f) {
            extract_mfcc(s_audio_buffer, s_mfcc_features);
            float confidence;
            int predicted = run_inference(s_mfcc_features, &confidence);
            if (predicted == 0 && confidence > DETECTION_THRESHOLD) {
                /* Handle detection */
                led_on();
                publish_detection("yes", confidence);
                vTaskDelay(pdMS_TO_TICKS(1000));
                led_off();
            }
        }

        /* Disable I2S and sleep */
        i2s_channel_disable(s_rx_chan);

        /* Light sleep for 500 ms */
        esp_sleep_enable_timer_wakeup(500 * 1000);
        esp_light_sleep_start();
    }
}

This approach reduces average current from ~80 mA to ~20 mA at the cost of potentially missing keywords that occur during sleep intervals. For applications where immediate response is not critical (e.g., a voice-controlled thermostat), this is an acceptable tradeoff.

Project File Structure

Directorywake-word-detector/
- CMakeLists.txt
- Directorymain/
  - CMakeLists.txt
  - main.cc
  - wake_word_model.h
  - mel_filterbank.h
- Directorycomponents/
  - Directoryesp-dsp/ (managed component)
    …

CMakeLists.txt (Top Level)

cmake_minimum_required(VERSION 3.16)
include($ENV{IDF_PATH}/tools/cmake/project.cmake)
project(wake-word-detector)

main/CMakeLists.txt

idf_component_register(SRCS "main.cc"
                       INCLUDE_DIRS "."
                       REQUIRES driver esp_timer)

idf_component.yml

dependencies:
  espressif/esp-dsp: "~1.4.0"
  espressif/esp-tflite-micro: "~1.3.1"

Building and Testing

Create the project directory:
Terminal window
```
mkdir -p wake-word-detector/main
```
Train the model on your PC using the Python scripts from the training section. This produces wake_word_model.tflite.

Convert the model to a C header:

xxd -i wake_word_model.tflite > wake-word-detector/main/wake_word_model.h

Generate the mel filterbank C array using the Python script and paste it into mel_filterbank.h.

Set the target and build:

cd wake-word-detector
idf.py set-target esp32
idf.py build

Flash and monitor:
Terminal window
```
idf.py -p /dev/ttyUSB0 flash monitor
```

Speak “yes” near the INMP441 microphone. You should see:

I (12345) wake_word: Class: yes (0.92), MFCC: 45000 us, Infer: 32000 us
I (12345) wake_word: Class: yes (0.89), MFCC: 44000 us, Infer: 31000 us
W (12345) wake_word: WAKE WORD DETECTED! Confidence: 0.89

The onboard LED lights up for 1 second after two consecutive detections.

Exercises

Add a custom wake phrase. Record 50 or more samples of a custom phrase (e.g., “Hey Lamp”) using a Python script that captures 1-second clips from your PC microphone. Add these to the training set as class 0 and retrain the model. Evaluate how many samples are needed for reliable detection.
Implement a sliding window with overlap. Instead of capturing non-overlapping 1-second windows, use a ring buffer that advances by 500 ms (50% overlap). This reduces the worst-case detection latency from 2 seconds to 1.5 seconds. Measure the impact on CPU utilization.
Add a log-mel spectrogram display over serial. After extracting features, print the mel spectrogram as a simple ASCII heatmap over the serial monitor. This helps debug audio quality issues and visualize the difference between keyword utterances and background noise.
Port the inference to STM32. Using the STM32 HAL and CMSIS-DSP for FFT, implement the same MFCC extraction and TFLM inference on an STM32F4 board. Compare inference latency and memory usage between the two platforms.

Summary

You built an end-to-end keyword spotting system: an INMP441 I2S MEMS microphone captures 16 kHz audio on the ESP32, firmware extracts 13 MFCCs per frame, a small Conv1D classifier (trained on Google’s Speech Commands dataset) runs inference in under 100 ms, and a consecutive-detection filter eliminates false positives before triggering an LED or publishing an MQTT event. The quantized model occupies roughly 20 KB of flash, and the entire inference pipeline fits within the ESP32’s 520 KB of SRAM. Power management techniques like silence gating and duty-cycled listening extend battery life for portable deployments.

Comments

Loading comments...