Camera Image Classification on ESP32

Modified: Jun. 4, 2026

Published: Mar. 8, 2026

Image classification is the most demanding TinyML task you can run on a microcontroller. A single 96x96 grayscale image is 9,216 bytes. A 96x96 RGB image is 27,648 bytes. The model weights, intermediate activations, and the image buffer all compete for the same limited RAM. The ESP32-CAM module solves part of this problem with 4 MB of PSRAM, but you still need careful memory management to make everything fit. This capstone lesson puts together everything from the course: model training, quantization, TFLM deployment, and real-time embedded application logic, all applied to the hardest modality. #ComputerVision #TinyML #ESP32CAM

  ESP32-CAM Memory Layout
  ──────────────────────────────────────────
  Internal SRAM (~520 KB)
  ┌──────────────────────────────────────┐
  │ FreeRTOS heap + task stacks   200 KB │
  │ TFLM interpreter + ops         50 KB │
  │ Tensor arena                  100 KB │
  │ Misc (WiFi, drivers)          170 KB │
  └──────────────────────────────────────┘

  PSRAM (4 MB, SPI-connected)
  ┌──────────────────────────────────────┐
  │ Camera frame buffer (QVGA)    150 KB │
  │ Resized input (96x96)           9 KB │
  │ Model weights (flash-mapped)  250 KB │
  │ Available                    ~3.5 MB │
  └──────────────────────────────────────┘

  Key: tensor arena MUST be in internal
  SRAM for speed. Image buffers can use
  PSRAM (slower but larger).

What We Are Building

ESP32-CAM Image Classifier

An ESP32-CAM module (AI-Thinker variant with OV2640 camera and 4 MB PSRAM) that captures images, preprocesses them to 96x96 grayscale, runs a quantized MobileNetV1 (alpha=0.25) or custom CNN through TFLite Micro, and outputs the classification result over serial. Two demo applications: person detection (binary: person/no-person) and object classification (multi-class: 5 to 10 categories). The system runs inference at approximately 2 to 5 frames per second.

Project specifications:

Parameter	Value
Module	ESP32-CAM (AI-Thinker)
Camera	OV2640, 2MP
PSRAM	4 MB
Internal SRAM	~520 KB
Input resolution	96 x 96 grayscale
Model (person detection)	Custom CNN, ~30 KB quantized
Model (object classification)	MobileNetV1 0.25, ~250 KB quantized
Inference rate	2 to 5 FPS (model dependent)
Output	Serial (class + confidence), onboard LED flash

Bill of Materials

Ref	Component	Quantity	Notes
U1	ESP32-CAM (AI-Thinker)	1	OV2640 camera, 4 MB PSRAM
	USB-to-serial adapter (e.g., FTDI FT232RL)	1	For programming and serial output
	Jumper wires	4	For serial connection

  Image Processing Pipeline (on ESP32)
  ──────────────────────────────────────────
  OV2640 Camera
  1600x1200 JPEG
       │
       ▼ decode + resize
  96 x 96 grayscale (9,216 bytes)
       │
       ▼ normalize to [0, 1] or [-128, 127]
  96 x 96 x 1 int8 tensor
       │
       ▼ TFLM inference (~200 ms)
  ┌────────────────────────┐
  │ person:    0.87        │
  │ no_person: 0.13        │
  └────────────────────────┘
       │
       ▼ if person > 0.7
  LED flash + serial output

Image Classification on Constrained Hardware

Running a CNN on a microcontroller is fundamentally different from running one on a GPU. On a GPU, you load the entire model into VRAM, process a batch of images, and the framework handles memory allocation transparently. On an MCU, every byte matters.

Resource	ESP32-CAM Available	Person Detection Model	MobileNetV1 0.25
Flash (model storage)	4 MB	~30 KB	~250 KB
SRAM (tensor arena)	~200 KB usable	~100 KB	~180 KB
PSRAM (image buffer)	4 MB	9 KB (96x96x1)	9 KB (96x96x1)
Inference time	N/A	~200 ms	~800 ms

The tensor arena is the memory block where TFLite Micro stores input/output tensors and intermediate activations during inference. For image models, the arena must be large enough to hold the largest intermediate activation map. In a CNN, the first few layers often produce the largest feature maps (e.g., 96x96x8 = 73,728 bytes for 8 filters at input resolution). The arena size is the primary constraint on model complexity.

ESP32-CAM Hardware Overview

The AI-Thinker ESP32-CAM module integrates:

ESP32-S (no USB): Dual-core Xtensa LX6 at 240 MHz, 520 KB SRAM, Wi-Fi, Bluetooth
OV2640 camera: 2 MP, supports JPEG and raw (YUV, RGB) output, connected via DVP 8-bit parallel interface
4 MB PSRAM: Connected via SPI, accessible as external RAM through the ESP32’s cache
4 MB Flash: For firmware and model storage
Onboard LED flash: GPIO 4, useful for low-light image capture
MicroSD card slot: GPIO 2, 12, 13, 14, 15 (shared with some camera pins in certain modes)

Pin Connections for Programming

USB-Serial Adapter	ESP32-CAM	Notes
TX	U0R (GPIO 3)	Adapter TX to ESP32 RX
RX	U0T (GPIO 1)	Adapter RX to ESP32 TX
GND	GND	Common ground
5V	5V	Power supply (5V input, onboard 3.3V regulator)

To enter flash mode, connect GPIO 0 to GND before powering on or pressing the reset button. Remove the GPIO 0 to GND connection after flashing.

Camera Driver Setup

The ESP-IDF camera driver handles the OV2640 initialization, frame capture, and pixel format conversion. The key configuration parameters are the resolution, pixel format, and frame buffer location.

#include <stdio.h>
#include <string.h>
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "esp_log.h"
#include "esp_timer.h"
#include "esp_camera.h"
#include "driver/gpio.h"

static const char *TAG = "cam_classify";

/* AI-Thinker ESP32-CAM pin definitions */
#define CAM_PIN_PWDN    32
#define CAM_PIN_RESET   -1
#define CAM_PIN_XCLK     0
#define CAM_PIN_SIOD    26
#define CAM_PIN_SIOC    27
#define CAM_PIN_D7      35
#define CAM_PIN_D6      34
#define CAM_PIN_D5      39
#define CAM_PIN_D4      36
#define CAM_PIN_D3      21
#define CAM_PIN_D2      19
#define CAM_PIN_D1      18
#define CAM_PIN_D0       5
#define CAM_PIN_VSYNC   25
#define CAM_PIN_HREF    23
#define CAM_PIN_PCLK    22

/* Onboard flash LED */
#define FLASH_LED_PIN   4

/* Image parameters */
#define IMG_WIDTH   96
#define IMG_HEIGHT  96

static esp_err_t camera_init(void)
{
    camera_config_t config = {
        .pin_pwdn     = CAM_PIN_PWDN,
        .pin_reset    = CAM_PIN_RESET,
        .pin_xclk     = CAM_PIN_XCLK,
        .pin_sccb_sda = CAM_PIN_SIOD,
        .pin_sccb_scl = CAM_PIN_SIOC,
        .pin_d7       = CAM_PIN_D7,
        .pin_d6       = CAM_PIN_D6,
        .pin_d5       = CAM_PIN_D5,
        .pin_d4       = CAM_PIN_D4,
        .pin_d3       = CAM_PIN_D3,
        .pin_d2       = CAM_PIN_D2,
        .pin_d1       = CAM_PIN_D1,
        .pin_d0       = CAM_PIN_D0,
        .pin_vsync    = CAM_PIN_VSYNC,
        .pin_href     = CAM_PIN_HREF,
        .pin_pclk     = CAM_PIN_PCLK,

        .xclk_freq_hz = 20000000,
        .ledc_timer   = LEDC_TIMER_0,
        .ledc_channel = LEDC_CHANNEL_0,

        .pixel_format  = PIXFORMAT_GRAYSCALE,
        .frame_size    = FRAMESIZE_96X96,
        .jpeg_quality  = 12,
        .fb_count      = 1,
        .fb_location   = CAMERA_FB_IN_PSRAM,
        .grab_mode     = CAMERA_GRAB_WHEN_EMPTY,
    };

    esp_err_t err = esp_camera_init(&config);
    if (err != ESP_OK) {
        ESP_LOGE(TAG, "Camera init failed: 0x%x", err);
        return err;
    }

    /* Adjust camera settings for indoor use */
    sensor_t *s = esp_camera_sensor_get();
    s->set_brightness(s, 1);
    s->set_contrast(s, 1);
    s->set_gainceiling(s, GAINCEILING_4X);

    ESP_LOGI(TAG, "Camera initialized: 96x96 grayscale");
    return ESP_OK;
}

Key configuration choices:

PIXFORMAT_GRAYSCALE: Outputs 1 byte per pixel instead of 2 (RGB565) or 3 (RGB888). Grayscale is sufficient for person detection and many classification tasks, and it halves the memory needed for the image buffer and model input.
FRAMESIZE_96X96: Matches the model input size directly, avoiding the need for a resize step. The OV2640 supports this resolution natively.
CAMERA_FB_IN_PSRAM: Stores the frame buffer in PSRAM, keeping the internal SRAM free for the TFLite Micro tensor arena.

Image Preprocessing Pipeline

Even with the camera configured to output 96x96 grayscale, the raw pixel values need normalization before they can be fed to the model.

/* Preprocess a grayscale frame buffer for model input.
   The model expects int8 values in [-128, 127].
   Raw pixels are uint8 [0, 255].
   Mapping: int8_value = pixel - 128 */

static void preprocess_grayscale(const uint8_t *src,
                                  int8_t *dst,
                                  size_t width,
                                  size_t height)
{
    size_t total = width * height;
    for (size_t i = 0; i < total; i++) {
        dst[i] = (int8_t)(src[i] - 128);
    }
}

If you use a higher camera resolution and need to resize:

/* Bilinear interpolation resize from src (src_w x src_h)
   to dst (dst_w x dst_h), grayscale */

static void resize_bilinear(const uint8_t *src, int src_w, int src_h,
                              uint8_t *dst, int dst_w, int dst_h)
{
    float x_ratio = (float)(src_w - 1) / (dst_w - 1);
    float y_ratio = (float)(src_h - 1) / (dst_h - 1);

    for (int y = 0; y < dst_h; y++) {
        float src_y = y * y_ratio;
        int y0 = (int)src_y;
        int y1 = y0 + 1;
        if (y1 >= src_h) y1 = src_h - 1;
        float fy = src_y - y0;

        for (int x = 0; x < dst_w; x++) {
            float src_x = x * x_ratio;
            int x0 = (int)src_x;
            int x1 = x0 + 1;
            if (x1 >= src_w) x1 = src_w - 1;
            float fx = src_x - x0;

            float val =
                src[y0 * src_w + x0] * (1 - fx) * (1 - fy) +
                src[y0 * src_w + x1] * fx * (1 - fy) +
                src[y1 * src_w + x0] * (1 - fx) * fy +
                src[y1 * src_w + x1] * fx * fy;

            dst[y * dst_w + x] = (uint8_t)(val + 0.5f);
        }
    }
}

Training a Small Classifier

We train two models: a custom small CNN for person detection (binary) and a MobileNetV1 0.25 for multi-class object classification (transfer learning). Both use grayscale 96x96 input.

Person Detection CNN

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import os

IMG_SIZE = 96

def build_person_detector():
    """Small CNN for binary person/no-person classification."""
    model = keras.Sequential([
        layers.Input(shape=(IMG_SIZE, IMG_SIZE, 1)),

        # Block 1: 96x96 -> 48x48
        layers.Conv2D(8, 3, padding='same', activation='relu'),
        layers.MaxPooling2D(2),

        # Block 2: 48x48 -> 24x24
        layers.Conv2D(16, 3, padding='same', activation='relu'),
        layers.MaxPooling2D(2),

        # Block 3: 24x24 -> 12x12
        layers.Conv2D(16, 3, padding='same', activation='relu'),
        layers.MaxPooling2D(2),

        # Block 4: 12x12 -> 6x6
        layers.Conv2D(32, 3, padding='same', activation='relu'),
        layers.MaxPooling2D(2),

        # Classification head
        layers.GlobalAveragePooling2D(),
        layers.Dense(16, activation='relu'),
        layers.Dropout(0.3),
        layers.Dense(2, activation='softmax'),
    ])

    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=0.001),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy'])

    return model

model = build_person_detector()
model.summary()

Expected model summary:

Layer (type)                Output Shape         Param #
================================================================
conv2d (Conv2D)             (None, 96, 96, 8)    80
max_pooling2d               (None, 48, 48, 8)    0
conv2d_1 (Conv2D)           (None, 48, 48, 16)   1168
max_pooling2d_1             (None, 24, 24, 16)   0
conv2d_2 (Conv2D)           (None, 24, 24, 16)   2320
max_pooling2d_2             (None, 12, 12, 16)   0
conv2d_3 (Conv2D)           (None, 12, 12, 32)   4640
max_pooling2d_3             (None, 6, 6, 32)     0
global_average_pooling2d    (None, 32)            0
dense (Dense)               (None, 16)            528
dropout                     (None, 16)            0
dense_1 (Dense)             (None, 2)             34
================================================================
Total params: 8,770
Trainable params: 8,770

Dataset Preparation

The Visual Wake Words (VWW) dataset is a subset of MS COCO with binary labels (person present / no person). It is specifically designed for TinyML benchmarking.

import tensorflow_datasets as tfds

# Load Visual Wake Words dataset
# If VWW is not available, use a person/no-person subset of CIFAR or a custom dataset
def load_vww_dataset():
    """Load and preprocess the Visual Wake Words dataset."""
    try:
        ds_train, ds_val = tfds.load(
            'visual_wake_words',
            split=['train', 'val'],
            as_supervised=True)
    except Exception:
        print("VWW not available in tfds. Using alternative approach.")
        return create_custom_person_dataset()

    def preprocess(image, label):
        image = tf.image.resize(image, [IMG_SIZE, IMG_SIZE])
        image = tf.image.rgb_to_grayscale(image)
        image = tf.cast(image, tf.float32) / 255.0
        return image, label

    ds_train = ds_train.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)
    ds_val = ds_val.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)

    return ds_train, ds_val

def create_custom_person_dataset():
    """Alternative: create dataset from a directory of images."""
    ds_train = keras.utils.image_dataset_from_directory(
        'person_dataset/train',
        labels='inferred',
        label_mode='int',
        color_mode='grayscale',
        image_size=(IMG_SIZE, IMG_SIZE),
        batch_size=None)

    ds_val = keras.utils.image_dataset_from_directory(
        'person_dataset/val',
        labels='inferred',
        label_mode='int',
        color_mode='grayscale',
        image_size=(IMG_SIZE, IMG_SIZE),
        batch_size=None)

    def normalize(image, label):
        return tf.cast(image, tf.float32) / 255.0, label

    return (ds_train.map(normalize, num_parallel_calls=tf.data.AUTOTUNE),
            ds_val.map(normalize, num_parallel_calls=tf.data.AUTOTUNE))

ds_train, ds_val = load_vww_dataset()

# Data augmentation
data_augmentation = keras.Sequential([
    layers.RandomFlip('horizontal'),
    layers.RandomRotation(0.1),
    layers.RandomBrightness(0.2),
    layers.RandomContrast(0.2),
])

ds_train_aug = ds_train.map(
    lambda x, y: (data_augmentation(x, training=True), y),
    num_parallel_calls=tf.data.AUTOTUNE)

ds_train_batched = ds_train_aug.shuffle(1000).batch(32).prefetch(tf.data.AUTOTUNE)
ds_val_batched = ds_val.batch(32).prefetch(tf.data.AUTOTUNE)

# Train
history = model.fit(
    ds_train_batched,
    validation_data=ds_val_batched,
    epochs=30,
    callbacks=[
        keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True),
        keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=3),
    ])

test_loss, test_acc = model.evaluate(ds_val_batched)
print(f"Validation accuracy: {test_acc:.4f}")

MobileNetV1 0.25 with Transfer Learning

For multi-class classification (e.g., 5 object categories), we use MobileNetV1 with width multiplier 0.25 and transfer learning from ImageNet weights.

def build_mobilenet_classifier(num_classes=5):
    """MobileNetV1 alpha=0.25 with transfer learning."""
    # MobileNetV1 expects RGB input, but we convert grayscale to 3-channel
    base_model = keras.applications.MobileNet(
        input_shape=(IMG_SIZE, IMG_SIZE, 3),
        alpha=0.25,
        depth_multiplier=1,
        include_top=False,
        weights='imagenet',
        pooling='avg')

    # Freeze base layers initially
    base_model.trainable = False

    model = keras.Sequential([
        layers.Input(shape=(IMG_SIZE, IMG_SIZE, 1)),
        # Convert grayscale to 3-channel for MobileNet
        layers.Conv2D(3, 1, padding='same', activation='linear',
                      name='gray_to_rgb'),
        base_model,
        layers.Dense(32, activation='relu'),
        layers.Dropout(0.3),
        layers.Dense(num_classes, activation='softmax'),
    ])

    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=0.001),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy'])

    return model, base_model

# Train in two phases
# Phase 1: Train only the new layers (feature extraction)
model_mob, base = build_mobilenet_classifier(num_classes=5)

# Assume ds_train_5class and ds_val_5class are prepared
# with 5 object categories from a custom dataset
model_mob.fit(ds_train_batched, validation_data=ds_val_batched,
              epochs=10)

# Phase 2: Fine-tune the last few layers
base.trainable = True
for layer in base.layers[:-20]:
    layer.trainable = False

model_mob.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-4),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy'])

model_mob.fit(ds_train_batched, validation_data=ds_val_batched,
              epochs=20,
              callbacks=[
                  keras.callbacks.EarlyStopping(patience=5,
                                                restore_best_weights=True)])

Quantization for ESP32’s Memory Budget

def quantize_model(keras_model, representative_data, output_path):
    """Full int8 quantization for ESP32 deployment."""
    converter = tf.lite.TFLiteConverter.from_keras_model(keras_model)
    converter.optimizations = [tf.lite.Optimize.DEFAULT]
    converter.representative_dataset = representative_data
    converter.target_spec.supported_ops = [
        tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
    converter.inference_input_type = tf.int8
    converter.inference_output_type = tf.int8

    tflite_model = converter.convert()

    with open(output_path, 'wb') as f:
        f.write(tflite_model)

    print(f"Model saved: {output_path} ({len(tflite_model)} bytes)")
    return tflite_model

# Representative dataset for calibration
def make_representative_dataset(dataset, num_samples=100):
    def gen():
        for img, _ in dataset.take(num_samples):
            yield [tf.expand_dims(img, 0)]
    return gen

# Quantize person detector
person_tflite = quantize_model(
    model,
    make_representative_dataset(ds_val),
    'person_detect_model.tflite')

# Convert to C header
import subprocess
subprocess.run(['xxd', '-i', 'person_detect_model.tflite',
                'person_detect_model.h'])
print("C header generated: person_detect_model.h")

Verifying Quantized Model Accuracy

# Load and test the quantized model
interpreter = tf.lite.Interpreter(model_path='person_detect_model.tflite')
interpreter.allocate_tensors()

input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

print(f"Input: {input_details[0]['shape']}, "
      f"dtype: {input_details[0]['dtype']}")
print(f"Output: {output_details[0]['shape']}, "
      f"dtype: {output_details[0]['dtype']}")
print(f"Input scale: {input_details[0]['quantization'][0]:.6f}, "
      f"zero_point: {input_details[0]['quantization'][1]}")

# Test accuracy
correct = 0
total = 0

for img, label in ds_val.take(200):
    # Quantize input
    scale = input_details[0]['quantization'][0]
    zp = input_details[0]['quantization'][1]
    img_np = img.numpy()
    quantized = np.clip(img_np / scale + zp, -128, 127).astype(np.int8)
    quantized = np.expand_dims(quantized, 0)

    interpreter.set_tensor(input_details[0]['index'], quantized)
    interpreter.invoke()
    output = interpreter.get_tensor(output_details[0]['index'])

    predicted = np.argmax(output)
    if predicted == label.numpy():
        correct += 1
    total += 1

print(f"Quantized model accuracy: {correct / total:.4f} ({correct}/{total})")

Deploying with TFLM on ESP32-CAM

Complete Inference Firmware

#include <stdio.h>
#include <string.h>
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "esp_log.h"
#include "esp_timer.h"
#include "esp_camera.h"
#include "esp_heap_caps.h"
#include "driver/gpio.h"

#include "tensorflow/lite/micro/micro_interpreter.h"
#include "tensorflow/lite/micro/micro_mutable_op_resolver.h"
#include "tensorflow/lite/schema/schema_generated.h"
#include "person_detect_model.h"

static const char *TAG = "cam_classify";

/* Camera pins (AI-Thinker ESP32-CAM) */
#define CAM_PIN_PWDN    32
#define CAM_PIN_RESET   -1
#define CAM_PIN_XCLK     0
#define CAM_PIN_SIOD    26
#define CAM_PIN_SIOC    27
#define CAM_PIN_D7      35
#define CAM_PIN_D6      34
#define CAM_PIN_D5      39
#define CAM_PIN_D4      36
#define CAM_PIN_D3      21
#define CAM_PIN_D2      19
#define CAM_PIN_D1      18
#define CAM_PIN_D0       5
#define CAM_PIN_VSYNC   25
#define CAM_PIN_HREF    23
#define CAM_PIN_PCLK    22

#define FLASH_LED_PIN   4
#define IMG_WIDTH       96
#define IMG_HEIGHT      96
#define NUM_CLASSES     2

static const char *CLASS_LABELS[] = {"no_person", "person"};

/* TFLite Micro arena - allocated in PSRAM for large models */
#define TENSOR_ARENA_SIZE  (150 * 1024)
static uint8_t *s_tensor_arena = NULL;

/* ---- Camera initialization ---- */

static esp_err_t camera_init(void)
{
    camera_config_t config = {
        .pin_pwdn     = CAM_PIN_PWDN,
        .pin_reset    = CAM_PIN_RESET,
        .pin_xclk     = CAM_PIN_XCLK,
        .pin_sccb_sda = CAM_PIN_SIOD,
        .pin_sccb_scl = CAM_PIN_SIOC,
        .pin_d7       = CAM_PIN_D7,
        .pin_d6       = CAM_PIN_D6,
        .pin_d5       = CAM_PIN_D5,
        .pin_d4       = CAM_PIN_D4,
        .pin_d3       = CAM_PIN_D3,
        .pin_d2       = CAM_PIN_D2,
        .pin_d1       = CAM_PIN_D1,
        .pin_d0       = CAM_PIN_D0,
        .pin_vsync    = CAM_PIN_VSYNC,
        .pin_href     = CAM_PIN_HREF,
        .pin_pclk     = CAM_PIN_PCLK,
        .xclk_freq_hz = 20000000,
        .ledc_timer   = LEDC_TIMER_0,
        .ledc_channel = LEDC_CHANNEL_0,
        .pixel_format  = PIXFORMAT_GRAYSCALE,
        .frame_size    = FRAMESIZE_96X96,
        .jpeg_quality  = 12,
        .fb_count      = 1,
        .fb_location   = CAMERA_FB_IN_PSRAM,
        .grab_mode     = CAMERA_GRAB_WHEN_EMPTY,
    };

    esp_err_t err = esp_camera_init(&config);
    if (err != ESP_OK) {
        ESP_LOGE(TAG, "Camera init failed: 0x%x", err);
        return err;
    }

    sensor_t *s = esp_camera_sensor_get();
    s->set_brightness(s, 1);
    s->set_contrast(s, 1);

    ESP_LOGI(TAG, "Camera initialized: %dx%d grayscale",
             IMG_WIDTH, IMG_HEIGHT);
    return ESP_OK;
}

/* ---- Flash LED ---- */

static void flash_led_init(void)
{
    gpio_config_t io_conf = {
        .pin_bit_mask = (1ULL << FLASH_LED_PIN),
        .mode = GPIO_MODE_OUTPUT,
    };
    gpio_config(&io_conf);
    gpio_set_level(FLASH_LED_PIN, 0);
}

static void flash_led_pulse(int duration_ms)
{
    gpio_set_level(FLASH_LED_PIN, 1);
    vTaskDelay(pdMS_TO_TICKS(duration_ms));
    gpio_set_level(FLASH_LED_PIN, 0);
}

/* ---- Classification task ---- */

static void classification_task(void *arg)
{
    /* Allocate tensor arena in PSRAM */
    s_tensor_arena = (uint8_t *)heap_caps_malloc(
        TENSOR_ARENA_SIZE, MALLOC_CAP_SPIRAM | MALLOC_CAP_8BIT);

    if (s_tensor_arena == NULL) {
        ESP_LOGE(TAG, "Failed to allocate tensor arena in PSRAM");
        /* Fall back to internal RAM */
        s_tensor_arena = (uint8_t *)heap_caps_malloc(
            TENSOR_ARENA_SIZE, MALLOC_CAP_INTERNAL | MALLOC_CAP_8BIT);
        if (s_tensor_arena == NULL) {
            ESP_LOGE(TAG, "Failed to allocate tensor arena");
            vTaskDelete(NULL);
            return;
        }
        ESP_LOGW(TAG, "Tensor arena allocated in internal RAM");
    } else {
        ESP_LOGI(TAG, "Tensor arena allocated in PSRAM (%d KB)",
                 TENSOR_ARENA_SIZE / 1024);
    }

    /* Initialize TFLite Micro */
    const tflite::Model *model =
        tflite::GetModel(person_detect_model_tflite);

    if (model->version() != TFLITE_SCHEMA_VERSION) {
        ESP_LOGE(TAG, "Model schema version mismatch: %lu vs %d",
                 (unsigned long)model->version(), TFLITE_SCHEMA_VERSION);
        vTaskDelete(NULL);
        return;
    }

    static tflite::MicroMutableOpResolver<8> resolver;
    resolver.AddConv2D();
    resolver.AddMaxPool2D();
    resolver.AddReshape();
    resolver.AddFullyConnected();
    resolver.AddSoftmax();
    resolver.AddMean();          /* For GlobalAveragePooling */
    resolver.AddDepthwiseConv2D(); /* For MobileNet */
    resolver.AddPad();

    static tflite::MicroInterpreter interpreter(
        model, resolver, s_tensor_arena, TENSOR_ARENA_SIZE);

    if (interpreter.AllocateTensors() != kTfLiteOk) {
        ESP_LOGE(TAG, "AllocateTensors failed");
        vTaskDelete(NULL);
        return;
    }

    TfLiteTensor *input = interpreter.input(0);
    TfLiteTensor *output = interpreter.output(0);

    ESP_LOGI(TAG, "Model loaded. Arena used: %zu / %d bytes",
             interpreter.arena_used_bytes(), TENSOR_ARENA_SIZE);
    ESP_LOGI(TAG, "Input: [%d, %d, %d, %d], scale=%.6f, zp=%d",
             input->dims->data[0], input->dims->data[1],
             input->dims->data[2], input->dims->data[3],
             input->params.scale, input->params.zero_point);

    /* Classification loop */
    int frame_count = 0;
    int64_t total_inference_us = 0;

    while (1) {
        /* Capture frame */
        camera_fb_t *fb = esp_camera_fb_get();
        if (fb == NULL) {
            ESP_LOGE(TAG, "Camera capture failed");
            vTaskDelay(pdMS_TO_TICKS(100));
            continue;
        }

        /* Verify frame dimensions */
        if (fb->width != IMG_WIDTH || fb->height != IMG_HEIGHT) {
            ESP_LOGE(TAG, "Unexpected frame size: %dx%d",
                     fb->width, fb->height);
            esp_camera_fb_return(fb);
            continue;
        }

        /* Preprocess: uint8 [0,255] to int8 [-128,127] */
        int8_t *input_data = input->data.int8;
        for (size_t i = 0; i < fb->len; i++) {
            input_data[i] = (int8_t)(fb->buf[i] - 128);
        }

        esp_camera_fb_return(fb);

        /* Run inference */
        int64_t t_start = esp_timer_get_time();
        TfLiteStatus status = interpreter.Invoke();
        int64_t t_end = esp_timer_get_time();
        int64_t inference_us = t_end - t_start;

        if (status != kTfLiteOk) {
            ESP_LOGE(TAG, "Inference failed");
            continue;
        }

        /* Dequantize output and find best class */
        float out_scale = output->params.scale;
        int out_zp = output->params.zero_point;
        int8_t *out_data = output->data.int8;

        int best_class = 0;
        float best_score = -100.0f;
        for (int i = 0; i < NUM_CLASSES; i++) {
            float score = (out_data[i] - out_zp) * out_scale;
            if (score > best_score) {
                best_score = score;
                best_class = i;
            }
        }

        frame_count++;
        total_inference_us += inference_us;

        ESP_LOGI(TAG, "[%d] %s (%.2f), Infer: %lld ms",
                 frame_count, CLASS_LABELS[best_class], best_score,
                 (long long)(inference_us / 1000));

        /* Flash LED on person detection */
        if (best_class == 1 && best_score > 0.7f) {
            flash_led_pulse(50);
        }

        /* Print FPS every 10 frames */
        if (frame_count % 10 == 0) {
            float avg_ms = (total_inference_us / frame_count) / 1000.0f;
            float fps = 1000.0f / avg_ms;
            ESP_LOGI(TAG, "Avg inference: %.1f ms (%.1f FPS)",
                     avg_ms, fps);
        }
    }
}

/* ---- Entry point ---- */

extern "C" void app_main(void)
{
    ESP_LOGI(TAG, "ESP32-CAM Image Classifier starting");

    /* Print memory info */
    ESP_LOGI(TAG, "Free internal RAM: %zu bytes",
             heap_caps_get_free_size(MALLOC_CAP_INTERNAL));
    ESP_LOGI(TAG, "Free PSRAM: %zu bytes",
             heap_caps_get_free_size(MALLOC_CAP_SPIRAM));

    flash_led_init();

    if (camera_init() != ESP_OK) {
        ESP_LOGE(TAG, "Camera initialization failed. Halting.");
        return;
    }

    /* Start classification on core 1 (core 0 handles Wi-Fi if needed) */
    xTaskCreatePinnedToCore(classification_task, "classify",
                            8192, NULL, 5, NULL, 1);
}

Project Configuration

# Top-level CMakeLists.txt
cmake_minimum_required(VERSION 3.16)
include($ENV{IDF_PATH}/tools/cmake/project.cmake)
project(cam-classifier)

idf_component_register(SRCS "main.cc"
                       INCLUDE_DIRS "."
                       REQUIRES esp_timer driver)

dependencies:
  espressif/esp32-camera: "~2.0.0"
  espressif/esp-tflite-micro: "~1.3.1"

sdkconfig Defaults

Add these to sdkconfig.defaults to enable PSRAM and allocate sufficient memory:

CONFIG_ESP32_SPIRAM_SUPPORT=y
CONFIG_SPIRAM=y
CONFIG_SPIRAM_USE_MALLOC=y
CONFIG_SPIRAM_MALLOC_ALWAYSINTERNAL=4096
CONFIG_SPIRAM_MALLOC_RESERVE_INTERNAL=32768
CONFIG_CAMERA_TASK_STACK_SIZE=4096

Person Detection Demo

Build and flash the firmware:

cd cam-classifier
idf.py set-target esp32
idf.py build
idf.py -p /dev/ttyUSB0 flash monitor

Point the camera at a person. The serial monitor shows:

I (1234) cam_classify: [1] person (0.91), Infer: 198 ms
I (1432) cam_classify: [2] person (0.88), Infer: 195 ms
I (1630) cam_classify: [3] person (0.85), Infer: 201 ms

Point the camera at an empty room:

I (1834) cam_classify: [4] no_person (0.94), Infer: 196 ms
I (2032) cam_classify: [5] no_person (0.92), Infer: 199 ms

The onboard flash LED blinks briefly each time a person is detected with confidence above 0.7.

Object Classification Demo

To switch from person detection to multi-class object classification, replace the model header with the MobileNetV1 0.25 model, update the class labels and count, and increase the tensor arena if needed:

/* For MobileNetV1 0.25 multi-class model */
#include "mobilenet_model.h"

#define NUM_CLASSES  5
static const char *CLASS_LABELS[] = {
    "cup", "book", "phone", "plant", "shoe"
};

/* MobileNet needs more arena space */
#define TENSOR_ARENA_SIZE  (300 * 1024)

The MobileNetV1 0.25 model is larger (~250 KB) and slower (~800 ms per inference) but significantly more capable. With transfer learning from ImageNet, it can distinguish between 5 to 10 object categories with reasonable accuracy (85% or better) even at 96x96 grayscale resolution.

Memory Optimization Techniques

PSRAM Usage Strategy

The ESP32-CAM has 4 MB of PSRAM, but PSRAM access is slower than internal SRAM (roughly 3x slower for random access). The strategy is:

Data	Location	Reason
Camera frame buffer	PSRAM	Large (9 KB+), infrequently accessed
Tensor arena	PSRAM	Large (100-300 KB), accessed linearly
Model weights	Flash (mmap)	Read-only, accessed by TFLM
TFLite interpreter state	Internal SRAM	Small, frequently accessed
Stack and task memory	Internal SRAM	Performance critical

/* Print detailed memory breakdown */
static void print_memory_info(void)
{
    ESP_LOGI(TAG, "=== Memory Report ===");
    ESP_LOGI(TAG, "Internal free: %6zu bytes",
             heap_caps_get_free_size(MALLOC_CAP_INTERNAL));
    ESP_LOGI(TAG, "Internal largest block: %6zu bytes",
             heap_caps_get_largest_free_block(MALLOC_CAP_INTERNAL));
    ESP_LOGI(TAG, "PSRAM free: %6zu bytes",
             heap_caps_get_free_size(MALLOC_CAP_SPIRAM));
    ESP_LOGI(TAG, "PSRAM largest block: %6zu bytes",
             heap_caps_get_largest_free_block(MALLOC_CAP_SPIRAM));
    ESP_LOGI(TAG, "====================");
}

Model Partitioning

For very large models that do not fit entirely in the tensor arena, you can split the model into two parts:

Feature extractor (convolutional layers): runs first, outputs a small feature vector
Classifier (dense layers): runs second, takes the feature vector as input

This reduces the peak memory usage because only one part’s intermediate activations exist at a time. However, it requires modifying the TFLite model export to produce two separate models, which adds complexity.

Reducing Arena Size

If the model barely fits, try these techniques:

Reduce input resolution. Going from 96x96 to 64x64 reduces the input by 55% and proportionally reduces the first few layers’ activation sizes.
Use depthwise separable convolutions. MobileNet already does this; for custom CNNs, replace standard Conv2D layers with DepthwiseConv2D + Conv2D(1x1).
Reduce the number of filters. The first convolutional layer’s filter count multiplied by the input resolution determines the first activation map size.
Use stride instead of pooling. Replace MaxPooling with stride-2 convolutions to reduce spatial dimensions and activation size simultaneously.

Performance Results

Model	Size (int8)	Arena	Inference	FPS	Accuracy
Person CNN	30 KB	100 KB	~200 ms	~5	89%
MobileNetV1 0.25 (5 class)	250 KB	280 KB	~800 ms	~1.2	86%
MobileNetV1 0.25 (person)	250 KB	280 KB	~800 ms	~1.2	92%

Notes:

Inference time measured at 240 MHz, tensor arena in PSRAM.
FPS includes camera capture time (~5 ms) plus inference time.
Accuracy measured on held-out validation sets after int8 quantization.
Moving the tensor arena to internal SRAM (when it fits) improves inference time by approximately 30%.

Project File Structure

Directorycam-classifier/
- CMakeLists.txt
- sdkconfig.defaults
- Directorymain/
  - CMakeLists.txt
  - idf_component.yml
  - main.cc
  - person_detect_model.h
  - mobilenet_model.h

Course Wrap-Up and Next Steps

Over nine lessons you progressed from deploying a pre-trained sine wave model to building a real-time image classifier on an ESP32-CAM and then connecting edge inference to cloud infrastructure. Here is what you covered:

Lesson	Skill Gained	Model Type	Platform
1	TinyML pipeline, first deployment	Sine predictor	ESP32
2	Edge Impulse workflow	Motion classifier	ESP32
3	TFLite Micro cross-platform	Gesture classifier	ESP32, STM32
4	Quantization (PTQ, QAT)	Comparison bench	ESP32
5	Audio ML, MFCC features	Wake word CNN	ESP32
6	IMU data pipeline, dual-platform	Gesture dense NN	Pico, STM32
7	Anomaly detection, autoencoders	Vibration autoencoder	ESP32
8	Vision ML, PSRAM management	Person/object CNN	ESP32-CAM
9	Edge-cloud hybrid, OTA models	Tiered inference system	ESP32 + Cloud

Where to Go From Here

Lesson 9: Edge-Cloud Hybrid Architectures. The next lesson ties edge inference to cloud infrastructure. You will build a system where the ESP32 runs local classification, escalates uncertain results to the cloud for a larger model, and receives updated models via OTA. If you completed the IoT Systems course, you already have the MQTT, dashboard, and REST API skills that Lesson 9 builds on.

Combine modalities. A device that runs both a wake word detector and a camera classifier can respond to voice commands with visual confirmation. The ESP32’s dual cores make this feasible: one core handles audio, the other handles vision.

Explore on-device learning. Transfer learning on the MCU itself is an active research area. Frameworks like TinyOL (Tiny On-device Learning) allow updating the last layer’s weights based on new data collected in the field, without sending data to the cloud.

Scale to production. For commercial IoT products, look into model signing (ensuring the model has not been tampered with), OTA model updates (without reflashing the entire firmware), and power profiling (measuring real battery life under realistic inference workloads).

Try different hardware. The ESP32-S3 has vector instructions that accelerate int8 MAC operations, potentially doubling inference speed for CNN models. The Nordic nRF5340 has a dual-core Cortex-M33 with DSP extensions. Each platform offers different tradeoffs between power, performance, and connectivity.

Exercises

Add a web server for live classification. Extend the firmware to serve a simple web page over Wi-Fi that displays the latest camera frame (as JPEG) and the classification result. Use the ESP-IDF HTTP server component. The page should auto-refresh every 2 seconds using a meta refresh tag or JavaScript fetch.
Build a custom 3-class classifier. Collect 200 images each of three objects on your desk (e.g., coffee mug, water bottle, keyboard) using the ESP32-CAM’s JPEG mode. Transfer the images to your PC, train a small CNN, quantize it, and deploy. Measure the accuracy difference between your custom CNN and MobileNetV1 0.25 for this specific task.
Implement a person counter. Instead of just detecting person/no-person, count the number of times a person enters and exits the frame. Use a simple state machine: if the previous frame was “no_person” and the current frame is “person”, increment the entry counter. If the previous was “person” and current is “no_person”, increment the exit counter. Display the counts over serial.
Optimize for speed with internal SRAM. For the person detection CNN (which needs ~100 KB arena), try allocating the tensor arena in internal SRAM instead of PSRAM. Measure the inference time improvement. Then try a hybrid approach: allocate the arena in PSRAM but copy the input tensor to internal SRAM before inference. Document the performance difference for each approach.
Compare ESP32 and ESP32-S3. If you have access to an ESP32-S3-CAM module, port the person detection firmware and measure the inference time. The S3’s vector instructions should provide a measurable speedup for the int8 convolution operations. Document the exact speedup and whether it changes the practical FPS.

Summary

You deployed image classification models on the most constrained vision platform in this course: the ESP32-CAM with its OV2640 camera and 4 MB PSRAM. A custom 4-layer CNN for person detection achieved ~89% accuracy at 5 FPS with a 30 KB quantized model. A MobileNetV1 0.25 with transfer learning handled 5-class object classification at ~86% accuracy and 1.2 FPS with a 250 KB model. PSRAM management was the central engineering challenge: the camera frame buffer and tensor arena both reside in PSRAM to keep internal SRAM free for the interpreter and task stacks. Preprocessing was straightforward (grayscale pixel shift from uint8 to int8), and the OV2640’s native 96x96 output mode eliminated the need for software resizing. This lesson demonstrated that useful computer vision is possible on a microcontroller that costs a few dollars and runs on milliwatts. Lesson 9 takes this further by connecting edge inference to cloud infrastructure for tiered classification, model retraining, and OTA updates.

Comments

Loading comments...