Double-Model Application Development Guide#

Overview#

In single_model_example.md, we introduced how to develop and run a single-model AI application on K230. Multi-model applications can be built on top of the same pattern. This document uses face recognition as the reference case to explain how to develop a double-model inference application.

Compared with a single-model application, the main difference is that a double-model application must load and coordinate multiple models and organize the inference flow between them.

Development Guide#

Convert `kmodel`#

First, prepare the required kmodel files. For the face-recognition example, you can reuse the models under:

src/rtsmart/examples/ai/face_recognition/utils

Typical files include:

face_detection_320.kmodel
face_recognition.kmodel

If you want to train your own model, you can train a pt or pth model first, export it to onnx or tflite, and then convert it to kmodel.

For the conversion flow, refer to:

nncase_compile.md

Official nncase repositories:

Develop Deployment Code#

Modules and Flow#

Involved Modules:

vicap (video input capture): Configures camera sensor properties and channel attributes including resolution, frame rate, and data format. Implements binding camera data to display and provides AI inference frame data.
vo (video output): Configures display device and layer attributes including position, resolution, frame rate, and data format. Displays camera frames or other input in real-time through video and OSD layers. The video layer supports YUV format only, while the OSD layer supports RGB format only.
kpu: Loads kmodel, configures input/output tensors, and performs model inference.
ai2d: Performs preprocessing on model input images. See usage_ai2d for details.

Processing Flow:

The double-model AI application uses the single-camera dual-channel processing approach:

Display Channel: One image stream is directly bound to the screen for real-time, low-latency display.
AI Channel: Another image stream is used for AI model inference with both detection and recognition results.

After inference completes, results are drawn onto a transparent OSD layer and merged with the live display. Users see the original image with both detection boxes and recognition results.

Like single-model applications, this structure solves performance bottlenecks:

Capture image → Create tensor → Preprocess → Inference → Postprocess → Draw results → Display

If multi-model inference takes significant time, the traditional pipeline causes image stutter. Therefore, we separate display from AI inference: prioritize live display with async multi-model inference result drawing and merging.

Serial vs. Parallel Inference:

The reference diagrams show:

double_model_pipeline

Serial double-model inference (detection then recognition) and:

double_model_pipeline

Parallel double-model inference.

Important Note

KPU inference is exclusive. If you implement parallel inference with multiple threads, add synchronization locks to prevent multiple threads from accessing KPU simultaneously.

Code Structure#

Using the face-recognition reference project as an example, the code structure is:

face_recognition
├── cmake
├── src
│    ├── ai_base.cc         # Model inference wrapper implementation
│    ├── ai_base.h          # Model inference header file
│    ├── ai_utils.cc        # Utility methods for model inference
│    ├── ai_utils.h         # Utility methods header file
│    ├── anchors_320.cc     # Anchors for 320-input face detection model
│    ├── anchors_640.cc     # Anchors for 640-input face detection model
│    ├── face_detection.cc  # Face detection: preprocess, inference, postprocess, drawing
│    ├── face_detection.h   # Face detection header file
│    ├── face_recognition.cc # Face recognition: preprocess, inference, postprocess, drawing
│    │                        # Includes face database init, add face, count, matching interfaces
│    ├── face_recognition.h  # Face recognition header file
│    ├── main.cc            # Main function: orchestrates complete AI application
│    ├── scoped_timing.h    # Timing utility for debugging
│    ├── setting.h          # Configuration macros for display and AI frame resolution
│    ├── video_pipeline.cc  # Single-camera dual-channel implementation
│    ├── video_pipeline.h   # Video pipeline header file
│    └── CMakeLists.txt     # CMakeLists for this task
├── utils                   # Pre-built kmodel and scripts
├── CMakeLists.txt          # Root CMakeLists
├── build_app.sh            # Compilation script
└── Makefile                # Alternative build method

Code Responsibilities#

The responsibilities of the main files are:

File	Description
`ai_base.h`	Declares the common model-inference interfaces
`ai_base.cc`	Implements the common model-inference interfaces
`ai_utils.h`	Declares shared helper functions
`ai_utils.cc`	Implements shared helper functions
`scoped_timing.h`	Provides timing helpers for profiling and debugging
`setting.h`	Defines display and AI-frame configuration macros
`video_pipeline.h`	Declares the single-camera dual-channel pipeline interface
`video_pipeline.cc`	Implements the video pipeline
`face_detection.h`	Declares preprocess, inference, postprocess, and drawing for face detection
`face_detection.cc`	Implements face detection
`face_recognition.h`	Declares recognition, face database, add/query/match interfaces
`face_recognition.cc`	Implements face recognition
`anchors_320.cc`	Anchor data for the `320`-input detection model
`anchors_640.cc`	Anchor data for the `640`-input detection model
`main.cc`	Organizes the complete application logic

When developing a new AI application:

ai_base.* and scoped_timing.h are usually reused without modification.
ai_utils.* can be extended if the existing helper methods are not enough.
setting.h and video_pipeline.* implement camera, display, and OSD initialization together with AI-frame acquisition and OSD overlay. These files usually remain unchanged unless you need a new display type or another camera/display configuration.
face_detection.*, face_recognition.*, and main.cc are the files users usually adapt for a new task flow. The task headers and source files mainly implement task-specific preprocess, inference reuse, postprocess, and draw logic, while main.cc controls task instance initialization and the order of the multi-stage inference pipeline.

Code Details#

`setting.h` Configuration#

The macros in setting.h mainly configure camera output, display output, OSD, and AI-frame resolution.

Macro	Description
`ISP_WIDTH`	ISP output width
`ISP_HEIGHT`	ISP output height
`DISPLAY_MODE`	`0`: `1920 x 1080` LT9611, `1`: `800 x 480` ST7701
`DISPLAY_WIDTH`	Display width
`DISPLAY_HEIGHT`	Display height
`DISPLAY_ROTATE`	`0`: no rotation, `1`: rotate 90 degrees
`AI_FRAME_WIDTH`	AI frame width
`AI_FRAME_HEIGHT`	AI frame height
`AI_FRAME_CHANNEL`	AI frame channels
`USE_OSD`	Whether to enable OSD
`OSD_WIDTH`	OSD width
`OSD_HEIGHT`	OSD height
`OSD_CHANNEL`	OSD channels

Typical fragments are:

#define ISP_WIDTH 1920
#define ISP_HEIGHT 1080

This is the camera-side resolution. On top of this source, the image is split into a display branch and an AI branch, and the format and size of each branch can be adjusted independently.

#define DISPLAY_MODE 1
#define DISPLAY_WIDTH 800
#define DISPLAY_HEIGHT 480
#define DISPLAY_ROTATE 1

This branch is sent to the display channel. The exact configuration depends on the target screen and its orientation. The current code supports both LT9611 HDMI 1920 x 1080 and ST7701 LCD 800 x 480.

ST7701 is physically a 480 x 800 portrait panel. The required 90 degree rotation is already handled by the lower vo layer, so you can treat it as a landscape screen in the application code.

#define AI_FRAME_WIDTH 640
#define AI_FRAME_HEIGHT 360
#define AI_FRAME_CHANNEL 3

This is the camera branch used by AI preprocessing. The example outputs PIXEL_FORMAT_RGB_888_PLANAR with shape 3 x 360 x 640 in CHW layout.

Note: The AI channel resolution is the frame size before preprocessing, while the model input resolution is the tensor size after preprocessing. For example, the AI channel may output 640 x 360 while the model requires 320 x 320, so preprocessing is still required before inference.

#define USE_OSD 1
#define OSD_WIDTH 800
#define OSD_HEIGHT 480
#define OSD_CHANNEL 4

This configures the transparent OSD layer. Its resolution must match the display resolution. The OSD frame itself contains only the drawn AI results such as boxes and landmarks. The final visible result comes from overlaying the OSD layer on top of the live display layer.

`AIBase` Notes#

AIBase in ai_base.h is the base class used to wrap common model-inference behavior, including model initialization, input/output shape handling, tensor initialization, model execution, and output retrieval.

/**
 * @brief AI base class, wraps nncase-related operations.
 * Later application development mainly needs to focus on preprocess and postprocess.
 */
class AIBase
{
public:
    /**
     * @brief Constructor. Loads the kmodel and initializes model inputs and outputs.
     * @param kmodel_file Path to the kmodel file
     * @param model_name  Model name
     * @param debug_mode  0: no debug, 1: timing only, 2: full debug logs
     */
    AIBase(const char *kmodel_file, const string model_name, const int debug_mode = 1);

    ~AIBase();
    runtime_tensor get_input_tensor(size_t idx);
    void set_input_tensor(size_t idx, runtime_tensor &input_tensor);
    void run();
    void get_output();
    runtime_tensor get_output_tensor(int idx);

protected:
    string model_name_;
    int debug_mode_;
    vector<float *> p_outputs_;
    vector<vector<int>> input_shapes_;
    vector<vector<int>> output_shapes_;

private:
    void set_input_init();
    void set_output_init();

    interpreter kmodel_interp_;
    vector<unsigned char> kmodel_vec_;
};

In application development, the most commonly reused members are:

input_shapes_
output_shapes_
p_outputs_

For example, the pointer to the first output tensor can be obtained with:

float *output0 = p_outputs_[0];

Task Header and Source Files#

face_detection.h, face_detection.cc, face_recognition.h, and face_recognition.cc are the core files users usually implement or adapt in a real project.

In your own project, the task files can be renamed to match the scenario:

***.h
***.cc

Each task should define a task class that inherits from AIBase:

class YourTask : public AIBase

That means you inherit the common inference wrapper and complete the task-specific logic yourself.

The task class is usually responsible for four parts:

Module	Need to implement	Description
Preprocess	Yes	Convert the input image to the format required by the model
Inference	Reuse `AIBase`	The common run path is already wrapped
Postprocess	Yes	Convert raw model outputs into usable results
Draw	Yes	Draw the results on the image or OSD

A simplified task header can look like the following. For double-model applications, the same pattern applies to both stage-1 and stage-2 task classes:

typedef struct ExampleResults
{
    // define the task result structure here
} ExampleResults;

class MyApp : public AIBase
{
public:
    /**
     * @brief Constructor for video inference.
     * Loads the kmodel, initializes model inputs and outputs, and configures
     * application-specific parameters such as thresholds.
     */
    MyApp(char *kmodel_file, other_params, FrameCHWSize image_size, int debug_mode);
    ~MyApp();

    void pre_process(runtime_tensor &input_tensor);
    void inference();
    void post_process(FrameCHWSize image_size, vector<ExampleResults> &results);
    void draw_result(cv::Mat &draw_frame, vector<ExampleResults> &results);

    std::unique_ptr<ai2d_builder> ai2d_builder_;
    runtime_tensor ai2d_out_tensor_;
    FrameCHWSize image_size_;
    FrameCHWSize input_size_;

    // Define other members used by this task as needed.
};

You can follow the implementation under src/rtsmart/examples/ai/face_recognition/src.

`main.cc` Changes#

Flow Overview#

main.cc contains the complete task logic:

get one frame from the camera or load one local image
create the input tensor
call preprocess
call inference
call postprocess
draw the result

The overall flow is illustrated here:

double_model_inference_rtos

Video Inference#

The video loop can be organized as:

FrameCHWSize image_size = {AI_FRAME_CHANNEL, AI_FRAME_HEIGHT, AI_FRAME_WIDTH};
cv::Mat draw_frame(OSD_HEIGHT, OSD_WIDTH, CV_8UC4, cv::Scalar(0, 0, 0, 0));
runtime_tensor input_tensor;
dims_t in_shape {1, AI_FRAME_CHANNEL, AI_FRAME_HEIGHT, AI_FRAME_WIDTH};

PipeLine pl(debug_mode);
pl.Create();
DumpRes dump_res;

MyApp_1 my_app_1(argv[1], atof(argv[2]), atof(argv[3]), image_size, atoi(argv[8]));
vector<ExampleResults> results_1;

MyApp_2 my_app_2(argv[5], atof(argv[6]), atof(argv[7]), image_size, atoi(argv[8]));
vector<ExampleResults> results_2;

while (!isp_stop)
{
    ScopedTiming st("total time", 1);
    pl.GetFrame(dump_res);
    input_tensor = host_runtime_tensor::create(
        typecode_t::dt_uint8, in_shape,
        {(gsl::byte *)dump_res.virt_addr, compute_size(in_shape)},
        false, hrt::pool_shared, dump_res.phy_addr).expect("cannot create input tensor");
    hrt::sync(input_tensor, sync_op_t::sync_write_back, true).expect("sync write_back failed");

    results_1.clear();
    results_2.clear();

    my_app_1.pre_process(input_tensor);
    my_app_1.inference();
    my_app_1.post_process(image_size, results_1);

    for (auto &result : results_1)
    {
        my_app_2.pre_process(input_tensor);
        my_app_2.inference();
        my_app_2.post_process(image_size, results_2);
    }

    draw_frame.setTo(cv::Scalar(0, 0, 0, 0));
    my_app_1.draw_result(draw_frame, results_1);
    my_app_2.draw_result(draw_frame, results_2);
    pl.InsertFrame(draw_frame.data);
    pl.ReleaseFrame();
}

pl.Destroy();

In practice, the first-stage task often provides ROIs or candidates for the second-stage task. When you rewrite the logic for your own application, the main changes are usually:

task instance initialization
the order of preprocess/inference/postprocess calls
how results from stage 1 are passed to stage 2
how the final drawing is merged on OSD

The video loop is typically run in a dedicated thread. When the user inputs q, set isp_stop to true so the loop can exit cleanly.

When adapting this logic for your own application, also update the thread-exit handling and the runtime prompt text so users know how to stop the program safely.

Image Inference#

main.cc also contains an image-inference path. The main differences are:

load the input image with cv::imread
convert the source image from HWC to CHW
create a host tensor from the local image
write the final output to result.jpg

The simplified flow is:

cv::Mat ori_img = cv::imread(argv[4]);
FrameCHWSize image_size = {ori_img.channels(), ori_img.rows, ori_img.cols};

std::vector<uint8_t> chw_vec;
std::vector<cv::Mat> bgrChannels(3);
cv::split(ori_img, bgrChannels);
for (auto i = 2; i > -1; i--)
{
    std::vector<uint8_t> data = std::vector<uint8_t>(bgrChannels[i].reshape(1, 1));
    chw_vec.insert(chw_vec.end(), data.begin(), data.end());
}

dims_t in_shape {1, 3, ori_img.rows, ori_img.cols};
runtime_tensor input_tensor = host_runtime_tensor::create(
    typecode_t::dt_uint8, in_shape, hrt::pool_shared).expect("cannot create input tensor");
auto input_buf = input_tensor.impl()->to_host().unwrap()->buffer().as_host().unwrap().map(map_access_::map_write).unwrap().buffer();
memcpy(reinterpret_cast<char *>(input_buf.data()), chw_vec.data(), chw_vec.size());
hrt::sync(input_tensor, sync_op_t::sync_write_back, true).expect("write back input failed");

my_app_1.pre_process(input_tensor);
my_app_1.inference();
my_app_1.post_process(image_size, results_1);

for (auto &result : results_1)
{
    my_app_2.pre_process(input_tensor);
    my_app_2.inference();
    my_app_2.post_process(image_size, results_2);
}

my_app_1.draw_result(ori_img, results_1);
my_app_2.draw_result(ori_img, results_2);
cv::imwrite("result.jpg", ori_img);

When you change the inference flow, also update the following parts in main.cc:

usage printing
argument count validation
task-specific argument parsing

For face recognition, the typical usage format is:

void print_usage(const char *name)
{
    cout << "Usage: " << name
         << "<kmodel_det> <det_thres> <nms_thres> <kmodel_recg> <recg_thres> <db_dir> <debug_mode>" << endl;
}

Besides usage printing, also update the argument-count validation and task-specific argument parsing to match the new inference chain. A typical startup check is:

std::cout << "case " << argv[0] << " built at " << __DATE__ << " " << __TIME__ << std::endl;
std::cout << "Press 'q+Enter' to exit." << std::endl;
if (argc != 8)
{
    print_usage(argv[0]);
    return -1;
}

`CMakeLists.txt` and `build_app.sh`#

At the source root:

add_subdirectory(src)

For the task subdirectory:

set(src main.cc ai_utils.cc video_pipeline.cc ai_base.cc face_detection.cc face_recognition.cc anchors_320.cc anchors_640.cc)
set(bin face_recognition.elf)

The build script usually defines the build environment and collects the generated elf, models, and helper files into k230_bin.

If you are not familiar with this part of the build system, you can keep the same split layout with a top-level CMakeLists.txt and a task-local src/CMakeLists.txt. This is also the organization used by the reference example.

Build#

Select Board and Build Firmware#

From the RTOS root:

make list-def
make ***_defconfig
make -j

After the build finishes, the image is generated in output.

The recommended application location is still under src/rtsmart/examples/ai, and the existing face_recognition example can be used as the reference structure for your own task.

Build Method 1#

After you finish the code changes, enter src/rtsmart/examples/ai/face_recognition and run:

./build_app.sh

The intermediate files are generated in build, and the deployment package is collected in k230_bin.

Build Method 2#

From the RTOS SDK root, run make menuconfig and enable:

RT-Smart UserSpace Examples Configuration
-> Enable build ai examples
-> Enable Build Face Recognition Programs

Then run:

make -j

This builds the deployment files directly into:

/sdcard/app/examples/ai/face_recognition

You can also enter the example directory and run:

make -j

This path also supports incremental build and collects the outputs in k230_bin.

Board Deployment#

Flash the firmware first. See:

how_to_flash

Then copy the generated files from k230_bin to:

CanMV/sdcard

Use the serial console to run run.sh or invoke face_recognition.elf directly with the correct arguments.

Typical runtime form:

face_recognition.elf <kmodel_det> <det_thres> <nms_thres> <kmodel_recg> <recg_thres> <db_dir> <debug_mode>

Parameter	Description	Range
`kmodel_det`	Face-detection `kmodel` path	file path
`det_thres`	Detection threshold	`0.0` to `1.0`
`nms_thres`	Detection NMS threshold	`0.0` to `1.0`
`kmodel_recg`	Face-recognition `kmodel` path	file path
`recg_thres`	Recognition threshold	`0` to `100`
`db_dir`	Face database directory	directory path
`debug_mode`	Debug level	`0`, `1`, `2`

The parameter order must match the application code. If you modify the inference pipeline or add more stages, update both the usage message and the runtime argument parsing accordingly.

Feature Support#

Feature	Supported	Command
Show help	Yes	`h` or `help`
Dump registration frame	Yes	`i`
Clear face database	Yes	`d`
Register face	Yes	input a face name
Query registration count	Yes	`n`
Exit	Yes	`q`

Notes:

When capturing a registration frame, keep only one clear face in the image.
Use recognizable English characters for names and avoid special symbols.
If registration or recognition is unstable, check whether the detection result fed into the recognition stage is using the expected ROI and preprocessing path.

Debugging Guide#

Check Model Input and Output Shapes#

Print input_shapes_ and output_shapes_ from AIBase to verify model I/O dimensions.

Dump Raw Data#

Use helper APIs from ai_utils.h, such as:

dump_binary_file
dump_gray_image
dump_color_image

This is useful for checking layout, normalization, and channel-order issues.

It is also useful for checking whether stage-1 outputs are passed correctly into the stage-2 pipeline.

Locate the Bug with Prints#

Add logging around each processing stage and rebuild to locate the exact failing position.

Add Timing Statistics#

Use scoped_timing.h to measure latency:

{
    ScopedTiming st("test", 1);
    /*
     * code under test
     */
}

Check Memory Usage#

Common commands:

cat /proc/media-mem
cat /proc/umap/vicap
cat /proc/umap/vb
cat /proc/umap/vo
list_page

list_page shows free pages, used pages, and peak usage. Each page is 4KB.

For multimedia-related memory issues, you can also inspect /proc/umap/vicap, /proc/umap/vb, and /proc/umap/vo to confirm that the display and frame-buffer paths are behaving as expected.

If Model Quality Is Not Good Enough#

If the code path is correct but the result is still poor, revisit:

source model quality
stage-to-stage preprocessing consistency
threshold values
recognition database quality
quantization and conversion settings

In practice, double-model applications are especially sensitive to consistency between the detection stage and the recognition stage. If the first stage crops an inaccurate ROI or uses a different coordinate transform than expected, the second stage may appear to fail even when the recognition model itself is correct.