KPU Application Development Guide#
Overview#
KPU (Knowledge Processing Unit) is the hardware accelerator on K230 designed for edge AI. It is a highly optimized deep-learning accelerator that can efficiently run dense computation in neural-network models. KPU supports a wide range of mainstream visual model structures and is suitable for many edge-vision AI scenarios.
The following diagram shows the position of KPU in the K230 system:
KPU Inference Flow#
When you use the KPU runtime API to run inference, the end-to-end flow is:
The main components involved in the inference process are AI2D and Interpreter.
AI2D#
Handles image preprocessing before model inference.
The preprocessing path is hardware-accelerated and significantly improves runtime efficiency.
Interpreter#
Executes model inference on KPU.
Handles model loading, tensor setup, model execution, and output retrieval.
The input and output data type used by both modules is host_runtime_tensor.
Model input tensors may be:
single-input
multi-input
In most applications, the AI2D output tensor is used directly as the model input tensor. During program initialization, applications usually initialize these items together:
ai2d_builderInterpreterinput tensors
output tensors
Their relationship is shown below:
For a single-input model, you can bind:
the AI2D output tensor
the Interpreter input tensor
to the same host_runtime_tensor. This:
avoids an extra copy
saves one tensor buffer
improves overall execution efficiency
If you do not use AI2D for preprocessing, you can also:
preprocess with OpenCV on the CPU
manually create the corresponding
host_runtime_tensoras the model input
The complete processing path is shown below:
Model Inference Example#
This section uses a YOLOv8 object-detection model to explain the overall deployment flow based on KPU.
The sample source code is located at:
src/rtsmart/examples/ai/usage_kpu
In that directory, run:
./build_app.sh
After the build completes, the generated executables are placed under:
k230_bin/
The output directory contains:
an image inference sample
a camera real-time inference sample
Copy the required executable to the board before running it.
Code Analysis Notes#
This section uses the image inference example and explains the main() function step by step to show the complete flow of model loading, preprocessing, inference, and postprocessing.
Image Inference Example#
The image sample accepts:
<kmodel> <image> <debug_mode>
debug_mode is typically:
0: no debug output1: timing-oriented output2: detailed debug output
The main implementation phases are:
Initialize
interpreterand load thekmodel.Initialize input tensors and record input shapes.
Initialize output tensors and record output shapes.
Read the image and convert it to
CHW + RGB.Compute
Pad + Resizeparameters with letterbox-style scaling.Create the AI2D input tensor and copy the image data into it.
Reuse the model input tensor as the AI2D output tensor.
Configure AI2D parameters and enable
pad + resize.Build and invoke AI2D.
Run KPU inference.
Read output pointers.
Run postprocessing and NMS.
Draw detection results and save the output image. Full
main()implementation:
int main(int argc, char *argv[])
{
std::cout << "case " << argv[0] << " build " << __DATE__ << " " << __TIME__ << std::endl;
if (argc < 4)
{
std::cerr << "Usage: " << argv[0] << " <kmodel> <image> <debug_mode>" << std::endl;
return -1;
}
int debug_mode = atoi(argv[3]);
// ==============================
// 1. Initialize Interpreter and load kmodel
// ==============================
interpreter interp;
std::ifstream ifs(argv[1], std::ios::binary);
interp.load_model(ifs).expect("Invalid kmodel");
vector<vector<int>> input_shapes;
vector<vector<int>> output_shapes;
vector<float *> p_outputs;
// ==============================
// 2. Initialize input tensors and record input shapes
// ==============================
for (int i = 0; i < interp.inputs_size(); i++)
{
auto desc = interp.input_desc(i);
auto shape = interp.input_shape(i);
auto tensor = host_runtime_tensor::create(desc.datatype, shape, hrt::pool_shared).expect("cannot create input tensor");
interp.input_tensor(i, tensor).expect("cannot set input tensor");
vector<int> in_shape;
if (debug_mode > 1)
std::cout << "input " << i << " datatype: " << desc.datatype << " , shape: ";
for (int j = 0; j < shape.size(); ++j)
{
in_shape.push_back(shape[j]);
if (debug_mode > 1)
std::cout << shape[j] << " ";
}
if (debug_mode > 1)
std::cout << std::endl;
input_shapes.push_back(in_shape);
}
// ==============================
// 3. Initialize output tensors and record output shapes
// ==============================
for (size_t i = 0; i < interp.outputs_size(); i++)
{
auto desc = interp.output_desc(i);
auto shape = interp.output_shape(i);
auto tensor = host_runtime_tensor::create(desc.datatype, shape, hrt::pool_shared).expect("cannot create output tensor");
interp.output_tensor(i, tensor).expect("cannot set output tensor");
vector<int> out_shape;
if (debug_mode > 1)
std::cout << "output " << i << " datatype: " << desc.datatype << " , shape: ";
for (int j = 0; j < shape.size(); ++j)
{
out_shape.push_back(shape[j]);
if (debug_mode > 1)
std::cout << shape[j] << " ";
}
if (debug_mode > 1)
std::cout << std::endl;
output_shapes.push_back(out_shape);
}
// ==============================
// 4. Read image and convert to CHW + RGB
// ==============================
cv::Mat ori_img = cv::imread(argv[2]);
int ori_w = ori_img.cols;
int ori_h = ori_img.rows;
std::vector<uint8_t> chw_vec;
std::vector<cv::Mat> bgrChannels(3);
cv::split(ori_img, bgrChannels);
for (auto i = 2; i > -1; i--)
{
std::vector<uint8_t> data = std::vector<uint8_t>(bgrChannels[i].reshape(1, 1));
chw_vec.insert(chw_vec.end(), data.begin(), data.end());
}
// ==============================
// 5. Compute Pad + Resize parameters (letterbox: scale by shorter side)
// ==============================
int width = input_shapes[0][3];
int height = input_shapes[0][2];
float ratiow = (float)width / ori_w;
float ratioh = (float)height / ori_h;
float ratio = ratiow < ratioh ? ratiow : ratioh;
int new_w = (int)(ratio * ori_w);
int new_h = (int)(ratio * ori_h);
float dw = (float)(width - new_w) / 2;
float dh = (float)(height - new_h) / 2;
int top = (int)(roundf(0));
int bottom = (int)(roundf(dh * 2 + 0.1));
int left = (int)(roundf(0));
int right = (int)(roundf(dw * 2 - 0.1));
// ==============================
// 6. Create AI2D input tensor and copy image data into it
// ==============================
dims_t ai2d_in_shape{1, 3, ori_h, ori_w};
runtime_tensor ai2d_in_tensor = host_runtime_tensor::create(typecode_t::dt_uint8, ai2d_in_shape, hrt::pool_shared).expect("cannot create input tensor");
auto input_buf = ai2d_in_tensor.impl()->to_host().unwrap()->buffer().as_host().unwrap().map(map_access_::map_write).unwrap().buffer();
memcpy(reinterpret_cast<char *>(input_buf.data()), chw_vec.data(), chw_vec.size());
hrt::sync(ai2d_in_tensor, sync_op_t::sync_write_back, true).expect("write back input failed");
// ==============================
// 7. Reuse model input tensor as the AI2D output tensor
// ==============================
runtime_tensor ai2d_out_tensor = interp.input_tensor(0).expect("cannot get input tensor");
dims_t out_shape = ai2d_out_tensor.shape();
// ==============================
// 8. Configure AI2D parameters (pad + resize)
// ==============================
ai2d_datatype_t ai2d_dtype{ai2d_format::NCHW_FMT, ai2d_format::NCHW_FMT, ai2d_in_tensor.datatype(), ai2d_out_tensor.datatype()};
ai2d_crop_param_t crop_param{false, 0, 0, 0, 0};
ai2d_shift_param_t shift_param{false, 0};
ai2d_pad_param_t pad_param{true, {{0, 0}, {0, 0}, {top, bottom}, {left, right}}, ai2d_pad_mode::constant, {114, 114, 114}};
ai2d_resize_param_t resize_param{true, ai2d_interp_method::tf_bilinear, ai2d_interp_mode::half_pixel};
ai2d_affine_param_t affine_param{false, ai2d_interp_method::cv2_bilinear, 0, 0, 127, 1, {0.5, 0.1, 0.0, 0.1, 0.5, 0.0}};
// ==============================
// 9. Build and invoke AI2D
// ==============================
ai2d_builder builder(ai2d_in_shape, out_shape, ai2d_dtype, crop_param, shift_param, pad_param, resize_param, affine_param);
builder.build_schedule();
builder.invoke(ai2d_in_tensor, ai2d_out_tensor).expect("error occurred in ai2d running");
// ==============================
// 10. Run KPU inference
// ==============================
interp.run().expect("error occurred in running model");
// ==============================
// 11. Read model outputs
// ==============================
p_outputs.clear();
for (int i = 0; i < interp.outputs_size(); i++)
{
auto out = interp.output_tensor(i).expect("cannot get output tensor");
auto buf = out.impl()->to_host().unwrap()->buffer().as_host().unwrap().map(map_access_::map_read).unwrap().buffer();
float *p_out = reinterpret_cast<float *>(buf.data());
p_outputs.push_back(p_out);
}
// ==============================
// 12. Postprocessing (decode + NMS)
// ==============================
std::vector<std::string> classes{"apple", "banana", "orange"};
float conf_thresh = 0.25;
float nms_thresh = 0.45;
int class_num = classes.size();
std::vector<cv::Scalar> class_colors = getColorsForClasses(class_num);
// Transpose output layout from C x N to N x C so each box's features are contiguous
float *output0 = p_outputs[0];
int f_len = class_num + 4;
int num_box =
((input_shapes[0][2] / 8) * (input_shapes[0][3] / 8) +
(input_shapes[0][2] / 16) * (input_shapes[0][3] / 16) +
(input_shapes[0][2] / 32) * (input_shapes[0][3] / 32));
float *output_det = new float[num_box * f_len];
for (int r = 0; r < num_box; r++)
for (int c = 0; c < f_len; c++)
output_det[r * f_len + c] = output0[c * num_box + r];
// Decode boxes and map back to original image coordinates
std::vector<Bbox> bboxes;
for (int i = 0; i < num_box; i++)
{
float *vec = output_det + i * f_len;
float box[4] = {vec[0], vec[1], vec[2], vec[3]};
float *class_scores = vec + 4;
auto max_class_score_ptr = std::max_element(class_scores, class_scores + class_num);
float score = *max_class_score_ptr;
int max_class_index = max_class_score_ptr - class_scores;
if (score > conf_thresh)
{
Bbox bbox;
float x_ = box[0] / ratio;
float y_ = box[1] / ratio;
float w_ = box[2] / ratio;
float h_ = box[3] / ratio;
int x = int(MAX(x_ - 0.5 * w_, 0));
int y = int(MAX(y_ - 0.5 * h_, 0));
int w = int(w_);
int h = int(h_);
if (w <= 0 || h <= 0) { continue; }
bbox.box = cv::Rect(x, y, w, h);
bbox.confidence = score;
bbox.index = max_class_index;
bboxes.push_back(bbox);
}
}
// Run Non-Maximum Suppression
std::vector<int> nms_result;
nms(bboxes, conf_thresh, nms_thresh, nms_result);
// ==============================
// 13. Draw detection results and save output image
// ==============================
for (int i = 0; i < nms_result.size(); i++)
{
int res = nms_result[i];
cv::Rect box = bboxes[res].box;
int idx = bboxes[res].index;
cv::rectangle(ori_img, box, class_colors[idx], 2, 8);
cv::putText(ori_img, classes[idx], cv::Point(box.x + 5, box.y - 10),
cv::FONT_HERSHEY_DUPLEX, 1, class_colors[idx], 2, 0);
}
cv::imwrite("result.jpg", ori_img);
delete[] output_det;
return 0;
}
Typical Image-Side Preprocessing#
The sample uses:
crop_param: disabledshift_param: disabledpad_param: enabledresize_param: enabledaffine_param: disabled
This is the standard detection-model letterbox path.
Postprocessing Logic#
The example decodes detection boxes from the first output tensor, performs confidence filtering and NMS, then draws results onto the original image.
Key operations include:
transpose the output layout from
C x NtoN x Cdecode bounding boxes
map coordinates back to the original image
perform NMS
draw boxes and labels
save
result.jpg
Camera Real-Time Inference Example#
The camera inference example uses the same model-inference logic, but the input frame comes from the camera pipeline rather than opencv::imread.
The main file tree is:
yolov8_run_camera
├── main.cc # core inference logic
├── scoped_timing.h # timing helper
├── setting.h # screen and resolution settings
├── video_pipeline.cc # display and frame pipeline implementation
├── video_pipeline.h # pipeline interface
└── CMakeLists.txt # build file
The camera_inference() function can be divided into these phases:
Parse command-line parameters and initialize basic variables.
Initialize the video pipeline and OSD frame.
Load the
kmodeland initializeInterpreter.Initialize model input and output tensors.
Compute
Resize + Paddingparameters for YOLO letterbox preprocessing.Configure AI2D input/output tensors and build the AI2D scheduler.
Enter the main loop:
get one ISP frame
create the AI2D input tensor with zero-copy binding to the ISP buffer
run AI2D
run KPU inference
read outputs
decode detections and run NMS
draw results onto the OSD layer
insert the OSD frame and release the current frame
Release resources and destroy the pipeline when exiting.
Full camera_inference() implementation:
int camera_inference(char *argv[])
{
/************************************************************
* Phase 0: Parse arguments and initialize basic variables
************************************************************/
int debug_mode = atoi(argv[4]);
// AI input image size (CHW)
FrameCHWSize image_size = {AI_FRAME_CHANNEL, AI_FRAME_HEIGHT, AI_FRAME_WIDTH};
// OSD layer (RGBA) for drawing bounding boxes and labels
cv::Mat draw_frame(OSD_HEIGHT, OSD_WIDTH, CV_8UC4, cv::Scalar(0, 0, 0, 0));
/************************************************************
* Phase 1: Initialize video pipeline (ISP -> DRM -> OSD)
************************************************************/
PipeLine pl(debug_mode);
pl.Create();
// Buffer for one ISP frame (virtual address + physical address)
DumpRes dump_res;
/************************************************************
* Phase 2: Load kmodel and initialize Interpreter
************************************************************/
interpreter interp;
std::ifstream ifs(argv[1], std::ios::binary);
interp.load_model(ifs).expect("Invalid kmodel");
/************************************************************
* Phase 3: Initialize input/output tensors and record shapes
************************************************************/
vector<vector<int>> input_shapes;
vector<vector<int>> output_shapes;
vector<float *> p_outputs;
for (int i = 0; i < interp.inputs_size(); i++)
{
auto desc = interp.input_desc(i);
auto shape = interp.input_shape(i);
auto tensor = host_runtime_tensor::create(desc.datatype, shape, hrt::pool_shared).expect("cannot create input tensor");
interp.input_tensor(i, tensor).expect("cannot set input tensor");
vector<int> in_shape;
if (debug_mode > 1)
std::cout << "input " << i << " datatype: " << desc.datatype << " , shape: ";
for (int j = 0; j < shape.size(); ++j)
{
in_shape.push_back(shape[j]);
if (debug_mode > 1)
std::cout << shape[j] << " ";
}
if (debug_mode > 1)
std::cout << std::endl;
input_shapes.push_back(in_shape);
}
for (size_t i = 0; i < interp.outputs_size(); i++)
{
auto desc = interp.output_desc(i);
auto shape = interp.output_shape(i);
auto tensor = host_runtime_tensor::create(desc.datatype, shape, hrt::pool_shared).expect("cannot create output tensor");
interp.output_tensor(i, tensor).expect("cannot set output tensor");
vector<int> out_shape;
if (debug_mode > 1)
std::cout << "output " << i << " datatype: " << desc.datatype << " , shape: ";
for (int j = 0; j < shape.size(); ++j)
{
out_shape.push_back(shape[j]);
if (debug_mode > 1)
std::cout << shape[j] << " ";
}
if (debug_mode > 1)
std::cout << std::endl;
output_shapes.push_back(out_shape);
}
/************************************************************
* Phase 4: Compute Resize + Padding parameters (YOLO LetterBox)
************************************************************/
int width = input_shapes[0][3];
int height = input_shapes[0][2];
float ratiow = (float)width / AI_FRAME_WIDTH;
float ratioh = (float)height / AI_FRAME_HEIGHT;
float ratio = ratiow < ratioh ? ratiow : ratioh;
int new_w = (int)(ratio * AI_FRAME_WIDTH);
int new_h = (int)(ratio * AI_FRAME_HEIGHT);
float dw = (float)(width - new_w) / 2;
float dh = (float)(height - new_h) / 2;
int top = (int)(roundf(0));
int bottom = (int)(roundf(dh * 2 + 0.1));
int left = (int)(roundf(0));
int right = (int)(roundf(dw * 2 - 0.1));
/************************************************************
* Phase 5: Configure AI2D tensors and builder
************************************************************/
dims_t ai2d_in_shape{1, AI_FRAME_CHANNEL, AI_FRAME_HEIGHT, AI_FRAME_WIDTH};
runtime_tensor ai2d_in_tensor;
// Reuse model input tensor as AI2D output to avoid extra copies
runtime_tensor ai2d_out_tensor = interp.input_tensor(0).expect("cannot get input tensor");
dims_t out_shape = ai2d_out_tensor.shape();
ai2d_datatype_t ai2d_dtype{
ai2d_format::NCHW_FMT, ai2d_format::NCHW_FMT,
typecode_t::dt_uint8, typecode_t::dt_uint8};
ai2d_crop_param_t crop_param{false, 0, 0, 0, 0};
ai2d_shift_param_t shift_param{false, 0};
ai2d_pad_param_t pad_param{true,
{{0, 0}, {0, 0}, {top, bottom}, {left, right}},
ai2d_pad_mode::constant, {114, 114, 114}};
ai2d_resize_param_t resize_param{true, ai2d_interp_method::tf_bilinear, ai2d_interp_mode::half_pixel};
ai2d_affine_param_t affine_param{false, ai2d_interp_method::cv2_bilinear, 0, 0, 127, 1,
{0.5, 0.1, 0.0, 0.1, 0.5, 0.0}};
ai2d_builder builder(ai2d_in_shape, out_shape, ai2d_dtype,
crop_param, shift_param, pad_param, resize_param, affine_param);
builder.build_schedule();
/************************************************************
* Phase 6: Initialize postprocessing parameters
************************************************************/
std::vector<std::string> classes{"apple", "banana", "orange"};
float conf_thresh = atof(argv[2]);
float nms_thresh = atof(argv[3]);
int class_num = classes.size();
std::vector<cv::Scalar> class_colors = getColorsForClasses(class_num);
float *output0;
int f_len = class_num + 4;
int num_box =
((input_shapes[0][2] / 8) * (input_shapes[0][3] / 8) +
(input_shapes[0][2] / 16) * (input_shapes[0][3] / 16) +
(input_shapes[0][2] / 32) * (input_shapes[0][3] / 32));
float *output_det = new float[num_box * f_len];
std::vector<Bbox> bboxes;
/************************************************************
* Phase 7: Main loop (capture -> preprocess -> infer -> postprocess -> display)
************************************************************/
while (!isp_stop)
{
// Get one ISP frame
pl.GetFrame(dump_res);
// Create AI2D input tensor with zero-copy binding to ISP buffer
ai2d_in_tensor = host_runtime_tensor::create(
typecode_t::dt_uint8, ai2d_in_shape,
{(gsl::byte *)dump_res.virt_addr, compute_size(ai2d_in_shape)},
false, hrt::pool_shared, dump_res.phy_addr)
.expect("cannot create input tensor");
hrt::sync(ai2d_in_tensor, sync_op_t::sync_write_back, true).expect("sync write_back failed");
// Run AI2D preprocessing
builder.invoke(ai2d_in_tensor, ai2d_out_tensor).expect("error occurred in ai2d running");
// Run KPU inference
interp.run().expect("error occurred in running model");
// Read model outputs
p_outputs.clear();
for (int i = 0; i < interp.outputs_size(); i++)
{
auto out = interp.output_tensor(i).expect("cannot get output tensor");
auto buf = out.impl()->to_host().unwrap()->buffer().as_host().unwrap().map(map_access_::map_read).unwrap().buffer();
p_outputs.push_back(reinterpret_cast<float *>(buf.data()));
}
// Transpose output layout (C x N -> N x C)
output0 = p_outputs[0];
for (int r = 0; r < num_box; r++)
for (int c = 0; c < f_len; c++)
output_det[r * f_len + c] = output0[c * num_box + r];
// Decode boxes, filter by confidence, run NMS
bboxes.clear();
for (int i = 0; i < num_box; i++)
{
float *vec = output_det + i * f_len;
float box[4] = {vec[0], vec[1], vec[2], vec[3]};
float *class_scores = vec + 4;
auto max_ptr = std::max_element(class_scores, class_scores + class_num);
float score = *max_ptr;
int max_class_index = max_ptr - class_scores;
if (score > conf_thresh)
{
Bbox bbox;
float x_ = box[0] / ratio;
float y_ = box[1] / ratio;
float w_ = box[2] / ratio;
float h_ = box[3] / ratio;
int x = int(MAX(x_ - 0.5 * w_, 0));
int y = int(MAX(y_ - 0.5 * h_, 0));
int w = int(w_);
int h = int(h_);
if (w <= 0 || h <= 0) continue;
bbox.box = cv::Rect(x, y, w, h);
bbox.confidence = score;
bbox.index = max_class_index;
bboxes.push_back(bbox);
}
}
std::vector<int> nms_result;
nms(bboxes, conf_thresh, nms_thresh, nms_result);
// Draw detections onto OSD layer
draw_frame.setTo(cv::Scalar(0, 0, 0, 0));
for (int i = 0; i < nms_result.size(); i++)
{
int res = nms_result[i];
cv::Rect box = bboxes[res].box;
int idx = bboxes[res].index;
float score = bboxes[res].confidence;
int x = int(box.x * float(OSD_WIDTH) / AI_FRAME_WIDTH);
int y = int(box.y * float(OSD_HEIGHT) / AI_FRAME_HEIGHT);
int w = int(box.width * float(OSD_WIDTH) / AI_FRAME_WIDTH);
int h = int(box.height * float(OSD_HEIGHT) / AI_FRAME_HEIGHT);
cv::Rect new_box(x, y, w, h);
cv::rectangle(draw_frame, new_box, class_colors[idx], 2, 8);
cv::putText(draw_frame,
classes[idx] + " " + std::to_string(score),
cv::Point(MIN(new_box.x + 5, OSD_HEIGHT), MAX(new_box.y - 10, 0)),
cv::FONT_HERSHEY_DUPLEX, 1, class_colors[idx], 2, 0);
}
// Composite OSD and release frame
pl.InsertFrame(draw_frame.data);
pl.ReleaseFrame(dump_res);
}
/************************************************************
* Phase 8: Release resources
************************************************************/
delete[] output_det;
pl.Destroy();
return 0;
}
Camera-Side Postprocessing and Drawing#
In the video path, the sample:
decodes YOLO output
filters by confidence threshold
performs NMS
maps box coordinates from AI-frame space to OSD space
draws class names and confidence values
inserts the OSD frame into the video pipeline
Build and Run#
After writing or modifying the code, build it with CMakeLists.txt or Makefile. For the sample above, you can run:
cd src/rtsmart/examples/ai/usage_kpu
./build_app.sh
The generated files are placed in k230_bin. Copy them to the TF card on the board, then run the corresponding command.
Run YOLOv8 on a Static Image#
./yolov8_image.elf best.kmodel test.jpg 2
The image inference result is saved as an output image. The reference result is shown below:
Run YOLOv8 on Camera Data#
./yolov8_camera.elf best.kmodel 0.5 0.45 2
The camera inference result is displayed on the screen in real time. The reference effect is shown below:
Notes#
The purpose of this guide is to explain the model-conversion and KPU-inference path. It is not intended to be a drop-in solution for every application scenario, but it can be used as a reference when developing KPU-based applications for other tasks.
