知从木牛瑞萨RH850 P1M-C软件算法优化实践CyberSecurity Application of ZC.MuNiu on Renesas RH850 ICUM
1 项目背景在嵌入式安全通信领域AES-CMACRFC 4493是报文完整性校验的核心算法之一。本文记录了在瑞萨 RH850 P1H-CR7F701372平台上利用 ABCA0 硬件加速器对 aes_cmac() 函数进行五轮深度性能优化的完整过程——从原始的 1.2 秒最终压缩至 90 毫秒实现 约 13 倍的整体性能提升。2 算法原理简述AES-CMAC 的核心是一个 CBC-MAC 变体对输入数据按 16 字节分块逐块执行 XOR → AES-ECB 加密 的链式运算。最后一块根据填充情况选择不同的子密钥subkey_1 或 subkey_2参与异或。M₁⊕ IV → AES →X₁M₂⊕ X₁→ AES → X₂M₃⊕ X₂→ AES → X₃ ← 最终 MAC算法本身是串行的——每一块的输出是下一块的输入——因此无法通过并行化加速。优化空间全部集中在减少每块的 CPU 侧指令开销和最小化硬件交互轮次上。3 性能基线分析原始实现中aes_cmac() 在循环内对每一块调用 R_AES_HW_ECB_Encrypt()而该函数每次都会完整执行初始化流程每块执行: Init →Configure → LoadKey →ProcessBlock原始每块指令分解操作指令数估说明R_AES_HW_ECB_Encrypt 函数调用/返回10call/ret 参数传递R_AES_HW_Init15每块检查 s_initialized waitR_AES_HW_Configure20参数校验 MD 寄存器写入R_AES_HW_LoadKey60bytes_to_words(~45) di/ei CTL IDAT waitR_AES_HW_ProcessBlock90bytes_to_words(~45) di/ei CTL IDAT wait di/ei ODAT words_to_bytes(~45)aes_cmac循环 block_xor_triple 调用16函数调用 4×XORaes_cmac 循环: if/else if 分支6每块判断 last block总计250APP 数据 CMAC 计算实测耗时1.2 秒。4 优化过程4.1 第1轮循环外一次性完成硬件初始化问题分析:R_AES_HW_ECB_Encrypt() 每次调用都执行 Init Configure LoadKey但这些操作的结果在连续的 CMAC 运算中是不变的——密钥不变模式不变硬件只需配置一次。改动:效果:每块消除 Init(~15) Configure(~20) LoadKey(~60) ~95 条指令。实测时间1.2 s →0.75 s提升约 1.6×。要点:这是最显而易见的优化但很多基于硬件加速器的加密库默认就是逐块完整初始化的——原因通常是安全性考量防止密钥残留或 API 简洁性。在裸机可控环境下循环外初始化是安全且收益显著的第一步。4.2 第2轮bytes↔words 转换优化 bytes ↔ Words Conversion Optimization问题分析R_AES_HW_ProcessBlock() 内部的处理流程为bytes → bytes_to_words() → 写 IDAT 寄存器 → HW 加密 → 读 ODAT 寄存器 → words_to_bytes() → bytes其中 bytes_to_words() 和 words_to_bytes() 各约 45 条指令合计 ~90 条指令——是为了将 16 字节数组与 ABCA0 的 32 位寄存器接口适配。RH850 是小端Little-Endian架构。对于 4 字节对齐的 uint32 指针直接内存读取的结果与手动拼接完全等价/* C 语言手动拼接 */uint32 val bytes[0] | (bytes[1]8) | (bytes[2]16) | (bytes[3]24);/* 小端平台上直接读取 */uint32 val *(uint32 *)(bytes); /* 结果完全相同 */改动将 bytes_to_words 和 words_to_bytes 用 RH850 汇编重写充分利用 RH850 的 32 位load/store 指令消除逐字节移位拼接; bytes_to_words 汇编版本示意; 利用 ld.w 直接读取 32 位替代逐字节 shiftorld.wr6, 0[r10] ; 直接读 4 字节st.wr6, 0[r11]ld.wr6, 4[r10]st.wr6, 4[r11]; ...同时在 aes_cmac.c 中调用改为R_AES_HW_ProcessBlock_Aligned((const uint32 *)temp, (uint32 *)prev);效果每块消除 2 次转换函数 ~90 条指令。实测时间0.75 s →0.4 s提升约 1.9×。第二轮选择用汇编重写转换函数而非简单的 C 语言指针强转原因在于工程需要保证 严格的数据对齐假设显式化汇编版本明确控制了 load/store 的地址对齐汇编版本可精确控制指令序列避免编译器生成多余的 load/store 对在后续优化中这些汇编片段可以直接内联到更大的优化函数中。4.3 第3轮CMAC 主循环下沉到驱动层问题分析即使使用了汇编优化的 ProcessBlock_Aligned每块仍存在大量逐块冗余开销这些开销的本质原因是 抽象层边界切割不当——CMAC 的逐块逻辑XOR、分支判断、HW 交互分散在 aes_cmac.c 和 icum_d_aes_hw.c 两个编译单元中编译器无法跨文件内联和优化。改动核心思路将整个 CMAC 主循环XOR → HW ECB → ODAT 读取 → last block 处理合并为驱动层的单个函数一次性接收全部参数。aes_hw_error_t R_AES_HW_CMAC_ProcessBlocks(const uint8 *p_input,uint32 length,const uint8 *p_subkey1,const uint8 *p_subkey2,uint8 *p_mac_out)1函数内部结构1. 参数检查仅一次2. 计算 n_blocks, remainder, prefix_count3. 预计算 CTL 值- ctrl_first START | NEW_KEY首块- ctrl_rest 0x0000后续块4. 首块单独处理循环外使用 ctrl_first5. 后续块紧凑循环无分支使用 ctrl_rest6. 最后一块subkey XOR padding 处理2每块热循环内的实际操作3aes_cmac.c 主循环替换为单次调用hw_ret R_AES_HW_CMAC_ProcessBlocks(input, length, subkey_1, subkey_2, mac_value);if (hw_ret AES_HW_OK) {return;}/* 否则走纯软件回退路径 */效果每块消除 ~37 条指令。实测时间0.4 s →0.2 s提升约 2×。设计考量将 CMAC 逻辑下沉到驱动层违背了驱动只做硬件抽象的传统分层原则。但在嵌入式性能敏感场景中这种权衡是合理的CMAC 是该硬件加速器的主要使用场景不是边缘用例裸机环境下没有多进程竞争不需要严格的抽象隔离保留了R_AES_HW_ProcessBlock_Aligned() 作为通用接口其他算法仍可使用。4.4 第4轮循环内 PSW 寄存器读取优化问题分析 DISABLE_INTERRUPT_WITH_CHECK宏的展开逻辑#define DISABLE_INTERRUPT_WITH_CHECK(saved) do {saved STSR(PSW); /* 读系统寄存器 ~3 条指令 */if (0 (saved 0x20)) { /* 条件分支 ~2 条指令 */__DI(); /* ~1 条指令 */}} while(0)每块执行 2 次写 IDAT前 读 ODAT 前仅 STSR(PSW) 就是 ~6 条指令。STSR 是特权系统寄存器读取开销高于普通内存访问循环执行期间PSW.ID 位中断禁止标志不会被外部代码改变——裸机环境下没有其他线程会在我们的循环中间修改中断状态。改动效果每块消除 2 次STSR(PSW) ~8 条指令。实测时间0.2 s →180 ms。4.5 第5轮开启编译器优化-Ospeed问题分析经过前四轮的手工优化CPU 侧的显性冗余已被基本消除。此时进一步的手动代码重构收益递减。但回顾整个优化过程有一个维度始终未被触及——编译器优化级别。工程中 CMAC 和 AES 硬件驱动相关的.gpj 文件此前可能使用的是默认或较低的优化级别如 -O0 或 -O1。编译器在低优化级别下不进行函数内联即使在同一编译单元内不消除冗余的load/store不进行循环展开或指令调度不利用 RH850 的寻址模式优化。改动对以下 .gpj 文件添加-Ospeed 编译选项CMAC 相关的 .gpjaes_cmac.c 所在工程文件AES 硬件驱动的 .gpjicum_d_aes_hw.c 所在工程文件效果实测时间180 ms →90 ms提升约 2×。这是五轮优化中 单轮收益最大 的一步说明前四轮手工优化后的代码中存在一些可被编译器消除的低效模式——特别是函数调用边界处的寄存器保存/恢复、冗余的条件分支、以及循环中的不变量计算。1 IntroductionIn the field of embedded secure communication, AES-CMAC (RFC 4493) is one of the core algorithms for message integrity verification. This paper documents the complete process of conducting five rounds of in-depth performance optimization for the aes_cmac() function using the ABCA0 hardware accelerator on the Renesas RH850 P1H-C (R7F701372) platform—reducing the original 1.2 seconds to 90 milliseconds, achieving approximately a 13-fold overall performance improvement.2 Design of Security Debugging FunctionThe core of AES-CMAC is a variant of CBC-MAC, which processes input data in 16-byte blocks, performing a chained operation of XOR followed by AES-ECB encryption. The final block selects different subkeys (subkey_1 or subkey_2) for XOR based on the padding scenario.M₁⊕ IV → AES →X₁M₂⊕ X₁→ AES → X₂M₃⊕ X₂→ AES → X₃ ← 最终 MACThe algorithm itself is sequential—each blocks output serves as the input for the next block—thus, it cannot be accelerated through parallelization. The optimization potential lies entirely in reducing CPU-side instruction overhead per block and minimizing hardware interaction rounds.3 Design of Security Debugging FunctionIn the original implementation, aes_cmac() calls R_AES_CW-ECBEncrypt() on each block within the loop, and this function executes the initialization process completely each time:Per‑block Execution: Init →Configure → LoadKey →ProcessBlockOriginal per‑block instruction breakdownEstimated Instruction CountOpreationEstimated Instruction CountDescriptionR_AES_HW_ECB_Encrypt Function Call / Return10call/ret Parameter PassingR_AES_HW_Init15Per‑block checks_initialized waitR_AES_HW_Configure20Parameter validation MD register writeR_AES_HW_LoadKey60bytes_to_words(~45) di/ei CTL IDAT waitR_AES_HW_ProcessBlock90bytes_to_words(~45) di/ei CTL IDAT wait di/ei ODAT words_to_bytes(~45)aes_cmac loop: block_xor_triple call16Function call 4×XORaes_cmacloop: if/else if branch6Per-block check last blockTotal250APP data CMAC calculation actual measurement time: 1.2 seconds.4 Design of Security Debugging Function4.1 Complete hardware initialization in one go outside the loopProblem analysis :Each call to R_AES-HW-ECBEncrypt() executes InitConfigureLoadKey, but the results of these operations remain unchanged in continuous CMAC operations - the key remains unchanged, the mode remains unchanged, and the hardware only needs to be configured once.modify:effect:Each block eliminates approximately 95 instructions, including Init (~15), Configure (~20), and LoadKey (~60).Actual testing time: 1.2 s → 0.75 s, with an increase of approximately 1.6 times.key points:This is the most obvious optimization, but many hardware accelerator based encryption libraries default to complete initialization block by block - usually due to security considerations (preventing key residue) or API simplicity. In a controllable bare metal environment, out of loop initialization is a safe and profitable first step.4.2 bytes↔wordsConversion Optimizationbytes ↔ Words Conversion OptimizationProblem analysisThe internal processing flow of R_AES_CrocessBlock() is as follows:bytes → bytes_to_words() → Write IDAT register → HW encryption → Read ODAT register → words_to_bytes() → bytesAmong them, bytes_to-words() and words_to-bytes() each have about 45 instructions, totaling~90 instructions - in order to adapt the 16 byte array to the 32-bit register interface of ABCA0.RH850 is a Little Endian architecture. For a 4-byte aligned uint32 pointer, the result of direct memory reading is completely equivalent to manual concatenation:/* C 语言手动拼接 */uint32 val bytes[0] | (bytes[1]8) | (bytes[2]16) | (bytes[3]24);/* 小端平台上直接读取 */uint32 val *(uint32 *)(bytes);/* 结果完全相同 */modifyRewrite bytes_to-words and words_to-bytes using RH850 assembly, fully utilizing RH850s 32-bit load/store instruction to eliminate byte by byte shift concatenation:; bytes_to_words 汇编版本示意; 利用 ld.w 直接读取 32 位替代逐字节 shiftorld.wr6, 0[r10] ; 直接读 4 字节st.wr6, 0[r11]ld.wr6, 4[r10]st.wr6, 4[r11]; ...At the same time, change the call in aes_cmac. c to:R_AES_HW_ProcessBlock_Aligned((const uint32 *)temp, (uint32 *)prev);effectEach block eliminates 2 conversion functions~90 instructions.Actual testing time: 0.75 s → 0.4 s, with an increase of approximately 1.9 times.In the second round, we chose to rewrite the conversion function using assembly instead of simple C language pointer rotation. The reason is that the project needs to ensure strict data alignment, and the assembly version explicitly controls the address alignment of load/store. The assembly version can precisely control the instruction sequence and avoid the compiler generating redundant load/store pairs. In subsequent optimizations, these assembly fragments can be directly inline into larger optimization functions.4.3 CMAC Main Loop Down to Driver LayerProblem analysisEven with assembly optimized ProcessBlock_Aligned, there is still a significant amount of block by block redundancy overhead per block, which is essentially due to improper boundary cutting of the abstraction layer - CMACs block by block logic (XOR, branch decision, HW interaction) is scattered in two compilation units, aes_cmac. c and icum_d_aes_hw. c, making it impossible for the compiler to inline and optimize across files.modifyCore idea: Merge the entire CMAC main loop (XOR → HW ECB → ODA read → last block processing) into a single function in the driver layer, receiving all parameters at once.aes_hw_error_t R_AES_HW_CMAC_ProcessBlocks(const uint8 *p_input,uint32 length,const uint8 *p_subkey1,const uint8 *p_subkey2,uint8 *p_mac_out)1Internal structure of function1. Parameter check (only once)2. Calculate n_blocks, residual, prefix_count3. Pre calculate CTL values-CtrlfirstSTART | NEW_KEY (first block)-Ctrd_reset0x0000 (subsequent block)4. Process the first block separately (outside the loop) using ctrd_first5. Compact loop for subsequent blocks (without branches), use ctrd_reset6. Last block: subkey XORpadding processing2Actual operation within each thermal cycle3Replace the aes_cmac. c main loop with a single callhw_ret R_AES_HW_CMAC_ProcessBlocks(input, length, subkey_1, subkey_2, mac_value);if (hw_ret AES_HW_OK) {return;}/* 否则走纯软件回退路径 */effectEliminate~37 instructions per block.Actual testing time: 0.4 s → 0.2 s, an increase of about 2 ×.design considerationsDowngrading CMAC logic to the driver layer violates the traditional layering principle of drivers only do hardware abstraction. But in embedded performance sensitive scenarios, this trade-off is reasonable: CMAC is the main usage scenario for this hardware accelerator, not an edge use case; There is no multi process competition in a bare metal environment, and strict abstract isolation is not required; Retained R_AES-HW-ProcessBlock_Aligned() as a universal interface, while other algorithms can still be used.4.4 Optimization of PSW register reading within the loopProblem analysisExpansion logic of the DISABLE_INTERRUPT_WITH_CHECK#define DISABLE_INTERRUPT_WITH_CHECK(saved) do {saved STSR(PSW); /* 读系统寄存器 ~3 条指令 */if (0 (saved 0x20)) { /* 条件分支 ~2 条指令 */__DI(); /* ~1 条指令 */}} while(0)Each block is executed twice (before writing IDATbefore reading ODAT), and STSR (PSW) alone contains~6 instructions. STSR is a privileged system register read with higher overhead than regular memory access. During loop execution, the PSW.ID bit (interrupt disable flag) will not be changed by external code - no other thread in a bare metal environment will modify the interrupt state in the middle of our loop.modifyeffectEach block eliminates 2 STSR (PSW)~8 instructions.Actual measurement time: 0.2 s → 180 ms.4.5 Enable compiler optimization-OspeedProblem analysisAfter the first four rounds of manual optimization, the explicit redundancy on the CPU side has been basically eliminated. At this point, further manual code refactoring yields diminishing returns. But looking back at the entire optimization process, there is one dimension that has never been touched upon - the compiler optimization level. The. gpj files related to CMAC and AES hardware drivers in engineering may have previously used default or lower optimization levels (such as - O0 or - O1). At low optimization levels, the compiler does not inline functions (even within the same compilation unit); Do not eliminate redundant load/store; Do not perform loop expansion or instruction scheduling; Optimizing the addressing mode without utilizing RH850modifyAdd the - OSpeed compilation option to the following. gpj files:CMAC related. gpj (project file containing aes_cmac. c)AES hardware driver. gpj (project file containing icum_d_aes_hw. c)effectActual testing time: 180 ms → 90 ms, an increase of about 2 times.This is the most profitable step in a single round of optimization among the five rounds, indicating that there are some inefficient patterns in the code after the first four rounds of manual optimization that can be eliminated by the compiler, especially register save/restore at function call boundaries, redundant conditional branching, and invariant computation in loops.点击进入知从科技官网https://www.shzckj.cn/