CANN融合因果一维卷积算子
aclnnFusedCausalConv1d【免费下载链接】ops-transformer本项目是CANN提供的transformer类大模型算子库实现网络在NPU上加速计算。项目地址: https://gitcode.com/cann/ops-transformer 查看源码产品支持情况产品是否支持Ascend 950PR/Ascend 950DT√Atlas A3 训练系列产品/Atlas A3 推理系列产品×Atlas A2 训练系列产品/Atlas A2 推理系列产品×Atlas 200I/500 A2 推理产品×Atlas 推理系列产品×Atlas 训练系列产品×功能说明接口功能对序列执行因果一维卷积沿序列维度使用缓存数据长度为卷积核宽减1对各序列头部进行padding确保输出依赖当前及历史输入卷积完成后将当前序列尾部的数据长度为卷积核宽减1更新到缓存在因果一维卷积输出的基础上将原始输入加到输出上以实现残差连接。支持以下场景场景一prefill场景x: [cu_seq_len, dim] weight: [K, dim]其中K3 convStates: [-1, K-1, dim] queryStartLoc: [batch1] cacheIndices: [batch] initialStateMode: [batch] bias: [dim]无作用 numAcceptedTokens: [batch]无作用 y: [cu_seq_len, dim] runMode: 0其中cu_seq_len为batch内所有变长序列拼接后的总长度。场景二decode场景 - 变长序列x: [cu_seq_len, dim] weight: [K, dim]其中K3 convStates: [-1, state_len, dim] queryStartLoc: [batch1] cacheIndices: [batch] initialStateMode: [batch] bias: [dim]无作用 numAcceptedTokens: [batch]用于投机解码 y: [cu_seq_len, dim] runMode: 1其中state_len必须大于所有batch中最大的token个数加1。场景三decode场景 - 固定batchx: [batch, m1, dim] weight: [K, dim]其中K3 convStates: [-1, K-1m, dim] queryStartLoc: [batch1]无作用 cacheIndices: [batch] initialStateMode: [batch] bias: [dim]无作用 numAcceptedTokens: [batch]用于投机解码m为投机token个数 y: [batch, m1, dim] runMode: 1计算公式K是卷积核宽度固定为3L是原始序列长度dim是特征维度。缓存拼接$$ x[i, dim] \begin{cases} cacheState[i, dim], 0 \leq i K-1 \ x[i - (K-1), dim], K-1 \leq i L K - 1 \end{cases} $$因果1维卷积$$ y[i, dim] \sum_{k0}^{K-1} w[k, dim] \cdot x[i k, dim] $$缓存更新$$ cacheState[i, dim] x[L i, dim], \quad i 0, 1, \dots, K-2 $$残差连接可选$$ y[i, dim] x[i, dim] $$函数原型每个算子分为两段式接口必须先调用aclnnFusedCausalConv1dGetWorkspaceSize接口获取入参并计算所需workspace大小以及包含了算子计算流程的执行器再调用aclnnFusedCausalConv1d接口执行计算。aclnnStatus aclnnFusedCausalConv1dGetWorkspaceSize( const aclTensor *x, const aclTensor *weight, aclTensor *convStates, const aclTensor *queryStartLoc, const aclTensor *cacheIndices, const aclTensor *initialStateMode, const aclTensor *bias, const aclTensor *numAcceptedTokens, int64_t activationMode, int64_t padSlotId, int64_t runMode, int64_t residualConnection, const aclTensor *y, uint64_t *workspaceSize, aclOpExecutor **executor)aclnnStatus aclnnFusedCausalConv1d( void *workspace, uint64_t workspaceSize, aclOpExecutor *executor, aclrtStream stream)aclnnFusedCausalConv1dGetWorkspaceSize参数说明参数名输入/输出描述使用说明数据类型数据格式维度(shape)非连续TensorxaclTensor*输入计算公式中的x代表输入序列不支持空tensor。prefill场景shape为[cu_seq_len, dim]。decode场景shape为[cu_seq_len, dim]或[batch, seq_len, dim]。FLOAT16、BFLOAT16ND2-3√weightaclTensor*输入计算公式中的w代表因果1维卷积核不支持空tensor。shape为[K, dim]。数据类型与x一致ND2×convStatesaclTensor*输入/输出计算公式中的cacheState代表缓存状态张量存储各序列的历史token数据各序列计算完成后原地更新。不支持空tensor。shape为[..., K-1, dim]数据类型与x一致ND3√queryStartLocaclTensor*输入序列起始位置索引记录各序列在拼接张量x中的起始位置。不支持空tensor。shape为[batch1]queryStartLoc[i]表示第i个序列的起始偏移INT32ND1×cacheIndicesaclTensor*输入缓存索引指定每个序列对应的缓存状态在convStates中的索引。不支持空tensor。shape为[batch]。INT32ND1×initialStateModeaclTensor*输入初始状态标志表示各序列是否使用缓存数据不支持空tensor。shape为[batch]取值为0、1、20零填充1使用缓存2使用缓存但前K-1个输出置0。INT32ND1×biasaclTensor*可选输入卷积的偏置不支持空tensor。shape为[dim]。数据类型与x一致ND1×numAcceptedTokensaclTensor*可选输入decode场景下的投机token个数。不支持空tensor。shape为[batch]。INT32ND1×activationModeint64_t属性激活函数类型取值为0、1、20None1silu2swishINT64---padSlotIdint64_t属性用于跳过不需要参与计算的batch当cacheIndices[i]padSlotId时跳过该batchINT64---runModeint64_t属性用于判断是prefill场景或decode场景取值为0、10prefill场景1decode场景INT64---residualConnectionint64_t属性是否做残差连接取值为0、10不做残差连接1输出y和输入x相加后输出INT64---yaclTensor*输出计算公式中的y代表输出序列shape与x一致数据类型与x一致ND2-3xworkspaceSizeint64_t*输出返回用户需要在Device侧申请的workspace大小-----executoraclOpExecutor**输出返回op执行器包含了算子计算流程-----返回值aclnnStatus返回状态码具体参见aclnn返回码。第一段接口完成入参校验出现以下场景时报错返回值错误码描述ACLNN_ERR_PARAM_NULLPTR161001传入的x、weight、convStates、y是空指针。ACLNN_ERR_INNER_TILING_ERROR561002输入和输出的数据类型不在支持的范围内。x、weight、convStates、bias、y的数据类型不一致。queryStartLoc、cacheIndices、initialStateMode、numAcceptedTokens的数据类型不一致。输入、输出Tensor的shape不在支持的范围内。输入的属性不在支持的范围内。dim不在指定的取值范围内。aclnnFusedCausalConv1d参数说明参数名输入/输出描述workspace输入在Device侧申请的workspace内存地址。workspaceSize输入在Device侧申请的workspace大小由第一段接口aclnnFusedCausalConv1dGetWorkspaceSize获取。executor输入op执行器包含了算子计算流程。stream输入指定执行任务的Stream。返回值返回aclnnStatus状态码具体参见aclnn返回码。约束说明确定性计算aclnnFusedCausalConv1d默认确定性实现。输入shape限制prefill场景x支持2维[cu_seq_len, dim]。weight必须是2维[K, dim]其中K固定为3。convStates必须是3维[..., K-1, dim]第0维大小不固定且大于等于batch。cu_seq_len范围[batch, 65536]dim范围[128, 16384]且是128的倍数batch范围[1, 256]。decode场景固定batchx支持3维[batch, m1, dim]。weight必须是2维[K, dim]其中K固定为3。convStates必须是3维[..., K-1m, dim]第0维大小不固定且大于等于batch。m范围[0, 5]dim范围[128, 16384]且是128的倍数batch范围[1, 256]。decode场景变长序列x支持2维[cu_seq_len, dim]。weight必须是2维[K, dim]其中K固定为3。convStates必须是3维[..., state_len, dim]第0维大小不固定且大于等于batchstate_len必须大于所有batch中最大的token个数加1。cu_seq_len范围[batch, batch*6]每个batch的token个数范围为[1, 6]。dim范围[128, 16384]且是128的倍数batch范围[1, 256]。输入值域限制queryStartLoc是累计偏移量取值范围[0, cu_seq_len]长度为batch1queryStartLoc[i]表示第i个序列的起始偏移queryStartLoc[batch1]表示最后一个序列的结束位置。cacheIndices长度为batch指定每个序列对应的缓存槽索引。numAcceptedTokens分为None和非None非None情况下长度为batch每个元素取值不超过当前batch的token个数且大于0。调用示例示例代码如下仅供参考具体编译和执行过程请参考编译与运行样例。/** * Copyright (c) 2025 Huawei Technologies Co., Ltd. * This program is free software, you can redistribute it and/or modify it under the terms and conditions of * CANN Open Software License Agreement Version 2.0 (the License). * Please refer to the License for details. You may not use this file except in compliance with the License. * THIS SOFTWARE IS PROVIDED ON AN AS IS BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. * See LICENSE in the root of the software repository for the full text of the License. */ /*! * \file test_aclnn_fused_causal_conv1d.cpp * \brief */ #include iostream #include vector #include acl/acl.h #include aclnn/opdev/fp16_t.h #include aclnnop/aclnn_fused_causal_conv1d.h #define CHECK_RET(cond, return_expr) \ do { \ if (!(cond)) { \ return_expr; \ } \ } while (0) #define LOG_PRINT(message, ...) \ do { \ printf(message, ##__VA_ARGS__); \ } while (0) int64_t GetShapeSize(const std::vectorint64_t shape) { int64_t shapeSize 1; for (auto i : shape) { shapeSize * i; } return shapeSize; } int Init(int32_t deviceId, aclrtStream* stream) { // 固定写法资源初始化 auto ret aclInit(nullptr); CHECK_RET(ret ACL_SUCCESS, LOG_PRINT(aclInit failed. ERROR: %d\n, ret); return ret); ret aclrtSetDevice(deviceId); CHECK_RET(ret ACL_SUCCESS, LOG_PRINT(aclrtSetDevice failed. ERROR: %d\n, ret); return ret); ret aclrtCreateStream(stream); CHECK_RET(ret ACL_SUCCESS, LOG_PRINT(aclrtCreateStream failed. ERROR: %d\n, ret); return ret); return 0; } template typename T int CreateAclTensor(const std::vectorT hostData, const std::vectorint64_t shape, void** deviceAddr, aclDataType dataType, aclTensor** tensor, aclFormat format) { auto size GetShapeSize(shape) * sizeof(T); // 调用aclrtMalloc申请device侧内存 auto ret aclrtMalloc(deviceAddr, size, ACL_MEM_MALLOC_HUGE_FIRST); CHECK_RET(ret ACL_SUCCESS, LOG_PRINT(aclrtMalloc failed. ERROR: %d\n, ret); return ret); // 调用aclrtMemcpy将host侧数据拷贝到device侧内存上 ret aclrtMemcpy(*deviceAddr, size, hostData.data(), size, ACL_MEMCPY_HOST_TO_DEVICE); CHECK_RET(ret ACL_SUCCESS, LOG_PRINT(aclrtMemcpy failed. ERROR: %d\n, ret); return ret); // 计算连续tensor的strides std::vectorint64_t strides(shape.size(), 1); for (int64_t i shape.size() - 2; i 0; i--) { strides[i] shape[i 1] * strides[i 1]; } // 调用aclCreateTensor接口创建aclTensor *tensor aclCreateTensor(shape.data(), shape.size(), dataType, strides.data(), 0, format, shape.data(), shape.size(), *deviceAddr); return 0; } int main() { // 1. 固定写法device/stream初始化参考acl API手册 // 根据自己的实际device填写deviceId int32_t deviceId 0; aclrtStream stream; auto ret Init(deviceId, stream); CHECK_RET(ret ACL_SUCCESS, LOG_PRINT(Init acl failed. ERROR: %d\n, ret); return ret); // 2. 构造输入与输出需要根据API的接口自定义构造 int64_t K 3; int64_t dim 128; int64_t batch 4; int64_t numSlots 8; // prefill场景: seq_lens [5, 3, 7, 4], cu_seq_len 19 int64_t cuSeqLen 19; int64_t stateLen K - 1; // 2 std::vectorint64_t xShape {cuSeqLen, dim}; std::vectorint64_t weightShape {K, dim}; std::vectorint64_t convStatesShape {numSlots, stateLen, dim}; std::vectorint64_t queryStartLocShape {batch 1}; std::vectorint64_t cacheIndicesShape {batch}; std::vectorint64_t initialStateModeShape {batch}; std::vectorint64_t biasShape {dim}; std::vectorint64_t numAcceptedTokensShape {batch}; std::vectorint64_t yShape {cuSeqLen, dim}; void* xDeviceAddr nullptr; void* weightDeviceAddr nullptr; void* convStatesDeviceAddr nullptr; void* queryStartLocDeviceAddr nullptr; void* cacheIndicesDeviceAddr nullptr; void* initialStateModeDeviceAddr nullptr; void* biasDeviceAddr nullptr; void* numAcceptedTokensDeviceAddr nullptr; void* yDeviceAddr nullptr; aclTensor* x nullptr; aclTensor* weight nullptr; aclTensor* convStates nullptr; aclTensor* queryStartLoc nullptr; aclTensor* cacheIndices nullptr; aclTensor* initialStateMode nullptr; aclTensor* bias nullptr; aclTensor* numAcceptedTokens nullptr; aclTensor* y nullptr; // 初始化host数据 std::vectorop::fp16_t hostX(cuSeqLen * dim, 1.0f); std::vectorop::fp16_t hostWeight(K * dim, 1.0f); std::vectorop::fp16_t hostConvStates(numSlots * stateLen * dim, 0.0f); // query_start_loc [0, 5, 8, 15, 19] (累计偏移量) std::vectorint32_t hostQueryStartLoc {0, 5, 8, 15, 19}; // cache_indices [0, 3, 1, 5] std::vectorint32_t hostCacheIndices {0, 3, 1, 5}; // initial_state_mode [1, 0, 2, 1] std::vectorint32_t hostInitialStateMode {1, 0, 2, 1}; std::vectorop::fp16_t hostBias(dim, 0); std::vectorint32_t hostNumAcceptedTokens(batch, 0); std::vectorop::fp16_t hostY(cuSeqLen * dim, 0.0f); // 创建x aclTensor ret CreateAclTensor(hostX, xShape, xDeviceAddr, aclDataType::ACL_FLOAT16, x, aclFormat::ACL_FORMAT_ND); CHECK_RET(ret ACL_SUCCESS, return ret); // 创建weight aclTensor ret CreateAclTensor(hostWeight, weightShape, weightDeviceAddr, aclDataType::ACL_FLOAT16, weight, aclFormat::ACL_FORMAT_ND); CHECK_RET(ret ACL_SUCCESS, return ret); // 创建convStates aclTensor ret CreateAclTensor(hostConvStates, convStatesShape, convStatesDeviceAddr, aclDataType::ACL_FLOAT16, convStates, aclFormat::ACL_FORMAT_ND); CHECK_RET(ret ACL_SUCCESS, return ret); // 创建queryStartLoc aclTensor ret CreateAclTensor(hostQueryStartLoc, queryStartLocShape, queryStartLocDeviceAddr, aclDataType::ACL_INT32, queryStartLoc, aclFormat::ACL_FORMAT_ND); CHECK_RET(ret ACL_SUCCESS, return ret); // 创建cacheIndices aclTensor ret CreateAclTensor(hostCacheIndices, cacheIndicesShape, cacheIndicesDeviceAddr, aclDataType::ACL_INT32, cacheIndices, aclFormat::ACL_FORMAT_ND); CHECK_RET(ret ACL_SUCCESS, return ret); // 创建initialStateMode aclTensor ret CreateAclTensor(hostInitialStateMode, initialStateModeShape, initialStateModeDeviceAddr, aclDataType::ACL_INT32, initialStateMode, aclFormat::ACL_FORMAT_ND); CHECK_RET(ret ACL_SUCCESS, return ret); // 创建bias aclTensor ret CreateAclTensor(hostBias, biasShape, biasDeviceAddr, aclDataType::ACL_FLOAT16, bias, aclFormat::ACL_FORMAT_ND); CHECK_RET(ret ACL_SUCCESS, return ret); // 创建numAcceptedTokens aclTensor ret CreateAclTensor(hostNumAcceptedTokens, numAcceptedTokensShape, numAcceptedTokensDeviceAddr, aclDataType::ACL_INT32, numAcceptedTokens, aclFormat::ACL_FORMAT_ND); CHECK_RET(ret ACL_SUCCESS, return ret); // 创建y aclTensor ret CreateAclTensor(hostY, yShape, yDeviceAddr, aclDataType::ACL_FLOAT16, y, aclFormat::ACL_FORMAT_ND); CHECK_RET(ret ACL_SUCCESS, return ret); // 3. 调用CANN算子库API需要修改为具体的Api名称 uint64_t workspaceSize 0; aclOpExecutor* executor; // 参数设置 int64_t activationMode 0; // 0: None int64_t padSlotId -1; // -1: 不跳过 int64_t runMode 0; // 0: prefill场景 int64_t residualConnection 1; // 1: 做残差连接 // 调用aclnnFusedCausalConv1d第一段接口 ret aclnnFusedCausalConv1dGetWorkspaceSize( x, weight, convStates, queryStartLoc, cacheIndices, initialStateMode, bias, numAcceptedTokens, activationMode, padSlotId, runMode, residualConnection, y, workspaceSize, executor); CHECK_RET(ret ACL_SUCCESS, LOG_PRINT(aclnnFusedCausalConv1dGetWorkspaceSize failed. ERROR: %d\n, ret); return ret); // 根据第一段接口计算出的workspaceSize申请device内存 void* workspaceAddr nullptr; if (workspaceSize 0) { ret aclrtMalloc(workspaceAddr, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST); CHECK_RET(ret ACL_SUCCESS, LOG_PRINT(allocate workspace failed. ERROR: %d\n, ret); return ret); } // 调用aclnnFusedCausalConv1d第二段接口 ret aclnnFusedCausalConv1d(workspaceAddr, workspaceSize, executor, stream); CHECK_RET(ret ACL_SUCCESS, LOG_PRINT(aclnnFusedCausalConv1d failed. ERROR: %d\n, ret); return ret); // 4. 固定写法同步等待任务执行结束 ret aclrtSynchronizeStream(stream); CHECK_RET(ret ACL_SUCCESS, LOG_PRINT(aclrtSynchronizeStream failed. ERROR: %d\n, ret); return ret); // 5. 获取输出的值将device侧内存上的结果拷贝至host侧需要根据具体API的接口定义修改 auto size GetShapeSize(yShape); std::vectorop::fp16_t resultData(size, 0); ret aclrtMemcpy(resultData.data(), resultData.size() * sizeof(resultData[0]), yDeviceAddr, size * sizeof(resultData[0]), ACL_MEMCPY_DEVICE_TO_HOST); CHECK_RET(ret ACL_SUCCESS, LOG_PRINT(copy result from device to host failed. ERROR: %d\n, ret); return ret); LOG_PRINT(First 10 output values:\n); for (int64_t i 0; i 10; i) { std::cout index: i : static_castfloat(resultData[i]) std::endl; } // 6. 释放aclTensor和aclScalar需要根据具体API的接口定义修改 aclDestroyTensor(x); aclDestroyTensor(weight); aclDestroyTensor(convStates); aclDestroyTensor(queryStartLoc); aclDestroyTensor(cacheIndices); aclDestroyTensor(initialStateMode); aclDestroyTensor(bias); aclDestroyTensor(numAcceptedTokens); aclDestroyTensor(y); // 7. 释放device资源需要根据具体API的接口定义参数 aclrtFree(xDeviceAddr); aclrtFree(weightDeviceAddr); aclrtFree(convStatesDeviceAddr); aclrtFree(queryStartLocDeviceAddr); aclrtFree(cacheIndicesDeviceAddr); aclrtFree(initialStateModeDeviceAddr); aclrtFree(biasDeviceAddr); aclrtFree(numAcceptedTokensDeviceAddr); aclrtFree(yDeviceAddr); if (workspaceSize 0) { aclrtFree(workspaceAddr); } aclrtDestroyStream(stream); aclrtResetDevice(deviceId); aclFinalize(); LOG_PRINT(Test completed successfully!\n); return 0; }【免费下载链接】ops-transformer本项目是CANN提供的transformer类大模型算子库实现网络在NPU上加速计算。项目地址: https://gitcode.com/cann/ops-transformer创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考