CANN/cannbot-skills A2双桥模式
a2 Cube-to-Vec-to-Cube Pattern (Double GM Bridge, One-Tile Lookahead)【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skillsRead this file when writing an a2 (easyasc.a2, deviceb3) kernel with:one cube stage that produces a tilevec logic that transforms that tilea later cube stage that consumes the vec resultDo not use this file for a5 kernels. The a5 path is materially different because it can publish vec output directly toL1.When to usethe formula is structurallycube - vec - cubethe vec result must be consumed by a later cube matmulthe later cube stage naturally runs one iteration behind the producer stagethe vec output is large enough that pretending it stays purely on chip would be misleadingTypical example:score_j q k_j^Tvec computesp_j exp(score_j - running_max).half()delayed cube stage computespv_j p_j.float() v_j.float()Why a2 needs a special patternTwo a2 hardware/software constraints dominate the design:l0c_to_ubis unavailable. Cube output cannot go directly fromL0CtoUB.ub_to_l1_nd2nz/ub_to_l1_nzare a5-only. Vec output cannot go directly fromUBtoL1.Practical consequence:cube - vec must bridge through GM workspacevec - cube must also bridge through GM workspaceSo the real a2 flow is:GM(q,k) - L1 - L0 - L0C - GM(score_ws) - UB - vec - GM(p_ws) - L1 - L0 - L0C - GM(pv)This is the core difference from a5.Stable schedule: warmup, steady state, drainThe clean control structure is a one-tile lookahead loop:for ni in range(0, tiles_n 1): if ni tiles_n: # stage 1: produce current tile p_j if ni 0: # stage 2: consume previous tile p_{j-1}Meaning:ni tiles_n: produce tilej nini 0: consume tilej ni - 1This creates:warmup: first iteration only producessteady state: middle iterations producejwhile consumingj-1drain: last iteration only consumes the final tileDo not force both stages into the same tile index inside one iteration. The delayed consumer is the point of the pattern.Workspace layoutUse two separate GM workspaces:score_wsdtype:floatshape:[GetCubeNum(), 2, TILE_M, TILE_N]purpose: bridgeL0C(score)-UBp_wsdtype:halfshape:[GetCubeNum(), 2, TILE_M, TILE_N]purpose: bridgeUB(p_j)-L1Why two workspaces:stage-1 score is naturallyfloatstage-2 cube input should consumehalfif the target contract isp_j.half().float() v_j.float()keeping them separate makes dtype intent explicit and avoids hidden castsBuffer ownership and reuse1. Reuse oneL0Cfamily across both cube stagesOn a2,TILE_M TILE_N 128with float accumulation already fills the entire128 KBL0C. That leaves no room for a second full-sizeL0Cfamily.Stable rule for this specific pattern:reuse one physicall0c DBuff(DT.float, [TILE_M, TILE_N], Position.L0C)let both cube stages write into that same familyadvance one sharedl0c_cntWhy this is safe here:stage 1 and stage 2 do not needL0Csimultaneouslystage 1 publishesL0C - score_wsbefore stage 2 reuses the slotstage 2 publishesL0C - outputbefore the next stage-1 reuseDo not generalize this into all delayed stages should share counters. This is a targeted capacity-driven exception for one serially reusedL0Cfamily.2. Keep other lifetimes separateEven thoughl0c_cntis shared, other stage-owned lifetimes should stay separate:l1q/l1kandl1p/l1vshould not share one counterdelayed slot ownership should usestage1_cntandstage2_cntRecommended split:l1qk_cnt: stage-1 operand loadsl1pv_cnt: stage-2 operand loadsl0c_cnt: shared physicalL0Cfamilystage1_cnt:score_ws/p_wsproducer slot rhythmstage2_cnt: delayed consumer slot rhythm3. If a delayed consumer reuses a producer operand, match buffer depth to the overlapSometimes the delayed cube stage needs not only the vec result, but also one of the original stage-1 operands again.Concrete example from dense attention backward:stage 1 loadsk_jand computesqk_j q k_j^Tvec computesdqk_jdelayed cube stage later computesgq dqk_j k_jIf you want to avoid reloadingk_jfrom GM, keep that operand family on chip and reuse it from the delayed stage.Important overlap rule:for a one-tile lookahead loop,DBuffis oftennotenough for a reused producer operandwhile the delayed stage is still consuming tilej, the producer may already be starting tilej2with only two slots, tilej2can overwrite slotjbefore the delayed consumer is doneStable rule for this case:promote only the reused delayed operand family toTBuffkeep unrelated families such asvonDBuffif they are not reused by the delayed stagelet the delayed consumer index thatTBuffby its own delayed-stage lineage, not by the immediate producer slotPractical outcome:kmay needTBuffvmay still stayDBuffthe extra on-chip slot can be cheaper than a second GM read on every tileThis is a lifetime decision, not a micro-optimization accident. Choose the buffer depth from the real overlap window.Cross-side synchronizationThis pattern has two ownership edges.Edge 1: cube - vec (score)Use:CvMutex(0, src_end_pipePipe.FIX, dst_end_pipePipe.MTE2)Reason:producer ends withl0c_to_gm_nz2ndonFIXvec consumer starts withgm_to_ub_padonMTE2Edge 2: vec - cube (p_j)Use:VcMutex(1, src_end_pipePipe.MTE3, dst_end_pipePipe.FIX)Reason:vec producer ends withub_to_gm_padonMTE3cube consumer eventually finishes the delayed use aftergm_to_l1 - l1_to_l0 - mmad - writebackfor this pattern, conservative release is safer: free only after the cube stage finishes the tileThis conservativedst_end_pipePipe.FIXmatches the do not release early rule for delayed reuse.Two-sub-block publication ruleEach a2 cube core has 2 vec sub-blocks. Each vec sub-block owns onlyHALF_Mrows inUB.So stage 1 should:readHALF_Mrows fromscore_wscomputep_jfor only those rowswrite those rows into the sharedp_wsslotTypical write pattern:sb GetSubBlockIdx() sb_row Var(sb * HALF_M) p_ws[cube_idx, slot, sb_row:sb_row HALF_M, 0:TILE_N] ub_pThen stage 2 cube waits on theVcMutexand reads the full tile:l1p[...] p_ws[cube_idx, slot, 0:TILE_M, 0:TILE_N]Important simulator/runtime fact:cube-sidewait_vec()completes only after both vec lanes have produced their tokensthis makes the full-tile read safe without an extra manual barrierRow-max state rulesIf the vec stage uses running row max across tiles:keep the running state in[HALF_M, 1]scalar formatinitialize withdup(neg_large)whereneg_largeis a sufficiently large finite negative sentinelupdate withvmax(ub_rmax_s, ub_rmax_s, ub_max_s)broadcast only after the scalar update usingbrcbForTILE_N 128, the stable sequence is:vmaxbetween the two 64-column halvescmaxto[HALF_M, 1]vmaxwith running state in[HALF_M, 1]brcbto[HALF_M, 8]slicedsubon[0:64]and[64:128]expcasttohalfDo not:update running max in[HALF_M, 8]broadcast formatsubtract a narrow max buffer from an unsliced[HALF_M, 128]tileStage ordering inside one loop iterationFor this a2 pattern, a stable order is:stage 1 cube computesscore_jstage 1 vec computesp_jand writesp_wsstage 2 cube consumes delayedp_{j-1}In other words: produce current tile first, then consume previous tile.Why this order is helpful:the reusedL0Cfamily is naturally free afterscore_j - score_wsthe delayed cube stage can then reuse that sameL0Cfamily safelyone sharedl0c_cntremains easy to reason aboutIf the delayed stage also reuses stage-1k_jon chip:the schedule is still producej, then consumej-1but thekbuffer family now lives longer than the immediatevfamilyreflect that longer lifetime in the buffer depth (TBuff) and in the counter choiceOutput layout ruleFor flattened GM output that preserves[B, H, tiles_n, S1, D], a stable write index is:out_row Var((bh * tiles_n tile_n) * S1 local_row)That corresponds to the physical layout:[(bh * n_tiles tile_n) * S1 row, D]Use this when the user wants to preserve the logical[B, H, tile_n, S1, D]grouping while still flatteningBHin the kernel contract.Validation checklistFor the first runnable version, keep the contract narrow and explicit:S1 % 128 0S2 % 128 0D 128scalepassed in as an explicit kernel scalarReference formula to compare against:for j in range(0, S2, 128): score_j q.float() k_j.float().t() * scale m maximum(m, rowmax(score_j)) p_j exp(score_j - m).half() pv_j p_j.float() v_j.float()Good first validation order:(B,H,S1,S2,D) (1,1,256,512,128)multi-head small casefull aligned case such as(1,3,2048,4096,128)Common mistakestrying to useUB - L1directly on a2allocating separate full-sizeL0Cfamilies for both cube stagessharing every counter just becausel0c_cntis sharedforgetting thetiles_n 1warmup/drain loopconsuming tilejin the same iteration that is supposed to produce tilejwriting only one vec sub-blocks rows intop_wsreleasing the vec - cube mutex before the delayed cube stage really finishesdocumenting the kernel as online softmax when it only keeps running max and does not maintain running sumFiles to studyagent/example/kernels/a2/flash_attn_score_pv.pyagent/example/kernels/a2/flash_attn_score_iter.pyagent/references/patterns/a2-cube-vec.mdagent/references/constraints/a2-device.mdagent/references/constraints/vec-reduction-a2.mdagent/references/constraints/vec-stride.md【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考