Fix +/-inf in LSE returned by forward #978

sgrigory · 2024-06-03T14:34:17Z

Forward op was returning +inf in LSE for queries which have no keys to attend to, e.g. when K/V length happens to be 0. This diverges from the definition of LSE = log(exp(L1) + ... exp(L2)) which would give log(0) = -inf.
This PR fixes it, which allows feeding the output LSE directly into ops like merge_attentions without postprocessing.

pytest tests/test_flash_attn.py
...
======================================================================================== 268004 passed, 152064 skipped in 4404.00s (1:13:23) =========================================================================================

tridao · 2024-06-27T09:40:00Z

One issue I can see is that in the backward pass, if lse = +inf then exp(qk - lse) returns 0, which is what we want. If lse = -inf then exp would blow up.

GD06 · 2025-01-03T03:13:16Z

QQ: do we plan to merge this PR as it has been pending for months.

sgrigory · 2025-01-10T13:17:13Z

QQ: do we plan to merge this PR as it has been pending for months.

Sorry, I didn't follow-up on @tridao's comment above. Basically I think there should be no NaNs after this change because the code actually checks for -inf before computing exp(score - lse) in the backward pass

flash-attention/csrc/flash_attn/src/softmax.h

Line 75 in 40fa35a

    
           const float max_scaled = max(mi) == -INFINITY ? 0.f : max(mi) * (Scale_max ? scale : float(M_LOG2E));

Also, in the Hopper kernel we write -inf for out-of-bounds positions

flash-attention/hopper/epilogue_fwd.hpp

Lines 379 to 385 in 40fa35a

    
               if (row < seqlen_o) { mLSE(row) = -INFINITY; } 
        
           } else { 
        
               if (row < seqlen_o * qhead_per_khead) { 
        
                   int m_idx, h_idx; 
        
                   m_idx = params.qhead_per_khead_divmod.divmod(h_idx, row); 
        
                   // mLSE shape shape ((qhead_per_khead, seqlen_q)) and it's unhappy with just 1 "make_coord" 
        
                   mLSE(make_coord(make_coord(h_idx, m_idx))) = -INFINITY;

flash-attention/hopper/flash_fwd_kernel_sm90.h

Line 406 in 40fa35a

// Write 0 to gO and -inf to gLSE.

If that makes sense and FA2 code is still relevant, I add a test which cover backward behaviour in this situation to make the PR mergeable

Fix inf in LSE

f8d63d6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix +/-inf in LSE returned by forward #978

Fix +/-inf in LSE returned by forward #978

sgrigory commented Jun 3, 2024

tridao commented Jun 27, 2024

GD06 commented Jan 3, 2025

sgrigory commented Jan 10, 2025

Fix +/-inf in LSE returned by forward #978

Are you sure you want to change the base?

Fix +/-inf in LSE returned by forward #978

Conversation

sgrigory commented Jun 3, 2024

tridao commented Jun 27, 2024

GD06 commented Jan 3, 2025

sgrigory commented Jan 10, 2025