[bug] `use_sliding_window` doesn't work as expected #38002

ZhiyuLi-Nvidia · 2025-05-07T19:24:03Z

System Info

transformer: main
pytorch, cuda: anyversion

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import AutoModelForCausalLM
import numpy as np
import torch


MODEL_NAME = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

input_ids = torch.randint(0, 100, (1, 8192))

with torch.no_grad():
    output_raw = model(input_ids)

# correct situaion: should be the same as the original model since use_sliding_window is False
model_no_sliding = AutoModelForCausalLM.from_pretrained(MODEL_NAME, sliding_window=None)

with torch.no_grad():
    output_non_sliding = model_no_sliding(input_ids)


np.testing.assert_allclose(output_raw.logits[:, :4096], output_non_sliding.logits[:, :4096])

# wrong: the logits are unexpectedly different with sliding_window=4096
np.testing.assert_allclose(output_raw.logits[:, 4096:], output_non_sliding.logits[:, 4096:])

Expected behavior

description

What is expected:

  "sliding_window": 4096,
  "use_sliding_window": false,

use_sliding_window is set as false in deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B here. We do expect sliding window is disabled. In other words, we should expect the same results even with different sliding_window.

However, the results are different in the repro script.

Root cause

Attention Mask is changed according to sliding_window without respect on use_sliding_window.

transformers/src/transformers/models/qwen2/modeling_qwen2.py

Lines 708 to 715 in 3c0796a

    
           if config.get_text_config().sliding_window is not None: 
        
               # if we have sliding window, we should not attend to tokens beyond sliding window length, so we mask them out also 
        
               # the check is needed to verify is current checkpoint was trained with sliding window or not 
        
               if not isinstance(past_key_values, SlidingWindowCache) or sequence_length > target_length: 
        
                   sliding_attend_mask = torch.arange(target_length, device=cache_position.device) <= ( 
        
                       cache_position.reshape(-1, 1) - config.get_text_config().sliding_window 
        
                   ) 
        
                   diagonal_attend_mask.bitwise_or_(sliding_attend_mask)

If we add some printing under this conditional block, we can clearly see attention mask is changed even with use_sliding_window=false

The text was updated successfully, but these errors were encountered:

Rocketknight1 · 2025-05-08T13:16:14Z

Not sure who's the right code owner here - cc @gante @zucchini-nlp, but feel free to tag Arthur instead if you think he's more appropriate!

zucchini-nlp · 2025-05-08T15:29:44Z

Nice catch! Indeed I think we should be checking both values, given that in attention layer we pass sliding window only when use_sliding_window. I can open a PR for that 🤗

ZhiyuLi-Nvidia added the bug label May 7, 2025

zucchini-nlp linked a pull request May 9, 2025 that will close this issue

[fix] sliding window attention mask #38045

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] `use_sliding_window` doesn't work as expected #38002

[bug] `use_sliding_window` doesn't work as expected #38002

ZhiyuLi-Nvidia commented May 7, 2025 •

edited

Loading

Rocketknight1 commented May 8, 2025

zucchini-nlp commented May 8, 2025

[bug] use_sliding_window doesn't work as expected #38002

[bug] use_sliding_window doesn't work as expected #38002

Comments

ZhiyuLi-Nvidia commented May 7, 2025 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

description

Root cause

Rocketknight1 commented May 8, 2025

zucchini-nlp commented May 8, 2025

[bug] `use_sliding_window` doesn't work as expected #38002

[bug] `use_sliding_window` doesn't work as expected #38002

ZhiyuLi-Nvidia commented May 7, 2025 •

edited

Loading