New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

DeepSeek_v3 support #1735

Open

srajabos wants to merge 14 commits into huggingface:main from srajabos:DeepSeek_v3

+2,246 −0

Contributor

srajabos commented Jan 30, 2025

What does this PR do?

DeepSeek v3 support on OH

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

srajabos requested review from ssarkar2, bhargaveede, vivekgoe and regisss as code owners

January 30, 2025 02:15

srajabos marked this pull request as draft

January 30, 2025 05:39

Collaborator

regisss commented Jan 30, 2025

@srajabos FYI there is an open PR to add Deepseek V3 to Transformers: huggingface/transformers#35926

We won't be able to rely on the Transformers implementation before Transformers v4.49 is released, but I thought this might be interesting to you.

Contributor Author

srajabos commented Jan 30, 2025

@regiss, I'll keep this as draft until verified with Transformers v4.49.
Thanks for the update.

anishagartia commented Feb 5, 2025 •

edited

Loading

@srajabos FYI there is an open PR to add Deepseek V3 to Transformers: huggingface/transformers#35926

We won't be able to rely on the Transformers implementation before Transformers v4.49 is released, but I thought this might be interesting to you.

Deepseek V3 (and hence R1) requriements.txt says the minimum version of transformer required is 4.46.3
OH currently is uses 4.45.2 per requirements.txt
Could be a easier step to enable 4.46.3 on Gaudi than wait for 4.49. Then adding the model files to it could work.

Contributor Author

srajabos commented Feb 5, 2025

@anishagartia, currently we are adding the model files and optimizing for Gaudi. Once we have performant data the plan is to get it in. Thanks for the link.

srajabos and others added 11 commits

February 11, 2025 10:11


          Resolve rebase conflicts on DeepSeek_v3

7bfe5e8


          Update __init__.py

ab3bcdd


          Update __init__.py

ecbdca2


          Update modeling_deepseek_v3.py

be3ad44


          Support optimized KV cache, static MOE and expert parallelism

93b38bf


          Commented out attention_mask assertion for the mmlu tests

e382e13


          Support optimized fusedDSPA, RoPE and RMS

1fa2a26


          Commented out attention_mask assertionn

99c1fc1


          Change references to deepseekv2 to deepseekv3

9b9d0b9


          Override load_state_dict to support deepseek-R1

ae5fdb8

Copied from transformers v4.48.2 for DeepSeek-R1 support.
Delete after upgrade transformers v4.45.2 to v4.48


          Added dynamic MoE changes

b2b1715

skavulya force-pushed the DeepSeek_v3 branch from 17f62bd to b2b1715 Compare

February 11, 2025 18:14


          Fix multicard expert parallelism for deepseekv3

srajabos marked this pull request as ready for review

February 12, 2025 16:15

Contributor Author

srajabos commented Feb 12, 2025

@yao-matrix @gyou2021 @IT-Forrest - kindly review the code.

ssarkar2 reviewed

View reviewed changes

Collaborator

ssarkar2 left a comment

[explanatory] are just comments to help follow the hpu code. no changes required for those comments. sorry for spamming comments in this category, thought it might be useful for future readers going thru the change and for others looking ot port similar models

[clarifications] some question from my end. Sometimes these are marked with [minor] if they are minor nitpicks

tests/test_text_generation_example.py Outdated Show resolved Hide resolved

optimum/habana/transformers/models/deepseek_v3/modeling_deepseek_v3.py

    
                  from habana_frameworks.torch.hpex.kernels import FusedSDPA

              except ImportError:

                  print("Not using HPU fused scaled dot-product attention kernel.")

                  FusedSDPA = None

Collaborator

ssarkar2 Feb 12, 2025

[explanatory] Import hpu fused ops

optimum/habana/transformers/models/deepseek_v3/modeling_deepseek_v3.py

    
                  def forward(self, hidden_states):

                      if hidden_states.device.type == "hpu" and FusedRMSNorm:

                          # mixed dtypes are not good for FusedRMSNorm, both inputs need to have same dtype

Collaborator

ssarkar2 Feb 12, 2025

[explanatory] use fused ops

optimum/habana/transformers/models/deepseek_v3/modeling_deepseek_v3.py

    
                      self.register_buffer("inv_freq", inv_freq, persistent=False)

                      # Build here to make `torch.jit.trace` work.

                      self.max_seq_len_cached = max_position_embeddings

Collaborator

ssarkar2 Feb 12, 2025

[explanatory] make it static (max_position_embeddings ) instead of updating depending on longest eq_len seen till now: "seq_len > self.max_seq_len_cached"

optimum/habana/transformers/models/deepseek_v3/modeling_deepseek_v3.py Outdated

    
              def apply_customized_rope(q, k, cos, sin, position_ids):

                  if q.device.type == "hpu" and FusedRoPE:

                      return FusedRoPE.apply(

Collaborator

ssarkar2 Feb 12, 2025

[explanatory] fused hpu op

Collaborator

ssarkar2 Feb 12, 2025

[clarification][minor] Could we call apply_customized_rope here?

optimum/habana/transformers/models/deepseek_v3/modeling_deepseek_v3.py

    
                  def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):

                      return tensor.view(bsz, seq_len, self.num_heads, self.v_head_dim).transpose(1, 2).contiguous()

                  def split_kv_b_proj(self):

Collaborator

ssarkar2 Feb 12, 2025

[clarification] this is present only in deepseek attention (v2/v3). Can we add some comment about this?

optimum/habana/transformers/models/deepseek_v3/modeling_deepseek_v3.py

    
                      self.q_absorb = kv_b_proj_weight[:, : self.qk_nope_head_dim, :].unsqueeze(0).transpose(0, 1)

                      self.out_absorb = kv_b_proj_weight[:, self.qk_nope_head_dim :, :].unsqueeze(0)

                  def compress_kv(

Collaborator

ssarkar2 Feb 12, 2025

[clarification] this is present only in deepseek attention (v2/v3). Can we add some comment about this? In the original deepseek code this is not a function, any particular reason of functionify-ing this? just want to clarify if making this a function is a stylistic choice or there is some reason

optimum/habana/transformers/models/deepseek_v3/modeling_deepseek_v3.py

    
                                  key_states, value_states, self.layer_idx, cache_kwargs

                              )

                          # optimization

                          if use_flash_attention and FusedSDPA is not None:

Collaborator

ssarkar2 Feb 12, 2025

[explanatory] hpu specific, similar to other modelling files in OH

optimum/habana/transformers/models/deepseek_v3/modeling_deepseek_v3.py

    
                      past_key_values_length = 0

                      if past_key_values is not None:

                          past_key_values_length = past_key_values[0][0].shape[2]

Collaborator

ssarkar2 Feb 12, 2025

[explanatory] hpu kv cache management, similar to other OH models

optimum/habana/transformers/models/deepseek_v3/modeling_deepseek_v3.py

    
                              and not self.training

                              and (torch.distributed.is_initialized() is False or torch.distributed.get_world_size() == 1)

                          ):

                              htcore.mark_step()

Collaborator

ssarkar2 Feb 12, 2025

[clarification] these marksteps breaks at layer boundaries is to fit model in memory? or some perf benefits?

Contributor

skavulya Feb 13, 2025

The mark_step is a memory optimization copied from llama #875

skavulya added 2 commits

February 13, 2025 05:05


          Delete duplicate tests accidentally copied

a94aa35


          Refactor deepseek_v3 and add clarifying comments

81005b0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

skavulya skavulya left review comments

ssarkar2 ssarkar2 left review comments

bhargaveede Awaiting requested review from bhargaveede bhargaveede is a code owner

vivekgoe Awaiting requested review from vivekgoe vivekgoe is a code owner

regisss Awaiting requested review from regisss regisss is a code owner

At least 1 approving review is required to merge this pull request.

Labels

None yet