Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expert Creation Explanation #10

Closed
DebapriyaHazra opened this issue Aug 22, 2023 · 6 comments
Closed

Expert Creation Explanation #10

DebapriyaHazra opened this issue Aug 22, 2023 · 6 comments

Comments

@DebapriyaHazra
Copy link

DebapriyaHazra commented Aug 22, 2023

Hello,

When I am executing:

python moefication/param_cluster_example.py --model_path bert-sst2-bsz32/epoch_1.bin --res_path results/bert-sst2 --num-layer 24 --num-expert 128 --templates bert.encoder.layer.{}.intermediate.dense.weight

My output displays 24 counters with values like this:
Counter({4: 32, 20: 32, 27: 32, 42: 32, 116: 32, 67: 32, 85: 32, 48: 32, 101: 32, 13: 32, 79: 32, 118: 32, 63: 32, 127: 32, 80: 32, 90: 32, 82: 32, 34: 32, 113: 32, 21: 32, 64: 32, 59: 32, 105: 32, 15: 32, 102: 32, 121: 32, 25: 32, 23: 32, 95: 32, 17: 32, 19: 32, 103: 32, 26: 32, 99: 32, 72: 32, 55: 32, 97: 32, 7: 32, 107: 32, 122: 32, 96: 32, 125: 32, 62: 32, 11: 32, 18: 32, 65: 32, 52: 32, 98: 32, 9: 32, 38: 32, 76: 32, 124: 32, 91: 32, 84: 32, 126: 32, 8: 32, 60: 32, 0: 32, 2: 32, 104: 32, 74: 32, 24: 32, 70: 32, 44: 32, 10: 32, 30: 32, 106: 32, 35: 32, 58: 32, 47: 32, 39: 32, 29: 32, 36: 32, 111: 32, 68: 32, 61: 32, 56: 32, 46: 32, 114: 32, 1: 32, 78: 32, 32: 32, 53: 32, 83: 32, 109: 32, 37: 32, 117: 32, 89: 32, 49: 32, 28: 32, 112: 32, 77: 32, 40: 32, 123: 32, 3: 32, 43: 32, 93: 32, 92: 32, 120: 32, 69: 32, 31: 32, 57: 32, 41: 32, 16: 32, 110: 32, 119: 32, 66: 32, 50: 32, 87: 32, 86: 32, 54: 32, 115: 32, 108: 32, 73: 32, 5: 32, 33: 32, 88: 32, 22: 32, 94: 32, 71: 32, 14: 32, 12: 32, 75: 32, 51: 32, 45: 32, 6: 32, 100: 32, 81: 32})
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [08:47<00:00, 21.97s/it]

What does this output represent? I guess that 24 is the count for the number of layers, there are 128 values (eg: 4:32,20:32,...) in the counter representing 128 experts. 32 is the number of neurons. Correct me if I am wrong and please let me know

  1. what does these combinations in the counter represents (4:32,20:32,...)
  2. how can we display the sub-matrices created in each layer and their dimension? Is there a way?
  3. Can we see the activated neurons or the sub-matrices that contain only zeros. For a weight of large dimension will there be only two sub-matrices (one for zeros and one for non-zeros) or will there be multiple sub-matrices?
  4. Can we apply this just to layer 2 of FFN weight instead of 1 or should it be simultaneously applied to both the linear layers of FFN.

Thank you

@zzy14
Copy link
Member

zzy14 commented Aug 22, 2023

  1. Yes, it represents the number of neurons in the corresponding expert.
  2. It is not implemented in the current code. You can add print(ffn.wi.weight.data[torch.LongTensor(labels), :].view(cluster_num, -1, ffn.wi.weight.data.shape[-1])) in this function to get the created matrix with the shape (expert_num, neuron_num, dim_model). Similar operation can be applied to ffn.wo.
  3. The activated neurons or experts are different for different inputs. You can print this variable to see the indices of selected experts.
  4. Yes, it should be applied to both linear layers of FFN.

@DebapriyaHazra
Copy link
Author

DebapriyaHazra commented Aug 22, 2023

Thank you for your reply.

  1. For point 2 and 3 where and what should we change to implement it in BERT model.
  2. For point 1, what is 4 and 20 in (4:32,20:32,...)
  3. Will there be only two submatrices per layer (for 0 and activated neurons) or multiple sub-matrices per layer for activated and inactive neurons?

@zzy14
Copy link
Member

zzy14 commented Aug 22, 2023

  1. In BERT model, you can add print(ffn.dense.weight.data[torch.LongTensor(labels), :].view(cluster_num, -1, ffn.dense.weight.data.shape[-1])) in this function. This is for the first linear. The second linear is in layer.output instead of layer.intermediate in Line 66 and other operations are similar. This is the indices of selected experts.
  2. 4 and 20 are the expert indices. 4:32 means that there are 32 neurons in the fourth expert.
  3. There are multiple sub-matrices per layer. Each sub-matrix represents one expert. For a certain input, some of the experts are activated and the activation is dynamic.

@DebapriyaHazra
Copy link
Author

Thank you so much! Can this approach be applied to Llama-2 model as well? Can you suggest the changes that can be made to implement it?

@zzy14
Copy link
Member

zzy14 commented Aug 22, 2023

This approach can not be directly applied to Llama-2 because its activation function is SiLU instead of ReLU. In SiLU, the activation sparsity is not high enough. We are working on converting Llama-2 into a ReLU version. If we have any new progress, we will update it here.

@DebapriyaHazra
Copy link
Author

Okay, thank you so much!

@zzy14 zzy14 closed this as completed Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants