Expert Creation Explanation #10

DebapriyaHazra · 2023-08-22T08:37:21Z

Hello,

When I am executing:

python moefication/param_cluster_example.py --model_path bert-sst2-bsz32/epoch_1.bin --res_path results/bert-sst2 --num-layer 24 --num-expert 128 --templates bert.encoder.layer.{}.intermediate.dense.weight

My output displays 24 counters with values like this:
Counter({4: 32, 20: 32, 27: 32, 42: 32, 116: 32, 67: 32, 85: 32, 48: 32, 101: 32, 13: 32, 79: 32, 118: 32, 63: 32, 127: 32, 80: 32, 90: 32, 82: 32, 34: 32, 113: 32, 21: 32, 64: 32, 59: 32, 105: 32, 15: 32, 102: 32, 121: 32, 25: 32, 23: 32, 95: 32, 17: 32, 19: 32, 103: 32, 26: 32, 99: 32, 72: 32, 55: 32, 97: 32, 7: 32, 107: 32, 122: 32, 96: 32, 125: 32, 62: 32, 11: 32, 18: 32, 65: 32, 52: 32, 98: 32, 9: 32, 38: 32, 76: 32, 124: 32, 91: 32, 84: 32, 126: 32, 8: 32, 60: 32, 0: 32, 2: 32, 104: 32, 74: 32, 24: 32, 70: 32, 44: 32, 10: 32, 30: 32, 106: 32, 35: 32, 58: 32, 47: 32, 39: 32, 29: 32, 36: 32, 111: 32, 68: 32, 61: 32, 56: 32, 46: 32, 114: 32, 1: 32, 78: 32, 32: 32, 53: 32, 83: 32, 109: 32, 37: 32, 117: 32, 89: 32, 49: 32, 28: 32, 112: 32, 77: 32, 40: 32, 123: 32, 3: 32, 43: 32, 93: 32, 92: 32, 120: 32, 69: 32, 31: 32, 57: 32, 41: 32, 16: 32, 110: 32, 119: 32, 66: 32, 50: 32, 87: 32, 86: 32, 54: 32, 115: 32, 108: 32, 73: 32, 5: 32, 33: 32, 88: 32, 22: 32, 94: 32, 71: 32, 14: 32, 12: 32, 75: 32, 51: 32, 45: 32, 6: 32, 100: 32, 81: 32})
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [08:47<00:00, 21.97s/it]

What does this output represent? I guess that 24 is the count for the number of layers, there are 128 values (eg: 4:32,20:32,...) in the counter representing 128 experts. 32 is the number of neurons. Correct me if I am wrong and please let me know

what does these combinations in the counter represents (4:32,20:32,...)
how can we display the sub-matrices created in each layer and their dimension? Is there a way?
Can we see the activated neurons or the sub-matrices that contain only zeros. For a weight of large dimension will there be only two sub-matrices (one for zeros and one for non-zeros) or will there be multiple sub-matrices?
Can we apply this just to layer 2 of FFN weight instead of 1 or should it be simultaneously applied to both the linear layers of FFN.

Thank you

zzy14 · 2023-08-22T08:58:33Z

Yes, it represents the number of neurons in the corresponding expert.
It is not implemented in the current code. You can add print(ffn.wi.weight.data[torch.LongTensor(labels), :].view(cluster_num, -1, ffn.wi.weight.data.shape[-1])) in this function to get the created matrix with the shape (expert_num, neuron_num, dim_model). Similar operation can be applied to ffn.wo.
The activated neurons or experts are different for different inputs. You can print this variable to see the indices of selected experts.
Yes, it should be applied to both linear layers of FFN.

DebapriyaHazra · 2023-08-22T09:01:29Z

Thank you for your reply.

For point 2 and 3 where and what should we change to implement it in BERT model.
For point 1, what is 4 and 20 in (4:32,20:32,...)
Will there be only two submatrices per layer (for 0 and activated neurons) or multiple sub-matrices per layer for activated and inactive neurons?

zzy14 · 2023-08-22T09:12:31Z

In BERT model, you can add print(ffn.dense.weight.data[torch.LongTensor(labels), :].view(cluster_num, -1, ffn.dense.weight.data.shape[-1])) in this function. This is for the first linear. The second linear is in layer.output instead of layer.intermediate in Line 66 and other operations are similar. This is the indices of selected experts.
4 and 20 are the expert indices. 4:32 means that there are 32 neurons in the fourth expert.
There are multiple sub-matrices per layer. Each sub-matrix represents one expert. For a certain input, some of the experts are activated and the activation is dynamic.

DebapriyaHazra · 2023-08-22T09:35:11Z

Thank you so much! Can this approach be applied to Llama-2 model as well? Can you suggest the changes that can be made to implement it?

zzy14 · 2023-08-22T09:41:02Z

This approach can not be directly applied to Llama-2 because its activation function is SiLU instead of ReLU. In SiLU, the activation sparsity is not high enough. We are working on converting Llama-2 into a ReLU version. If we have any new progress, we will update it here.

DebapriyaHazra · 2023-08-22T09:46:09Z

Okay, thank you so much!

zzy14 closed this as completed Aug 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expert Creation Explanation #10

Expert Creation Explanation #10

DebapriyaHazra commented Aug 22, 2023 •

edited

Loading

zzy14 commented Aug 22, 2023 •

edited

Loading

DebapriyaHazra commented Aug 22, 2023 •

edited

Loading

zzy14 commented Aug 22, 2023 •

edited

Loading

DebapriyaHazra commented Aug 22, 2023

zzy14 commented Aug 22, 2023

DebapriyaHazra commented Aug 22, 2023

Expert Creation Explanation #10

Expert Creation Explanation #10

Comments

DebapriyaHazra commented Aug 22, 2023 • edited Loading

zzy14 commented Aug 22, 2023 • edited Loading

DebapriyaHazra commented Aug 22, 2023 • edited Loading

zzy14 commented Aug 22, 2023 • edited Loading

DebapriyaHazra commented Aug 22, 2023

zzy14 commented Aug 22, 2023

DebapriyaHazra commented Aug 22, 2023

DebapriyaHazra commented Aug 22, 2023 •

edited

Loading

zzy14 commented Aug 22, 2023 •

edited

Loading

DebapriyaHazra commented Aug 22, 2023 •

edited

Loading

zzy14 commented Aug 22, 2023 •

edited

Loading