Skip to content

Commit

Permalink
Add RemovePadding and RestorePadding for BERT model (microsoft#13701)
Browse files Browse the repository at this point in the history
Add two operators RemovePadding and RestorePadding based on ideal of
effective transformer (https://github.com/bytedance/effective_transformer) to improve large
batch size inference for BERT model.
  • Loading branch information
tianleiwu authored Nov 22, 2022
1 parent ba9a585 commit 8b0e0f4
Show file tree
Hide file tree
Showing 16 changed files with 1,320 additions and 61 deletions.
85 changes: 85 additions & 0 deletions docs/ContribOperators.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,8 @@ Do not modify directly.*
* <a href="#com.microsoft.QuickGelu">com.microsoft.QuickGelu</a>
* <a href="#com.microsoft.Range">com.microsoft.Range</a>
* <a href="#com.microsoft.ReduceSumInteger">com.microsoft.ReduceSumInteger</a>
* <a href="#com.microsoft.RemovePadding">com.microsoft.RemovePadding</a>
* <a href="#com.microsoft.RestorePadding">com.microsoft.RestorePadding</a>
* <a href="#com.microsoft.Rfft">com.microsoft.Rfft</a>
* <a href="#com.microsoft.SampleOp">com.microsoft.SampleOp</a>
* <a href="#com.microsoft.SkipLayerNormalization">com.microsoft.SkipLayerNormalization</a>
Expand Down Expand Up @@ -3595,6 +3597,89 @@ This version of the operator has been available since version 1 of the 'com.micr
</dl>


### <a name="com.microsoft.RemovePadding"></a><a name="com.microsoft.removepadding">**com.microsoft.RemovePadding**</a>

Compress transformer input by removing paddings. It assumes padding is on the right side of sequence.

The input has padding with shape (batch_size, sequence_length, hidden_size). This will generate two outputs:
output has shape (total_tokens, hidden_size); token_offset with shape (batch_size, sequence_length).

token_offset has offsets of all non-padding tokens first, then offset of all padding tokens. It is
a list of batch_size * sequence_length elements, which is reshaped to 2D for convenience of shape inference.

#### Version

This version of the operator has been available since version 1 of the 'com.microsoft' operator set.

#### Inputs

<dl>
<dt><tt>input</tt> : T</dt>
<dd>Input tensor with shape (batch_size, sequence_length, hidden_size)</dd>
<dt><tt>sequence_token_count</tt> : M</dt>
<dd>Number of non-padding tokens in each sequence with shape (batch_size).</dd>
</dl>

#### Outputs

<dl>
<dt><tt>output</tt> : T</dt>
<dd>output tensor with shape (total_tokens, hidden_size)</dd>
<dt><tt>token_offset</tt> : M</dt>
<dd>Offset of non-padding tokens, and those of padding tokens. Its shape is (batch_size, sequence_length)</dd>
<dt><tt>cumulated_seq_len</tt> : M</dt>
<dd>Cumulated sequence lengths. Its shape is (batch_size + 1)</dd>
<dt><tt>max_seq_len</tt> : M</dt>
<dd>Max sequence length without padding. Its shape is (1)</dd>
</dl>

#### Type Constraints

<dl>
<dt><tt>T</tt> : tensor(float), tensor(float16)</dt>
<dd>Constrain input and output types to float tensors.</dd>
<dt><tt>M</tt> : tensor(int32)</dt>
<dd>Constrain sequence_token_count and token_offset to integer types</dd>
</dl>


### <a name="com.microsoft.RestorePadding"></a><a name="com.microsoft.restorepadding">**com.microsoft.RestorePadding**</a>

Restore paddings and fill padding with zeros.

The input has padding with shape (total_tokens, hidden_size) and token_offset with shape (batch_size, sequence_length).
The output has shape (batch_size, sequence_length, hidden_size).

#### Version

This version of the operator has been available since version 1 of the 'com.microsoft' operator set.

#### Inputs

<dl>
<dt><tt>input</tt> : T</dt>
<dd>Input tensor with shape (total_tokens, hidden_size)</dd>
<dt><tt>token_offset</tt> : M</dt>
<dd>Offset of non-padding tokens and paddings. Its shape is (batch_size, sequence_length)</dd>
</dl>

#### Outputs

<dl>
<dt><tt>output</tt> : T</dt>
<dd>output tensor with shape (batch_size, sequence_length, hidden_size)</dd>
</dl>

#### Type Constraints

<dl>
<dt><tt>T</tt> : tensor(float), tensor(float16)</dt>
<dd>Constrain input and output types to float tensors.</dd>
<dt><tt>M</tt> : tensor(int32)</dt>
<dd>Constrain token_offset to integer types</dd>
</dl>


### <a name="com.microsoft.Rfft"></a><a name="com.microsoft.rfft">**com.microsoft.Rfft**</a>

#### Version
Expand Down
2 changes: 2 additions & 0 deletions docs/OperatorKernels.md
Original file line number Diff line number Diff line change
Expand Up @@ -793,6 +793,8 @@ Do not modify directly.*
|QuantizeLinear|*in* x:**T1**<br> *in* y_scale:**T1**<br> *in* y_zero_point:**T2**<br> *out* y:**T2**|1+|**T1** = tensor(float16)<br/> **T2** = tensor(int8), tensor(uint8)|
|QuantizeWithOrder|*in* input:**F**<br> *in* scale_input:**S**<br> *out* output:**Q**|1+|**F** = tensor(float), tensor(float16)<br/> **Q** = tensor(int8)<br/> **S** = tensor(float)|
|QuickGelu|*in* X:**T**<br> *out* Y:**T**|1+|**T** = tensor(double), tensor(float), tensor(float16)|
|RemovePadding|*in* input:**T**<br> *in* sequence_token_count:**M**<br> *out* output:**T**<br> *out* token_offset:**M**<br> *out* cumulated_seq_len:**M**<br> *out* max_seq_len:**M**|1+|**T** = tensor(float), tensor(float16)|
|RestorePadding|*in* input:**T**<br> *in* token_offset:**M**<br> *out* output:**T**|1+|**T** = tensor(float), tensor(float16)|
|Rfft|*in* X:**T**<br> *out* Y:**T**|1+|**T** = tensor(double), tensor(float), tensor(float16)|
|SkipLayerNormalization|*in* input:**T**<br> *in* skip:**T**<br> *in* gamma:**T**<br> *in* beta:**T**<br> *in* bias:**T**<br> *out* output:**T**<br> *out* mean:**U**<br> *out* inv_std_var:**U**|1+|**T** = tensor(float), tensor(float16)|
|TransposeMatMul|*in* A:**T**<br> *in* B:**T**<br> *out* Y:**T**|1+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)|
Expand Down
1 change: 1 addition & 0 deletions onnxruntime/contrib_ops/cuda/bert/attention_impl.cu
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ limitations under the License.
#include "contrib_ops/cuda/bert/add_bias_transpose.h"
#include "contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/mha_runner.h"
#include "contrib_ops/cpu/bert/attention_base.h"
#include "contrib_ops/cuda/bert/bert_padding.h"

using namespace onnxruntime::cuda;
using namespace cub;
Expand Down
6 changes: 0 additions & 6 deletions onnxruntime/contrib_ops/cuda/bert/attention_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -145,12 +145,6 @@ Status LaunchConcatPastToPresent(cudaStream_t stream,
const half* past,
const half* k_v,
half* present);

void LaunchTrtSequenceOffset(int* trt_mha_padding_offset,
const int* mask_index,
const int batch_size,
cudaStream_t stream);

} // namespace cuda
} // namespace contrib
} // namespace onnxruntime
55 changes: 0 additions & 55 deletions onnxruntime/contrib_ops/cuda/bert/attention_padding.cu

This file was deleted.

Loading

0 comments on commit 8b0e0f4

Please sign in to comment.