+
+ Name
+ |
+ Description
+ |
+ How the sparse matrix is stored
+ |
+
+
+ COO (sparse_coo)
+ |
+ COOrdinate format to store sparse matrices. The matrices are stored as a combination of the non-sparse data vector and the index locations of those elements in the dense matrix.
+ |
+ sparse matrix = {Index: Tensor of coordinate locations,
+ Data: Tensor of values corresponding to index locations }
+ |
+
+
+ BSR (sparse_bsr)
+ |
+ Block sparse row format to store sparse matrices. The matrices are stored as data blocks and the index locations of those blocks in the dense matrix. Very similar to COO, except that individual data consists of blocks, not scalars.
+ |
+ sparse matrix = {Index: Tensor of coordinate locations, two dimensional for a matrix,
+ Data: Tensor of blocks corresponding to index locations }
+ where a block is a matrix corresponding to the sparsity pattern.
+ |
+
+
+ CSR (sparse_csr) / CSC (sparse_csc)
+ |
+ Compressed sparse row /column format to store sparse matrices. The sparse matrices are stored as data blocks on columns / rows and indices of those rows/columns in a dense matrix. This is the most compact format for storing block sparse matrices.
+ |
+ sparse_matrix = {Index: 1D tensor of column indices,
+ IndexPtr: 1D tensor specifying the start and end indices of columns for rows, starting from row 0,
+ Data: Tensor of blocks corresponding to Index locations.}
+ |
+
+
+ NVIDIA 2:4 compressed representation
+ |
+ Custom NVIDIA compressed storage format for 2:4 semi-structured sparsity. We store the sparse matrix as a compressed dense matrix (½ the size) containing the non-pruned elements and a bitmask index. When multiplying our sparse matrix by another dense matrix, we use the mask to index into the dense matrix and multiply with our compressed dense matrix.
+ |
+ sparse_matrix = {Bitmask: 2bit indices of pruned elements Compressed dense matrix: contains all unpruned elements, half the size of original dense matrix}
+ |
+
+
+
+
+*Table 4.1: Overview of common sparse tensor layouts.*
+
+While the general idea of pruning is quite simple, there are many details that a user must figure out before they can successfully prune a model.
+
+These can be loosely broken down as follows:
+
+
+* **Pruning Configuration** - What layers should I prune? What sparsity level should I prune to?
+* **Pruning Criteria** - How should I decide which parameters to remove?
+* **Pruning Strategy** - Once I have removed parameters, how can I recover any accuracy degradation?
+* **Sparsity Pattern** - Should I try to use a specific sparsity pattern when I prune my model? Different hardware backends support accelerated inference for different sparsity patterns.
+
+Pruning Configuration
+^^^^^^^^^^^^^^^^^^^^^
+
+Not all layers in a neural network are created equal. Some layers can be more sensitive to pruning than others. The user must decide what layers to prune and also the **sparsity level** for each layer, which is the % of 0s for that weight tensor. The pruning configuration has an effect on both the accuracy and speedup of the pruned model.
+
+Determining the best pruning configuration and sparsity level for a given model is an open problem and a general solution does not exist. This is in part because the optimal pruning configuration is dependent on the subsequent pruning criteria and strategy, and there are an infinite number of ways to decide how to prune models and how to recover lost accuracy.
+
+One common method to determine which layers to prune and to what degree is to perform sensitivity analysis by pruning each layer in the model at different sparsity levels and seeing the subsequent accuracy drop (without retraining). This gives a user a sparsity-accuracy curve for each layer that the user can then use as a proxy to determine the best pruning configuration.
+
+Pruning Criteria
+^^^^^^^^^^^^^^^^
+
+A user must decide on a criteria for removing parameters from a neural network. Much like determining the best pruning configuration, determining the best pruning criteria is an open research question and is dependent on the other aforementioned factors.
+
+The most common pruning criteria is to use weight magnitude. The idea is that low-magnitude weights contribute less than high-magnitude weights to the model output. If we want to remove parameters, we can remove the weights that have the smallest absolute value.
+
+However, even with a simple pruning criteria such as weight magnitude, there are additional factors that a user would have to consider:
+
+
+* Local vs global scope
+
+ * **Local scope** implies that the sparsity mask is only computed with respect to the layer statistics.
+
+ * Pros: Simple mask computing
+ * Cons: Potentially sub-optimal accuracy vs sparsity tradeoff.
+
+ * **Global scope** means that the sparsity statistics are not bounded by a single layer, but can span over multiple layers if needed.
+
+ * Pros: No need for per-layer thresholds. The tensor statistics is shared across layers, and normalization is used across layers to allow for it.
+ * Cons: Increased complexity when computing the masks.
+
+* Tensors used for mask calculation
+
+ * **Weights**\ : Just use the weight tensor in order to calculate the mask. This method is the simplest for inference as the weight tensors are constant.
+ * **Gradients**\ : Compute importance based on both weights and gradient norms. Common for pre-training based methods. Currently CTR_mobile_feed uses a gradient-based pruning algorithm.
+ * **Activations**\ : In some research papers, the norm of the activations that are applied with the weight of interest are used to compute the importance score.
+
+* In place or out of place mask updates
+
+ * **In-place** updates the sparse tensor by performing W = W (Mask). Once the weight tenosr is udpated, the sparse values are zeroed out and cannot be recovered.
+
+ * **Pros**\ : Requires only one copy of the sparse tensor to be stored (+ mask)
+ * **Cons**\ : Once a mask is applied to a weight, it is zeroed out, all past history is lost. These weights cannot regrow.
+
+ * **Out-of-place** updates don't modify the tensor directly, but perform the following: W' = W (Mask) and dW'= dW (Mask)
+
+ * **Pros**\ : The original tensor is preserved (the masked elements are not updated via backprop). Weights can regrow if the mask changes. This is necessary for PAT.
+ * **Cons**\ : In addition to the unmasked weights (W), the masked weights (W’) are computed and resident in memory for forward/backward computations.
+
+
+.. raw:: html
+
+
+
+
+ Name
+ |
+ Description
+ |
+ Notes
+ |
+
+
+ Magnitude / Saliency
+ |
+ Remove parameters that have the lowest norm (L1 is commonly used)
+ |
+ Shown to work well with 2:4 semi-structured sparsity. Able to achieve identical accuracy as the original model by repeating the training loop after one-shot magnitude pruning.
+ |
+
+
+ Movement Pruning
+ |
+ These methods aim to use gradient information in order to decide what parameters to remove. The idea is to remove parameters that do not change much during fine-tuning.
+ |
+ Common for pretrained models.
+
+ See https://arxiv.org/abs/2005.07683
+ |
+
+
+ Low-rank factorization
+ |
+ These methods aim to replace Wx with SQx, where S and Q are matrices with lower rank.
+ |
+ Usually these methods use some sort of layer-wise reconstruction, where instead of training the model to recover lost accuracy, they seek to match layer-wise statistics (Find SQx such that L2(SQx, Wx) is minimized).
+ |
+
+
+ Random
+ |
+ Remove parameters randomly
+ |
+
+ |
+
+
+
+
+*Table 4.2: Description of some common pruning criteria.*
+
+Pruning Strategy
+^^^^^^^^^^^^^^^^
+
+This is a general term that describes the method in which a user tries to recover any accuracy degradation from their pruned model. After pruning a model, it is common to see accuracy degradation of the model, so users usually retrain the pruned model in order to remediate this. The pruning strategy also determines when and how often the model is pruned during model training.
+
+The line between a pruning strategy and a pruning criteria is not well defined, especially in the case of pruning aware training methods, which update the mask during training. We sometimes use the term **pruning** **algorithm** to refer to the combination of these two items. These two factors, along with the pruning configuration ultimately control the final accuracy of the pruned model.
+
+
+.. raw:: html
+
+
+
+ Pruning Strategy
+ |
+ Description
+ |
+ Notes
+ |
+
+
+ Zero-shot
+ |
+ Prune once, don’t retrain the model
+ |
+ These methods rely on more complicated pruning criteria.
+
+ This is sometimes referred to as one-shot in literature, but we will use one-shot to refer to pruning once and retraining once.
+ |
+
+
+ One-shot
+ |
+ Prune once, retrain the model once
+ |
+ NVIDIA has shown that one-shot 2:4 semi-structured sparsity pruning generalizes well across a range of common vision / nlp models. \
+ \
+ The retraining strategy is to simply repeat the training process again.
+ |
+
+
+ Iterative
+ |
+ Prune the model, retrain, repeat
+ |
+ We can iteratively increase the sparsity level, or iteratively prune different layers in the model.
+ |
+
+
+ Pruning Aware Training
+ |
+ Mask is learned during training
+ |
+ Used by CTR_feed for their current pruning algorithm.
+ |
+
+
+ NAS / Multimask
+ |
+ Multiple masks are used during training. This can be thought of a form of neural architecture search.
+ |
+ Used by PySpeech (FastNAS)
+ |
+
+
+ Layer-wise reconstruction
+ |
+ Instead of retraining using a loss function, we try to recover as much information as possible from each layer by using a two model approach similar to knowledge distillation.
+ |
+ See https://arxiv.org/pdf/2204.09656.pdf
+ |
+
+
+
+
+*Table 4.3: Description of some common pruning strategies.*
+
+Sparsity Pattern
+^^^^^^^^^^^^^^^^
+
+A sparsity pattern describes how the pruned parameters are arranged within the model / tensor.
+
+Recall that in general it is necessary to use optimized sparse kernels in order to achieve performance gains. Depending on the format and the sparsity level of the weight tensor, sparse matrix multiplication can be faster than its dense counterpart. It can also be slower if a tensor is not sufficiently sparse.
+
+At the most general level, pruning is unstructured -every parameter has it’s own mask. This gives the most flexibility but requires very high sparsity (>98%) in order to provide performance benefits. In order to provide accelerated inference at lower sparsity levels, hardware backends have added support for special sparsity patterns.
+
+We seek to prune the model so that the weight tensors exhibit the same sparsity pattern as our inference backend. If we are able to recover the accuracy lost while maintaining the sparsity pattern, we can run this model on sparse hardware for accelerated inference without an accuracy penalty. We can also run a model pruned to a different sparsity pattern on our target backend, at the expense of some additional accuracy loss.
+
+The specific backend hardware and its corresponding sparsity pattern, as well as the pruning configuration ultimately dictates the performance speedups that we observe. If we prune a model using a different pruning criteria it will have the same performance characteristics if it follows the same sparsity pattern and sparsity level. For example, if we decided to remove the highest-magnitude weights instead of the lowest-magnitude weights, we wouldn’t expect that to change the performance characteristics of the pruned model.
+
+
+.. raw:: html
+
+
+
+ Sparsity Pattern
+ |
+ Mask Visualization
+
+ (50% sparsity level)
+ |
+
+
+ Unstructured Sparsity
+ |
+
+
+
+
+ Fig 2.3: unstructured sparsity
+
+
+ 1
+ |
+ 0
+ |
+ 1
+ |
+ 1
+ |
+ 0
+ |
+ 1
+ |
+ 0
+ |
+ 1
+ |
+
+
+ 0
+ |
+ 0
+ |
+ 1
+ |
+ 1
+ |
+ 1
+ |
+ 1
+ |
+ 1
+ |
+ 0
+ |
+
+
+ 1
+ |
+ 0
+ |
+ 0
+ |
+ 0
+ |
+ 1
+ |
+ 0
+ |
+ 1
+ |
+ 0
+ |
+
+
+ 0
+ |
+ 1
+ |
+ 1
+ |
+ 0
+ |
+ 0
+ |
+ 0
+ |
+ 0
+ |
+ 1
+ |
+
+
+
+
+ |
+
+
+ 2:4 Semi-Structured
+ |
+
+
+
+ Fig 2.4: 2:4 semi-structured sparsity
+
+
+ 0
+ |
+ 1
+ |
+ 1
+ |
+ 0
+ |
+ 1
+ |
+ 0
+ |
+ 1
+ |
+ 0
+ |
+
+
+ 0
+ |
+ 0
+ |
+ 1
+ |
+ 1
+ |
+ 1
+ |
+ 1
+ |
+ 0
+ |
+ 0
+ |
+
+
+ 1
+ |
+ 0
+ |
+ 0
+ |
+ 1
+ |
+ 0
+ |
+ 1
+ |
+ 0
+ |
+ 1
+ |
+
+
+ 0
+ |
+ 1
+ |
+ 0
+ |
+ 1
+ |
+ 1
+ |
+ 0
+ |
+ 1
+ |
+ 0
+ |
+
+
+
+ |
+
+
+ Block Sparsity
+
+ |
+
+
+
+
+ Fig 2.5: 4x4 block-wise structured sparsity
+
+
+ 0
+ |
+ 0
+ |
+ 0
+ |
+ 0
+ |
+ 1
+ |
+ 1
+ |
+ 1
+ |
+ 1
+ |
+
+
+ 0
+ |
+ 0
+ |
+ 0
+ |
+ 0
+ |
+ 1
+ |
+ 1
+ |
+ 1
+ |
+ 1
+ |
+
+
+ 0
+ |
+ 0
+ |
+ 0
+ |
+ 0
+ |
+ 1
+ |
+ 1
+ |
+ 1
+ |
+ 1
+ |
+
+
+ 0
+ |
+ 0
+ |
+ 0
+ |
+ 0
+ |
+ 1
+ |
+ 1
+ |
+ 1
+ |
+ 1
+ |
+
+
+
+ |
+
+
+ Structured Sparsity
+ |
+
+
+
+ Fig 2.6: row-wise structured sparsity
+
+
+ 1
+ |
+ 1
+ |
+ 1
+ |
+ 1
+ |
+ 1
+ |
+ 1
+ |
+ 1
+ |
+ 1
+ |
+
+
+ 0
+ |
+ 0
+ |
+ 0
+ |
+ 0
+ |
+ 0
+ |
+ 0
+ |
+ 0
+ |
+ 0
+ |
+
+
+ 1
+ |
+ 1
+ |
+ 1
+ |
+ 1
+ |
+ 1
+ |
+ 1
+ |
+ 1
+ |
+ 1
+ |
+
+
+ 0
+ |
+ 0
+ |
+ 0
+ |
+ 0
+ |
+ 0
+ |
+ 0
+ |
+ 0
+ |
+ 0
+ |
+
+
+ |
+
+
+
+*Table 4.4: Description of some common sparsity patterns.*
+
+For more information on our supported APIs and benchmaks please refer `Sparsity README