Skip to content

Commit

Permalink
[#11] Add contributing guide for sgnlp (#12)
Browse files Browse the repository at this point in the history
[#11] Update contributing guide
  • Loading branch information
benedictleedm authored Sep 8, 2021
1 parent faa1a17 commit f1486e8
Showing 1 changed file with 319 additions and 0 deletions.
319 changes: 319 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,319 @@
# Contributing Guide

Thank you for considering contributing to SG-NLP! We believe that SG-NLP's future is closely tied with the community and community contributions will help SG-NLP to grow faster.

## How can I contribute?

- [Add a new model to `sgnlp`](#adding-a-new-model-to-sgnlp)
- [Pre-requisites](#pre-requisites)
- [Required Components](#required-components)
- [Config](#config)
- [Preprocess](#preprocess)
- [Tokenizer](#tokenizer-optional)
- [Modeling](#modeling)
- [Train](#train)
- [Eval](#eval)
- [Utils](#utils)
- [README](#readme)
- [Model weights and artefacts](#model-weights-and-artefacts)
- [Submit Bug Fixes](#submitting-bug-fixes)
- [Add Documentation](#adding-documentation)

## Adding a new model to `sgnlp`

### Pre-requisites

1. Fork the `sgnlp` repository.
2. Create a Python virtual environment (verison >= 3.8)
3. Install the packages in `requirements_dev.txt` at the root of the repository using `pip install -r requirements_dev.txt`. This will install `black`, `flake8`, and `pytest` which we use for code formatting and testing.

### Required Components

Before you create a pull request to add your model to `sgnlp`, please ensure that you have the following components ready.

- Python scripts / code comprising of (we'll go into more detail below)
- [config.py](#config)
- [modeling.py](#modeling)
- [preprocess.py](#preprocess)
- [train.py](#train)
- [eval.py](#eval)
- [utils.py](#utils) (optional)
- [README](#readme)
- [requirements.txt](#requirements.txt) (discouraged)
- Model information (to be included in the README)
- Original paper / source
- Datasets and/or how to obtain them
- Evaluation metrics
- Model size
- Training information
- Model weights and artefacts (to be submitted separately)
- pytorch_model.bin
- config.json
- tokenizer_config.json (optional)

To contribute a model, add a folder for the model at `sgnlp/sgnlp/models/<model_name>`. The following components are required within this folder.

| Folder / File | Description |
| :---------------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| _<model_name>_ | Folder containing the modeling, preprocess, config, train, and eval scripts. |
| _<model_name>/config_ | Folder containing the JSON configuration files used for the train and eval scripts. |
| _<model_name>/config.py_ | Script containing the model config class which inherits from HuggingFace's [`PretrainedConfig`](https://huggingface.co/transformers/main_classes/configuration.html#transformers.PretrainedConfig) or its family of derived classes. |
| _<model_name>/eval.py_ | Script containing code to evaluate the model performance. |
| _<model_name>/modeling.py_ | Script containing the model class which inherits from HuggingFace's [`PretrainedModel`](https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel) class or its family of derived classes. |
| _<model_name>/preprocess.py_ | Script containing code to preprocess the input text into the model's required input tensors. |
| _<model_name>/tokenization.py_ (optional) | Script containing the model tokenizer class which inherits from HuggingFace's [`PretrainedTokenizer`](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer) class or its family of derived classes. |
| _<model_name>/train.py_ | Script containing code to train the model. It is recommended to utilize the [`Trainer`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Trainer) class from HuggingFace. |
| _<model_name>/README.md_ | README markdown file containing model information such as model source, architecture, evaluation datasets and metrics, model size, and training information. |

To manage the number of dependencies installed with `sgnlp`, contributors are strongly recommended to limit their code to use the packages listed in `setup.py`. If additional dependencies are required, please introduce a check in the `__init__.py` at `sgnlp/sgnlp/models/<model_name>/__init__.py`. For example, the Latent Structure Refinement model for Relation Extraction requires the `networkx` package. The code snippet from LSR's `__init__.py` checks if `networkx` is installed when the model is imported. Users will have to install these additional dependencies separately.

```
from ...utils.requirements import check_requirements
requirements = ["networkx"]
check_requirements(requirements)
```

### Config

Model configs contain model architecture information. Typically, this would include hyperparameters for the different layers within the model as well as the loss function. Model configs should inherit from the `PretrainedConfig` class from the `transformers` package. The following is an example from the Cross Lingual Cross Domain Sentiment Analysis model.

```
from transformers import PretrainedConfig
class UFDClassifierConfig(PretrainedConfig):
model_type = "classifier"
def __init__(self, embed_dim=1024, num_class=2, initrange=0.1, **kwargs):
super().__init__(**kwargs)
self.embed_dim = embed_dim
self.num_class = num_class
self.initrange = initrange
```

For models that use or adapt pre-trained configs already available in the `transformers` package, the model config should inherit from the pre-trained config class instead. For example, this model config inherits from `BertConfig` which is a child class of `PretrainedConfig`.

```
from transformers import BertConfig
class NewModelConfig(BertConfig):
def __init__(self, **kwargs):
super().__init__(**kwargs)
```

### Preprocess

The `preprocess.py` script and its associated preprocessor class is an addition in `sgnlp`. When implementing various models, the team found that some models required more complex preprocessing. For example, some NLP models take in multiple different text inputs (ie, different utterances, multiple tweets, a single question and multiple possible answers, etc) which require different preprocessing steps. The `preprocess.py` and the preprocessor class is the team's solution to packaging all of these different steps into a single and consistent (across different models) step.

The preprocessor class inherits from the default `object` class. All preprocessing steps should be executed in the class' `__call__` method. The `__call__` method should return a dictionary containing all the necessary input tensors required by the model. The following code snippet illustrates the `__call__` method from the RECCON Span Extraction model's `RecconSpanExtractionPreprocessor`.

```
class RecconSpanExtractionPreprocessor:
def __call__(
self, data_batch: Dict[str, List[str]]
) -> Tuple[
BatchEncoding,
List[Dict[str, Union[int, str]]],
List[SquadExample],
List[SquadFeatures],
]:
self._check_values_len(data_batch)
concatenated_batch, evidences = self._concatenate_batch(data_batch)
dataset, examples, features = load_examples(
concatenated_batch, self.tokenizer, evaluate=True, output_examples=True
)
input_ids = [torch.unsqueeze(instance[0], 0) for instance in dataset]
attention_mask = [torch.unsqueeze(instance[1], 0) for instance in dataset]
token_type_ids = [torch.unsqueeze(instance[2], 0) for instance in dataset]
output = {
"input_ids": torch.cat(input_ids, axis=0),
"attention_mask": torch.cat(attention_mask, axis=0),
"token_type_ids": torch.cat(token_type_ids, axis=0),
}
output = BatchEncoding(output)
return output, evidences, examples, features
```

In the RECCON Span Extraction model, `output` is a dictionary with the token ids, attention masks and token type ids for the input utterance. `evidences`, `examples` and `features` are other features required in the RECCON model. The key idea here is to consolidate all the necessary preprocessing steps into a single method to reduce the effort needed to start using the models.

### Tokenizer (optional)

The `tokenizer.py` is optional if the `preprocess.py` already contains a tokenizer. All tokenizers should inherit from the `PreTrainedTokenizer` or `PreTrainedTokenizerFast` classes from the `transformers` package.

```
from transformers import PreTrainedTokenizer
class NewModelTokenizer(PreTrainedTokenizer):
def __init__(self, **kwargs):
super().__init__(**kwargs)
```

For models that use or adapt pre-trained tokenizers already available in the `transformers` package, the tokenizer should inherit from the pre-trained tokenizer class instead. For example, the RECCON Span Extraction tokenizer inherits from `BertTokenizer` which inherits from `PreTrainedTokenizer`.

```
from transformers import BertTokenizer
class RecconSpanExtractionTokenizer(BertTokenizer):
"""
Constructs a Reccon Span Extraction tokenizer, derived from the Bert tokenizer.
Args:
vocab_file (:obj:`str`):
Path to the vocabulary file.
do_lower_case (:obj:`bool`, defaults to :obj:`False`):
Whether or not to lowercase the input when tokenizing.
"""
def __init__(self, vocab_file: str, do_lower_case: bool = False, **kwargs) -> None:
super().__init__(vocab_file=vocab_file, do_lower_case=do_lower_case, **kwargs)
```

### Modeling

There are 2 steps required to add a new model class. The first step is to introduce a `NewModelPreTrainedModel` class which handles weights instantiation, downloading and loading pretrained models. This class should inherit from the `PreTrainedModel` class from `transformers`.

The key things to define as the `config_class`, `base_model_prefix` class attributes and `_init_weights` method. The `_init_weights` method dictates how the weights for the different layers are instantiated.

```
from transformers import PreTrainedModel
from .config import LsrConfig
class LsrPreTrainedModel(PreTrainedModel):
"""
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
models.
"""
config_class = LsrConfig
base_model_prefix = "lsr"
def _init_weights(self, module):
""" Initialize the weights """
...
```

Subsequently, the main model class should inherit from this `NewModelPreTrainedModel` class. The main model class contains the code required to execute the model's forward pass.

```
class RumourDetectionTwitterModel(RumourDetectionTwitterPreTrainedModel):
def __init__(self, config: RumourDetectionTwitterConfig):
super().__init__(config)
self.config = config
self.wordEncoder = WordEncoder(self.config)
self.positionEncoderWord = PositionEncoder(config, self.config.max_length)
self.positionEncoderTime = PositionEncoder(config, self.config.size)
self.hierarchicalTransformer = HierarchicalTransformer(self.config)
if config.loss == "cross_entropy":
self.loss = nn.CrossEntropyLoss()
self.init_weights()
def forward(
self,
token_ids: torch.Tensor,
time_delay_ids: torch.Tensor,
structure_ids: torch.Tensor,
token_attention_mask=None,
post_attention_mask=None,
labels: Optional[torch.Tensor] = None,
):
X = self.wordEncoder(token_ids)
word_pos = self.prepare_word_pos(token_ids).to(X.device)
word_pos = self.positionEncoderWord(word_pos)
time_delay = self.positionEncoderTime(time_delay_ids)
logits = self.hierarchicalTransformer(
X,
word_pos,
time_delay,
structure_ids,
attention_mask_word=token_attention_mask,
attention_mask_post=post_attention_mask,
)
if labels is not None:
loss = self.loss(logits, labels)
else:
loss = None
return RumourDetectionTwitterModelOutput(loss=loss, logits=logits)
```

There are 3 key things to note in the above implementation.

1. When initialising the model, it is important to invoke the `init_weights()` method. Note the lack of an underscore at the start of the method name. This is required so that the model weights are initialized using the methods defined in the `__init__` method specified in `NewModelPreTrainedModel`.

2. The `forward` method takes in an optional `labels` argument. If this argument is passed to the model, the `forward` method should also return the value of the loss function for that batch of inputs.

3. The `forward` method's output is an object of the `RumourDetectionTwitterModelOutput` dataclass. This dataclass is illustrated in the code snippet below.

```
from dataclasses import dataclass
from transformers.file_utils import ModelOutput
@dataclass
class RumourDetectionTwitterModelOutput(ModelOutput):
"""
Base class for outputs of Rumour Detection models
Args:
loss (:obj:`torch.Tensor` of shape `(1,)`, `optional`, returned when :obj:`labels` is provided):
Classification loss, typically cross entropy. Loss function used is dependent on what is specified in RumourDetectionTwitterConfig.
logits (:obj:`torch.Tensor` of shape :obj:`(batch_size, num_classes)`):
Raw logits for each class. num_classes = 4 by default.
"""
loss: Optional[torch.Tensor] = None
logits: torch.Tensor = None
```

### train

`train.py` should contain a working implementation of the model training process. A user should be able to train the model from the command line using `python -m train --train_args config/config.json`.

### eval

`eval.py` should contain a working implementing of the model evaluation process. This script should load the trained model (using the information in `config/config.json`) and evaluate it against the evaluation datasets. The evaluation metrics reported should correspond to that reported in the `README.md`.

### utils

`utils.py` should contain other functions which are useful for `train.py` or `eval.py` but do not directly fit within any of the other scripts above.

### README

The `README` for the model should provide a concise introduction to the model. The following information are required:

- Citation and link to the original paper or source that introduced the model
- Citation and link to the train, validation and test datasets that the model was trained on. If the model was trained and evaluated on licensed datasets, information should be provided on how the SG-NLP team can obtain access to the evaluation (test) dataset. The train dataset may be omitted.
- Evaluation metrics. Please cite the appropriate paper if the evaluation metric is a published benchmark.
- Model size (in terms of the size of the trained model's weights)
- Training information such as hyperparameter values and compute time and resources used to train the model.

### Model weights and artefacts

Model weights and artefects comprise of:

- Saved model weights. Specifically, the `pytorch_model.bin` file saved using the `save_pretrained` method from the model class implemented in `modeling.py`. For now, only models implemented in `PyTorch` are accepted. The team is looking into accepting models implemented in `TensorFlow` as well.
- Model config. The `config.json` generated when using the `save_pretrained` method from the model config class implemented in `config.py`.
- Any artefacts needed by the tokenizer or preprocessor.

## Submitting Bug Fixes

If you spotted a bug, please follow these steps to report them.

1. Check the issues list to see whether the bug has already been reported. If an issue has already been created, please comment on that issue with details on how to replicate the bug.
2. If there is no issue relevant to the bug, please open a GitHub with
- Clear description of the bug
- Information related to your environment (ie, OS, package version, Python version, etc)
- Steps on how to replicate the bug (ie, a code snippet that we could run to encounter the bug)

## Adding Documentation

One easy way to contribute is to add or refine documentation / docstrings to the models that are currently available. `sgnlp` uses the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html) for our docstrings. Once the docstrings have been added or edited, please submit a pull request.

0 comments on commit f1486e8

Please sign in to comment.