[WIP] Changes for training entropy model and correcting attention in local models #25

EntilZha · 2025-01-16T21:51:15Z

Summary:

Refactor local model configs to be separate and clearer
Add attention arguments and correct which attention is used in local models
Preparation for being able to have an entropy train script
Fix failing unit tests

Test Plan:

artidoro

A few suggestions for improvements. But the changes seem functionally good!

artidoro · 2025-01-17T01:42:12Z

bytelatent/args.py

@@ -176,6 +179,10 @@ class TrainArgs(BaseModel):
    data: DataloaderArgs = DataloaderArgs()
    optim: OptimArgs = OptimArgs()
    model: ByteLatentTransformerArgs = ByteLatentTransformerArgs()
+    # This is only needed for training the entropy model


In our old code we had an architecture parameter which selects either vanilla Transformer or BLT. That seems easier than having all these args. Could we do that instead?

In general agreed, I'm not quite sure yet how to do this with pydantic yet. I'll see what i can do in the next PR that will have the entropy model training config/code. Ideally, I agree we can have a parameter to specify architecture, the tricky bit is having pydantic instantiate the default values for model based on that.

artidoro · 2025-01-17T01:55:32Z

bytelatent/model/blt.py

+        cross_attn_decoder=False,
+        cross_attn_k=args.cross_attn_k if args.cross_attn_encoder else None,
+        cross_attn_init_by_pooling=args.cross_attn_init_by_pooling,
+        # Defaults


Can we avoid copying all the defaults below?

I actually prefer the explicit copy rather than share a config in which not all the parameters are significant. There might be a way to copy all the values though, I'll look into that at least.

artidoro · 2025-01-17T01:59:44Z

bytelatent/model/global_transformer.py

How about latent_transformer instead of global_transformer? We might want to have that rename across all files.

makes sense, i can do that.

artidoro · 2025-01-17T02:13:31Z

bytelatent/model/local_models.py

 from bytelatent.model.utils import create_causal_mask, downsample
 from bytelatent.tokenizers.blt_tokenizer import BOE_ID

 logger = logging.getLogger()


+class LocalModelArgs(BaseModel):
+    model_config = ConfigDict(extra="forbid")


It seems like this could be simplified by inheriting from the BaseTransformerArgs. There should be very few additional things that the local models need to know about.

Agreed, assuming that its possible to override the defualt args in BaseTransformerArgs

artidoro · 2025-01-17T02:19:20Z

bytelatent/transformer.py

    ):
+        if attn_impl is None:


It might be clearer in a single line:
attn_impl = attn_impl or self.attn_impl

I think I actually prefer the explicit style, but not super strong preference.

artidoro · 2025-01-17T02:33:23Z

bytelatent/model/utils.py

+            logging.warning(
+                "SDPA attention being used, which doesn't have specialized attention implementations for block_causal and local_block_causal attention."
+            )
+            WARNED_SDPA = True
        return "causal"
    elif attn_impl == "flex_attention":
        return create_block_mask(causal_mask, None, None, seqlen, seqlen)


Add assert that the bias type is causal here and for sdpa.

I didn't add an assert here since then our code won't run at all without xformers, which some other issue comments need to run on not as capable GPUs. I think I could do an intermediate solution where I provide a way to suppress the error, but by default crash training.

Vectorrent · 2025-01-17T20:39:24Z

This may or may not be the correct place to discuss... but I ran into a problem with the entropy model, in the patcher code.

This code expects an entropy model to exist as a checkpoint on-disk, but I was hoping to pass an already-instantiated entropy model to the patcher, to train it alongside the latent model. Is there any way we could rewrite the realtime patching, to allow a user to pass any arbitrary nn.Module to the Patcher, as an alternative to loading from a checkpoint?

EntilZha · 2025-01-17T21:28:52Z

@Vectorrent I think it would be better to open a separate issue, since this will be closed once I merge my PR. If you open a new issue, I'll comment there.

…local models Summary: - Refactor local model configs to be separate and clearer - Add attention arguments and correct which attention is used in local models - Preparation for being able to have an entropy train script - Fix failing unit tests Test Plan:

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 16, 2025

EntilZha force-pushed the pr25 branch from 38022ac to 374409f Compare January 17, 2025 01:02

EntilZha marked this pull request as ready for review January 17, 2025 01:02

EntilZha changed the title ~~[WIP] Changes for training entropy model and correcting attention in local models~~ Changes for training entropy model and correcting attention in local models Jan 17, 2025

artidoro approved these changes Jan 17, 2025

View reviewed changes

EntilZha changed the title ~~Changes for training entropy model and correcting attention in local models~~ [WIP] Changes for training entropy model and correcting attention in local models Jan 17, 2025

EntilZha force-pushed the pr25 branch from 374409f to 7f305b3 Compare January 17, 2025 22:21

EntilZha merged commit 6ffeb66 into main Jan 17, 2025
7 checks passed

Vectorrent mentioned this pull request Jan 18, 2025

Fix realtime entropy patching #26

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Changes for training entropy model and correcting attention in local models #25

[WIP] Changes for training entropy model and correcting attention in local models #25

EntilZha commented Jan 16, 2025

artidoro left a comment

artidoro Jan 17, 2025

EntilZha Jan 17, 2025

artidoro Jan 17, 2025

EntilZha Jan 17, 2025

artidoro Jan 17, 2025

EntilZha Jan 17, 2025

artidoro Jan 17, 2025

EntilZha Jan 17, 2025

artidoro Jan 17, 2025

EntilZha Jan 17, 2025

artidoro Jan 17, 2025

EntilZha Jan 17, 2025

Vectorrent commented Jan 17, 2025

EntilZha commented Jan 17, 2025

[WIP] Changes for training entropy model and correcting attention in local models #25

[WIP] Changes for training entropy model and correcting attention in local models #25

Conversation

EntilZha commented Jan 16, 2025

artidoro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vectorrent commented Jan 17, 2025

EntilZha commented Jan 17, 2025