Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor unit test workflow; add smoke test workflow #419

Closed
wants to merge 3 commits into from

Conversation

JamesKunstle
Copy link
Contributor

@JamesKunstle JamesKunstle commented Jan 31, 2025

This PR does two things:

  1. refactors the existing actions workflow for unit-test.yaml to call two new reusable workflows for starting and stopping an ec2 runner.
  2. adds a smoke-test.yaml workflow that's very similar to unit-test.yaml but calls a different tox+pytest path and requires a cuda runner.

Test groups are divided into three categories:
1) unit tests
2) smoke tests
3) benchmark tests

They each have a dedicated tox entrypoint.

Adds outer product of [FSDP, DeepSpeed] x [CPU offload, Not] test
matrix.

DEEPSPEED TESTS ARE BROKEN IN THIS COMMIT and are marked xFail- to be
fixed in another, later commit.

Signed-off-by: James Kunstle <[email protected]>
starting/stopping self-hosted ec2 runners is common code that we can
move into separate, reusable workflows.

moved those steps into their own files and refactored `unit-tests.yaml`
to call them instead of inlining.

Signed-off-by: James Kunstle <[email protected]>
@mergify mergify bot added CI/CD Affects CI/CD configuration testing Relates to testing dependencies Pull requests that update a dependency file labels Jan 31, 2025
@JamesKunstle JamesKunstle force-pushed the add-feature-testing branch 3 times, most recently from 908fcf2 to f3b17a6 Compare January 31, 2025 01:14
users can dispatch a workflow that runs smoke tests against a selected
branch

Signed-off-by: James Kunstle <[email protected]>
@JamesKunstle JamesKunstle changed the title Add feature testing Refactor unit test workflow; add smoke test workflow Jan 31, 2025
@mergify mergify bot added ci-failure and removed ci-failure labels Jan 31, 2025
@@ -0,0 +1,76 @@
# SPDX-License-Identifier: Apache-2.0
name: "[Reusable] Start EC2 self-hosted runner."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Do you need to put reusable in the name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a convention I saw in a blog post, we don't have to adopt that if it seems messy

"""

huggingface_hub.snapshot_download(
token=os.getenv("HF_TOKEN", None),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this token needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly related: instructlab/instructlab#3100

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an annoying behavioral thing. snapshot_download seems like it should grab HF_TOKEN from the environment but it wasn't doing that, so I had to pull it in directly this way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HF_TOKEN shouldn't be accessible to this test. Before instructlab/instructlab#3100, the CLI required something to be passed even if it wasn't needed. Hopefully this is a case that can just stop passing the token.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, you're totally right. Removed.

@danmcp
Copy link
Member

danmcp commented Feb 1, 2025

One concern I have about this change is it will make the ec2 handling logic different from the rest of the repos. Given the limited expertise in this logic, I think it would be worth it to keep the different instructlab repos consistent. So if we are going to change it here, we should make the same changes in instructlab, sdg, and eval.

@mergify mergify bot added the ci-failure label Feb 1, 2025
@JamesKunstle
Copy link
Contributor Author

JamesKunstle commented Feb 4, 2025

One concern I have about this change is it will make the ec2 handling logic different from the rest of the repos. Given the limited expertise in this logic, I think it would be worth it to keep the different instructlab repos consistent. So if we are going to change it here, we should make the same changes in instructlab, sdg, and eval.

That's a fair point. I was hoping to break this logic out so we can maintain only these common reusable components since the startup / stop EC2 behavior is common to a lot of scripts.

@courtneypacheco would there be any appetite to refactoring the existing workflows across the repos to reuse these reusable jobs? inlining the reusable logic seems like a hassle at this point. I'd gladly help to do this work.

@JamesKunstle JamesKunstle requested a review from danmcp February 4, 2025 06:50
Copy link
Contributor

mergify bot commented Feb 4, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. @JamesKunstle please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 4, 2025
@@ -0,0 +1,76 @@
# SPDX-License-Identifier: Apache-2.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JamesKunstle
Copy link
Contributor Author

@danmcp closing this PR in favor of #424, introducing only the smoke test, omitting the reusable workflow refactor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD Affects CI/CD configuration ci-failure dependencies Pull requests that update a dependency file needs-rebase testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants