Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] enhance model_uploader workflow to support MIT-licensed models from huggingface #388

Closed
wants to merge 23 commits into from

Conversation

zhichao-aws
Copy link
Member

Description

Code changes to enhance model upload workflow to support MIT licensed models from huggingface

Issues Resolved

#387

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
@zhichao-aws
Copy link
Member Author

zhichao-aws commented Apr 26, 2024

The workflow works well in my repo GH actions (I commented out the manual approve part)
They traced model and upload to S3 and create PR in my repo
link:
https://github.com/zhichao-aws/opensearch-py-ml/actions/runs/8843390731
https://github.com/zhichao-aws/opensearch-py-ml/actions/runs/8843323304

autocut update PR:
zhichao-aws#7
zhichao-aws#8

@zhichao-aws
Copy link
Member Author

zhichao-aws commented Apr 26, 2024

With this workflow, to upload BGE-small models, we trigger it with these settings, and the last input MIT license url should be : https://github.com/FlagOpen/FlagEmbedding/raw/master/LICENSE
image

@zhichao-aws
Copy link
Member Author

zhichao-aws commented Apr 26, 2024

With this workflow, to upload BGE-small models, we trigger it with these settings, and the last input MIT license url should be : https://github.com/FlagOpen/FlagEmbedding/raw/master/LICENSE

To upload BGE-base, change the model id to BAAI/bge-base-en-v1.5

@zhichao-aws zhichao-aws marked this pull request as ready for review April 26, 2024 09:00
Signed-off-by: zhichao-aws <[email protected]>
@zhichao-aws
Copy link
Member Author

One screenshot for the upload workflow at my GH repo
image

@zhichao-aws
Copy link
Member Author

This PR is ready for review.

@@ -49,7 +57,10 @@ on:
options:
- "NO"
- "YES"

MIT_license_url:
description: "(Optional) MIT license url of the huggingface MIT model."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does every model have different MIT licenses?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a standard statements for MIT licenses, but the copyright header is customized according to the author info

echo "verified=:white_check_mark: — It is verified that this model is licensed under Apache 2.0" >> $GITHUB_OUTPUT
echo "unverified=- [ ] :warning: The license cannot be verified. Please confirm by yourself that the model is licensed under Apache 2.0 :warning:" >> $GITHUB_OUTPUT
echo "verified_apache=:white_check_mark: — It is verified that this model is licensed under Apache 2.0" >> $GITHUB_OUTPUT
echo "verified_mit=:white_check_mark: — It is verified that this model is licensed under MIT, and we have provided copyright statements" >> $GITHUB_OUTPUT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For apache 2.0 licensed models, will this make any issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a regression test using Apache-2 models and it still works well https://github.com/zhichao-aws/opensearch-py-ml/actions/runs/8843322311/job/24283396611

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For apache 2.0, I'm seeing:

- Model License Apache-2.0
- Model Version: 1.0.0
- Tracing Format: BOTH
- Embedding Dimension: N/A
- Pooling Mode: N/A
- Model Description: N/A
- MIT License Url: N/A

I was thinking may be we can make this bit more dynamic? Like in MIT License Url, MIT could be picked up from the Model License? So that later if we use any other different licenses, it won't show only MIT?


# find the copyright statements from origin MIT license. It looks like: Copyright (c) {year} {authorname}
copyright_statements = re.findall("Copyright.*\n", license_text)[0].strip()
huggingface_url = "https://huggingface.co/" + model_id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are giving model source from the UI. Can this model source be dynamic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the model_id and mit_license_url should be provided at the UI chart

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant about huggingface.co--> Currently we have model source field in the UI. Can we leverage that?

@@ -428,12 +449,15 @@ def main(
:type pooling_mode: string
:param model_description: Model description input
:type model_description: string
:param third_party_copyrights_statements: Statements text for non Apache-2.0 licensed third party model. Should be put in the final artifact.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure that for any kind non Apahe-2.0 licenses we need this copy right statements? Or this is only for MIT licenses?

Copy link
Member Author

@zhichao-aws zhichao-aws May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we only know we need third party statements for MIT models, and know the rule to generate third-party file applies for BGE models. We still need to confirm the generating rule applies for other MIT licensed models, to confirm we need third-party file and how to generate third-party file for other non Apahe-2.0 licenses

echo "missing MIT license url"
exit 1
fi
if [[ "${{ github.event.inputs.model_source }}" == "" ]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we provide some other source?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the third party statements can be wrong. So we need to check whether they are matched before approve the release

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I tried to mean is let's say github.event.inputs.model_source == "XYZ" then this condition will pass, right?

opensearch_py_ml/ml_models/sentencetransformermodel.py Outdated Show resolved Hide resolved
@zhichao-aws
Copy link
Member Author

@dhrubo-os Any clue why the IT fails? It seems not related to the change, is it a flaky test?

.ci/run-repository.sh Outdated Show resolved Hide resolved
@@ -38,6 +38,7 @@
)

LICENSE_URL = "https://github.com/opensearch-project/opensearch-py-ml/raw/main/LICENSE"
THIRD_PARTY_FILE_NAME = "THIRD-PARTY"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please check with the open source engineer that if naming the attribution file as THIRD-PARTY looks good to him or not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name THIRD_PARTY is provided by the open source engineer

if add_apache_license == True and third_party_copyrights_statements is not None:
assert (
False
), "When the model is from third party under non Apache-2.0 license, we can not add Apache-2.0 license for it."
Copy link
Collaborator

@dhrubo-os dhrubo-os May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is, we will still distribute the artifacts under apache 2.0 license but for MIT licenses models, we also need to add extra attribution file to deliberately mention about the contributors name. Am I missing anything?

@dhrubo-os
Copy link
Collaborator

@dhrubo-os Any clue why the IT fails? It seems not related to the change, is it a flaky test?

Yeah, I think that was a flaky test.

@zhichao-aws
Copy link
Member Author

close it for the issue of BGE training data. Feel free to re-open it when we need to upload other MIT models

@zhichao-aws zhichao-aws closed this Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants