-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] enhance model_uploader workflow to support MIT-licensed models from huggingface #388
Conversation
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
The workflow works well in my repo GH actions (I commented out the manual approve part) autocut update PR: |
With this workflow, to upload BGE-small models, we trigger it with these settings, and the last input MIT license url should be : https://github.com/FlagOpen/FlagEmbedding/raw/master/LICENSE |
To upload BGE-base, change the model id to BAAI/bge-base-en-v1.5 |
Signed-off-by: zhichao-aws <[email protected]>
This PR is ready for review. |
.github/workflows/model_uploader.yml
Outdated
@@ -49,7 +57,10 @@ on: | |||
options: | |||
- "NO" | |||
- "YES" | |||
|
|||
MIT_license_url: | |||
description: "(Optional) MIT license url of the huggingface MIT model." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does every model have different MIT licenses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a standard statements for MIT licenses, but the copyright header is customized according to the author info
echo "verified=:white_check_mark: — It is verified that this model is licensed under Apache 2.0" >> $GITHUB_OUTPUT | ||
echo "unverified=- [ ] :warning: The license cannot be verified. Please confirm by yourself that the model is licensed under Apache 2.0 :warning:" >> $GITHUB_OUTPUT | ||
echo "verified_apache=:white_check_mark: — It is verified that this model is licensed under Apache 2.0" >> $GITHUB_OUTPUT | ||
echo "verified_mit=:white_check_mark: — It is verified that this model is licensed under MIT, and we have provided copyright statements" >> $GITHUB_OUTPUT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For apache 2.0 licensed models, will this make any issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done a regression test using Apache-2 models and it still works well https://github.com/zhichao-aws/opensearch-py-ml/actions/runs/8843322311/job/24283396611
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For apache 2.0, I'm seeing:
- Model License Apache-2.0
- Model Version: 1.0.0
- Tracing Format: BOTH
- Embedding Dimension: N/A
- Pooling Mode: N/A
- Model Description: N/A
- MIT License Url: N/A
I was thinking may be we can make this bit more dynamic? Like in MIT License Url
, MIT
could be picked up from the Model License
? So that later if we use any other different licenses, it won't show only MIT?
|
||
# find the copyright statements from origin MIT license. It looks like: Copyright (c) {year} {authorname} | ||
copyright_statements = re.findall("Copyright.*\n", license_text)[0].strip() | ||
huggingface_url = "https://huggingface.co/" + model_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are giving model source from the UI. Can this model source be dynamic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the model_id and mit_license_url should be provided at the UI chart
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant about huggingface.co
--> Currently we have model source field in the UI. Can we leverage that?
@@ -428,12 +449,15 @@ def main( | |||
:type pooling_mode: string | |||
:param model_description: Model description input | |||
:type model_description: string | |||
:param third_party_copyrights_statements: Statements text for non Apache-2.0 licensed third party model. Should be put in the final artifact. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we sure that for any kind non Apahe-2.0 licenses we need this copy right statements? Or this is only for MIT licenses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we only know we need third party statements for MIT models, and know the rule to generate third-party file applies for BGE models. We still need to confirm the generating rule applies for other MIT licensed models, to confirm we need third-party file and how to generate third-party file for other non Apahe-2.0 licenses
echo "missing MIT license url" | ||
exit 1 | ||
fi | ||
if [[ "${{ github.event.inputs.model_source }}" == "" ]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we provide some other source?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the third party statements can be wrong. So we need to check whether they are matched before approve the release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I tried to mean is let's say github.event.inputs.model_source
== "XYZ" then this condition will pass, right?
Signed-off-by: zhichao-aws <[email protected]>
@dhrubo-os Any clue why the IT fails? It seems not related to the change, is it a flaky test? |
@@ -38,6 +38,7 @@ | |||
) | |||
|
|||
LICENSE_URL = "https://github.com/opensearch-project/opensearch-py-ml/raw/main/LICENSE" | |||
THIRD_PARTY_FILE_NAME = "THIRD-PARTY" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please check with the open source engineer that if naming the attribution file as THIRD-PARTY
looks good to him or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name THIRD_PARTY
is provided by the open source engineer
if add_apache_license == True and third_party_copyrights_statements is not None: | ||
assert ( | ||
False | ||
), "When the model is from third party under non Apache-2.0 license, we can not add Apache-2.0 license for it." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is, we will still distribute the artifacts under apache 2.0 license but for MIT licenses models, we also need to add extra attribution file to deliberately mention about the contributors name. Am I missing anything?
Yeah, I think that was a flaky test. |
Signed-off-by: zhichao-aws <[email protected]>
close it for the issue of BGE training data. Feel free to re-open it when we need to upload other MIT models |
Description
Code changes to enhance model upload workflow to support MIT licensed models from huggingface
Issues Resolved
#387
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.