Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] enhance model_uploader workflow to support MIT-licensed models from huggingface #388

Closed
wants to merge 23 commits into from
Closed
Show file tree
Hide file tree
Changes from 22 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .ci/run-repository.sh
Original file line number Diff line number Diff line change
Expand Up @@ -68,11 +68,13 @@ elif [[ "$TASK_TYPE" == "doc" ]]; then
elif [[ "$TASK_TYPE" == "trace" ]]; then
# Set up OpenSearch cluster & Run model autotracing (Invoked by model_uploader.yml workflow)
echo -e "\033[34;1mINFO:\033[0m MODEL_ID: ${MODEL_ID}\033[0m"
echo -e "\033[34;1mINFO:\033[0m MODEL_LICENSE: ${MODEL_LICENSE}\033[0m"
echo -e "\033[34;1mINFO:\033[0m MODEL_VERSION: ${MODEL_VERSION}\033[0m"
echo -e "\033[34;1mINFO:\033[0m TRACING_FORMAT: ${TRACING_FORMAT}\033[0m"
echo -e "\033[34;1mINFO:\033[0m EMBEDDING_DIMENSION: ${EMBEDDING_DIMENSION:-N/A}\033[0m"
echo -e "\033[34;1mINFO:\033[0m POOLING_MODE: ${POOLING_MODE:-N/A}\033[0m"
echo -e "\033[34;1mINFO:\033[0m MODEL_DESCRIPTION: ${MODEL_DESCRIPTION:-N/A}\033[0m"
echo -e "\033[34;1mINFO:\033[0m MIT_LICENSE_URL: ${MIT_LICENSE_URL:-N/A}\033[0m"
zhichao-aws marked this conversation as resolved.
Show resolved Hide resolved

docker run \
--network=${network_name} \
Expand All @@ -84,7 +86,8 @@ elif [[ "$TASK_TYPE" == "trace" ]]; then
--env "TEST_TYPE=server" \
--name opensearch-py-ml-trace-runner \
opensearch-project/opensearch-py-ml \
nox -s "trace-${PYTHON_VERSION}" -- ${MODEL_ID} ${MODEL_VERSION} ${TRACING_FORMAT} -ed ${EMBEDDING_DIMENSION} -pm ${POOLING_MODE} -md ${MODEL_DESCRIPTION:+"$MODEL_DESCRIPTION"}
nox -s "trace-${PYTHON_VERSION}" -- ${MODEL_ID} ${MODEL_LICENSE} ${MODEL_VERSION} ${TRACING_FORMAT} -ed ${EMBEDDING_DIMENSION} -pm ${POOLING_MODE} \
-md ${MODEL_DESCRIPTION:+"$MODEL_DESCRIPTION"} -mlu ${MIT_LICENSE_URL}

docker cp opensearch-py-ml-trace-runner:/code/opensearch-py-ml/upload/ ./upload/
docker cp opensearch-py-ml-trace-runner:/code/opensearch-py-ml/trace_output/ ./trace_output/
Expand Down
56 changes: 48 additions & 8 deletions .github/workflows/model_uploader.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,14 @@ on:
required: true
type: string
default: "huggingface"
model_license:
description: "Model license (e.g. Apache-2.0)"
required: true
type: choice
default: "Apache-2.0"
options:
- "Apache-2.0"
- "MIT"
model_id:
description: "Model ID for auto-tracing and uploading (e.g. sentence-transformers/msmarco-distilbert-base-tas-b)"
required: true
Expand Down Expand Up @@ -49,7 +57,10 @@ on:
options:
- "NO"
- "YES"

MIT_license_url:
zhichao-aws marked this conversation as resolved.
Show resolved Hide resolved
description: "(Optional) MIT license url of the huggingface MIT model."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does every model have different MIT licenses?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a standard statements for MIT licenses, but the copyright header is customized according to the author info

required: false
type: string

jobs:
# Step 2: Initiate workflow variable
Expand All @@ -73,6 +84,7 @@ jobs:
embedding_dimension=${{ github.event.inputs.embedding_dimension }}
pooling_mode=${{ github.event.inputs.pooling_mode }}
model_description="${{ github.event.inputs.model_description }}"
MIT_license_url="${{ github.event.inputs.MIT_license_url }}"

workflow_info="
============= Workflow Details ==============
Expand All @@ -83,11 +95,13 @@ jobs:

========= Workflow Input Information =========
- Model ID: ${{ github.event.inputs.model_id }}
- Model License ${{ github.event.inputs.model_license }}
- Model Version: ${{ github.event.inputs.model_version }}
- Tracing Format: ${{ github.event.inputs.tracing_format }}
- Embedding Dimension: ${embedding_dimension:-N/A}
- Pooling Mode: ${pooling_mode:-N/A}
- Model Description: ${model_description:-N/A}
- MIT License Url: ${MIT_license_url:-N/A}

======== Workflow Output Information =========
- Embedding Verification: Passed"
Expand All @@ -96,16 +110,35 @@ jobs:
echo "${workflow_info@E}" >> $GITHUB_OUTPUT
echo "EOF" >> $GITHUB_OUTPUT
echo "${workflow_info@E}"
- name: Check MIT license url
id: check_mit_license_url
run: |
if [[ "${{ github.event.inputs.model_license }}" == "MIT" ]]
then
echo "Uploading MIT licensed model"
if [[ "${{ github.event.inputs.MIT_license_url }}" == "" ]]
then
echo "missing MIT license url"
exit 1
fi
if [[ "${{ github.event.inputs.model_source }}" == "" ]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we provide some other source?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the third party statements can be wrong. So we need to check whether they are matched before approve the release

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I tried to mean is let's say github.event.inputs.model_source == "XYZ" then this condition will pass, right?

then
echo "only support MIT models from huggingface now"
exit 1
fi
fi
- name: Initiate license_line
id: init_license_line
run: |
echo "verified=:white_check_mark: — It is verified that this model is licensed under Apache 2.0" >> $GITHUB_OUTPUT
echo "unverified=- [ ] :warning: The license cannot be verified. Please confirm by yourself that the model is licensed under Apache 2.0 :warning:" >> $GITHUB_OUTPUT
echo "verified_apache=:white_check_mark: — It is verified that this model is licensed under Apache 2.0" >> $GITHUB_OUTPUT
echo "verified_mit=:white_check_mark: — It is verified that this model is licensed under MIT, and we have provided copyright statements" >> $GITHUB_OUTPUT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For apache 2.0 licensed models, will this make any issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a regression test using Apache-2 models and it still works well https://github.com/zhichao-aws/opensearch-py-ml/actions/runs/8843322311/job/24283396611

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For apache 2.0, I'm seeing:

- Model License Apache-2.0
- Model Version: 1.0.0
- Tracing Format: BOTH
- Embedding Dimension: N/A
- Pooling Mode: N/A
- Model Description: N/A
- MIT License Url: N/A

I was thinking may be we can make this bit more dynamic? Like in MIT License Url, MIT could be picked up from the Model License? So that later if we use any other different licenses, it won't show only MIT?

echo "unverified=- [ ] :warning: The license cannot be verified. Please confirm by yourself that the model is licensed under Apache 2.0 or MIT with enough materials :warning:" >> $GITHUB_OUTPUT
outputs:
model_folder: ${{ steps.init_folders.outputs.model_folder }}
sentence_transformer_folder: ${{ steps.init_folders.outputs.sentence_transformer_folder }}
workflow_info: ${{ steps.init_workflow_info.outputs.workflow_info }}
verified_license_line: ${{ steps.init_license_line.outputs.verified }}
verified_apache_license_line: ${{ steps.init_license_line.outputs.verified_apache }}
verified_mit_license_line: ${{ steps.init_license_line.outputs.verified_mit }}
unverified_license_line: ${{ steps.init_license_line.outputs.unverified }}

# Step 3: Check if the model already exists in the model hub
Expand Down Expand Up @@ -175,11 +208,13 @@ jobs:
- name: Export Arguments
run: |
echo "MODEL_ID=${{ github.event.inputs.model_id }}" >> $GITHUB_ENV
echo "MODEL_LICENSE=${{ github.event.inputs.model_license }}" >> $GITHUB_ENV
echo "MODEL_VERSION=${{ github.event.inputs.model_version }}" >> $GITHUB_ENV
echo "TRACING_FORMAT=${{ github.event.inputs.tracing_format }}" >> $GITHUB_ENV
echo "EMBEDDING_DIMENSION=${{ github.event.inputs.embedding_dimension }}" >> $GITHUB_ENV
echo "POOLING_MODE=${{ github.event.inputs.pooling_mode }}" >> $GITHUB_ENV
echo "MODEL_DESCRIPTION=${{ github.event.inputs.model_description }}" >> $GITHUB_ENV
echo "MODEL_DESCRIPTION=${{ github.event.inputs.model_description }}" >> $GITHUB_ENV
echo "MIT_LICENSE_URL=${{ github.event.inputs.MIT_license_url }}" >> $GITHUB_ENV
- name: Autotracing ${{ matrix.cluster }} secured=${{ matrix.secured }} version=${{matrix.entry.opensearch_version}}
run: "./.ci/run-tests ${{ matrix.cluster }} ${{ matrix.secured }} ${{ matrix.entry.opensearch_version }} trace"
- name: Limit Model Size to 2GB
Expand All @@ -195,10 +230,15 @@ jobs:
- name: License Verification
id: license_verification
run: |
apache_verified=$(<trace_output/apache_verified.txt)
if [[ $apache_verified == "True" ]]
license_verified=$(<trace_output/license_verified.txt)
if [[ $license_verified == "True" ]]
then
echo "license_line=${{ needs.init-workflow-var.outputs.verified_license_line }}" >> $GITHUB_OUTPUT
if [[ "${{ github.event.inputs.model_license }}" == "Apache-2.0" ]]
then
echo "license_line=${{ needs.init-workflow-var.outputs.verified_apache_license_line }}" >> $GITHUB_OUTPUT
else
echo "license_line=${{ needs.init-workflow-var.outputs.verified_mit_license_line }}" >> $GITHUB_OUTPUT
fi
echo "license_info=Automatically Verified" >> $GITHUB_OUTPUT
else
echo "license_line=${{ needs.init-workflow-var.outputs.unverified_license_line }}" >> $GITHUB_OUTPUT
Expand Down
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ Inspired from [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
- Add support for model profiles by @rawwar in ([#358](https://github.com/opensearch-project/opensearch-py-ml/pull/358))
- Support for security default admin credential changes in 2.12.0 in ([#365](https://github.com/opensearch-project/opensearch-py-ml/pull/365))
- adding cross encoder models in the pre-trained traced list ([#378](https://github.com/opensearch-project/opensearch-py-ml/pull/378))
- adding support for upload huggingface MIT licensed models ([388](https://github.com/opensearch-project/opensearch-py-ml/pull/388))


### Changed
Expand Down
41 changes: 40 additions & 1 deletion opensearch_py_ml/ml_models/sentencetransformermodel.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
import subprocess
import time
from pathlib import Path
from typing import List
from typing import List, Optional
from zipfile import ZipFile

import matplotlib.pyplot as plt
Expand All @@ -38,6 +38,7 @@
)

LICENSE_URL = "https://github.com/opensearch-project/opensearch-py-ml/raw/main/LICENSE"
THIRD_PARTY_FILE_NAME = "THIRD-PARTY"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please check with the open source engineer that if naming the attribution file as THIRD-PARTY looks good to him or not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name THIRD_PARTY is provided by the open source engineer



class SentenceTransformerModel:
Expand Down Expand Up @@ -657,6 +658,22 @@ def _add_apache_license_to_model_zip_file(self, model_zip_file_path: str):
with ZipFile(str(model_zip_file_path), "a") as zipObj:
zipObj.writestr("LICENSE", r.content)

def _add_third_party_copyrights_statements_to_model_zip_file(
self, third_party_copyrights_statements: str, model_zip_file_path: str
):
"""
Add Statements text for non Apache-2.0 licensed third party model. Add it to the model zip file at model_zip_file_path

:param third_party_copyrights_statements: Statements text for non Apache-2.0 licensed third party model. Should be put in the final artifact.
:type third_party_copyrights_statements: string
:param model_zip_file_path: Path to the model zip file
:type model_zip_file_path: string
:return: no return value expected
:rtype: None
"""
with ZipFile(str(model_zip_file_path), "a") as zipObj:
zipObj.writestr(THIRD_PARTY_FILE_NAME, third_party_copyrights_statements)

def zip_model(
self,
model_path: str = None,
Expand Down Expand Up @@ -771,6 +788,7 @@ def save_as_pt(
model_output_path: str = None,
zip_file_name: str = None,
add_apache_license: bool = False,
third_party_copyrights_statements: Optional[str] = None,
) -> str:
"""
Download sentence transformer model directly from huggingface, convert model to torch script format,
Expand Down Expand Up @@ -802,10 +820,16 @@ def save_as_pt(
:param add_apache_license:
Optional, whether to add a Apache-2.0 license file to model zip file
:type add_apache_license: string
:param third_party_copyrights_statements: Statements text for non Apache-2.0 licensed third party model. Should be put in the final artifact.
:type third_party_copyrights_statements: string
:return: model zip file path. The file path where the zip file is being saved
:rtype: string
"""

if add_apache_license == True and third_party_copyrights_statements is not None:
assert (
False
), "When the model is from third party under non Apache-2.0 license, we can not add Apache-2.0 license for it."
Copy link
Collaborator

@dhrubo-os dhrubo-os May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is, we will still distribute the artifacts under apache 2.0 license but for MIT licenses models, we also need to add extra attribution file to deliberately mention about the contributors name. Am I missing anything?

model = SentenceTransformer(model_id)

if model_name is None:
Expand Down Expand Up @@ -870,6 +894,10 @@ def save_as_pt(
)
if add_apache_license:
self._add_apache_license_to_model_zip_file(zip_file_path)
if third_party_copyrights_statements is not None:
self._add_third_party_copyrights_statements_to_model_zip_file(
third_party_copyrights_statements, zip_file_path
)

self.torch_script_zip_file_path = zip_file_path
print("zip file is saved to ", zip_file_path, "\n")
Expand All @@ -883,6 +911,7 @@ def save_as_onnx(
model_output_path: str = None,
zip_file_name: str = None,
add_apache_license: bool = False,
third_party_copyrights_statements: Optional[str] = None,
) -> str:
"""
Download sentence transformer model directly from huggingface, convert model to onnx format,
Expand Down Expand Up @@ -911,10 +940,16 @@ def save_as_onnx(
:param add_apache_license:
Optional, whether to add a Apache-2.0 license file to model zip file
:type add_apache_license: string
:param third_party_copyrights_statements: Statements text for non Apache-2.0 licensed third party model. Should be put in the final artifact.
:type third_party_copyrights_statements: string
:return: model zip file path. The file path where the zip file is being saved
:rtype: string
"""

if add_apache_license == True and third_party_copyrights_statements is not None:
assert (
False
), "When the model is from third party under non Apache-2.0 license, we can not add Apache-2.0 license for it."
model = SentenceTransformer(model_id)

if model_name is None:
Expand Down Expand Up @@ -968,6 +1003,10 @@ def save_as_onnx(
)
if add_apache_license:
self._add_apache_license_to_model_zip_file(zip_file_path)
if third_party_copyrights_statements is not None:
self._add_third_party_copyrights_statements_to_model_zip_file(
third_party_copyrights_statements, zip_file_path
)

self.onnx_zip_file_path = zip_file_path
print("zip file is saved to ", zip_file_path, "\n")
Expand Down
81 changes: 81 additions & 0 deletions tests/ml_models/test_sentencetransformermodel_pytest.py
Original file line number Diff line number Diff line change
Expand Up @@ -221,6 +221,28 @@ def test_missing_files():
assert "Cannot find config.json" in str(exc_info.value)


def test_save_model_but_license_conflicts():
with pytest.raises(AssertionError) as exc_info:
test_model.save_as_pt(
sentences=["today is sunny"],
add_apache_license=True,
third_party_copyrights_statements="test statements",
)
assert (
"When the model is from third party under non Apache-2.0 license, we can not add Apache-2.0 license for it."
in str(exc_info.value)
)

with pytest.raises(AssertionError) as exc_info:
test_model.save_as_onnx(
add_apache_license=True, third_party_copyrights_statements="test statements"
)
assert (
"When the model is from third party under non Apache-2.0 license, we can not add Apache-2.0 license for it."
in str(exc_info.value)
)


def test_save_as_pt():
try:
test_model.save_as_pt(sentences=["today is sunny"])
Expand Down Expand Up @@ -658,5 +680,64 @@ def test_zip_model_with_license():
clean_test_folder(TEST_FOLDER)


def test_save_as_pt_with_third_party_copyrights_statements():
model_id = "sentence-transformers/all-MiniLM-L6-v2"
model_format = "TORCH_SCRIPT"
torch_script_zip_file_path = os.path.join(TEST_FOLDER, "all-MiniLM-L6-v2.zip")
torch_script_expected_filenames = {
"all-MiniLM-L6-v2.pt",
"tokenizer.json",
"THIRD-PARTY",
}

clean_test_folder(TEST_FOLDER)
test_model18 = SentenceTransformerModel(
folder_path=TEST_FOLDER,
model_id=model_id,
)
third_party_copyrights_statements = "test statements"

test_model18.save_as_pt(
model_id=model_id,
sentences=["today is sunny"],
third_party_copyrights_statements=third_party_copyrights_statements,
)

compare_model_zip_file(
torch_script_zip_file_path, torch_script_expected_filenames, model_format
)

clean_test_folder(TEST_FOLDER)


def test_save_as_onnx_with_third_party_copyrights_statements():
model_id = "sentence-transformers/all-MiniLM-L6-v2"
model_format = "TORCH_SCRIPT"
torch_script_zip_file_path = os.path.join(TEST_FOLDER, "all-MiniLM-L6-v2.zip")
torch_script_expected_filenames = {
"all-MiniLM-L6-v2.onnx",
"tokenizer.json",
"THIRD-PARTY",
}

clean_test_folder(TEST_FOLDER)
test_model19 = SentenceTransformerModel(
folder_path=TEST_FOLDER,
model_id=model_id,
)
third_party_copyrights_statements = "test statements"

test_model19.save_as_onnx(
model_id=model_id,
third_party_copyrights_statements=third_party_copyrights_statements,
)

compare_model_zip_file(
torch_script_zip_file_path, torch_script_expected_filenames, model_format
)

clean_test_folder(TEST_FOLDER)


clean_test_folder(TEST_FOLDER)
clean_test_folder(TESTDATA_UNZIP_FOLDER)
Loading
Loading