-
Notifications
You must be signed in to change notification settings - Fork 676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-17023: Use Modern NLP Models via ONNX and Apache OpenNLP with Solr #1999
base: main
Are you sure you want to change the base?
Conversation
Check it out @jzonthemtn |
configurations.all { | ||
resolutionStrategy { | ||
force 'org.apache.opennlp:opennlp-tools:2.2.0' | ||
force 'org.apache.opennlp:opennlp-dl:2.2.0' | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need for force a resolution? This shouldn't be necessary at all...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need apache/lucene#448 to be merged to get us to 2.2, or we ahve to work around Lucene... Is there a better way? This is what Jeff and I hacked in..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try org.apache.opennlp:opennlp*=2.2.0
in versions.props and remove this from here. It should force the upgrade to 2.2.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lots of errors when I took out the resolutionStrategy
:
* What went wrong:
Could not determine the dependencies of task ':solr:modules:analysis-extras:compileJava'.
> Could not resolve all task dependencies for configuration ':solr:modules:analysis-extras:compileClasspath'.
> Could not resolve org.apache.opennlp:opennlp-tools:2.2.0.
Required by:
project :solr:modules:analysis-extras
project :solr:modules:analysis-extras > org.apache.opennlp:opennlp-dl:2.2.0
> Cannot find a version of 'org.apache.opennlp:opennlp-tools' that satisfies the version constraints:
Dependency path 'org.apache.solr:analysis-extras:10.0.0-SNAPSHOT' --> 'org.apache.opennlp:opennlp-tools:2.2.0'
Constraint path 'org.apache.solr:analysis-extras:10.0.0-SNAPSHOT' --> 'org.apache.solr:core:10.0.0-SNAPSHOT' (apiElements) --> 'org.apache.opennlp:opennlp-tools:1.9.4' because of the following reason: Computed from com.palantir.consistent-versions' versions.lock in solr-root
Constraint path 'org.apache.solr:analysis-extras:10.0.0-SNAPSHOT' --> 'org.apache.solr:solrj:10.0.0-SNAPSHOT' (apiElements) --> 'org.apache.opennlp:opennlp-tools:1.9.4' because of the following reason: Computed from com.palantir.consistent-versions' versions.lock in solr-root
Dependency path 'org.apache.solr:analysis-extras:10.0.0-SNAPSHOT' --> 'org.apache.lucene:lucene-analysis-opennlp:9.8.0' (compile) --> 'org.apache.opennlp:opennlp-tools:1.9.1'
also, feel free to push up changes to this branch ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure to run ./gradlew --write-locks
to make sure it takes into account versions.props
changes.
versions.props
Outdated
@@ -70,3 +70,4 @@ org.semver4j:semver4j=5.2.1 | |||
org.slf4j:*=2.0.9 | |||
org.xerial.snappy:snappy-java=1.1.10.5 | |||
software.amazon.awssdk:*=2.20.155 | |||
org.apache.opennlp:opennlp-dl=2.2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try org.apache.opennlp:opennlp*=2.2.0
Thanks @risdenk , your suggestions worked. One slight wrinkle that I don't know how to deal with is that the |
|
This is after uploading the model to the PackageStore and reference it in there! |
@test "Check lifecycle of sentiment classification" { | ||
|
||
# GPU versions is linux and windows only, not OSX. So swap jars. | ||
rm -f ${SOLR_TIP}/modules/analysis-extras/lib/onnxruntime_gpu-1.14.0.jar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just always run on cpu? Why need to test if there is a capable GPU or not? Could be on a VM without a GPU too even w/ linux and windows...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the purposes of this, I totally agree.. I think I need some Gradle help to figure this out ;-( I've banged at it with no luck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just CPU is definitely a better choice here. And just a note, ONNX Runtime does not currently support Mac silicon so this will only work on x64 machines.
OpenNLP brings in the onnxruntime-gpu
dependency. I will open a PR in OpenNLP to use onnxruntime
instead for all of the stated reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there is a workaround, please let me know...!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Release is done. New artifacts should be available soon ;-) (not depending on gpu anymore)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!!!
rm -f ${SOLR_TIP}/modules/analysis-extras/lib/onnxruntime_gpu-1.14.0.jar | ||
# restore | ||
#curl --insecure -o ${SOLR_TIP}/modules/analysis-extras/lib/onnxruntime-1.14.0.jar https://repo1.maven.org/maven2/com/microsoft/onnxruntime/onnxruntime/1.14.0/onnxruntime-1.14.0.jar | ||
cp /Users/epugh/Documents/projects/solr-epugh/onnxruntime-1.14.0.jar ${SOLR_TIP}/modules/analysis-extras/lib/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is using your path...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah... I was in the back of David Smiley's car in the wilds of New Brunswick Canada while I was working on this and didn't want to burn up my data plan ;-). Likewise I need to figure out a better apporoach with the model since it's 600 mb.... I don't know if I can find a toy one that is super small, or if I add docs to download it from hugging facet etc. Longer term, I doin't even think this bats test gets committed the way it is because of all the external dependencies.. Though.. if I had a super small model that I could check into Git, maybe that owuld be okay? Like we have a binary package for some of the package tests..
@risdenk @jzonthemtn so, any ideas how to force the |
also, i am seeing two errors on looking up licenses...
|
@jzonthemtn did open https://issues.apache.org/jira/browse/OPENNLP-1515, but that will require a release of OpenNLP before we can piggy back on it |
I think the first draft is done, and it "works". There are some caveats and things to be figured out, like the |
…n the OpenNLP project
Hmm, interesting. So we have:
So if the classes here were built as an independent plugin (with minimum Java17) and then deployed (with the relevant dependencies) into a Solr setup running Java17 with the original Solr artefacts (built with Java11) -- I wonder if that would work? Also noting that apache/lucene#579 bumped Lucene to Java17 on |
I guess this should work. |
Feels like what we should be doing is having Solr 10 target Java 17 since Lucene 10 will require it, and then this code goes on Solr 10, but not on Solr 9. This lets us have some more time to experiment with out dealing with the headaches of supporting an official release in the 9.x line (backcompat and the rest)....?? |
I concur. Also a nice motivation for targeting Java 17 i.e. specific example of functionality that it would unlock. And in the meantime "independent plugin" approaches remain a possibility in the community, perhaps even in the https://github.com/apache/solr-sandbox if someone wanted to pursue that (haven't checked how that is built, just kinda "name dropping" |
Lucene 9 requires older version of Java than the minimum required version that OpenNLP requires. That means that this PR is pending a release of Lucene 10, and the adoption of Lucene 10 by Solr. It would be interesting to think about if there was a way for Solr |
Resolved Conflicts: versions.lock versions.props
@@ -49,6 +49,7 @@ org.apache.httpcomponents:httpmime=4.5.14 | |||
org.apache.kerby:*=1.0.1 | |||
org.apache.logging.log4j:*=2.21.0 | |||
org.apache.lucene:*=9.9.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Temporarily within this pull request (pre-merge) we could change this to a Lucene 10 prerelease based on the "Update Lucene prerelease" steps in https://github.com/apache/solr/blob/main/help/dependencies.txt ...
... might be worth waiting though until Lucene 9.10 is out and https://issues.apache.org/jira/browse/SOLR-17157 has upgraded Solr to use it i.e. then solr/main
will be closer to lucene/main
than it is right now.
Technically I guess Solr |
@cpoerschke #1510 might be helpful here. I have a few wip prs for newer jdks |
This PR had no visible activity in the past 60 days, labeling it as stale. Any new activity will remove the stale label. To attract more reviewers, please tag someone or notify the [email protected] mailing list. Thank you for your contribution! |
This PR is now closed due to 60 days of inactivity after being marked as stale. Re-opening this PR is still possible, in which case it will be marked as active again. |
this is still valid ;-) |
https://issues.apache.org/jira/browse/SOLR-17023
Description
Code and BATS test that demonstrates downloading a sentiment classification model from Huggingface, converting it to Onnx model, uploading the model to Solr's FileStore, and then configuring a collection and associated pipeline to use the model for enriching content.
Solution
New DocumentCategorizerUpdateProcessorFactory that interacts with the model.
Tests
BATS test. Check this PR out and run:
./gradlew integrationTests --tests test_opennlp.bats
Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.