-
Notifications
You must be signed in to change notification settings - Fork 856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training data for training a NER model #599
Comments
Hi @arjasethan1, Great questions! Responding in order: (1) Yes, Snorkel can definitely be used for this! (in fact, we have a paper on entity tagging which will be posted very soon...). At a high level, the goal of a Snorkel application is to train a classifier (the "end discriminative model") to classify possible or candidate extractions. In the intro tutorial, these candidates are potential mentions of spouse relations, but they could also just be mentions of single entities. For example, if you were just trying to tag mentions of people in the intro tutorial (part 2), instead of pairs of people that are spouses, you would instead just do: from snorkel.models import candidate_subclass
Person = candidate_subclass('Person', ['person1']) Basically everything else would be the same for the rest of the process, other than that your candidate class would now be different, so you would write slightly different types of labeling functions, etc. And yes, we do run CoreNLP's NER tagger during preprocessing, which tags a basic set of entity types (e.g. PERSON, ORG, etc.), but you can use Snorkel to tag more specific / less standard entity types where hand-labeled training data is not readily available, such as in your scenario! (2) The main differences between Snorkel and DeepDive are (1) that Snorkel uses data programming (https://papers.nips.cc/paper/6523-data-programming-creating-large-training-sets-quickly) to model noisy training data, adding in a new modeling stage for this (the generative model in intro tutorial 4), (2) Snorkel is written in Python and supports a simple Jupyter notebook interface, and (3) Snorkel focuses on independent classification problems rather than more complex factor graphs (at the current moment at least!) Hope this answers your questions! |
Hi @ajratner, Thanks for you and your team for building this great tool and making it public and inspiring lot of young minds like mine. That petty much cleared my doubt and seems like snorkel can solve my problem, But I have few more doubts: Do I still need to run Core-NLP(in this scenario) and abstract sentences, parts of speech, and NER entities etc which might be good features in building the generative model or snorkel can abstract its own features? And I want to train NER model for picking two classes(may be more) "Softwares" and "Vendors" can I just give two classes in this way ?
Thanks!! |
Hi @arjasethan1 that's great to hear! We'd love to hear how your project goes (both positive and negative), it's great feedback for us in our research! You will still need to run at least some elements of CoreNLP (e.g. splitting the sentence into words, getting grammatical structure, etc.). We actually have an upcoming PR that will allow more customizability here in terms of what is run, but I would probably leave the defaults to start. And yes, right now Snorkel is geared for binary extraction, so each class would be its own Snorkel classification model. You can still run in the same notebook or you could have two separate notebooks, either way should be fine! |
Hi @arjasethan1 and @ajratner I am working in the same area. Here is my approach (please correct me or suggest me anything I am missing) I used POS tags with proper nouns Used Then I wrote set of labeling functions based on rules and distance supervision. --
These features would be very useful:
|
@thammegowda sounds like a great start! And glad the regex matcher came in handy :) Let's check back on this once the PR for this is in, can always add more fnality! |
Hi @ajratner , Thanks for your reply and that makes lot of sense. I will defiantly let you guys know how my work went when its done. thanks @thammegowda for your suggestions, Even I thought of same approach. Most of the entities which I want to pick are also Nouns, they can be a good candidates to start with. Thanks for letting me know about |
Hi, I tried Thanks, |
I found it !! I misunderstood a bit about the gold labels, when I changed annotator_name in Thanks. |
Yes! from snorkel.annotations import load_gold_labels
L_gold_dev = load_gold_labels(session, annotator_name=os.environ['USER'], split=1) |
We have a note somewhere to make this clearer in the tutorial, will add soon! |
Hi, I finished writing some labeling functions for my use case and it seems like working well. I want to get this data( like in the form of sentences with labels saying which word is a software and which is not, like the data I am using in Viewer) to train my NER model, how can I do that ? I tried querying database to see if that information is saved in any of the tables but I am not able to figure it out. Any help would be appreciated. Thanks!! |
@arjasethan1
So a |
Anyway I'm struggling with overlapping candidates and in the SwellShark paper they use a multinomial model with overlapping spansets. So far I could create those spansets, but I don't know how to feed those to the generative model. |
Hi @fsonntag, the current implementation of the generative model in Snorkel doesn't have multinomial support, but that should be added soon (see Issue 604). In the meantime, implementing some heuristic as you indicated can work pretty well. Another option is using the simple |
Thanks a lot for the answer, @jason-fries. |
@fsonntag the linked multinomial Naive Bayes code is trained over the list of all spanset matrices X_hat, which can be of different sizes. If you use |
Works without a flaw, thanks for the feedback :) |
HI @fsonntag , Thanks for your reply!! |
Check out this: |
Thanks you very much @fsonntag for pointing me to this. |
Related to @arjasethan1's first question and the corresponding answer by @ajratner, I'm wondering what the appropriate function is in order to do the RNN+LSTM training towards the end for mentions of single entities/events/etc. I'm running a slightly modified version of the intro tutorial. I'm mostly concerned that the context obtained within the I'm using a single-event candidate:
and in the "Training an End Extraction Model," I'm currently using a slightly modified
|
@arturomp They also have a tagging RNN implemented for that purpose, but I think it's slightly outdated and you also have to modify the code a little to get it running. |
@thammegowda |
@pidugusundeep I think regex matcher matches one token at a time (I could be wrong), so it won't be able to match two tokens since
I hope that helps |
@thammegowda @arjasethan1 Can you just brief me on how to load gold labeled data? I have a hand labeled data for few observations in file. How am I supposed to use it to perform analysis on the model? Thanks. |
@Anirudh-Muthukumar did you try the code snippet I posted here #599 (comment) |
Yes @thammegowda I tried the same snippet here. from snorkel.annotations import load_gold_labels This is how my script looks. Output: Are there any changes to be made? |
Hi,
I have two questions about the ways of using the snorkel:
I am trying to train a Stanford Core-NLP NER model for picking all the software names and vendor names from the comments and descriptions of a software ( raw text). For this I tried to do some manual labeling which is very time consuming but I can clearly see the improvement in accuracy when keep increasing the training data. In this case I am trying to use Snorkel to produce some training data for me but it seems like(in Tutorials) it is already using Core-NLP NER models and generating training data for abstracting relation between two entities. Is their a way to use snorkel for creating train data for abstracting entities rather than their relation?
I have also used DeepDive to abstract relations between entities, when I skim through the tutorials I am not able to find much difference between DeepDive and Snorkel. Is Snorkel is the python version of DeepDive ?
The text was updated successfully, but these errors were encountered: