Training data for training a NER model #599

arjasethan1 · 2017-04-20T15:04:55Z

Hi,

I have two questions about the ways of using the snorkel:

I am trying to train a Stanford Core-NLP NER model for picking all the software names and vendor names from the comments and descriptions of a software ( raw text). For this I tried to do some manual labeling which is very time consuming but I can clearly see the improvement in accuracy when keep increasing the training data. In this case I am trying to use Snorkel to produce some training data for me but it seems like(in Tutorials) it is already using Core-NLP NER models and generating training data for abstracting relation between two entities. Is their a way to use snorkel for creating train data for abstracting entities rather than their relation?
I have also used DeepDive to abstract relations between entities, when I skim through the tutorials I am not able to find much difference between DeepDive and Snorkel. Is Snorkel is the python version of DeepDive ?

ajratner · 2017-04-20T15:22:15Z

Great questions! Responding in order:

(1) Yes, Snorkel can definitely be used for this! (in fact, we have a paper on entity tagging which will be posted very soon...). At a high level, the goal of a Snorkel application is to train a classifier (the "end discriminative model") to classify possible or candidate extractions. In the intro tutorial, these candidates are potential mentions of spouse relations, but they could also just be mentions of single entities. For example, if you were just trying to tag mentions of people in the intro tutorial (part 2), instead of pairs of people that are spouses, you would instead just do:

from snorkel.models import candidate_subclass
Person = candidate_subclass('Person', ['person1'])

Basically everything else would be the same for the rest of the process, other than that your candidate class would now be different, so you would write slightly different types of labeling functions, etc.

And yes, we do run CoreNLP's NER tagger during preprocessing, which tags a basic set of entity types (e.g. PERSON, ORG, etc.), but you can use Snorkel to tag more specific / less standard entity types where hand-labeled training data is not readily available, such as in your scenario!

(2) The main differences between Snorkel and DeepDive are (1) that Snorkel uses data programming (https://papers.nips.cc/paper/6523-data-programming-creating-large-training-sets-quickly) to model noisy training data, adding in a new modeling stage for this (the generative model in intro tutorial 4), (2) Snorkel is written in Python and supports a simple Jupyter notebook interface, and (3) Snorkel focuses on independent classification problems rather than more complex factor graphs (at the current moment at least!)

Hope this answers your questions!

arjasethan1 · 2017-04-20T21:49:11Z

Hi @ajratner, Thanks for you and your team for building this great tool and making it public and inspiring lot of young minds like mine.

That petty much cleared my doubt and seems like snorkel can solve my problem, But I have few more doubts:

Do I still need to run Core-NLP(in this scenario) and abstract sentences, parts of speech, and NER entities etc which might be good features in building the generative model or snorkel can abstract its own features?

And I want to train NER model for picking two classes(may be more) "Softwares" and "Vendors" can I just give two classes in this way ?

from snorkel.models import candidate_subclass
Software = candidate_subclass('Software', ['software'])
Vendor = candidate_subclass('Vendor', ['vendor'])

Thanks!!

ajratner · 2017-04-21T17:14:07Z

Hi @arjasethan1 that's great to hear! We'd love to hear how your project goes (both positive and negative), it's great feedback for us in our research!

You will still need to run at least some elements of CoreNLP (e.g. splitting the sentence into words, getting grammatical structure, etc.). We actually have an upcoming PR that will allow more customizability here in terms of what is run, but I would probably leave the defaults to start.

And yes, right now Snorkel is geared for binary extraction, so each class would be its own Snorkel classification model. You can still run in the same notebook or you could have two separate notebooks, either way should be fine!

thammegowda · 2017-04-21T18:03:46Z

Hi @arjasethan1 and @ajratner

I am working in the same area. Here is my approach (please correct me or suggest me anything I am missing)

I used POS tags with proper nouns NNP as candidates (but later found that POS tagger has many false negatives for proper nouns in my domain - so included all NN* ) @ajratner Thanks for the regex matcher, it came handy :-)

Used RegexMatchEach(attrib='pos_tags', rgx="NN.*") instead of PersonMatcher for generating the candidates.

Then I wrote set of labeling functions based on rules and distance supervision.

--

We actually have an upcoming PR that will allow more customizability here in terms of what is run, but I would probably leave the defaults to start.

These features would be very useful:

Option to specify which annotators to use in the CoreNLP pipeline (so that we can exclude costly operations such as NER when we dont need them).
Option to set other properties to corenlp, such as ner.model etc (so that we can specify our custom ner models trained using CRF classifier)

ajratner · 2017-04-21T18:17:51Z

@thammegowda sounds like a great start! And glad the regex matcher came in handy :)

Let's check back on this once the PR for this is in, can always add more fnality!

arjasethan1 · 2017-04-21T18:34:56Z

Hi @ajratner , Thanks for your reply and that makes lot of sense. I will defiantly let you guys know how my work went when its done.

thanks @thammegowda for your suggestions, Even I thought of same approach. Most of the entities which I want to pick are also Nouns, they can be a good candidates to start with. Thanks for letting me know about RegexMatchEach and CoreNLP options before wasting some time.

arjasethan1 · 2017-04-25T15:25:27Z

Hi,
I labeled all the candidates from Development and Test datasets using the viewer but I couldn't able to figure out how to load those hand labeled data into a sparse matrix for further evaluation and training. Like this in the tutorial
L_gold_dev = load_gold_labels(session, annotator_name='gold', split=1)

I tried l_test = load_label_matrix(session=session, split=2) but it gave me <222x0 sparse matrix of type '<type 'numpy.float64'>' with 0 stored elements in Compressed Sparse Row format>

Thanks,
Sethan.

arjasethan1 · 2017-04-25T18:12:14Z

I found it !! I misunderstood a bit about the gold labels, when I changed annotator_name in L_gold_dev = load_gold_labels(session, annotator_name='gold', split=1) to my user name I am able to load the labeled data

Thanks.

thammegowda · 2017-04-26T21:21:19Z

Yes!
For future visitors, to load the labels annotated using SentenceNgramViewer

from snorkel.annotations import load_gold_labels
L_gold_dev = load_gold_labels(session, annotator_name=os.environ['USER'], split=1)

ajratner · 2017-05-05T08:07:03Z

We have a note somewhere to make this clearer in the tutorial, will add soon!

arjasethan1 · 2017-05-12T18:49:23Z

Hi,

I finished writing some labeling functions for my use case and it seems like working well. I want to get this data( like in the form of sentences with labels saying which word is a software and which is not, like the data I am using in Viewer) to train my NER model, how can I do that ? I tried querying database to see if that information is saved in any of the tables but I am not able to figure it out. Any help would be appreciated.

Thanks!!

fsonntag · 2017-05-13T07:17:26Z

@arjasethan1
Maybe this code snippet is helpful, I have extended export code by @jason-fries for the brat tool.
It writes the text and the annotations in separate files.

    def export_project(self, output_dir, positive_only_labels=True):
        """
        :param output_dir:
        :positive_only_labels
        :return:
        """
        os.makedirs(output_dir, exist_ok=True)
        candidates = self.session.query(Candidate).filter(Candidate.split == 0).all()
        doc_index = _group_by_document(candidates)
        snorkel_types = {type(c) for c in candidates}
        configuration_string = self._create_config_from_candidate_types(snorkel_types)
        # write the annotation file
        with open(os.path.join(output_dir, 'annotation.conf'), 'w') as conf_file:
            conf_file.write(configuration_string)

        # iterate over the documents
        for name in doc_index:
            # write the text
            with open(os.path.join(output_dir, f'{name}.txt'), 'w') as text_file:
                text = "".join([sentence.text for sentence in doc_index[name][0][0].sentence.document.sentences])
                text_file.write(text)

            # write the annotation file
            with open(os.path.join(output_dir, f'{name}.ann'), 'w') as ann_file:
                annotation_tuples = []
                for i, c in enumerate(doc_index[name]):
                    if positive_only_labels and c.training_marginal <= 0.5:
                        continue
                    sentence_start = sum(len(sentence.text) for sentence in c[0].sentence.document.sentences[:c[0].sentence.position])
                    char_start = sentence_start + c[0].char_start
                    char_end = sentence_start + c[0].char_end + 1
                    text = c[0].get_span()
                    annotation_tuples.append((c.__class__.__name__, char_start, char_end, text))

                annotation_tuples.sort(key=lambda tuple: tuple[1])
                lines = [f'T{i + 1}\t{annotation_tuple[0]} {annotation_tuple[1]} {annotation_tuple[2]}\t{annotation_tuple[3]}\n'for i, annotation_tuple in enumerate(annotation_tuples)]
                ann_file.writelines(lines)

So a candidate has a id, that will be the same as in your software table. There you can find a ..._id field, this is the id of the corresponding span. The span table then tells you the sentence and the position in the corresponding sentence.

fsonntag · 2017-05-13T08:42:58Z

Anyway I'm struggling with overlapping candidates and in the SwellShark paper they use a multinomial model with overlapping spansets. So far I could create those spansets, but I don't know how to feed those to the generative model.
Previously we had L_train, which was of size #spans x #LFs and the model learned the weights. Now I have a bunch of spansets of size #LFs x #spans in that spanset and I don't really know what to do with them...
Edit: I made a workaround by using the getting the span with the highest sum of labels in each spanset (if there are more than one, use all) and setting the labels of the others to 0. Works well and fast.

jason-fries · 2017-05-13T19:36:39Z

Hi @fsonntag, the current implementation of the generative model in Snorkel doesn't have multinomial support, but that should be added soon (see Issue 604). In the meantime, implementing some heuristic as you indicated can work pretty well. Another option is using the simple Naive Bayes generative model we used in the SwellShark paper https://github.com/HazyResearch/snorkel/blob/newftrs/snorkel/learning_mn.py

fsonntag · 2017-05-13T19:43:21Z

Thanks a lot for the answer, @jason-fries.
I thought about using Naive Bayes, nevertheless I'm not sure what to use as an input. Those spansets all have a different size (depending on the number of spans) and I don't know how to put them together...

jason-fries · 2017-05-13T21:53:00Z

@fsonntag the linked multinomial Naive Bayes code is trained over the list of all spanset matrices X_hat, which can be of different sizes. If you use MnLogReg, you can pass in your list of spanset matrices and learn LF weights ('accuracies') as before. Marginals then consist of a multinomial distribution across all candidates within a given spanset.

fsonntag · 2017-05-14T09:56:52Z

Works without a flaw, thanks for the feedback :)

arjasethan1 · 2017-05-15T19:57:46Z

HI @fsonntag ,

Thanks for your reply!!
How I can use this functionexport_project? do I need to define it under any of the class ? as it take self as the first input, or is it already defined in any of the class object ?

fsonntag · 2017-05-15T20:11:11Z

Check out this:
https://github.com/HazyResearch/snorkel/blob/brat/snorkel/contrib/brat/tools.py
https://github.com/HazyResearch/snorkel/blob/a1fc55e26c4a1d3f9660b95befd33c92bfded159/tutorials/cdr/CDR_Tutorial_BRAT_Export.ipynb
Jason implemented another version and in the iPython notebook he shows how to use it. Maybe that's simpler :)

arjasethan1 · 2017-05-16T14:00:17Z

Thanks you very much @fsonntag for pointing me to this.

arturomp · 2018-01-26T22:21:02Z

Related to @arjasethan1's first question and the corresponding answer by @ajratner, I'm wondering what the appropriate function is in order to do the RNN+LSTM training towards the end for mentions of single entities/events/etc. I'm running a slightly modified version of the intro tutorial.

I'm mostly concerned that the context obtained within the for loop is a single word and that that could cause some sort of unexpected behaviour (for good or for bad). I looked at issue #838 but this isn't addressed there either.

I'm using a single-event candidate:

from snorkel.models import candidate_subclass
typing = candidate_subclass('typing', ['action'])

and in the "Training an End Extraction Model," I'm currently using a slightly modified TextRNN() (below) because reRNN() required two arguments and would (obviously) fail with an IndexError. The minor modification required involved extract the text from a Span object since it doesn't have a text attribute. Still not sure this is the right approach.

from snorkel.models import Span

class TextRNN(RNNBase):
    """TextRNN for strings of text."""
    def _preprocess_data(self, candidates, extend=False):
        """Convert candidate sentences to lookup sequences
        
        :param candidates: candidates to process
        :param extend: extend symbol table for tokens (train), or lookup (test)?
        """
        if not hasattr(self, 'word_dict'):
            self.word_dict = SymbolTable()
        data, ends = [], []
        for candidate in candidates:
            if type(candidate.get_contexts()[0]) == Span:
                toks = candidate.get_contexts()[0].get_span().split()
            else:
                toks = candidate.get_contexts()[0].text.split()
            # Either extend word table or retrieve from it
            f = self.word_dict.get if extend else self.word_dict.lookup
            data.append(np.array(list(map(f, toks))))
            ends.append(len(toks))
        return data, ends

fsonntag · 2018-01-27T17:40:25Z

@arturomp They also have a tagging RNN implemented for that purpose, but I think it's slightly outdated and you also have to modify the code a little to get it running.
In the pca branch, there is a Word-Char LSTM, which works fine for entity tagging.
I customized it to an LSTM model that reads the left and right context and the characters of the candidate, it worked best in my use case. You can check it out here if you're interested.

pidugusundeep · 2018-04-13T05:34:03Z

@thammegowda
According to this -> https://github.com/HazyResearch/snorkel/issues/599#issuecomment-296262029 i understood that we can have a regex for Noun but this will internally search the regex based on the tokens so i want to get a combination of co-occuring pos tags like NN,NN or NNP,NNP or NNP,NN and this is the regex expression i wrote to identify ((NN\sNN)|(NNP\sNNP)|(NNP\sNN)).* but then its not identifying even a single candidate in the documents is my regex format incorrect or do i have to modify the existing regex to accept a sentence instead of tokens, iam working with some specific domain keywords which i want to identify so i dont have an option to go with NER tagging(coreNLP) also so can you please help me.

thammegowda · 2018-04-16T16:51:20Z

@pidugusundeep I think regex matcher matches one token at a time (I could be wrong), so it won't be able to match two tokens since \s never occur within POS tags. If you need context (previous and next tokens) then labeling functions should be able to help you.
Suggestion: break your task into two phases:

first phases match all the nouns.
your labeling function may use context to filter the co-occurring nouns

I hope that helps

Anirudh-Muthukumar · 2018-06-18T22:00:01Z

@thammegowda @arjasethan1 Can you just brief me on how to load gold labeled data? I have a hand labeled data for few observations in file. How am I supposed to use it to perform analysis on the model?

Thanks.

thammegowda · 2018-06-19T17:04:57Z

@Anirudh-Muthukumar did you try the code snippet I posted here #599 (comment)

Anirudh-Muthukumar · 2018-06-19T17:44:08Z

Yes @thammegowda I tried the same snippet here.

from snorkel.annotations import load_gold_labels
L_gold_dev = load_gold_labels(session, annotator_name=os.environ['USER'], split=1)

This is how my script looks.

Output:
<2632x0 sparse matrix of type '<type 'numpy.float64'>'
with 0 stored elements in Compressed Sparse Row format>

Are there any changes to be made?

ajratner closed this as completed Jul 10, 2017

fanglinchen mentioned this issue Oct 11, 2017

dealing with multi-class label #811

Closed

wenfeixiang1991 mentioned this issue Nov 27, 2017

How to Named Entity Recognize using Data Programming in Snorkel? #838

Closed

ajratner added the Q&A label Dec 13, 2017

depo34 mentioned this issue Apr 23, 2019

Entity Recognition (Unary candidate class) throws IndexError in LSTM #1100

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training data for training a NER model #599

Training data for training a NER model #599

arjasethan1 commented Apr 20, 2017

ajratner commented Apr 20, 2017

arjasethan1 commented Apr 20, 2017

ajratner commented Apr 21, 2017

thammegowda commented Apr 21, 2017

ajratner commented Apr 21, 2017

arjasethan1 commented Apr 21, 2017

arjasethan1 commented Apr 25, 2017

arjasethan1 commented Apr 25, 2017 •

edited

Loading

thammegowda commented Apr 26, 2017

ajratner commented May 5, 2017

arjasethan1 commented May 12, 2017 •

edited

Loading

fsonntag commented May 13, 2017

fsonntag commented May 13, 2017 •

edited

Loading

jason-fries commented May 13, 2017

fsonntag commented May 13, 2017

jason-fries commented May 13, 2017

fsonntag commented May 14, 2017

arjasethan1 commented May 15, 2017

fsonntag commented May 15, 2017

arjasethan1 commented May 16, 2017

arturomp commented Jan 26, 2018

fsonntag commented Jan 27, 2018

pidugusundeep commented Apr 13, 2018 •

edited

Loading

thammegowda commented Apr 16, 2018

Anirudh-Muthukumar commented Jun 18, 2018

thammegowda commented Jun 19, 2018

Anirudh-Muthukumar commented Jun 19, 2018 •

edited

Loading

Training data for training a NER model #599

Training data for training a NER model #599

Comments

arjasethan1 commented Apr 20, 2017

ajratner commented Apr 20, 2017

arjasethan1 commented Apr 20, 2017

ajratner commented Apr 21, 2017

thammegowda commented Apr 21, 2017

ajratner commented Apr 21, 2017

arjasethan1 commented Apr 21, 2017

arjasethan1 commented Apr 25, 2017

arjasethan1 commented Apr 25, 2017 • edited Loading

thammegowda commented Apr 26, 2017

ajratner commented May 5, 2017

arjasethan1 commented May 12, 2017 • edited Loading

fsonntag commented May 13, 2017

fsonntag commented May 13, 2017 • edited Loading

jason-fries commented May 13, 2017

fsonntag commented May 13, 2017

jason-fries commented May 13, 2017

fsonntag commented May 14, 2017

arjasethan1 commented May 15, 2017

fsonntag commented May 15, 2017

arjasethan1 commented May 16, 2017

arturomp commented Jan 26, 2018

fsonntag commented Jan 27, 2018

pidugusundeep commented Apr 13, 2018 • edited Loading

thammegowda commented Apr 16, 2018

Anirudh-Muthukumar commented Jun 18, 2018

thammegowda commented Jun 19, 2018

Anirudh-Muthukumar commented Jun 19, 2018 • edited Loading

arjasethan1 commented Apr 25, 2017 •

edited

Loading

arjasethan1 commented May 12, 2017 •

edited

Loading

fsonntag commented May 13, 2017 •

edited

Loading

pidugusundeep commented Apr 13, 2018 •

edited

Loading

Anirudh-Muthukumar commented Jun 19, 2018 •

edited

Loading