-
Notifications
You must be signed in to change notification settings - Fork 100
Training the Neural Network Parser
An evaluation of the REACH information extraction task indicated that much of the runtime is exacerbated by the CoreNLPProcessor
parsing text. In an effort to improve this runtime, we considered alternatives, namely FastNLPProcessor
and the Neural Network DependencyParser
. As a sanity check and to provide concrete results, the Penn Treebank Wall Street Journal (WSJ) and GENIA corpora were used to test each parser. Both corpora were split into test, dev, and test partitions as follows:
- The GENIA division by David McClosky, which includes a
future_use
partition that was not used, was used. - Our distribution of the WSJ corpora was split into sets labeled
00
to24
. The standard partitioning of02-21
for train,{01,22,24}
for dev,23
for test, and discarding00
was used.
The Dependency Parser required the corpora to use Basic Dependencies, and details on converting Penn Treebank to Basic Dependencies can be found on this wiki page.
Processors represents text as Document
objects which in turn contain an array of Sentence
objects. Reading in corpora in Basic Dependency format was a task in itself, though heavily adapted from the DocumentSerializer
class and manifested in the ConllxReader
class. Additional utilities were included in CoreNLPUtils
(the addition of sentenceToCoreMap
, sentenceToAnnotation
, and docToAnnotation
), EvaluateUtils
(for calculating precision, recall, etc.), and ParserUtils
(for performing the training and model saving itself). reference.conf
includes the paths to all relevant files, most notably the train, dev, test, and model files for each model. Note that this may still just be in the nn-parser-training
branch.
The observed models included CoreNLPProcessor
, FastNLPProcessor
, and the DependencyParser
with various configurations. The neural network configurations required word embeddings, and the Word2Vec embeddings for Gigaword and Pubmed Open Access were chosen given the nature of their content being relatively similar to WSJ and GENIA respectively. Two models were trained as a sanity check for this intuition: (1) training/testing on WSJ with Gigaword embeddings and (2) training/testing on GENIA with Pubmed Open Access embeddings. The intended "best" model which uses both corpora and both sets of embeddings was trained with 5 different multiples of the GENIA corpus. Given how much smaller the GENIA corpus is relative to WSJ, the GENIA corpus was concatenated onto the WSJ corpus 1, 2, 3, 4, and 5 times to create 5 different models which use the combined WSJ+GENIA*k (k being the number of iterations of the GENIA corpus) corpus with Gigaword+Pubmed embeddings. These embeddings were generated with Word2Vec with dimension 200 and all other settings set to their default values. (All relevant Gigaword-Pubmed files can be found here: /net/kate/storage/data/nlp/corpora/word2vec/gigaword-pubmed/
. See reference.conf
for more information.)
The results of training on these models are as follows:
-
Vanilla
CoreNLPProcessor
, tested on:- WSJ train file
tp: 716313, fp: 83544, tn: 0, fn: 193883 Precision: 0.8955513298002018 Recall: 0.786987637827457 F1: 0.8377670165778488
- GENIA train file
tp: 263133, fp: 58407, tn: 0, fn: 96928 Precision: 0.8183523045344281 Recall: 0.7308011698017836 F1: 0.7721027404595945
- WSJ test file
tp: 40867, fp: 6421, tn: 0, fn: 13401 Precision: 0.8642150228387752 Recall: 0.7530588929018943 F1: 0.8048170467525306
- GENIA test file
tp: 24495, fp: 6047, tn: 0, fn: 9784 Precision: 0.8020103464082248 Recall: 0.7145774380816243 F1: 0.7557735918915167
-
Vanilla
FastNLPProcessor
, tested on:- WSJ train file
tp: 834228, fp: 75968, tn: 0, fn: 75968 Precision: 0.9165366580384884 Recall: 0.9165366580384884 F1: 0.9165366580384884
- GENIA train file
tp: 322992, fp: 37069, tn: 0, fn: 37069 Precision: 0.8970480001999661 Recall: 0.8970480001999661 F1: 0.8970480001999661
- WSJ test file
tp: 47476, fp: 6792, tn: 0, fn: 6792 Precision: 0.8748433699417705 Recall: 0.8748433699417705 F1: 0.8748433699417705
- GENIA test file
tp: 29522, fp: 4757, tn: 0, fn: 4757 Precision: 0.8612269902855976 Recall: 0.8612269902855976 F1: 0.8612269902855976
-
Train on WSJ with gigaword embeddings, test on:
- WSJ test file
tp: 6585, fp: 47683, tn: 0, fn: 47683 Precision: 0.12134222746369869 Recall: 0.12134222746369869 F1: 0.12134222746369869
- GENIA test file
tp: 4040, fp: 30239, tn: 0, fn: 30239 Precision: 0.11785641354765308 Recall: 0.11785641354765308 F1: 0.11785641354765308
-
Train on GENIA with PMC embeddings, test on:
- WSJ test file
tp: 10042, fp: 44226, tn: 0, fn: 44226 Precision: 0.18504459349893124 Recall: 0.18504459349893124 F1: 0.18504459349893124
- GENIA test file
tp: 6323, fp: 27956, tn: 0, fn: 27956 Precision: 0.18445695615391347 Recall: 0.18445695615391347 F1: 0.18445695615391347
-
Train on WSJ+GENIA*1 with gigaword+PMC embeddings
- Test on WSJ
tp: 8410, fp: 45858, tn: 0, fn: 45858 Precision: 0.15497162231886194 Recall: 0.15497162231886194 F1: 0.15497162231886194
- Test on GENIA
tp: 5346, fp: 28933, tn: 0, fn: 28933 Precision: 0.1559555412935033 Recall: 0.1559555412935033 F1: 0.1559555412935033
-
Train on WSJ+GENIA*2 with gigaword+PMC embeddings
- Test on WSJ
tp: 9496, fp: 44772, tn: 0, fn: 44772 Precision: 0.17498341564089334 Recall: 0.17498341564089334 F1: 0.17498341564089334
- Test on GENIA
tp: 5978, fp: 28301, tn: 0, fn: 28301 Precision: 0.17439248519501735 Recall: 0.17439248519501735 F1: 0.17439248519501735
-
Train on WSJ+GENIA*3 with gigaword+PMC embeddings
- Test on WSJ
tp: 9004, fp: 45264, tn: 0, fn: 45264 Precision: 0.1659172993292548 Recall: 0.1659172993292548 F1: 0.1659172993292548
- Test on GENIA
tp: 5790, fp: 28489, tn: 0, fn: 28489 Precision: 0.16890807783190875 Recall: 0.16890807783190875 F1: 0.16890807783190875
-
Train on WSJ+GENIA*4 with gigaword+PMC embeddings
- Test on WSJ
tp: 9448, fp: 44820, tn: 0, fn: 44820 Precision: 0.17409891648853837 Recall: 0.17409891648853837 F1: 0.17409891648853837
- Test on GENIA
tp: 5794, fp: 28485, tn: 0, fn: 28485 Precision: 0.16902476735027278 Recall: 0.16902476735027278 F1: 0.16902476735027278
-
Train on WSJ+GENIA*5 with gigaword+PMC embeddings
- Test on WSJ
tp: 9912, fp: 44356, tn: 0, fn: 44356 Precision: 0.18264907496130317 Recall: 0.18264907496130317 F1: 0.18264907496130317
- Test on GENIA
tp: 5992, fp: 28287, tn: 0, fn: 28287 Precision: 0.1748008985092914 Recall: 0.1748008985092914 F1: 0.1748008985092914
FastNLPProcessor
has already been noted as having even better results that CoreNLPProcessor
and has a faster runtime, so this was an improvement that required little changed to the existing code in REACH. Strangely, the neural network models have very poor performance. Whether this is an implementation error or a true reflection of their performance is being looked into.
- Users (r--)
- Developers (-w-)
- Maintainers (--x)