Merge pull request #5 from BramVanroy/stanford

Improved documentation, added tests for spacy-stanfordnlp
BramVanroy · Feb 2, 2020 · e3cd567 · e3cd567
2 parents 98061d8 + 958eb80
commit e3cd567
Show file tree

Hide file tree

Showing 12 changed files with 444 additions and 135 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,6 @@
 Pipfile.lock
+TODO.rst
+dummy.py
 
 # .idea (JetBrains)
 .idea/

diff --git a/HISTORY.rst b/HISTORY.rst
@@ -2,6 +2,16 @@
 History
 #######
 
+**************************
+1.2.0 (February 2nd, 2020)
+**************************
+* **BREAKING**: :code:`._.conll` now outputs a dictionary for sentences :code:`fieldname: [value1, value2...]`, and
+  a list of such dictionaries for a Doc
+* Added a :code:`conversion_maps` argument where one can define a mapping to have better control over the model's tagset
+  (see the advanced example in README.rst)
+* Tests for usage with :code:`spacy-stanfordnlp`
+* Better documentation, including advanced example
+
 **************************
 1.1.0 (January 21st, 2020)
 **************************
@@ -15,10 +25,10 @@ Minor documentation changes for PyPi.
 **************************
 1.0.0 (January 13th, 2020)
 **************************
-- Complete overhaul. Can now be used a custom pipeline component in spaCy.
-- Spacy2ConllParser is now deprecated.
-- The CLI interface does not rely on Spacy2ConllParser anymore but uses the custom pipeline component instead.
-- Added :code:`-e|--no_force_counting` to the CLI options. By default, when using :code:`-d|--include_headers`,
+* Complete overhaul. Can now be used a custom pipeline component in spaCy.
+* Spacy2ConllParser is now deprecated.
+* The CLI interface does not rely on Spacy2ConllParser anymore but uses the custom pipeline component instead.
+* Added :code:`-e|--no_force_counting` to the CLI options. By default, when using :code:`-d|--include_headers`,
   parsed sentence will be numbered incrementally. This can be disabled so that the sentence numbering depends on how
   spaCy segments the sentences.
 

diff --git a/Pipfile b/Pipfile
@@ -8,4 +8,7 @@ spacy = "*"
 packaging = "*"
 
 [dev-packages]
+torch = "*"
+spacy-stanfordnlp = "*"
 pytest = "*"
+pygments = "*"
diff --git a/README.rst b/README.rst
@@ -1,25 +1,43 @@
-===========================
-Parsing to CoNLL with spaCy
-===========================
+================================================
+Parsing to CoNLL with spaCy or spacy-stanfordnlp
+================================================
 This module allows you to parse a text to `CoNLL-U format`_. You can use it as a command line tool, or embed it in your
-own scripts by adding it as a custom component to a spaCy pipeline. 
+own scripts by adding it as a custom component to a spaCy or spacy-stanfordnlp pipeline.
 
-Note that the module simply takes spaCy output and puts it in a formatted string adhering to the linked ConLL-U format. It does not as of yet do an explicit tagset mapping of spaCy to UD tags. The output tags depend on the spaCy model used.
+Note that the module simply takes a parser's output and puts it in a formatted string adhering to the linked ConLL-U
+format. The output tags depend on the spaCy model used. If you want Universal Depencies tags as output, I advise you to
+use this library in combination with `spacy_stanfordnlp`_, which is a spaCy interface using :code:`stanfordnlp` and its
+models behind the scenes. Those models use the Universal Dependencies formalism. See the remainder README for more
+information and usage guidelines.
 
 .. _`CoNLL-U format`: https://universaldependencies.org/format.html
+.. _`spacy_stanfordnlp`: https://github.com/explosion/spacy-stanfordnlp
 
 ============
 Installation
 ============
 
-Requires `spaCy`_ and an `installed spaCy language model`_. When using the module from the command line, you also need the :code:`packaging` package.
+Requires `spaCy`_ and an `installed spaCy language model`_. When using the module from the command line, you also need
+the :code:`packaging` package. See section `spaCy`_ for usage.
+
+Because `spaCy's models`_ are not necessarily trained on Universal Dependencies conventions, their output labels are
+not UD either. By using :code:`spacy-stanfordnlp`, we get the easy-to-use interface of spaCy as a wrapper around
+:code:`stanfordnlp` and its models that *are* trained on UD data. If you want to use the Stanford NLP models, you also
+need :code:`spacy-stanfordnlp` and `a corresponding model`_. See the section `spacy-stanfordnlp`_ for usage.
+
+**NOTE**: :code:`spacy-stanfordnlp` is not automatically installed as a dependency for this library, because it might be
+too much overhead for those who don't need UD. If you wish to use its functionality, you have to install it manually.
+By default, only :code:`spacy` and :code:`packaging` are installed as dependencies.
+
+To install the library, simply use pip.
 
 .. code:: bash
 
   pip install spacy_conll
 
 .. _spaCy: https://spacy.io/usage/models#section-quickstart
 .. _installed spaCy language model: https://spacy.io/usage/models
+.. _`a corresponding model`: https://stanfordnlp.github.io/stanfordnlp/models.html
 
 =====
 Usage
@@ -30,8 +48,8 @@ Command line
 
     > python -m spacy_conll -h
     usage: [-h] [-f INPUT_FILE] [-a INPUT_ENCODING] [-b INPUT_STR]
-           [-t] [-o OUTPUT_FILE] [-c OUTPUT_ENCODING] [-m MODEL] [-s]
-           [-d] [-e] [-j N_PROCESS] [-v]
+                       [-o OUTPUT_FILE] [-c OUTPUT_ENCODING] [-m MODEL_OR_LANG]
+                       [-s] [-t] [-d] [-e] [-j N_PROCESS] [-u] [-v]
 
     Parse an input string or input file to CoNLL-U format.
 
@@ -45,21 +63,23 @@ Command line
                             default.
       -b INPUT_STR, --input_str INPUT_STR
                             Input string to parse. (default: None)
-      -t, --is_tokenized    Indicates whether your text has already been tokenized
-                            (space-seperated). (default: False)
       -o OUTPUT_FILE, --output_file OUTPUT_FILE
                             Path to output file. If not specified, the output will
                             be printed on standard output. (default: None)
       -c OUTPUT_ENCODING, --output_encoding OUTPUT_ENCODING
                             Encoding of the output file. Default value is system
                             default.
-      -m MODEL, --model MODEL
-                            spaCy model to use (must be installed). (default:
-                            en_core_web_sm)
+      -m MODEL_OR_LANG, --model_or_lang MODEL_OR_LANG
+                            spaCy or stanfordnlp model or language to use (must be
+                            installed). (default: None)
       -s, --disable_sbd     Disables spaCy automatic sentence boundary detection.
                             In practice, disabling means that every line will be
                             parsed as one sentence, regardless of its actual
-                            content. (default: False)
+                            content. Only works when using spaCy. (default: False)
+      -t, --is_tokenized    Indicates whether your text has already been tokenized
+                            (space-seperated). When used in conjunction with
+                            spacy-stanfordnlp, it will also be assumed that the
+                            text is sentence split by newline. (default: False)
       -d, --include_headers
                             To include headers before the output of every
                             sentence. These headers include the sentence text and
@@ -73,11 +93,14 @@ Command line
                             Number of processes to use in nlp.pipe(). -1 will use
                             as many cores as available. Requires spaCy v2.2.2.
                             (default: 1)
+      -u, --use_stanfordnlp
+                            Use stanfordnlp models rather than spaCy models.
+                            Requires spacy-stanfordnlp. (default: False)
       -v, --verbose         To print the output to stdout, regardless of
                             'output_file'. (default: False)
 
 
-For example, parsing a sentence:
+For example, parsing a single line, multi-sentence string:
 
 .. code:: bash
 
@@ -102,8 +125,17 @@ For example, parsing a large input file and writing output to output file, using
 
     > python -m spacy_conll --input_file large-input.txt --output_file large-conll-output.txt --include_headers --disable_sbd -j 4
 
+You can also use Stanford NLP's models to retrieve UD tags. You can do this by using the :code:`-u` flag. **NOTE**:
+Using Stanford's models has limited options due to the API of :code:`stanfordnlp`. It is not possible to disable
+sentence segmentation and control the tokenisation at the same time. When using the :code:`-u` flag you can only enable
+the :code:`--is_tokenized` flag which behaves different when used with spaCy. With spaCy, it will simply not try to
+tokenize the text and use spaces as token separators. When using :code:`spacy-stanfordnlp`, it will also be assumed that
+the text is sentence split by newline. No further sentence segmentation is done.
+
 In Python
 ---------
+spaCy
++++++
 
 :code:`spacy_conll` is intended to be used a custom pipeline component in spaCy. Three custom extensions are accessible,
 by default named :code:`conll_str`, :code:`conll_str_headers`, and :code:`conll`.
@@ -145,7 +177,111 @@ The snippet above will return (and print) the following string:
     2	you	-PRON-	PRON	PRP	PronType=prs	1	nsubj	_	_
     3	?	?	PUNCT	.	PunctType=peri	1	punct	_	_
 
+
+An advanced example, showing the more complex options:
+
+* :code:`ext_names`: changes the attribute names to a custom key by using a dictionary. You can change:
+
+ * :code:`conll_str`: a string representation of the CoNLL format
+ * :code:`conll_str_headers`: the same a conll_str but with leading lines containing sentence index and sentence text
+ * :code:`conll`: a dictionary containing the field names and their values. For a Doc object, this returns a list of
+                  dictionaries where each dictionary is a sentence
+
+* :code:`field_names`: a dictionary containing a mapping of field names so that you can name them as you wish
+* :code:`conversion_maps`: a two-level dictionary that looks like :code:`{field_name: {tag_name: replacement}}`
+                           In other words, you can specify in which field a certain value should be replaced by another.
+                           This is especially useful when you are not satisfied with the tagset of a model and wish
+                           to change some tags to an alternative.
+
+The example below
+
+* changes the custom attribute :code:`conll` to :code:`connl_for_pd`;
+* changes the :code:`lemma` field to :code:`word_lemma`;
+* converts any :code:`-PRON-` to :code:`PRON`;
+* as a bonus: uses the output dictionary to create a pandas DataFrame.
+
+.. code:: python
+
+    import pandas as pd
+    import spacy
+    from spacy_conll import ConllFormatter
+
+
+    nlp = spacy.load('en')
+    conllformatter = ConllFormatter(nlp,
+                                    ext_names={'conll': 'connl_for_pd'},
+                                    field_names={'lemma': 'word_lemma'},
+                                    conversion_maps={'word_lemma': {'-PRON-': 'PRON'}})
+    nlp.add_pipe(conllformatter, after='parser')
+    doc = nlp('I like cookies.')
+    df = pd.DataFrame.from_dict(doc._.connl_for_pd[0])
+    print(df)
+
+The snippet above will output a pandas DataFrame:
+
+.. code:: text
+
+       id     form word_lemma upostag  ... head deprel  deps misc
+    0   1        I       PRON    PRON  ...    2  nsubj     _    _
+    1   2     like       like    VERB  ...    0   ROOT     _    _
+    2   3  cookies     cookie    NOUN  ...    2   dobj     _    _
+    3   4        .          .   PUNCT  ...    2  punct     _    _
+
+    [4 rows x 10 columns]
+
+spacy-stanfordnlp
++++++++++++++++++
+
+Using :code:`spacy_conll` in conjunction with :code:`spacy-stanfordnlp` is similar to using it with :code:`spacy`:
+in practice we are still simply adding a custom component pipeline to the existing pipeline, but this time that pipeline
+is a Stanford NLP pipeline that is wrapped in spaCy's API.
+
+.. code:: python
+
+    from spacy_stanfordnlp import StanfordNLPLanguage
+    import stanfordnlp
+
+    from spacy_conll import ConllFormatter
+
+
+    snlp = stanfordnlp.Pipeline(lang='en')
+    nlp = StanfordNLPLanguage(snlp)
+    conllformatter = ConllFormatter(nlp)
+    nlp.add_pipe(conllformatter, last=True)
+
+    s = 'A cookie is a baked or cooked food that is typically small, flat and sweet.'
+
+    doc = nlp(s)
+    print(doc._.conll_str)
+
+Output:
+
+.. code:: text
+
+    1	A	a	DET	DT	_	2	det	_	_
+    2	cookie	cookie	NOUN	NN	Number=sing	8	nsubj	_	_
+    3	is	be	AUX	VBZ	VerbForm=fin|Tense=pres|Number=sing|Person=three	8	cop	_	_
+    4	a	a	DET	DT	_	8	det	_	_
+    5	baked	bake	VERB	VBN	VerbForm=part|Tense=past|Aspect=perf	8	amod	_	_
+    6	or	or	CCONJ	CC	ConjType=comp	7	cc	_	_
+    7	cooked	cook	VERB	VBN	VerbForm=part|Tense=past|Aspect=perf	5	conj	_	_
+    8	food	food	NOUN	NN	Number=sing	0	root	_	_
+    9	that	that	PRON	WDT	_	12	nsubj	_	_
+    10	is	be	AUX	VBZ	VerbForm=fin|Tense=pres|Number=sing|Person=three	12	cop	_	_
+    11	typically	typically	ADV	RB	Degree=pos	12	advmod	_	_
+    12	small	small	ADJ	JJ	Degree=pos	8	acl:relcl	_	_
+    13	,	,	PUNCT	,	PunctType=comm	14	punct	_	_
+    14	flat	flat	ADJ	JJ	Degree=pos	12	conj	_	_
+    15	and	and	CCONJ	CC	ConjType=comp	16	cc	_	_
+    16	sweet	sweet	ADJ	JJ	Degree=pos	12	conj	_	_
+    17	.	.	PUNCT	.	PunctType=peri	8	punct	_	_
+
+.. _`spaCy's models`: https://spacy.io/models
+
+----
+
 **DEPRECATED:** :code:`Spacy2ConllParser`
++++++++++++++++++++++++++++++++++++++++++
 
 There are two main methods, :code:`parse()` and :code:`parseprint()`. The latter is a convenience method for printing the output of
 :code:`parse()` to stdout (default) or a file.

diff --git a/setup.py b/setup.py
@@ -6,12 +6,12 @@
 
 setup(
     name='spacy_conll',
-    version='1.1.0',
+    version='1.2.0',
     description='A custom pipeline component for spaCy that can convert any parsed Doc'
                 ' and its sentences into CoNLL-U format. Also provides a command line entry point.',
     long_description=long_description,
     long_description_content_type='text/x-rst',
-    keywords='nlp spacy spacy-extension conll conllu tagging',
+    keywords='nlp spacy spacy-extension conll conllu tagging parsing stanfordnlp spacy_stanfordnlp',
     packages=['spacy_conll'],
     url='https://github.com/BramVanroy/spacy_conll',
     author='Bram Vanroy, Raquel G. Alhama',