Skip to content

Commit

Permalink
Merge pull request #5 from BramVanroy/stanford
Browse files Browse the repository at this point in the history
Improved documentation, added tests for spacy-stanfordnlp
  • Loading branch information
Bram Vanroy authored Feb 2, 2020
2 parents 98061d8 + 958eb80 commit e3cd567
Show file tree
Hide file tree
Showing 12 changed files with 444 additions and 135 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
Pipfile.lock
TODO.rst
dummy.py

# .idea (JetBrains)
.idea/
Expand Down
18 changes: 14 additions & 4 deletions HISTORY.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,16 @@
History
#######

**************************
1.2.0 (February 2nd, 2020)
**************************
* **BREAKING**: :code:`._.conll` now outputs a dictionary for sentences :code:`fieldname: [value1, value2...]`, and
a list of such dictionaries for a Doc
* Added a :code:`conversion_maps` argument where one can define a mapping to have better control over the model's tagset
(see the advanced example in README.rst)
* Tests for usage with :code:`spacy-stanfordnlp`
* Better documentation, including advanced example

**************************
1.1.0 (January 21st, 2020)
**************************
Expand All @@ -15,10 +25,10 @@ Minor documentation changes for PyPi.
**************************
1.0.0 (January 13th, 2020)
**************************
- Complete overhaul. Can now be used a custom pipeline component in spaCy.
- Spacy2ConllParser is now deprecated.
- The CLI interface does not rely on Spacy2ConllParser anymore but uses the custom pipeline component instead.
- Added :code:`-e|--no_force_counting` to the CLI options. By default, when using :code:`-d|--include_headers`,
* Complete overhaul. Can now be used a custom pipeline component in spaCy.
* Spacy2ConllParser is now deprecated.
* The CLI interface does not rely on Spacy2ConllParser anymore but uses the custom pipeline component instead.
* Added :code:`-e|--no_force_counting` to the CLI options. By default, when using :code:`-d|--include_headers`,
parsed sentence will be numbered incrementally. This can be disabled so that the sentence numbering depends on how
spaCy segments the sentences.

Expand Down
3 changes: 3 additions & 0 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,7 @@ spacy = "*"
packaging = "*"

[dev-packages]
torch = "*"
spacy-stanfordnlp = "*"
pytest = "*"
pygments = "*"
166 changes: 151 additions & 15 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,25 +1,43 @@
===========================
Parsing to CoNLL with spaCy
===========================
================================================
Parsing to CoNLL with spaCy or spacy-stanfordnlp
================================================
This module allows you to parse a text to `CoNLL-U format`_. You can use it as a command line tool, or embed it in your
own scripts by adding it as a custom component to a spaCy pipeline.
own scripts by adding it as a custom component to a spaCy or spacy-stanfordnlp pipeline.

Note that the module simply takes spaCy output and puts it in a formatted string adhering to the linked ConLL-U format. It does not as of yet do an explicit tagset mapping of spaCy to UD tags. The output tags depend on the spaCy model used.
Note that the module simply takes a parser's output and puts it in a formatted string adhering to the linked ConLL-U
format. The output tags depend on the spaCy model used. If you want Universal Depencies tags as output, I advise you to
use this library in combination with `spacy_stanfordnlp`_, which is a spaCy interface using :code:`stanfordnlp` and its
models behind the scenes. Those models use the Universal Dependencies formalism. See the remainder README for more
information and usage guidelines.

.. _`CoNLL-U format`: https://universaldependencies.org/format.html
.. _`spacy_stanfordnlp`: https://github.com/explosion/spacy-stanfordnlp

============
Installation
============

Requires `spaCy`_ and an `installed spaCy language model`_. When using the module from the command line, you also need the :code:`packaging` package.
Requires `spaCy`_ and an `installed spaCy language model`_. When using the module from the command line, you also need
the :code:`packaging` package. See section `spaCy`_ for usage.

Because `spaCy's models`_ are not necessarily trained on Universal Dependencies conventions, their output labels are
not UD either. By using :code:`spacy-stanfordnlp`, we get the easy-to-use interface of spaCy as a wrapper around
:code:`stanfordnlp` and its models that *are* trained on UD data. If you want to use the Stanford NLP models, you also
need :code:`spacy-stanfordnlp` and `a corresponding model`_. See the section `spacy-stanfordnlp`_ for usage.

**NOTE**: :code:`spacy-stanfordnlp` is not automatically installed as a dependency for this library, because it might be
too much overhead for those who don't need UD. If you wish to use its functionality, you have to install it manually.
By default, only :code:`spacy` and :code:`packaging` are installed as dependencies.

To install the library, simply use pip.

.. code:: bash
pip install spacy_conll
.. _spaCy: https://spacy.io/usage/models#section-quickstart
.. _installed spaCy language model: https://spacy.io/usage/models
.. _`a corresponding model`: https://stanfordnlp.github.io/stanfordnlp/models.html

=====
Usage
Expand All @@ -30,8 +48,8 @@ Command line
> python -m spacy_conll -h
usage: [-h] [-f INPUT_FILE] [-a INPUT_ENCODING] [-b INPUT_STR]
[-t] [-o OUTPUT_FILE] [-c OUTPUT_ENCODING] [-m MODEL] [-s]
[-d] [-e] [-j N_PROCESS] [-v]
[-o OUTPUT_FILE] [-c OUTPUT_ENCODING] [-m MODEL_OR_LANG]
[-s] [-t] [-d] [-e] [-j N_PROCESS] [-u] [-v]
Parse an input string or input file to CoNLL-U format.
Expand All @@ -45,21 +63,23 @@ Command line
default.
-b INPUT_STR, --input_str INPUT_STR
Input string to parse. (default: None)
-t, --is_tokenized Indicates whether your text has already been tokenized
(space-seperated). (default: False)
-o OUTPUT_FILE, --output_file OUTPUT_FILE
Path to output file. If not specified, the output will
be printed on standard output. (default: None)
-c OUTPUT_ENCODING, --output_encoding OUTPUT_ENCODING
Encoding of the output file. Default value is system
default.
-m MODEL, --model MODEL
spaCy model to use (must be installed). (default:
en_core_web_sm)
-m MODEL_OR_LANG, --model_or_lang MODEL_OR_LANG
spaCy or stanfordnlp model or language to use (must be
installed). (default: None)
-s, --disable_sbd Disables spaCy automatic sentence boundary detection.
In practice, disabling means that every line will be
parsed as one sentence, regardless of its actual
content. (default: False)
content. Only works when using spaCy. (default: False)
-t, --is_tokenized Indicates whether your text has already been tokenized
(space-seperated). When used in conjunction with
spacy-stanfordnlp, it will also be assumed that the
text is sentence split by newline. (default: False)
-d, --include_headers
To include headers before the output of every
sentence. These headers include the sentence text and
Expand All @@ -73,11 +93,14 @@ Command line
Number of processes to use in nlp.pipe(). -1 will use
as many cores as available. Requires spaCy v2.2.2.
(default: 1)
-u, --use_stanfordnlp
Use stanfordnlp models rather than spaCy models.
Requires spacy-stanfordnlp. (default: False)
-v, --verbose To print the output to stdout, regardless of
'output_file'. (default: False)
For example, parsing a sentence:
For example, parsing a single line, multi-sentence string:

.. code:: bash
Expand All @@ -102,8 +125,17 @@ For example, parsing a large input file and writing output to output file, using
> python -m spacy_conll --input_file large-input.txt --output_file large-conll-output.txt --include_headers --disable_sbd -j 4
You can also use Stanford NLP's models to retrieve UD tags. You can do this by using the :code:`-u` flag. **NOTE**:
Using Stanford's models has limited options due to the API of :code:`stanfordnlp`. It is not possible to disable
sentence segmentation and control the tokenisation at the same time. When using the :code:`-u` flag you can only enable
the :code:`--is_tokenized` flag which behaves different when used with spaCy. With spaCy, it will simply not try to
tokenize the text and use spaces as token separators. When using :code:`spacy-stanfordnlp`, it will also be assumed that
the text is sentence split by newline. No further sentence segmentation is done.

In Python
---------
spaCy
+++++

:code:`spacy_conll` is intended to be used a custom pipeline component in spaCy. Three custom extensions are accessible,
by default named :code:`conll_str`, :code:`conll_str_headers`, and :code:`conll`.
Expand Down Expand Up @@ -145,7 +177,111 @@ The snippet above will return (and print) the following string:
2 you -PRON- PRON PRP PronType=prs 1 nsubj _ _
3 ? ? PUNCT . PunctType=peri 1 punct _ _
An advanced example, showing the more complex options:

* :code:`ext_names`: changes the attribute names to a custom key by using a dictionary. You can change:

* :code:`conll_str`: a string representation of the CoNLL format
* :code:`conll_str_headers`: the same a conll_str but with leading lines containing sentence index and sentence text
* :code:`conll`: a dictionary containing the field names and their values. For a Doc object, this returns a list of
dictionaries where each dictionary is a sentence

* :code:`field_names`: a dictionary containing a mapping of field names so that you can name them as you wish
* :code:`conversion_maps`: a two-level dictionary that looks like :code:`{field_name: {tag_name: replacement}}`
In other words, you can specify in which field a certain value should be replaced by another.
This is especially useful when you are not satisfied with the tagset of a model and wish
to change some tags to an alternative.

The example below

* changes the custom attribute :code:`conll` to :code:`connl_for_pd`;
* changes the :code:`lemma` field to :code:`word_lemma`;
* converts any :code:`-PRON-` to :code:`PRON`;
* as a bonus: uses the output dictionary to create a pandas DataFrame.

.. code:: python
import pandas as pd
import spacy
from spacy_conll import ConllFormatter
nlp = spacy.load('en')
conllformatter = ConllFormatter(nlp,
ext_names={'conll': 'connl_for_pd'},
field_names={'lemma': 'word_lemma'},
conversion_maps={'word_lemma': {'-PRON-': 'PRON'}})
nlp.add_pipe(conllformatter, after='parser')
doc = nlp('I like cookies.')
df = pd.DataFrame.from_dict(doc._.connl_for_pd[0])
print(df)
The snippet above will output a pandas DataFrame:

.. code:: text
id form word_lemma upostag ... head deprel deps misc
0 1 I PRON PRON ... 2 nsubj _ _
1 2 like like VERB ... 0 ROOT _ _
2 3 cookies cookie NOUN ... 2 dobj _ _
3 4 . . PUNCT ... 2 punct _ _
[4 rows x 10 columns]
spacy-stanfordnlp
+++++++++++++++++

Using :code:`spacy_conll` in conjunction with :code:`spacy-stanfordnlp` is similar to using it with :code:`spacy`:
in practice we are still simply adding a custom component pipeline to the existing pipeline, but this time that pipeline
is a Stanford NLP pipeline that is wrapped in spaCy's API.

.. code:: python
from spacy_stanfordnlp import StanfordNLPLanguage
import stanfordnlp
from spacy_conll import ConllFormatter
snlp = stanfordnlp.Pipeline(lang='en')
nlp = StanfordNLPLanguage(snlp)
conllformatter = ConllFormatter(nlp)
nlp.add_pipe(conllformatter, last=True)
s = 'A cookie is a baked or cooked food that is typically small, flat and sweet.'
doc = nlp(s)
print(doc._.conll_str)
Output:

.. code:: text
1 A a DET DT _ 2 det _ _
2 cookie cookie NOUN NN Number=sing 8 nsubj _ _
3 is be AUX VBZ VerbForm=fin|Tense=pres|Number=sing|Person=three 8 cop _ _
4 a a DET DT _ 8 det _ _
5 baked bake VERB VBN VerbForm=part|Tense=past|Aspect=perf 8 amod _ _
6 or or CCONJ CC ConjType=comp 7 cc _ _
7 cooked cook VERB VBN VerbForm=part|Tense=past|Aspect=perf 5 conj _ _
8 food food NOUN NN Number=sing 0 root _ _
9 that that PRON WDT _ 12 nsubj _ _
10 is be AUX VBZ VerbForm=fin|Tense=pres|Number=sing|Person=three 12 cop _ _
11 typically typically ADV RB Degree=pos 12 advmod _ _
12 small small ADJ JJ Degree=pos 8 acl:relcl _ _
13 , , PUNCT , PunctType=comm 14 punct _ _
14 flat flat ADJ JJ Degree=pos 12 conj _ _
15 and and CCONJ CC ConjType=comp 16 cc _ _
16 sweet sweet ADJ JJ Degree=pos 12 conj _ _
17 . . PUNCT . PunctType=peri 8 punct _ _
.. _`spaCy's models`: https://spacy.io/models

----

**DEPRECATED:** :code:`Spacy2ConllParser`
+++++++++++++++++++++++++++++++++++++++++

There are two main methods, :code:`parse()` and :code:`parseprint()`. The latter is a convenience method for printing the output of
:code:`parse()` to stdout (default) or a file.
Expand Down
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,12 @@

setup(
name='spacy_conll',
version='1.1.0',
version='1.2.0',
description='A custom pipeline component for spaCy that can convert any parsed Doc'
' and its sentences into CoNLL-U format. Also provides a command line entry point.',
long_description=long_description,
long_description_content_type='text/x-rst',
keywords='nlp spacy spacy-extension conll conllu tagging',
keywords='nlp spacy spacy-extension conll conllu tagging parsing stanfordnlp spacy_stanfordnlp',
packages=['spacy_conll'],
url='https://github.com/BramVanroy/spacy_conll',
author='Bram Vanroy, Raquel G. Alhama',
Expand Down
Loading

0 comments on commit e3cd567

Please sign in to comment.