Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to the docs #46

Merged
merged 2 commits into from
Apr 29, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
DOC: Make tutorials runnable as myst notebooks
ghisvail committed Apr 25, 2024
commit f2d398823431e3dec6da22746aa9c5f6a26a0255
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
@@ -15,7 +15,7 @@

extensions = [
"autoapi.extension",
"myst_parser",
"myst_nb",
"numpydoc",
"sphinxcontrib.mermaid",
"sphinx_design",
332 changes: 29 additions & 303 deletions docs/tutorial/context_detection.md

Large diffs are not rendered by default.

199 changes: 28 additions & 171 deletions docs/tutorial/entity_matching.md

Large diffs are not rendered by default.

68 changes: 34 additions & 34 deletions docs/user_guide/first_steps.md
Original file line number Diff line number Diff line change
@@ -8,21 +8,21 @@ and context detection operations.

For starters, let's load a text file using the {class}`~medkit.core.text.TextDocument` class:

:::{code}
```{code} python
# You can download the file available in source code
# !wget https://raw.githubusercontent.com/medkit-lib/medkit/main/docs/data/text/1.txt
from pathlib import Path
from medkit.core.text import TextDocument
doc = TextDocument.from_file(Path("../data/text/1.txt"))
:::
```

The full raw text can be accessed through the `text` attribute:

:::{code}
```{code} python
print(doc.text)
:::
```

A `TextDocument` can store {class}`~medkit.core.text.TextAnnotation` objects.
For now, our document is free of annotations.
@@ -36,14 +36,14 @@ documents in sentences.
including a rule-based {class}`~medkit.text.segmentation.SentenceTokenizer` class
that relies on a list of punctuation characters.

:::{code}
```{code} python
from medkit.text.segmentation import SentenceTokenizer
sent_tokenizer = SentenceTokenizer(
output_label="sentence",
punct_chars=[".", "?", "!"],
)
:::
```

As all operations, `SentenceTokenizer` defines a `run()` method.

@@ -54,14 +54,14 @@ and returns a list of `Segment` objects.
Here, we can pass a special `Segment` containing the full text of the document,
which can be retrieved through the `raw_segment` attribute of `TextDocument`:

:::{code}
```{code} python
sentences = sent_tokenizer.run([doc.raw_segment])
for sentence in sentences:
print(f"uid={sentence.uid}")
print(f"text={sentence.text!r}")
print(f"spans={sentence.spans}, label={sentence.label}\n")
:::
```

Each segment features:
- an `uid` attribute, which unique value is automatically generated;
@@ -76,10 +76,10 @@ Each segment features:
If you take a look at the 13th and 14th detected sentences,
you will notice something strange:

:::{code}
```{code} python
print(repr(sentences[12].text))
print(repr(sentences[13].text))
:::
```

This is actually one sentence that was split into two segments,
because the sentence tokenizer incorrectly considers the dot in the decimal weight value
@@ -92,31 +92,31 @@ For this, we can use the {class}`~medkit.text.preprocessing.RegexpReplacer` clas
a regexp-based "search-and-replace" operation.
As other `medkit` operations, it can be configured with a set of user-determined rules:

:::{code}
```{code} python
from medkit.text.preprocessing import RegexpReplacer
rule = (r"(?<=\d)\.(?=\d)", ",") # => (pattern to replace, new text)
regexp_replacer = RegexpReplacer(output_label="clean_text", rules=[rule])
:::
```

The `run()` method of the normalizer takes a list of `Segment` objects
and returns a list of new `Segment` objects, one for each input `Segment`.
In our case we only want to preprocess the full raw text segment,
and we will only receive one preprocessed segment,
so we can call it with:

:::{code}
```{code} python
clean_segment = regexp_replacer.run([doc.raw_segment])[0]
print(clean_segment.text)
:::
```

We may use again our previously-defined sentence tokenizer again,
but this time on the preprocessed text:

:::{code}
```{code} python
sentences = sent_tokenizer.run([clean_segment])
print(sentences[12].text)
:::
```

Problem fixed!

@@ -126,7 +126,7 @@ The `medkit` library also comes with operations to perform NER (named entity rec
for instance with {class}`~medkit.text.ner.regexp_matcher.RegexpMatcher`.
Let's instantiate one with a few simple rules:

:::{code}
```{code} python
from medkit.text.ner import RegexpMatcher, RegexpMatcherRule
regexp_rules = [
@@ -138,7 +138,7 @@ regexp_rules = [
RegexpMatcherRule(regexp=r"\bnasonex?\b", label="treatment", case_sensitive=False),
]
regexp_matcher = RegexpMatcher(rules=regexp_rules)
:::
```

As you can see, you can also define some rules that ignore case distinctions
by setting `case-sensitive` parameter to `False`.
@@ -162,13 +162,13 @@ representing the entities that were matched (`Entity` is a subclass of `Segment`
As input, it expects a list of `Segment` objects.
Let's give it the sentences returned by the sentence tokenizer:

:::{code}
```{code} python
entities = regexp_matcher.run(sentences)
for entity in entities:
print(f"uid={entity.uid}")
print(f"text={entity.text!r}, spans={entity.spans}, label={entity.label}\n")
:::
```

Just like sentences, each entity features `uid`, `text`, `spans` and `label` attributes
(in this case, determined by the rule that was used to match it).
@@ -192,7 +192,7 @@ accessible through their {class}`~medkit.core.AttributeContainer`).
Let's instantiate a `NegationDetector` with a couple of simplistic handcrafted rules
and run it on our sentences:

:::{code}
```{code} python
from medkit.text.context import NegationDetector, NegationDetectorRule
neg_rules = [
@@ -202,7 +202,7 @@ neg_rules = [
]
neg_detector = NegationDetector(output_label="is_negated", rules=neg_rules)
neg_detector.run(sentences)
:::
```

:::{note}
Similarly to `RegexpMatcher`, `DetectionDetector` also comes with a set of default rules
@@ -213,12 +213,12 @@ located in the `medkit.text.context` module.

And now, let's look at which sentence have been detected as being negated:

:::{code}
```{code} python
for sentence in sentences:
neg_attr = sentence.attrs.get(label="is_negated")[0]
if neg_attr.value:
print(sentence.text)
:::
```

Our simple negation detector does not work too bad,
but sometimes some part of the sentence is tagged with a negation whilst the rest does not,
@@ -235,7 +235,7 @@ which are stored in file `default_syntagma_definition.yml`
located in the `medkit.text.segmentation` module.
:::

:::{code}
```{code} python
from medkit.text.segmentation import SyntagmaTokenizer
synt_tokenizer = SyntagmaTokenizer(
@@ -249,7 +249,7 @@ for syntagma in syntagmas:
neg_attr = syntagma.attrs.get(label="is_negated")[0]
if neg_attr.value:
print(syntagma.text)
:::
```

We now have some information about negation attached to syntagmas,
but the end goal is really to know, for each entity,
@@ -268,14 +268,14 @@ Let's again use a `RegexpMatcher` to find some entities,
but this time from syntagmas rather than from sentences,
and using `attrs_to_copy` to copy negation attributes:

:::{code}
```{code} python
regexp_matcher = RegexpMatcher(rules=regexp_rules, attrs_to_copy=["is_negated"])
entities = regexp_matcher.run(syntagmas)
for entity in entities:
neg_attr = entity.attrs.get(label="is_negated")[0]
print(f"text='{entity.text}', label={entity.label}, is_negated={neg_attr.value}")
:::
```

We now have a negation `Attribute` for each entity!

@@ -293,21 +293,21 @@ an instance of {class}`~medkit.core.text.TextAnnotationContainer`)
that behaves roughly like a list but also offers additional filtering methods.
Annotations can be added by calling its `add()` method:

:::{code}
```{code} python
for entity in entities:
doc.anns.add(entity)
:::
```

The document and its corresponding entities can be exported to supported formats
such as brat (see {class}`~medkit.io.brat.BratOutputConverter`)
or Doccano (see {class}`~medkit.io.doccano.DoccanoOutputConverter`),
or serialized to JSON (see {mod}`~medkit.io.medkit_json`):

:::{code}
```{code} python
from medkit.io import medkit_json
medkit_json.save_text_document(doc, "doc_1.json")
:::
```

## Visualizing entities with displacy

@@ -316,13 +316,13 @@ a visualization tool part of the [spaCy](https://spacy.io/) NLP library.
`medkit` provides helper functions to facilitate the use of `displacy`
in the {mod}`~medkit.text.spacy.displacy_utils` module:

:::{code}
```{code} python
from spacy import displacy
from medkit.text.spacy.displacy_utils import medkit_doc_to_displacy
displacy_data = medkit_doc_to_displacy(doc)
displacy.render(displacy_data, manual=True, style="ent")
:::
```

## Wrapping it up

19 changes: 11 additions & 8 deletions docs/user_guide/module.md
Original file line number Diff line number Diff line change
@@ -57,14 +57,17 @@ segment.

```python
class MyTokenizer(SegmentationOperation):
...

def _tokenize(self, segment: Segment) -> Segment:
"""Custom method for segment tokenization."""
...

def run(self, segments: List[Segment]) -> List[Segment]:
# Here is your code for the tokenizer:
# * process each input
return [
token
for segment in segments
for token in self._mytokenmethod(segment)
for token in self._tokenize(segment)
]
```

## 3. Make your operation non-destructive (for text)
@@ -85,7 +88,7 @@ segments.
```python
class MyTokenizer(SegmentationOperation):
...
def _mytokenmethod(self, segment):
def _tokenize(self, segment):
# process the segment (e.g., cut the segment)
size = len(segment)
cut_index = size // 2
@@ -140,7 +143,7 @@ Here is our example which store information about:
```python
class MyTokenizer(SegmentationOperation):
...
def _mytokenmethod(self, segment):
def _tokenize(self, segment):
...

# save the provenance data for this operation
@@ -166,7 +169,7 @@ To illustrate what we have seen in a more concrete manner, here is a fictional
"days of the week" matcher that takes text segments as input a return entities
for week days:

:::{code}
```python
import re
from medkit.core import Operation
from medkit.core.text import Entity, span_utils
@@ -222,7 +225,7 @@ class DayMatcher(Operation):
)

return entities
:::
```

Note than since this is a entity matcher, adding support for `attrs_to_copy`
would be nice (cf [Context detection](../tutorial/context_detection.md)).
58 changes: 29 additions & 29 deletions docs/user_guide/pipeline.md
Original file line number Diff line number Diff line change
@@ -8,7 +8,7 @@ and how to create pipelines to enrich documents.
Let's reuse the preprocessing, segmentation, context detection and entity recognition operations
from the [First steps](./first_steps.md) tutorial:

:::{code}
```{code} python
from medkit.text.preprocessing import RegexpReplacer
from medkit.text.segmentation import SentenceTokenizer, SyntagmaTokenizer
from medkit.text.context import NegationDetector, NegationDetectorRule
@@ -30,14 +30,14 @@ syntagma_tokenizer = SyntagmaTokenizer(
)
# context detection
neg_rules = [
negation_rules = [
NegationDetectorRule(regexp=r"\bpas\s*d[' e]\b"),
NegationDetectorRule(regexp=r"\bsans\b", exclusion_regexps=[r"\bsans\s*doute\b"]),
NegationDetectorRule(regexp=r"\bne\s*semble\s*pas"),
]
negation_detector = NegationDetector(
output_label="is_negated",
rules=neg_rules,
rules=negation_rules,
)
# entity recognition
@@ -50,13 +50,13 @@ regexp_rules = [
RegexpMatcherRule(regexp=r"\bnasonex?\b", label="treatment", case_sensitive=False),
]
regexp_matcher = RegexpMatcher(rules=regexp_rules, attrs_to_copy=["is_negated"])
:::
```

Each of these operations features a `run()` method, which could be called sequentially.
Data need to be routed manually between inputs and outputs for each operation,
using a document's raw text segment as initial input:

:::{code}
```{code} python
from pathlib import Path
from medkit.core.text import TextDocument
@@ -74,7 +74,7 @@ syntagmas = syntagma_tokenizer.run(sentences)
# but rather appends attributes to the segments it received.
negation_detector.run(syntagmas)
entities = regexp_matcher.run(syntagmas)
:::
```

This way of coding is useful for interactive exploration of `medkit`.
In the next section, we will introduce a different way using `Pipeline` objects.
@@ -105,7 +105,7 @@ But we also need to "connect" the operations together,
i.e. to indicate which output of an operation should be fed as input to another operation.
This is the purpose of the {class}`~medkit.core.PipelineStep` objects:

:::{code}
```{code} python
from medkit.core import PipelineStep
steps = [
@@ -115,13 +115,13 @@ steps = [
PipelineStep(negation_detector, input_keys=["syntagmas"], output_keys=[]), # no output
PipelineStep(regexp_matcher, input_keys=["syntagmas"], output_keys=["entities"]),
]
:::
```

Each `PipelineStep` associates an operation with input and output _keys_.
Pipeline steps with matching input and output keys will be connected to each other.
The resulting pipeline can be represented like this:

:::{mermaid}
```{mermaid}
---
align: center
---
@@ -143,11 +143,11 @@ graph TD
F --> G
classDef io fill:#fff4dd,stroke:#edb:
:::
```

Pipeline steps can then be used to instantiate a {class}`~medkit.core.Pipeline` object:

:::{code}
```{code} python
from medkit.core import Pipeline
pipeline = Pipeline(
@@ -162,7 +162,7 @@ pipeline = Pipeline(
# (and therefore that it should be the output of the regexp matcher)
output_keys=["entities"]
)
:::
```

The resulting pipeline is functionally equivalent to some operation
processing full text segments as input and returning entities with family attributes as output.
@@ -171,13 +171,13 @@ but more complex pipelines with multiple inputs and outputs are supported.

Like any other operation, the pipeline can be evaluated using its `run` method:

:::{code}
```{code} python
entities = pipeline.run([doc.raw_segment])
for entity in entities:
neg_attr = entity.attrs.get(label="is_negated")[0]
print(f"text='{entity.text}', label={entity.label}, is_negated={neg_attr.value}")
:::
```

## Nested pipelines

@@ -188,7 +188,7 @@ which can be used, tested and exercised in isolation.
In our example, we can use this feature to regroup together our regexp replacer,
sentence tokenizer and family detector into a context sub-pipeline:

:::{code}
```{code} python
# Context pipeline that receives full text segments
# and returns preprocessed syntagmas segments with negation attributes.
context_pipeline = Pipeline(
@@ -197,20 +197,20 @@ context_pipeline = Pipeline(
name="context",
steps=[
PipelineStep(regexp_replacer, input_keys=["full_text"], output_keys=["clean_text"]),
PipelineStep(sent_tokenizer, input_keys=["clean_text"], output_keys=["sentences"]),
PipelineStep(synt_tokenizer, input_keys=["sentences"], output_keys=["syntagmas"]),
PipelineStep(neg_detector, input_keys=["syntagmas"], output_keys=[]),
PipelineStep(sentence_tokenizer, input_keys=["clean_text"], output_keys=["sentences"]),
PipelineStep(syntagma_tokenizer, input_keys=["sentences"], output_keys=["syntagmas"]),
PipelineStep(negation_detector, input_keys=["syntagmas"], output_keys=[]),
],
input_keys=["full_text"],
output_keys=["syntagmas"],
)
:::
```

Likewise, we can introduce a NER sub-pipelines
composed of a UMLS-based matching operation (see also [Entity Matching](../tutorial/entity_matching.md))
grouped with the previously defined regexp matcher:

:::{code}
```{code} python
from medkit.text.ner import UMLSMatcher
umls_matcher = UMLSMatcher(
@@ -231,15 +231,15 @@ ner_pipeline = Pipeline(
input_keys=["syntagmas"],
output_keys=["entities"],
)
:::
```

Since both pipeline steps feature the same output key (_entities_),
the pipeline will return a list containing the entities matched by
both the regexp matcher and the UMLS matcher.

The NER and context sub-pipelines can now be sequenced with:

:::{code}
```{code} python
pipeline = Pipeline(
steps=[
PipelineStep(context_pipeline, input_keys=["full_text"], output_keys=["syntagmas"]),
@@ -248,7 +248,7 @@ pipeline = Pipeline(
input_keys=["full_text"],
output_keys=["entities"],
)
:::
```

which can be represented like this:

@@ -287,14 +287,14 @@ graph TD

Let's run the pipeline and verify entities with negation attributes:

:::{code}
```{code} python
entities = pipeline.run([doc.raw_segment])
for entity in entities:
neg_attr = entity.attrs.get(label="is_negated")[0]
print(entity.label, ":", entity.text)
print("negation:", neg_attr.value, end="\n\n")
:::
```

```text
problem : allergies
@@ -393,28 +393,28 @@ To scale the processing of such pipeline to a collection of documents,
one needs to iterate over each document manually to obtain its entities
rather than processing all the documents at once:

:::{code}
```{code} python
docs = TextDocument.from_dir(Path("..data/text"))
for doc in docs:
entities = pipeline.run([doc.raw_segment])
for entity in entities:
doc.anns.add(entity)
:::
```

To handle this common use case, `medkit` provides a {class}`~medkit.core.DocPipeline` class,
which wraps a `Pipeline` instance and run it on a list of documents.

Here is an example of its usage:

:::{code}
```{code} python
from medkit.core import DocPipeline
docs = TextDocument.from_dir(Path("..data/text"))
doc_pipeline = DocPipeline(pipeline=pipeline)
doc_pipeline.run(docs)
:::
```

## Summary

70 changes: 35 additions & 35 deletions docs/user_guide/provenance.md
Original file line number Diff line number Diff line change
@@ -25,7 +25,7 @@ and take a look at provenance for a single annotation, generated by a single ope
We are going to create a very simple `TextDocument` containing just one sentence,
and run a `RegexpMatcher` to match a single `Entity`:

:::{code}
```{code} python
from medkit.core.text import TextDocument
from medkit.text.ner import RegexpMatcher, RegexpMatcherRule
@@ -34,32 +34,32 @@ doc = TextDocument(text=text)
regexp_rule = RegexpMatcherRule(regexp=r"\basthme\b", label="problem")
regexp_matcher = RegexpMatcher(rules=[regexp_rule])
:::
```

Before calling the `run()` method of our regexp matcher,
we will activate provenance tracing for the generated entities.
This is done by assigning it a {class}`~medkit.core.ProvTracer` object.
The `ProvTracer` is in charge of gathering provenance information across all operations.

:::{code}
```{code} python
from medkit.core import ProvTracer
prov_tracer = ProvTracer()
regexp_matcher.set_prov_tracer(prov_tracer)
:::
```

Now that provenance is enabled, the regexp matcher can be applied to the input document:

:::{code}
```{code} python
entities = regexp_matcher.run([doc.raw_segment])
for entity in entities:
print(f"text={entity.text!r}, label={entity.label}")
:::
```

Let's retrieve and inspect provenance information concerning the matched entity:

:::{code}
```{code} python
def print_prov(prov):
# data item
print(f"data_item={prov.data_item.text!r}")
@@ -74,7 +74,7 @@ def print_prov(prov):
entity = entities[0]
prov = prov_tracer.get_prov(entity.uid)
print_prov(prov)
:::
```

The `get_prov()` method of `ProvTracer` returns a simple {class}`~medkit.core.Prov` object
containing all the provenance information related to a specific object.
@@ -89,17 +89,17 @@ It features the following attributes:
Here there is only one source, the raw text segment,
because the entity was found in this particular segment by the regexp matcher.
But it is possible to have more than one data item in the sources;
- `derived_data_items` contains the objects that were derived from the data item by further operations.
- `derived_data_items` contains the objects that were derived from the data item by further operations.
In this simple example, there are none.

If we are interested in all the provenance information gathered by the `ProvTracer` instance,
rather than the provenance of a specific item,
then we can call the `get_provs()` method:

:::{code}
```{code} python
for prov in prov_tracer.get_provs():
print_prov(prov)
:::
```

Here, we have another `Prov` object with partial provenance information about the raw text segment:
we know how it was used (the entity was derived from it) but we don't know how it was created.
@@ -118,7 +118,7 @@ It also provides command-line executable named `dot` to generate images from suc
You will need to install `graphviz` on your system to be able to run the following code.
:::

:::{code}
```{code} python
from pathlib import Path
from IPython.display import Image
from medkit.tools import save_prov_to_dot
@@ -144,15 +144,15 @@ dot_file = output_dir / "prov.dot"
save_prov_to_dot(prov_tracer, dot_file)
display_dot(dot_file)
:::
```

## Provenance composition

Let's move on to a slightly more complex example.
Before using the `RegexpMatcher` matcher, we will split our document into sentences with a `SentenceTokenizer`.
We will also compose the `SentenceTokenizer` and our `RegexpMatcher` operations in a `Pipeline`.

:::{code}
```{code} python
from medkit.text.segmentation import SentenceTokenizer
from medkit.core.pipeline import PipelineStep, Pipeline
@@ -166,7 +166,7 @@ steps = [
PipelineStep(regexp_matcher, input_keys=["sentences"], output_keys=["entities"]),
]
pipeline = Pipeline(steps=steps, input_keys=["full_text"], output_keys=["entities"])
:::
```

A pipeline being itself an operation, it also features a `set_prov_tracer()` method,
and calling it will automatically enable provenance tracing for all the operations in the pipeline.
@@ -175,23 +175,23 @@ and calling it will automatically enable provenance tracing for all the operatio
Provenance tracers can only accumulate provenance information, not modify or delete it.
:::

:::{code}
```{code} python
prov_tracer = ProvTracer()
pipeline.set_prov_tracer(prov_tracer)
entities = pipeline.run([doc.raw_segment])
for entity in entities:
print(f"text={entity.text!r}, label={entity.label}")
:::
```

As expected, the result is identical to the first example: we have matched one entity.
However, its provenance is structured differently:

:::{code}
```{code} python
for prov in prov_tracer.get_provs():
print_prov(prov)
:::
```

Compared to the simpler case, the operation that created the entity is the `Pipeline`, instead of the `RegexpMatcher`.
It might sound a little surprising, but it does make sense: the pipeline is a processing operation itself,
@@ -202,12 +202,12 @@ If we are interested in the details about what happened inside the `Pipeline`,
the information is still available through a sub-provenance tracer
that can be retrieved with `get_sub_prov_tracer()`:

:::{code}
```{code} python
pipeline_prov_tracer = prov_tracer.get_sub_prov_tracer(pipeline.uid)
for prov in pipeline_prov_tracer.get_provs():
print_prov(prov)
:::
```

Although the order of each `Prov` returned by `get_provs()` is not the order of creation of the annotations themselves,
we can see the details of what happened in the pipeline.
@@ -220,17 +220,17 @@ The `save_prov_to_dot()` helper is able to leverage this structure.
By default, it will expand and display all sub-provenance info recursively,
but it has a optional `max_sub_prov_depth` parameter that allows to limit the depth of the sub-provenance to show:

:::{code}
```{code} python
# show only outer-most provenance
save_prov_to_dot(prov_tracer, dot_file, max_sub_prov_depth=0)
display_dot(dot_file)
:::
```

:::{code}
```{code} python
# expand next level of sub-provenance
save_prov_to_dot(prov_tracer, dot_file, max_sub_prov_depth=1)
display_dot(dot_file)
:::
```

The same way that pipeline can contain sub-pipelines recursively,
the provenance tracer can contain sub-provenance tracers recursively for the corresponding sub-pipelines.
@@ -247,7 +247,7 @@ To demonstrate a bit more the potential of provenance tracing in `medkit`,
let's build a more complicated pipeline involving a sub-pipeline
and an operation that creates attributes:

:::{code}
```{code} python
from medkit.text.context import NegationDetector, NegationDetectorRule
# segmentation
@@ -284,30 +284,30 @@ pipeline = Pipeline(
input_keys=["full_text"],
output_keys=["entities"],
)
:::
```

Since there are 2 pipelines, we need to pass an optional `name` parameter to each of them
that will be used in the operation description and will help us to distinguish between them.

Running the main pipeline returns 2 entities with negation attributes:

:::{code}
```{code} python
prov_tracer = ProvTracer()
pipeline.set_prov_tracer(prov_tracer)
entities = pipeline.run([doc.raw_segment])
for entity in entities:
is_negated = entity.attrs.get(label="is_negated")[0].value
print(f"text={entity.text!r}, label={entity.label}, is_negated={is_negated}")
:::
```

At the outermost level, provenance tells us that the main pipeline created 2 entities and 2 attributes.
Intermediary data and operations (`SentenceTokenizer`, `NegationDetector`, `RegexpMatcher`) are hidden.

:::{code}
```{code} python
save_prov_to_dot(prov_tracer, dot_file, max_sub_prov_depth=0)
display_dot(dot_file)
:::
```

You can see dotted arrow showing which attribute relates to which annotation.
While this is not strictly speaking provenance information,
@@ -318,10 +318,10 @@ are copied to new annotations (cf `attrs_to_copy` as explained in the

Expanding one more level of provenance gives us the following graph:

:::{code}
```{code} python
save_prov_to_dot(prov_tracer, dot_file, max_sub_prov_depth=1)
display_dot(dot_file)
:::
```

Now, We can see the details of the operations and data items handled in our main pipeline.
A sub-pipeline created sentence segments and negation attributes,
@@ -330,10 +330,10 @@ The negation attributes were attached to both the sentences and the entities der

To have more details about the processing inside the context sub-pipeline, we have to go one level deeper:

:::{code}
```{code} python
save_prov_to_dot(prov_tracer, dot_file, max_sub_prov_depth=2)
display_dot(dot_file)
:::
```

## Wrapping it up

14 changes: 10 additions & 4 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -164,9 +164,10 @@ all = [
webrtc-voice-detector]""",
]
docs = [
"myst-parser",
"myst-nb",
"numpydoc",
"sphinx>=7,<8",
"pandas",
"sphinx",
"sphinx-autoapi",
"sphinx-autobuild",
"sphinx-book-theme",
@@ -207,15 +208,20 @@ cov = [

[tool.hatch.envs.docs]
dependencies = [
"myst-parser",
"myst-nb",
"numpydoc",
"sphinx>=7,<8",
"pandas",
"sphinx",
"sphinx-autoapi",
"sphinx-autobuild",
"sphinx-book-theme",
"sphinx-design",
"sphinxcontrib-mermaid",
]
features = [
"spacy",
]
python = "3.12"

[tool.hatch.envs.docs.scripts]
clean = "rm -rf docs/_build"