DOC: Make tutorials runnable as myst notebooks

medkit-lib · ghisvail · Apr 29, 2024 · Apr 24, 2024 · Apr 29, 2024 · Apr 25, 2024
commit f2d398823431e3dec6da22746aa9c5f6a26a0255
diff --git a/docs/conf.py b/docs/conf.py
@@ -15,7 +15,7 @@
 
 extensions = [
     "autoapi.extension",
-    "myst_parser",
+    "myst_nb",
     "numpydoc",
     "sphinxcontrib.mermaid",
     "sphinx_design",

diff --git a/docs/tutorial/context_detection.md b/docs/tutorial/context_detection.md
diff --git a/docs/tutorial/entity_matching.md b/docs/tutorial/entity_matching.md
diff --git a/docs/user_guide/first_steps.md b/docs/user_guide/first_steps.md
@@ -8,21 +8,21 @@ and context detection operations.
 
 For starters, let's load a text file using the {class}`~medkit.core.text.TextDocument` class:
 
-:::{code}
+```{code} python
 # You can download the file available in source code
 # !wget https://raw.githubusercontent.com/medkit-lib/medkit/main/docs/data/text/1.txt
 
 from pathlib import Path
 from medkit.core.text import TextDocument
 
 doc = TextDocument.from_file(Path("../data/text/1.txt"))
-:::
+```
 
 The full raw text can be accessed through the `text` attribute:
 
-:::{code}
+```{code} python
 print(doc.text)
-:::
+```
 
 A `TextDocument` can store {class}`~medkit.core.text.TextAnnotation` objects.
 For now, our document is free of annotations.
@@ -36,14 +36,14 @@ documents in sentences.
 including a rule-based {class}`~medkit.text.segmentation.SentenceTokenizer` class
 that relies on a list of punctuation characters.
 
-:::{code}
+```{code} python
 from medkit.text.segmentation import SentenceTokenizer
 
 sent_tokenizer = SentenceTokenizer(
     output_label="sentence",
     punct_chars=[".", "?", "!"],
 )
-:::
+```
 
 As all operations, `SentenceTokenizer` defines a `run()` method.
 
@@ -54,14 +54,14 @@ and returns a list of `Segment` objects.
 Here, we can pass a special `Segment` containing the full text of the document,
 which can be retrieved through the `raw_segment` attribute of `TextDocument`:
 
-:::{code}
+```{code} python
 sentences = sent_tokenizer.run([doc.raw_segment])
 
 for sentence in sentences:
     print(f"uid={sentence.uid}")
     print(f"text={sentence.text!r}")
     print(f"spans={sentence.spans}, label={sentence.label}\n")
-:::
+```
 
 Each segment features:
  - an `uid` attribute, which unique value is automatically generated;
@@ -76,10 +76,10 @@ Each segment features:
 If you take a look at the 13th and 14th detected sentences,
 you will notice something strange:
 
-:::{code}
+```{code} python
 print(repr(sentences[12].text))
 print(repr(sentences[13].text))
-:::
+```
 
 This is actually one sentence that was split into two segments,
 because the sentence tokenizer incorrectly considers the dot in the decimal weight value
@@ -92,31 +92,31 @@ For this, we can use the {class}`~medkit.text.preprocessing.RegexpReplacer` clas
 a regexp-based "search-and-replace" operation.
 As other `medkit` operations, it can be configured with a set of user-determined rules:
 
-:::{code}
+```{code} python
 from medkit.text.preprocessing import RegexpReplacer
 
 rule = (r"(?<=\d)\.(?=\d)", ",")  # => (pattern to replace, new text)
 regexp_replacer = RegexpReplacer(output_label="clean_text", rules=[rule])
-:::
+```
 
 The `run()` method of the normalizer takes a list of `Segment` objects
 and returns a list of new `Segment` objects, one for each input `Segment`.
 In our case we only want to preprocess the full raw text segment,
 and we will only receive one preprocessed segment,
 so we can call it with:
 
-:::{code}
+```{code} python
 clean_segment = regexp_replacer.run([doc.raw_segment])[0]
 print(clean_segment.text)
-:::
+```
 
 We may use again our previously-defined sentence tokenizer again,
 but this time on the preprocessed text:
 
-:::{code}
+```{code} python
 sentences = sent_tokenizer.run([clean_segment])
 print(sentences[12].text)
-:::
+```
 
 Problem fixed!
 
@@ -126,7 +126,7 @@ The `medkit` library also comes with operations to perform NER (named entity rec
 for instance with {class}`~medkit.text.ner.regexp_matcher.RegexpMatcher`.
 Let's instantiate one with a few simple rules:
 
-:::{code}
+```{code} python
 from medkit.text.ner import RegexpMatcher, RegexpMatcherRule
 
 regexp_rules = [
@@ -138,7 +138,7 @@ regexp_rules = [
     RegexpMatcherRule(regexp=r"\bnasonex?\b", label="treatment", case_sensitive=False),
 ]
 regexp_matcher = RegexpMatcher(rules=regexp_rules)
-:::
+```
 
 As you can see, you can also define some rules that ignore case distinctions
 by setting `case-sensitive` parameter to `False`.
@@ -162,13 +162,13 @@ representing the entities that were matched (`Entity` is a subclass of `Segment`
 As input, it expects a list of `Segment` objects.
 Let's give it the sentences returned by the sentence tokenizer:
 
-:::{code}
+```{code} python
 entities = regexp_matcher.run(sentences)
 
 for entity in entities:
     print(f"uid={entity.uid}")
     print(f"text={entity.text!r}, spans={entity.spans}, label={entity.label}\n")
-:::
+```
 
 Just like sentences, each entity features `uid`, `text`, `spans` and `label` attributes
 (in this case, determined by the rule that was used to match it).
@@ -192,7 +192,7 @@ accessible through their {class}`~medkit.core.AttributeContainer`).
 Let's instantiate a `NegationDetector` with a couple of simplistic handcrafted rules
 and run it on our sentences:
 
-:::{code}
+```{code} python
 from medkit.text.context import NegationDetector, NegationDetectorRule
 
 neg_rules = [
@@ -202,7 +202,7 @@ neg_rules = [
 ]
 neg_detector = NegationDetector(output_label="is_negated", rules=neg_rules)
 neg_detector.run(sentences)
-:::
+```
 
 :::{note}
 Similarly to `RegexpMatcher`, `DetectionDetector` also comes with a set of default rules
@@ -213,12 +213,12 @@ located in the `medkit.text.context` module.
 
 And now, let's look at which sentence have been detected as being negated:
 
-:::{code}
+```{code} python
 for sentence in sentences:
     neg_attr = sentence.attrs.get(label="is_negated")[0]
     if neg_attr.value:
         print(sentence.text)
-:::
+```
 
 Our simple negation detector does not work too bad,
 but sometimes some part of the sentence is tagged with a negation whilst the rest does not,
@@ -235,7 +235,7 @@ which are stored in file `default_syntagma_definition.yml`
 located in the `medkit.text.segmentation` module.
 :::
 
-:::{code}
+```{code} python
 from medkit.text.segmentation import SyntagmaTokenizer
 
 synt_tokenizer = SyntagmaTokenizer(
@@ -249,7 +249,7 @@ for syntagma in syntagmas:
     neg_attr = syntagma.attrs.get(label="is_negated")[0]
     if neg_attr.value:
         print(syntagma.text)
-:::
+```
 
 We now have some information about negation attached to syntagmas,
 but the end goal is really to know, for each entity,
@@ -268,14 +268,14 @@ Let's again use a `RegexpMatcher` to find some entities,
 but this time from syntagmas rather than from sentences,
 and using `attrs_to_copy` to copy negation attributes:
 
-:::{code}
+```{code} python
 regexp_matcher = RegexpMatcher(rules=regexp_rules, attrs_to_copy=["is_negated"])
 entities = regexp_matcher.run(syntagmas)
 
 for entity in entities:
     neg_attr = entity.attrs.get(label="is_negated")[0]
     print(f"text='{entity.text}', label={entity.label}, is_negated={neg_attr.value}")
-:::
+```
 
 We now have a negation `Attribute` for each entity!
 
@@ -293,21 +293,21 @@ an instance of {class}`~medkit.core.text.TextAnnotationContainer`)
 that behaves roughly like a list but also offers additional filtering methods.
 Annotations can be added by calling its `add()` method:
 
-:::{code}
+```{code} python
 for entity in entities:
     doc.anns.add(entity)
-:::
+```
 
 The document and its corresponding entities can be exported to supported formats
 such as brat (see {class}`~medkit.io.brat.BratOutputConverter`)
 or Doccano (see {class}`~medkit.io.doccano.DoccanoOutputConverter`),
 or serialized to JSON (see {mod}`~medkit.io.medkit_json`):
 
-:::{code}
+```{code} python
 from medkit.io import medkit_json
 
 medkit_json.save_text_document(doc, "doc_1.json")
-:::
+```
 
 ## Visualizing entities with displacy
 
@@ -316,13 +316,13 @@ a visualization tool part of the [spaCy](https://spacy.io/) NLP library.
 `medkit` provides helper functions to facilitate the use of `displacy`
 in the {mod}`~medkit.text.spacy.displacy_utils` module:
 
-:::{code}
+```{code} python
 from spacy import displacy
 from medkit.text.spacy.displacy_utils import medkit_doc_to_displacy
 
 displacy_data = medkit_doc_to_displacy(doc)
 displacy.render(displacy_data, manual=True, style="ent")
-:::
+```
 
 ## Wrapping it up
 

diff --git a/docs/user_guide/module.md b/docs/user_guide/module.md
@@ -57,14 +57,17 @@ segment.
 
 ```python
 class MyTokenizer(SegmentationOperation):
-    ...
+
+    def _tokenize(self, segment: Segment) -> Segment:
+        """Custom method for segment tokenization."""
+        ...
+
     def run(self, segments: List[Segment]) -> List[Segment]:
-        # Here is your code for the tokenizer:
-        # * process each input
         return [
             token
             for segment in segments
-            for token in self._mytokenmethod(segment) 
+            for token in self._tokenize(segment)
+        ]
 ```
 
 ## 3. Make your operation non-destructive (for text)
@@ -85,7 +88,7 @@ segments.
 ```python
 class MyTokenizer(SegmentationOperation):
     ...
-    def _mytokenmethod(self, segment):
+    def _tokenize(self, segment):
         # process the segment (e.g., cut the segment)
         size = len(segment)
         cut_index = size // 2
@@ -140,7 +143,7 @@ Here is our example which store information about:
 ```python
 class MyTokenizer(SegmentationOperation):
     ...
-    def _mytokenmethod(self, segment):
+    def _tokenize(self, segment):
         ...
 
         # save the provenance data for this operation
@@ -166,7 +169,7 @@ To illustrate what we have seen in a more concrete manner, here is a fictional
 "days of the week" matcher that takes text segments as input a return entities
 for week days:
 
-:::{code}
+```python
 import re
 from medkit.core import Operation
 from medkit.core.text import Entity, span_utils
@@ -222,7 +225,7 @@ class DayMatcher(Operation):
                         )
 
         return entities
-:::
+```
 
 Note than since this is a entity matcher, adding support for `attrs_to_copy`
 would be nice (cf [Context detection](../tutorial/context_detection.md)).
diff --git a/docs/user_guide/pipeline.md b/docs/user_guide/pipeline.md
@@ -8,7 +8,7 @@ and how to create pipelines to enrich documents.
 Let's reuse the preprocessing, segmentation, context detection and entity recognition operations
 from the [First steps](./first_steps.md) tutorial:
 
-:::{code}
+```{code} python
 from medkit.text.preprocessing import RegexpReplacer
 from medkit.text.segmentation import SentenceTokenizer, SyntagmaTokenizer
 from medkit.text.context import NegationDetector, NegationDetectorRule
@@ -30,14 +30,14 @@ syntagma_tokenizer = SyntagmaTokenizer(
 )
 
 # context detection 
-neg_rules = [
+negation_rules = [
     NegationDetectorRule(regexp=r"\bpas\s*d[' e]\b"),
     NegationDetectorRule(regexp=r"\bsans\b", exclusion_regexps=[r"\bsans\s*doute\b"]),
     NegationDetectorRule(regexp=r"\bne\s*semble\s*pas"),
 ]
 negation_detector = NegationDetector(
     output_label="is_negated",
-    rules=neg_rules,
+    rules=negation_rules,
 )
 
 # entity recognition
@@ -50,13 +50,13 @@ regexp_rules = [
     RegexpMatcherRule(regexp=r"\bnasonex?\b", label="treatment", case_sensitive=False),
 ]
 regexp_matcher = RegexpMatcher(rules=regexp_rules, attrs_to_copy=["is_negated"])
-:::
+```
 
 Each of these operations features a `run()` method, which could be called sequentially.
 Data need to be routed manually between inputs and outputs for each operation,
 using a document's raw text segment as initial input:
 
-:::{code}
+```{code} python
 from pathlib import Path
 from medkit.core.text import TextDocument
 
@@ -74,7 +74,7 @@ syntagmas = syntagma_tokenizer.run(sentences)
 # but rather appends attributes to the segments it received.
 negation_detector.run(syntagmas)
 entities = regexp_matcher.run(syntagmas)
-:::
+```
 
 This way of coding is useful for interactive exploration of `medkit`.
 In the next section, we will introduce a different way using `Pipeline` objects.
@@ -105,7 +105,7 @@ But we also need to "connect" the operations together,
 i.e. to indicate which output of an operation should be fed as input to another operation.
 This is the purpose of the {class}`~medkit.core.PipelineStep` objects:
 
-:::{code}
+```{code} python
 from medkit.core import PipelineStep
 
 steps = [
@@ -115,13 +115,13 @@ steps = [
     PipelineStep(negation_detector, input_keys=["syntagmas"], output_keys=[]),  # no output
     PipelineStep(regexp_matcher, input_keys=["syntagmas"], output_keys=["entities"]),
 ]
-:::
+```
 
 Each `PipelineStep` associates an operation with input and output _keys_.
 Pipeline steps with matching input and output keys will be connected to each other.
 The resulting pipeline can be represented like this:
 
-:::{mermaid}
+```{mermaid}
 ---
 align: center
 ---
@@ -143,11 +143,11 @@ graph TD
     F --> G
 
     classDef io fill:#fff4dd,stroke:#edb:
-:::
+```
 
 Pipeline steps can then be used to instantiate a {class}`~medkit.core.Pipeline` object:
 
-:::{code}
+```{code} python
 from medkit.core import Pipeline
 
 pipeline = Pipeline(
@@ -162,7 +162,7 @@ pipeline = Pipeline(
     # (and therefore that it should be the output of the regexp matcher)
     output_keys=["entities"]
 )
-:::
+```
 
 The resulting pipeline is functionally equivalent to some operation
 processing full text segments as input and returning entities with family attributes as output.
@@ -171,13 +171,13 @@ but more complex pipelines with multiple inputs and outputs are supported.
 
 Like any other operation, the pipeline can be evaluated using its `run` method: 
 
-:::{code}
+```{code} python
 entities = pipeline.run([doc.raw_segment])
 
 for entity in entities:
     neg_attr = entity.attrs.get(label="is_negated")[0]
     print(f"text='{entity.text}', label={entity.label}, is_negated={neg_attr.value}")
-:::
+```
 
 ## Nested pipelines
 
@@ -188,7 +188,7 @@ which can be used, tested and exercised in isolation.
 In our example, we can use this feature to regroup together our regexp replacer,
 sentence tokenizer and family detector into a context sub-pipeline:
 
-:::{code}
+```{code} python
 # Context pipeline that receives full text segments
 # and returns preprocessed syntagmas segments with negation attributes.
 context_pipeline = Pipeline(
@@ -197,20 +197,20 @@ context_pipeline = Pipeline(
     name="context",
     steps=[
         PipelineStep(regexp_replacer, input_keys=["full_text"], output_keys=["clean_text"]),
-        PipelineStep(sent_tokenizer, input_keys=["clean_text"], output_keys=["sentences"]),
-        PipelineStep(synt_tokenizer, input_keys=["sentences"], output_keys=["syntagmas"]),
-        PipelineStep(neg_detector, input_keys=["syntagmas"], output_keys=[]),
+        PipelineStep(sentence_tokenizer, input_keys=["clean_text"], output_keys=["sentences"]),
+        PipelineStep(syntagma_tokenizer, input_keys=["sentences"], output_keys=["syntagmas"]),
+        PipelineStep(negation_detector, input_keys=["syntagmas"], output_keys=[]),
     ],
     input_keys=["full_text"],
     output_keys=["syntagmas"],
 )
-:::
+```
 
 Likewise, we can introduce a NER sub-pipelines
 composed of a UMLS-based matching operation (see also [Entity Matching](../tutorial/entity_matching.md))
 grouped with the previously defined regexp matcher:
 
-:::{code}
+```{code} python
 from medkit.text.ner import UMLSMatcher
 
 umls_matcher = UMLSMatcher(
@@ -231,15 +231,15 @@ ner_pipeline = Pipeline(
     input_keys=["syntagmas"],
     output_keys=["entities"],
 )
-:::
+```
 
 Since both pipeline steps feature the same output key (_entities_),
 the pipeline will return a list containing the entities matched by
 both the regexp matcher and the UMLS matcher.
 
 The NER and context sub-pipelines can now be sequenced with:
 
-:::{code}
+```{code} python
 pipeline = Pipeline(
     steps=[
         PipelineStep(context_pipeline, input_keys=["full_text"], output_keys=["syntagmas"]),
@@ -248,7 +248,7 @@ pipeline = Pipeline(
     input_keys=["full_text"],
     output_keys=["entities"],
 )
-:::
+```
 
 which can be represented like this:
 
@@ -287,14 +287,14 @@ graph TD
 
 Let's run the pipeline and verify entities with negation attributes:
 
-:::{code}
+```{code} python
 entities = pipeline.run([doc.raw_segment])
 
 for entity in entities:
     neg_attr = entity.attrs.get(label="is_negated")[0]
     print(entity.label, ":", entity.text)
     print("negation:", neg_attr.value, end="\n\n")
-:::
+```
 
 ```text
 problem : allergies
@@ -393,28 +393,28 @@ To scale the processing of such pipeline to a collection of documents,
 one needs to iterate over each document manually to obtain its entities
 rather than processing all the documents at once:
 
-:::{code}
+```{code} python
 docs = TextDocument.from_dir(Path("..data/text"))
 
 for doc in docs:
     entities = pipeline.run([doc.raw_segment])
     for entity in entities:
         doc.anns.add(entity)
-:::
+```
 
 To handle this common use case, `medkit` provides a {class}`~medkit.core.DocPipeline` class,
 which wraps a `Pipeline` instance and run it on a list of documents.
 
 Here is an example of its usage:
 
-:::{code}
+```{code} python
 from medkit.core import DocPipeline
 
 docs = TextDocument.from_dir(Path("..data/text"))
 
 doc_pipeline = DocPipeline(pipeline=pipeline)
 doc_pipeline.run(docs)
-:::
+```
 
 ## Summary
 

diff --git a/docs/user_guide/provenance.md b/docs/user_guide/provenance.md
@@ -25,7 +25,7 @@ and take a look at provenance for a single annotation, generated by a single ope
 We are going to create a very simple `TextDocument` containing just one sentence,
 and run a `RegexpMatcher` to match a single `Entity`:
 
-:::{code}
+```{code} python
 from medkit.core.text import TextDocument
 from medkit.text.ner import RegexpMatcher, RegexpMatcherRule
 
@@ -34,32 +34,32 @@ doc = TextDocument(text=text)
 
 regexp_rule = RegexpMatcherRule(regexp=r"\basthme\b", label="problem")
 regexp_matcher = RegexpMatcher(rules=[regexp_rule])
-:::
+```
 
 Before calling the `run()` method of our regexp matcher,
 we will activate provenance tracing for the generated entities.
 This is done by assigning it a {class}`~medkit.core.ProvTracer` object.
 The `ProvTracer` is in charge of gathering provenance information across all operations.
 
-:::{code}
+```{code} python
 from medkit.core import ProvTracer
 
 prov_tracer = ProvTracer()
 regexp_matcher.set_prov_tracer(prov_tracer)
-:::
+```
 
 Now that provenance is enabled, the regexp matcher can be applied to the input document:
 
-:::{code}
+```{code} python
 entities = regexp_matcher.run([doc.raw_segment])
 
 for entity in entities:
     print(f"text={entity.text!r}, label={entity.label}")
-:::
+```
 
 Let's retrieve and inspect provenance information concerning the matched entity:
 
-:::{code}
+```{code} python
 def print_prov(prov):
     # data item
     print(f"data_item={prov.data_item.text!r}")
@@ -74,7 +74,7 @@ def print_prov(prov):
 entity = entities[0]
 prov = prov_tracer.get_prov(entity.uid)
 print_prov(prov)
-:::
+```
 
 The `get_prov()` method of `ProvTracer` returns a simple {class}`~medkit.core.Prov` object
 containing all the provenance information related to a specific object.
@@ -89,17 +89,17 @@ It features the following attributes:
   Here there is only one source, the raw text segment,
   because the entity was found in this particular segment by the regexp matcher.
   But it is possible to have more than one data item in the sources;
- - `derived_data_items` contains the objects that were derived from the data item by further operations.
+- `derived_data_items` contains the objects that were derived from the data item by further operations.
   In this simple example, there are none.
 
 If we are interested in all the provenance information gathered by the `ProvTracer` instance,
 rather than the provenance of a specific item,
 then we can call the `get_provs()` method:
 
-:::{code}
+```{code} python
 for prov in prov_tracer.get_provs():
     print_prov(prov)
-:::
+```
 
 Here, we have another `Prov` object with partial provenance information about the raw text segment:
 we know how it was used (the entity was derived from it) but we don't know how it was created.
@@ -118,7 +118,7 @@ It also provides command-line executable named `dot` to generate images from suc
 You will need to install `graphviz` on your system to be able to run the following code.
 :::
 
-:::{code}
+```{code} python
 from pathlib import Path
 from IPython.display import Image
 from medkit.tools import save_prov_to_dot
@@ -144,15 +144,15 @@ dot_file = output_dir / "prov.dot"
 
 save_prov_to_dot(prov_tracer, dot_file)
 display_dot(dot_file)
-:::
+```
 
 ## Provenance composition
 
 Let's move on to a slightly more complex example.
 Before using the `RegexpMatcher` matcher, we will split our document into sentences with a `SentenceTokenizer`.
 We will also compose the `SentenceTokenizer` and our `RegexpMatcher` operations in a `Pipeline`.
 
-:::{code}
+```{code} python
 from medkit.text.segmentation import SentenceTokenizer
 from medkit.core.pipeline import PipelineStep, Pipeline
 
@@ -166,7 +166,7 @@ steps = [
     PipelineStep(regexp_matcher, input_keys=["sentences"], output_keys=["entities"]),
 ]
 pipeline = Pipeline(steps=steps, input_keys=["full_text"], output_keys=["entities"])
-:::
+```
 
 A pipeline being itself an operation, it also features a `set_prov_tracer()` method,
 and calling it will automatically enable provenance tracing for all the operations in the pipeline.
@@ -175,23 +175,23 @@ and calling it will automatically enable provenance tracing for all the operatio
 Provenance tracers can only accumulate provenance information, not modify or delete it.
 :::
 
-:::{code}
+```{code} python
 prov_tracer = ProvTracer()
 pipeline.set_prov_tracer(prov_tracer)
 
 entities = pipeline.run([doc.raw_segment])
 
 for entity in entities:
     print(f"text={entity.text!r}, label={entity.label}")
-:::
+```
 
 As expected, the result is identical to the first example: we have matched one entity.
 However, its provenance is structured differently:
 
-:::{code}
+```{code} python
 for prov in prov_tracer.get_provs():
     print_prov(prov)
-:::
+```
 
 Compared to the simpler case, the operation that created the entity is the `Pipeline`, instead of the `RegexpMatcher`.
 It might sound a little surprising, but it does make sense: the pipeline is a processing operation itself,
@@ -202,12 +202,12 @@ If we are interested in the details about what happened inside the `Pipeline`,
 the information is still available through a sub-provenance tracer
 that can be retrieved with `get_sub_prov_tracer()`:
 
-:::{code}
+```{code} python
 pipeline_prov_tracer = prov_tracer.get_sub_prov_tracer(pipeline.uid)
 
 for prov in pipeline_prov_tracer.get_provs():
     print_prov(prov)
-:::
+```
 
 Although the order of each `Prov` returned by `get_provs()` is not the order of creation of the annotations themselves,
 we can see the details of what happened in the pipeline.
@@ -220,17 +220,17 @@ The `save_prov_to_dot()` helper is able to leverage this structure.
 By default, it will expand and display all sub-provenance info recursively,
 but it has a optional `max_sub_prov_depth` parameter that allows to limit the depth of the sub-provenance to show:
 
-:::{code}
+```{code} python
 # show only outer-most provenance
 save_prov_to_dot(prov_tracer, dot_file, max_sub_prov_depth=0)
 display_dot(dot_file)
-:::
+```
 
-:::{code}
+```{code} python
 # expand next level of sub-provenance
 save_prov_to_dot(prov_tracer, dot_file, max_sub_prov_depth=1)
 display_dot(dot_file)
-:::
+```
 
 The same way that pipeline can contain sub-pipelines recursively,
 the provenance tracer can contain sub-provenance tracers recursively for the corresponding sub-pipelines.
@@ -247,7 +247,7 @@ To demonstrate a bit more the potential of provenance tracing in `medkit`,
 let's build a more complicated pipeline involving a sub-pipeline
 and an operation that creates attributes:
 
-:::{code}
+```{code} python
 from medkit.text.context import NegationDetector, NegationDetectorRule
 
 # segmentation
@@ -284,30 +284,30 @@ pipeline = Pipeline(
     input_keys=["full_text"],
     output_keys=["entities"],
 )
-:::
+```
 
 Since there are 2 pipelines, we need to pass an optional `name` parameter to each of them
 that will be used in the operation description and will help us to distinguish between them.
 
 Running the main pipeline returns 2 entities with negation attributes:
 
-:::{code}
+```{code} python
 prov_tracer = ProvTracer()
 pipeline.set_prov_tracer(prov_tracer)
 entities = pipeline.run([doc.raw_segment])
 
 for entity in entities:
     is_negated = entity.attrs.get(label="is_negated")[0].value
     print(f"text={entity.text!r}, label={entity.label}, is_negated={is_negated}")
-:::
+```
 
 At the outermost level, provenance tells us that the main pipeline created 2 entities and 2 attributes.
 Intermediary data and operations (`SentenceTokenizer`, `NegationDetector`, `RegexpMatcher`) are hidden.
 
-:::{code}
+```{code} python
 save_prov_to_dot(prov_tracer, dot_file, max_sub_prov_depth=0)
 display_dot(dot_file)
-:::
+```
 
 You can see dotted arrow showing which attribute relates to which annotation.
 While this is not strictly speaking provenance information,
@@ -318,10 +318,10 @@ are copied to new annotations (cf `attrs_to_copy` as explained in the
 
 Expanding one more level of provenance gives us the following graph:
 
-:::{code}
+```{code} python
 save_prov_to_dot(prov_tracer, dot_file, max_sub_prov_depth=1)
 display_dot(dot_file)
-:::
+```
 
 Now, We can see the details of the operations and data items handled in our main pipeline.
 A sub-pipeline created sentence segments and negation attributes,
@@ -330,10 +330,10 @@ The negation attributes were attached to both the sentences and the entities der
 
 To have more details about the processing inside the context sub-pipeline, we have to go one level deeper:
 
-:::{code}
+```{code} python
 save_prov_to_dot(prov_tracer, dot_file, max_sub_prov_depth=2)
 display_dot(dot_file)
-:::
+```
 
 ## Wrapping it up
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -164,9 +164,10 @@ all = [
     webrtc-voice-detector]""",
 ]
 docs = [
-  "myst-parser",
+  "myst-nb",
   "numpydoc",
-  "sphinx>=7,<8",
+  "pandas",
+  "sphinx",
   "sphinx-autoapi",
   "sphinx-autobuild",
   "sphinx-book-theme",
@@ -207,15 +208,20 @@ cov = [
 
 [tool.hatch.envs.docs]
 dependencies = [
-  "myst-parser",
+  "myst-nb",
   "numpydoc",
-  "sphinx>=7,<8",
+  "pandas",
+  "sphinx",
   "sphinx-autoapi",
   "sphinx-autobuild",
   "sphinx-book-theme",
   "sphinx-design",
   "sphinxcontrib-mermaid",
 ]
+features = [
+  "spacy",
+]
+python = "3.12"
 
 [tool.hatch.envs.docs.scripts]
 clean = "rm -rf docs/_build"