OPT20014: Create Structure Annotations for Chonjuk Data #30

tenzin3 · 2024-09-05T10:42:09Z

Description

We are planning to implement DTS Specifications to develop a text api. One of the key endpoints is the Navigation endpoint, which helps retrieve specific text by allowing users to navigate through the content. To ensure proper functionality and testing of this feature, we are preparing structured annotations for the Chonjuk dataset.

Requirement

The data should be Chonjuk data.
Each structural annotation must include necessary metadata, such as ID, title, etc.
Higher-level structural annotations should have the capability to reference or call lower-level structural annotations.

Chonjuk Data and Annotation Illustration

Expected Output

An OPF/Pecha for Chonjuk with three levels of structural annotations.

Implementation Steps

text segmenter
annotate in stam
save annotation
metadata annotation

annotation on annotation

load higher level annotation
annotate on higher level annotation

tenzin3 · 2024-09-09T04:46:37Z

chonjuk data chosen

tenzin3 · 2024-09-09T11:50:38Z

The Pecha Parser is designed with the following key principles in mind:

High Abstraction: The parser provides a high level of abstraction to simplify its use.
Custom Pipeline Flexibility: Users can create custom pipelines to suit their specific needs.

The parser operates with the following logic:

Input Text: Accepts the text to be processed.
Segmenter: Segments the text using one of the following methods:
Space Segmenter
New Line Segmenter
Regex Segmenter
Annotation Name: Assigns a name to the segmented text. The annotation name must be selected from a predefined list of enums.

tenzin3 · 2024-09-12T04:55:19Z

Read annotations and its annotation data in stam

from stam import AnnotationStore


stam_obj = AnnotationStore(file="annotation_store_path_str.json")

anns = list(stam_obj.annotations())

ann_data = []
for ann in anns:
    curr_data = {}
    curr_data["content"] = str(ann)
    for data in ann:
        curr_data[data.key().id()] = str(data.value())
    ann_data.append(curr_data)

print(ann_data)

tenzin3 · 2024-09-12T06:45:44Z

Issues with the new `AnnotationSubStore`

Location of annotation data:
When creating annotations in the AnnotationStore based on those contained in the AnnotationSubStore using an annotation selector, the annotation data is being stored in the AnnotationSubStore instead of the AnnotationStore.
AnnotationStore annotations function:
The annotations function in AnnotationStore is working correctly, but it is also retrieving annotations from the AnnotationSubStore.
@include path:
For our project, due to our defined data folder structure, we need to store the base file and annotation store files in separate folders. Instead of saving the AnnotationStore directly using set_filename and save, we convert the annotations to a JSON string using to_json_string and then modify the path to a relative format like ../../base/7906.txt. This allows us to keep files in different folders and improves usability. However, in AnnotationSubStore, when we use to_json_string from AnnotationStore, the dependency in AnnotationSubStore, where we have modified the @include paths to relative ones, is automatically being converted back to absolute paths.
Unable to load AnnotationStore with multi-layer AnnotationSubStore:
When attempting to load an AnnotationStore with multiple layers of AnnotationSubStore, it fails with the error stam.PyStamError: [StamError] DeserializationError: Deserialization failed: Expected string or array for @include in AnnotationStore.

tenzin3 · 2024-09-12T07:31:39Z

I have created an issue regarding STAM here.

tenzin3 · 2024-09-17T09:39:18Z

Framework Design

1.Text Processing:

Split the text into atomic units
An atomic unit is defined as a string split by a new line.

2.Condition Check:

Verify if the atomic units contain specific annotations.
A particular regex sometimes cant extract all annotations.

3: Modular Design:

Each function should perform only one task to ensure high reusability.

4:Custom Pipeline:

Users should be able to create their own custom processing pipelines.

tenzin3 · 2024-09-18T04:57:41Z

For metadata

ann_store: id = Pecha ID
ann_data_set: id = Meta_Data

For Translation

ann_store: id = Pecha ID
ann_data_set: id = Translation 

ann_data:
    key: Translation_Segment
    value: Tibetan_Segment

    key: Translation_Segment
    value: Englist_Segment

For Root and Commentary

ann_store: id = Pecha ID
ann_data_set: id = Root_Commentary

ann_data:
    key: Associated_Alignment    
    value: Root_Segment

    key: Associated_Alignment
    value: Commentary_Segment

For OPF

ann_store: id = Pecha ID
ann_data_set: id = Structure_Annotation

ann_data:
    key: Structure_Type
    value: Chapter

    key: Structure Type
    value: Tsawa

    key: Structure Type
    value: Meaning_Segment

tenzin3 self-assigned this Sep 5, 2024

tenzin3 mentioned this issue Sep 5, 2024

OPT20015: Build navigation endpoint for Chonjuk Data #31

Open

5 tasks

tenzin3 changed the title ~~OPT20005: Create Structure Annotations for Chonjuk Data~~ OPT20014: Create Structure Annotations for Chonjuk Data Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPT20014: Create Structure Annotations for Chonjuk Data #30

OPT20014: Create Structure Annotations for Chonjuk Data #30

tenzin3 commented Sep 5, 2024 •

edited

Loading

tenzin3 commented Sep 9, 2024 •

edited

Loading

tenzin3 commented Sep 9, 2024

tenzin3 commented Sep 12, 2024

tenzin3 commented Sep 12, 2024 •

edited

Loading

tenzin3 commented Sep 12, 2024

tenzin3 commented Sep 17, 2024

tenzin3 commented Sep 18, 2024 •

edited

Loading

OPT20014: Create Structure Annotations for Chonjuk Data #30

OPT20014: Create Structure Annotations for Chonjuk Data #30

Comments

tenzin3 commented Sep 5, 2024 • edited Loading

Description

Requirement

Chonjuk Data and Annotation Illustration

Expected Output

Implementation Steps

tenzin3 commented Sep 9, 2024 • edited Loading

tenzin3 commented Sep 9, 2024

tenzin3 commented Sep 12, 2024

Read annotations and its annotation data in stam

tenzin3 commented Sep 12, 2024 • edited Loading

Issues with the new AnnotationSubStore

tenzin3 commented Sep 12, 2024

tenzin3 commented Sep 17, 2024

Framework Design

tenzin3 commented Sep 18, 2024 • edited Loading

tenzin3 commented Sep 5, 2024 •

edited

Loading

tenzin3 commented Sep 9, 2024 •

edited

Loading

tenzin3 commented Sep 12, 2024 •

edited

Loading

Issues with the new `AnnotationSubStore`

tenzin3 commented Sep 18, 2024 •

edited

Loading