Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPT20014: Create Structure Annotations for Chonjuk Data #30

Open
4 of 6 tasks
tenzin3 opened this issue Sep 5, 2024 · 7 comments
Open
4 of 6 tasks

OPT20014: Create Structure Annotations for Chonjuk Data #30

tenzin3 opened this issue Sep 5, 2024 · 7 comments
Assignees

Comments

@tenzin3
Copy link
Contributor

tenzin3 commented Sep 5, 2024

Description

We are planning to implement DTS Specifications to develop a text api. One of the key endpoints is the Navigation endpoint, which helps retrieve specific text by allowing users to navigate through the content. To ensure proper functionality and testing of this feature, we are preparing structured annotations for the Chonjuk dataset.

Requirement

  • The data should be Chonjuk data.
  • Each structural annotation must include necessary metadata, such as ID, title, etc.
  • Higher-level structural annotations should have the capability to reference or call lower-level structural annotations.

Chonjuk Data and Annotation Illustration

Image

Expected Output

An OPF/Pecha for Chonjuk with three levels of structural annotations.

Implementation Steps

  • text segmenter
  • annotate in stam
  • save annotation
  • metadata annotation

annotation on annotation

  • load higher level annotation
  • annotate on higher level annotation
@tenzin3
Copy link
Contributor Author

tenzin3 commented Sep 9, 2024

chonjuk data chosen

@tenzin3
Copy link
Contributor Author

tenzin3 commented Sep 9, 2024

The Pecha Parser is designed with the following key principles in mind:

  • High Abstraction: The parser provides a high level of abstraction to simplify its use.
  • Custom Pipeline Flexibility: Users can create custom pipelines to suit their specific needs.

The parser operates with the following logic:

  1. Input Text: Accepts the text to be processed.
  2. Segmenter: Segments the text using one of the following methods:
    Space Segmenter
    New Line Segmenter
    Regex Segmenter
  3. Annotation Name: Assigns a name to the segmented text. The annotation name must be selected from a predefined list of enums.

@tenzin3
Copy link
Contributor Author

tenzin3 commented Sep 12, 2024

Read annotations and its annotation data in stam

from stam import AnnotationStore


stam_obj = AnnotationStore(file="annotation_store_path_str.json")

anns = list(stam_obj.annotations())

ann_data = []
for ann in anns:
    curr_data = {}
    curr_data["content"] = str(ann)
    for data in ann:
        curr_data[data.key().id()] = str(data.value())
    ann_data.append(curr_data)

print(ann_data)

@tenzin3
Copy link
Contributor Author

tenzin3 commented Sep 12, 2024

Issues with the new AnnotationSubStore

  1. Location of annotation data:
    When creating annotations in the AnnotationStore based on those contained in the AnnotationSubStore using an annotation selector, the annotation data is being stored in the AnnotationSubStore instead of the AnnotationStore.

  2. AnnotationStore annotations function:
    The annotations function in AnnotationStore is working correctly, but it is also retrieving annotations from the AnnotationSubStore.

  3. @include path:
    For our project, due to our defined data folder structure, we need to store the base file and annotation store files in separate folders. Instead of saving the AnnotationStore directly using set_filename and save, we convert the annotations to a JSON string using to_json_string and then modify the path to a relative format like ../../base/7906.txt. This allows us to keep files in different folders and improves usability. However, in AnnotationSubStore, when we use to_json_string from AnnotationStore, the dependency in AnnotationSubStore, where we have modified the @include paths to relative ones, is automatically being converted back to absolute paths.

  4. Unable to load AnnotationStore with multi-layer AnnotationSubStore:
    When attempting to load an AnnotationStore with multiple layers of AnnotationSubStore, it fails with the error stam.PyStamError: [StamError] DeserializationError: Deserialization failed: Expected string or array for @include in AnnotationStore.

@tenzin3
Copy link
Contributor Author

tenzin3 commented Sep 12, 2024

I have created an issue regarding STAM here.

@tenzin3 tenzin3 changed the title OPT20005: Create Structure Annotations for Chonjuk Data OPT20014: Create Structure Annotations for Chonjuk Data Sep 17, 2024
@tenzin3
Copy link
Contributor Author

tenzin3 commented Sep 17, 2024

Framework Design

1.Text Processing:

  • Split the text into atomic units
  • An atomic unit is defined as a string split by a new line.

2.Condition Check:

  • Verify if the atomic units contain specific annotations.
  • A particular regex sometimes cant extract all annotations.

3: Modular Design:

  • Each function should perform only one task to ensure high reusability.

4:Custom Pipeline:

  • Users should be able to create their own custom processing pipelines.

Image

@tenzin3
Copy link
Contributor Author

tenzin3 commented Sep 18, 2024

For metadata

ann_store: id = Pecha ID
ann_data_set: id = Meta_Data

For Translation

ann_store: id = Pecha ID
ann_data_set: id = Translation 

ann_data:
    key: Translation_Segment
    value: Tibetan_Segment

    key: Translation_Segment
    value: Englist_Segment

For Root and Commentary

ann_store: id = Pecha ID
ann_data_set: id = Root_Commentary

ann_data:
    key: Associated_Alignment    
    value: Root_Segment

    key: Associated_Alignment
    value: Commentary_Segment

For OPF

ann_store: id = Pecha ID
ann_data_set: id = Structure_Annotation

ann_data:
    key: Structure_Type
    value: Chapter

    key: Structure Type
    value: Tsawa

    key: Structure Type
    value: Meaning_Segment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant