Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPT20031: Planning for Toolkit Pipeline Implementation #88

Open
tenzin3 opened this issue Dec 3, 2024 · 0 comments
Open

OPT20031: Planning for Toolkit Pipeline Implementation #88

tenzin3 opened this issue Dec 3, 2024 · 0 comments
Assignees

Comments

@tenzin3
Copy link
Contributor

tenzin3 commented Dec 3, 2024

Description

With the data team, toolkit developers, and pecha.org collaborating, there is a significant amount of work and updates happening. To streamline these operations, we are preparing all the necessary inputs (from the data team) and output JSON files for pecha.org manually. This approach will ensure clarity and alignment among all the involved parties.

Important Note

  • This card is strictly for planning and data preparation.
  • The actual implementation steps will not begin unless approved by drupchen and ngawangtrinley.

Diagram

Image

Edit here

Implementation Diagram

Image

Edit here

Transfer mechanism for Translation and Commentary Diagram:
Image
Image

Dummy Data

available here

Input ( All Google Docs )

  1. Pecha Display segment and its translations aligned input

    • Tibetan Root Text Segmented Google Docs
    • English Root Text Segmented in alignment with Tibetan Root Text Google Docs
    • Chinese Root Text Segmented in alignment with Tibetan Root Text Google Docs
      Note: There could be more than one translation for a single language
  2. All Commentary with its corresponding Root Text.

    • Tibetan Commentary Text Alignment to its root text
    • English Commentary text aligned to its Root text
    • Chinese Commentary text alignment to its root text

Note: This root text is unique to its commentary and there could be more than one commentary for a single language.

Output ( All Json files )

  1. Translation alignment

    • JSON file for Tibetan Root text with english aligned text. Test-Root-Text.json
    • JSON of it's aligned Chinese translation Root Text Chinese-Test.json
      Note:
      Each of the segment has to start with the root segment id mapping if exists.
      Root segment id would be use to map with Commentary.
  2. Commentary alignment
    -JSON file for Commentary text.

Note:
Commentary associated with Root segment should start with root segment id.

View on Pecha.org

  1. Tibetan Root Text and Its Alignment: https://staging.pecha.org/texts/Test

Tasks at hand

Parsers:

  • Root texts google docs to convert to OPF
  • Commentary texts google docs to convert to OPF

Serializer:

  • Root text and its translations OPFs into a different json
  • Commentary OPFs into a different json

Annotation Transfer Scripts:

  • Base Update Script: Updates the root text (aligned with the commentary) to match the root text (aligned with the translation).
  • Layer Update Script: Adjusts the segment and span values based on segmentation changes made from the root text (commentary-aligned) to the root text (translation-aligned).
@OpenPecha OpenPecha deleted a comment from tenzin3 Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants