Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hindi TN Support for Cardinal, Decimal, Fraction, Date, Time, Money and Measure #241

Merged
merged 25 commits into from
Nov 18, 2024

Conversation

ngachchi
Copy link
Contributor

@ngachchi ngachchi commented Oct 30, 2024

What does this PR do ?

This PR introduces Hindi support for a wide range of numerical and temporal formats, including:

  • Cardinal numbers: Natural numbers (e.g., एक, दो, तीन)
  • Decimal numbers: Numbers with decimal points (e.g., दशमलव दो दशमलव पांच)
  • Fractions: Rational numbers expressed as ratios (e.g., एक बटा दो)
  • Dates: Various date formats (e.g., आज, कल, १४ नवंबर २०२४)
  • Time: Time formats (e.g., दो बजकर पांच मिनट)
  • Money: Monetary amounts (e.g., दस रुपये पचास पैसे)
  • Measure: Units of measurement (e.g., दस किलोमीटर)

Before your PR is "Ready for review"

Pre checks:

  • Have you signed your commits? Use git commit -s to sign.
  • Do all unittests finish successfully before sending PR?
    1. pytest or (if your machine does not have GPU) pytest --cpu from the root folder (given you marked your test cases accordingly @pytest.mark.run_only_on('CPU')).
    2. Sparrowhawk tests bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
  • If you are adding a new feature: Have you added test cases for both pytest and Sparrowhawk here.
  • Have you added __init__.py for every folder and subfolder, including data folder which has .TSV files?
  • Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
  • Have you added the correct license header Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. to all newly added Python files?
  • If you copied nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should be Copyright 2015 and onwards Google, Inc.. See an example here.
  • Remove import guards (try import: ... except: ...) if not already done.
  • If you added a new language or a new feature please update the NeMo documentation (lives in different repo).
  • Have you added your language support to tools/text_processing_deployment/pynini_export.py.

PR Type:

  • New Feature
  • Bugfix
  • Documentation
  • Test

If you haven't finished some of the above items you can still open "Draft" PR.

@ngachchi ngachchi changed the title Hindi ITN Support for Cardinal, Decimal, Fraction, Date, Time, Money and Measure Hindi TN Support for Cardinal, Decimal, Fraction, Date, Time, Money and Measure Oct 30, 2024
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

Signed-off-by: Namrata Gachchi <[email protected]>
Copy link
Contributor Author

@ngachchi ngachchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remaining files from the whitelist data class will be removed and single would be there

@ngachchi ngachchi requested a review from mgrafu October 30, 2024 13:38
zoobereq and others added 3 commits November 13, 2024 11:34
Signed-off-by: Simon Zuberek <[email protected]>
Signed-off-by: Simon Zuberek <[email protected]>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason why we have English vocab as part of the Hindi TN grammar? I believe best approach for now would be monolingual

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason why we have English vocab as part of the Hindi TN grammar? I believe best approach for now would be monolingual

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason why we have English vocab as part of the Hindi TN grammar? I believe best approach for now would be monolingual

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason why we have English vocab as part of the Hindi TN grammar? I believe best approach for now would be monolingual

@mgrafu mgrafu merged commit c8a937a into NVIDIA:main Nov 18, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants