NepaliKit is a Python library for natural language processing tasks in the Nepali language.
You can install NepaliKit using pip:
pip install nepalikit
Alternatively, you can clone the repository and install it manually:
git clone https://github.com/prabhashj07/nepalikit.git
cd nepalikit
pip install .
NepaliKit provides the following features:
- Tokenization: Tokenize Nepali text using the SentencePiece tokenizer.
- Preprocessing: Clean and preprocess Nepali text data, including removing HTML tags, special characters, and other cleaning tasks.
- Stopword Management: Load and remove stopwords from Nepali text.
- Sentence Operations: Segment Nepali text into sentences based on punctuation marks.
- SentencePiece Model Training: Train custom SentencePiece models for Nepali text data.
- Utility Functions: Various utility functions for text processing and manipulation.
- Integration with PyTorch: Utilities for integrating with PyTorch for machine learning tasks.
from nepalikit.tokenization import Tokenizer
text = "नमस्ते, के छ खबर? यो एउटा वाक्य हो।"
tokenizer = Tokenizer()
# Sentence tokenization
sentences = tokenizer.tokenize(text, level='sentence')
print(sentences)
# Word tokenization
words = tokenizer.tokenize(text, level='word')
print(words)
# Character tokenization
characters = tokenizer.tokenize(text, level='characters')
print(characters)
from nepalikit.tokenization import SentencePieceTokenizer
text = "नमस्ते, के छ खबर?"
tokenizer = SentencePieceTokenizer()
tokens = tokenizer.tokenize(text)
print(tokens)
# Detokenization
original_text = tokenizer.detokenize(tokens)
print(original_text)
from nepalikit.preprocessing import TextProcessor
text = "<p>नमस्ते, के छ खबर?</p>"
processor = TextProcessor()
clean_text = processor.remove_html_tags(text)
clean_text = processor.remove_special_characters(clean_text)
print(clean_text)
from nepalikit.manage_stopwords import load_stopwords, remove_stopword
stopwords = load_stopwords('/path/to/stopword/directory')
remove_stopword('कुनै_स्टापवर्ड')
The TextProcessor
class provides various methods for text preprocessing:
remove_html_tags(text)
: Removes HTML tags from the text.remove_special_characters(text)
: Removes special characters, keeping only Devanagari characters and spaces.remove_extra_whitespace(text)
: Removes extra whitespace from the text.remove_stopwords(text)
: Removes stopwords from the text.normalize_text(text)
: Converts the text to lowercase.preprocess_text(text)
: Applies all preprocessing steps to the text.get_word_frequency(tokens)
: Returns the frequency of words in a list of tokens.
The urls_emails
class provides methods to remove or replace URLs and email addresses in the text:
replace_urls_emails(text)
: Replaces URLs and email addresses with specified replacements.remove_urls_emails(text)
: Removes URLs and email addresses from the text.
The sentence_operation
folder contains various modules for sentence-level operations:
extract_sentences.py
: Extracts sentences from text.load_abbreviation.py
: Loads abbreviations for text processing.normalize_text.py
: Normalizes text.segment_sentences.py
: Segments text into sentences.sentence_stats.py
: Provides statistics about sentences.
The Tokenizer
class provides the following methods:
sentence_tokenize(text)
: Tokenizes input text into sentences based on '।' character.word_tokenize(sentence, new_punctuation=None)
: Tokenizes input sentence into words, handling specified punctuation.character_tokenize(word)
: Tokenizes input word into characters.tokenize(text, level='word', new_punctuation=None)
: General tokenization method for sentence, word, or character level.sentence_detokenize(sentences)
: Detokenizes a list of sentences back into the original text.word_detokenize(words)
: Detokenizes a list of words back into the original sentence.character_detokenize(characters)
: Detokenizes a list of characters back into the original word.detokenize(tokens, level='word')
: General detokenization method for sentence, word, or character level.
The SentencePieceTokenizer
class provides the following methods:
tokenize(text)
: Tokenizes text using the SentencePiece model.detokenize(tokens)
: Detokenizes text using the SentencePiece model.
The NepaliTextProcessor
class in utils.py
offers additional text processing capabilities:
merge_text(tokens)
: Merges a list of tokens into a single string.split_text(text)
: Splits a text string into a list of tokens.count_words(text)
: Counts the number of words in a text string.count_words_in_paragraph(paragraph)
: Counts the total number of words in a paragraph.
This project is licensed under the MIT License.
- Prabhash Kumar Jha
- Email: [email protected]