This project preprocesses Hindi text using the IndicNLP library for normalization and tokenization. A custom tokenizer enhances this process by cleaning text, removing stop words, and handling language-specific nuances.
-
Text Preprocessing:
- Normalize and tokenize Hindi text with IndicNLP.
- Clean the text and remove stop words using a custom tokenizer.
-
Feature Extraction:
- Apply TF-IDF vectorization with bigrams to extract key terms and phrases.
- Capture the semantic structure of dialogues.
-
Sentiment Analysis:
- Utilize a labeled Hindi word list to determine sentiment scores.
- Analyze emotional tones for individual speakers and the overall conversation.
-
Conversation Insights:
- Summarize key themes and interaction dynamics using extracted terms and sentiment analysis.
This pipeline provides a structured approach to analyzing Hindi conversations, making it useful for linguistic research, sentiment analysis, and dialogue summarization. 🚀