MeCab [1] is a part-of-speech and morphological analyzer for Japanese, development is maintained by Kyoto University's Department of Information Science in collaboration with *NTT Basic Sciences Research Department *[2], developed primarily by Taku Kudo [3] who is also responsible for the (active) github repository.
It appears to be mainly a faster version of Chasen, with some sources (none that are quotable!) saying the project started its life as a fork (or at least under a name containing) chasen.
MeCab is a open source command-line application with bindings to popular languages such as perl and python. It is written in C/C++.
-
https://github.com/taku910/mecab / http://taku910.github.io/mecab/ (old: http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html)
-
Applying Conditional Random Fields to Japanese Morphological Analysis [PDF] [PPT (slide)] Notes from slides:
-
"Standard testbed corpus used for Japanese morphological analysis" [??]
-
Chinese segmentation and new word detection using conditional random fields [pdf]
-
MeCab | ChaSen | JUMAN | KAKASI | |
Analysis model | bi-gram Markov model | Variable length Markov model | bi-gram Markov model | Longest match |
Cost Estimation | Learning from the corpus | Learning from the corpus | Manpower | No concept of cost |
Learning model | CRF (identification model) | HMM (generation model) | ||
Dictionary lookup algorithm | Double Array | Double Array | Patricia tree | Hash? |
Solution search algorithm | Viterbi | Viterbi | Viterbi | Decisive? |
Implementation of the connecting table | Two-dimensional Table | Automaton | Two-dimensional Table? | Without connecting table? |
Hierarchy of the part of speech | Unlimited multi-tier part of speech | Unlimited multi-tier part of speech | Two-stage fixed | No concept of part of speech? |
Unknown word processing | Character types (can change the behavior definition) | Character types (unmodifiable) | Character types (unmodifiable) | |
Constrained analysis | Possible | 2.4.0 possible | Impossible | Impossible |
N-best solution | Possible | Impossible | Impossible | Impossible |
KyTea (pronounced Cutie) is a Morphological Analyser developed by Kyoto University in 2009, it uses a pointwise classifier-based (SVM or logistic regression) approach, allowing for training on partially annotated training data. Main developer is Graham Neubig [website]
-
Graham Neubig, Yosuke Nakata, Shinsuke Mori, Pointwise Prediction for Robust, Adaptable Japanese Morphological Analysis, The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT). Portland, Oregon, USA. June 2011
-
Graham Neubig, Shinsuke Mori. Word-based Partial Annotation for Efficient Corpus Construction, The seventh international conference on Language Resources and Evaluation (LREC 2010). Malta. May 2010.
-
Mori, Shinsuke, et al. "Pointwise Prediction and Sequence-based Reranking for Adaptable Part-of-Speech Tagging." ( Pacific Association for Computational Linguistics 2015).[pdf]
-
Claims to beat both Chasen (referred to only as HMM) and MeCab (referred to by name and CRF approaches)
-
Uses partial annotated and fully annotated corpora
-
Annotated corpora is BCCWJ, with the Yahoo section as test and rest as training
-
Uses only 21 "coarse" tags (universal pos tagset?)
- This will give it an edge over vanilla MeCab, which uses over 50)
-
First paper (or place) I've seen mention KyTea, a newer-than-MeCab morphology analyser.
-
- Main distribution uses BCCWJ and Unidic, but can be extended with several others listed on the website. http://www.phontron.com/kytea/train.html
Kagome appears to be a Go version of Kurumoji.
Java program licensed under Apache and compatible with MeCab dictionaries maintained by Atilika. Optimized for searching. A bit unclear which algorithm it use, but appears to be an FST.
It is used in a sizeable amount of research, though I can't find any papers detailing its approach, nor comparing its performance with other models.
MeCab can be compiled using 1 of 3 freely (as in beer) dictionaries. These dictionaries are already compiled and consist of 1 or more .csv files with word and morpheme definitions as well as between 7-9 .def files containing large numerical matrix (matrix.def) or rewrite rules or other seemingly grammatical rules.
There exist a host of other online available MeCab formatted dictionaries and most modern JMA are compatible with MeCab dictionaries.
Resources related to tagsets
-
List of postags generated by Juman and Chasen
-
VE - Ruby library that can re-segment Japanese into more "normal" segmentations
-
Python parser for EDICT files (unmaintained)
-
IPA dictionary, based on the IPA corpus (Recommended) Download
-
Authors: Taku Kudo <[email protected]> [email protected]
Masayuki Asahara:[email protected]
Yuji Matsumoto:[email protected]
-
Maintainer: The Information-technology Promotion Agency of Japan (IPA)
-
Papers:
- ipadic version 2.7.0 User’s Manual [pdf] (for chuman)
IPAdic is the recommended dictionary for MeCab. While the IPA are credited with the source of this dictionary, their website indicates absolutely no relevance to NLP in any form. It also appears (according to the manual, linked above) that the original MeCab used a different (and equally elusive) source, the RWCP as well as the Kyoto Corpus (Juman, below), that uses Mainichi Shinbun as the text corpora, which is available for ~100-200.000 yen.
-
Juman dictionary, based on the Kyoto corpus. Download
-
Author: Taku Kudo <[email protected]>
-
**Maintainer: **http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?Kyoto%20University%20Text%20Corpus
-
Papers:
- Building a Japanese parsed corpus while improving the parsing system [pdf]
The annotations from the Kyoto Corpus are available freely from the above link, the actual text is not, and must be purchased from Mainichi Shinbun for ~100-200.000 yen
-
Unidic dictionary, based on the BCCWJ corpus.
-
Download unidic-MLJ_14.zip (via ninjal (seemingly official))
-
**Author: **The UniDic Consortium
-
http://www2.ninjal.ac.jp/lrc/index.php?UniDic (seems to be more organizational )
-
Maintainer: The National Institute for Japanese Language (NINJAL)
- http://pj.ninjal.ac.jp/corpus_center/ Center for Corpus Development, NINJAL. Appears to be the home of all corpus, including UniDic and BCCWJ and several other corpora.
-
Papers:
-
NINJAL publications: [link]
-
KOTONOHA and BCCWJ: Development of a Balanced Corpus of Contemporary Written Japanese, 2007 [PDF]
-
Design of a balanced corpus of contemporary written Japanese, 2007 [pdf]
-
UniDic for Early Middle Japanese: a Dictionary for Morphological Analysis of Classical Japanese, 2012. [pdf]
-
Corpus-based Japanese morphological analysis (unidic; doctor's thesis), 2003 [pdf]
-
A Proper Approach to Japanese Morphological Analysis: Dictionary, Model, and Evaluation (UniDic), 2008. [pdf]
-
Balanced Corpus of contemporary written Japanese (the BCCWJ paper?), 2008. [pdf]
-
-
Links
The BCCWJ Corpus is available for purchase for between 50.000 - 400.000 yen for a yearly license.
It is fully annotated and contains a mix of newspaper, journal and internet articles.
The corpus can be manually queried free of charge via either the Shonagon or Chunagon (registration required) service. There doesn't appear to exist an API.
-
Shonagon: http://www.kotonoha.gr.jp/shonagon/
-
Chunagon: https://chunagon.ninjal.ac.jp/
-
EDR Electronic Dictionary Technical Guide (1993) , Japan Electronic Dictionary Research Institute, Ltd [1995 version; ACM link][pdf]
-
Via Mori et. al 2015 [pdf] "... many fully annotated corpora, in which the sentences are divided into words completely and all the words are annotated with POSs, are available. Almost all annotated corpora produced through corpus annotation research [EDR], [17] fall in this category."
-
Large Japanese/English word-sense dictionary combined with Japanese and English corpora.
-
-
Tsukuba Web Corpus (online only)
-
Neologd - Neogolism Dictionary for ipadic
-
An extension to IPAdic which includes many neologisms (new word), that have been extracted from "many language resources on the Web". Exact method is a bit uncertain.
-
Has monthly updates.
-
-
Yasuharu Den, Toshinobu Ogiso, Hideki Ogura, Atsushi Yamada, Nobuaki Minematsu, Kiyotaka Uchimoto, and Hanae Koiso. 2007. The development of an electronic dictionary for morphological analysis and its application to Japanese corpus linguistics (in Japanese). Japanese Linguistics, 22:101–123 (via A proper approach to Japanese morphological analysis: Dictionary, model, and evaluation)
-
Santos, Cicero D., and Bianca Zadrozny. "Learning character-level representations for part-of-speech tagging." Proceedings of the 31st International Conference on Machine Learning (ICML-14). 2014. [pdf]
-
Accurate Word Segmentation and POS Tagging for Japanese Microblogs: Corpus Annotation and Joint Modeling with Lexical Normalization, 2014 [pdf]
-
POS tagging twitter
-
build a twitter corpus of 1000 posts, 1831 sentences all manually segmented and annotated with POS.
-