Software

MeCab

MeCab [1] is a part-of-speech and morphological analyzer for Japanese, development is maintained by Kyoto University's Department of Information Science in collaboration with *NTT Basic Sciences Research Department *[2], developed primarily by Taku Kudo [3] who is also responsible for the (active) github repository.

It appears to be mainly a faster version of Chasen, with some sources (none that are quotable!) saying the project started its life as a fork (or at least under a name containing) chasen.

MeCab is a open source command-line application with bindings to popular languages such as perl and python. It is written in C/C++.

links:

https://github.com/taku910/mecab / http://taku910.github.io/mecab/ (old: http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html)
http://nlp.ist.i.kyoto-u.ac.jp/kuntt
http://chasen.org/~taku

Related papers:

Applying Conditional Random Fields to Japanese Morphological Analysis [PDF] [PPT (slide)] Notes from slides:
- "Standard testbed corpus used for Japanese morphological analysis" [??]
- Chinese segmentation and new word detection using conditional random fields [pdf]

Software comparison

	MeCab	ChaSen	JUMAN	KAKASI
Analysis model	bi-gram Markov model	Variable length Markov model	bi-gram Markov model	Longest match
Cost Estimation	Learning from the corpus	Learning from the corpus	Manpower	No concept of cost
Learning model	CRF (identification model)	HMM (generation model)
Dictionary lookup algorithm	Double Array	Double Array	Patricia tree	Hash?
Solution search algorithm	Viterbi	Viterbi	Viterbi	Decisive?
Implementation of the connecting table	Two-dimensional Table	Automaton	Two-dimensional Table?	Without connecting table?
Hierarchy of the part of speech	Unlimited multi-tier part of speech	Unlimited multi-tier part of speech	Two-stage fixed	No concept of part of speech?
Unknown word processing	Character types (can change the behavior definition)	Character types (unmodifiable)	Character types (unmodifiable)
Constrained analysis	Possible	2.4.0 possible	Impossible	Impossible
N-best solution	Possible	Impossible	Impossible	Impossible

KyTea

KyTea (pronounced Cutie) is a Morphological Analyser developed by Kyoto University in 2009, it uses a pointwise classifier-based (SVM or logistic regression) approach, allowing for training on partially annotated training data. Main developer is Graham Neubig [website]

links:

http://www.phontron.com/kytea/
https://github.com/neubig/kytea

Related papers:

Graham Neubig, Yosuke Nakata, Shinsuke Mori, Pointwise Prediction for Robust, Adaptable Japanese Morphological Analysis, The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT). Portland, Oregon, USA. June 2011
Graham Neubig, Shinsuke Mori. Word-based Partial Annotation for Efficient Corpus Construction, The seventh international conference on Language Resources and Evaluation (LREC 2010). Malta. May 2010.
Mori, Shinsuke, et al. "Pointwise Prediction and Sequence-based Reranking for Adaptable Part-of-Speech Tagging." ( Pacific Association for Computational Linguistics 2015).[pdf]
- Claims to beat both Chasen (referred to only as HMM) and MeCab (referred to by name and CRF approaches)
- Uses partial annotated and fully annotated corpora
- Annotated corpora is BCCWJ, with the Yahoo section as test and rest as training
- Uses only 21 "coarse" tags (universal pos tagset?)
  - This will give it an edge over vanilla MeCab, which uses over 50)
- First paper (or place) I've seen mention KyTea, a newer-than-MeCab morphology analyser.

Dictionaries / models

Main distribution uses BCCWJ and Unidic, but can be extended with several others listed on the website. http://www.phontron.com/kytea/train.html

Kurumoji

Kagome appears to be a Go version of Kurumoji.

Java program licensed under Apache and compatible with MeCab dictionaries maintained by Atilika. Optimized for searching. A bit unclear which algorithm it use, but appears to be an FST.

It is used in a sizeable amount of research, though I can't find any papers detailing its approach, nor comparing its performance with other models.

links:

http://atilika.com/en/
http://atilika.com/en/products/kuromoji.html
https://github.com/atilika/kuromoji

Dictionaries

MeCab can be compiled using 1 of 3 freely (as in beer) dictionaries. These dictionaries are already compiled and consist of 1 or more .csv files with word and morpheme definitions as well as between 7-9 .def files containing large numerical matrix (matrix.def) or rewrite rules or other seemingly grammatical rules.

There exist a host of other online available MeCab formatted dictionaries and most modern JMA are compatible with MeCab dictionaries.

Tagsets and other related resources

Resources related to tagsets

List of postags generated by Juman and Chasen
- http://www.unixuser.org/~euske/doc/postag/
VE - Ruby library that can re-segment Japanese into more "normal" segmentations
- https://github.com/Kimtaro/ve/blob/master/lib/providers/mecab_ipadic.rb#L118
Python parser for EDICT files (unmaintained)
- http://repo.or.cz/w/jbparse.git/blame/8e42831ca5f721c0320b27d7d83cb553d6e9c68f:/jbparse/edict.py

IPA dictionary (mecab-ipadic-2.7.0-20070801.tar.gz)

IPA dictionary, based on the IPA corpus (Recommended) Download
Authors: Taku Kudo <taku@chasen.org> chasen@is.aist-nara.ac.jp

Masayuki Asahara:masayu-a@is.aist-nara.ac.jp

Yuji Matsumoto:matsu@is.aist-nara.ac.jp

About IPA Dictionary

Maintainer: The Information-technology Promotion Agency of Japan (IPA)
Papers:
- ipadic version 2.7.0 Userâ€™s Manual [pdf] (for chuman)

IPAdic is the recommended dictionary for MeCab. While the IPA are credited with the source of this dictionary, their website indicates absolutely no relevance to NLP in any form. It also appears (according to the manual, linked above) that the original MeCab used a different (and equally elusive) source, the RWCP as well as the Kyoto Corpus (Juman, below), that uses Mainichi Shinbun as the text corpora, which is available for ~100-200.000 yen.

Juman dictionary (mecab-jumandic-7.0-20130310.tar.gz)

Juman dictionary, based on the Kyoto corpus. Download
Author: Taku Kudo <taku@chasen.org>

About the Kyoto Corpus

**Maintainer: **http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?Kyoto%20University%20Text%20Corpus
Papers:
- Building a Japanese parsed corpus while improving the parsing system [pdf]

The annotations from the Kyoto Corpus are available freely from the above link, the actual text is not, and must be purchased from Mainichi Shinbun for ~100-200.000 yen

Unidic dictionary (unidic-mecab-2.1.2_src.zip or unidic-MLJ_14.zip)

Unidic dictionary, based on the BCCWJ corpus.
- Download unidic-mecab-2.1.2_src.zip (via y-ken)
- Download unidic-MLJ_14.zip (via ninjal (seemingly official))
**Author: **The UniDic Consortium
- https://osdn.jp/projects/unidic/
- http://www2.ninjal.ac.jp/lrc/index.php?UniDic (seems to be more organizational )

About the Balanced Corpus of Contemporary Written Japanese (BCCWJ)

Maintainer: The National Institute for Japanese Language (NINJAL)
- http://pj.ninjal.ac.jp/corpus_center/ Center for Corpus Development, NINJAL. Appears to be the home of all corpus, including UniDic and BCCWJ and several other corpora.
Papers:
- NINJAL publications: [link]
- KOTONOHA and BCCWJ: Development of a Balanced Corpus of Contemporary Written Japanese, 2007 [PDF]
- Design of a balanced corpus of contemporary written Japanese, 2007 [pdf]
- UniDic for Early Middle Japanese: a Dictionary for Morphological Analysis of Classical Japanese, 2012. [pdf]
- Corpus-based Japanese morphological analysis (unidic; doctor's thesis), 2003 [pdf]
- A Proper Approach to Japanese Morphological Analysis: Dictionary, Model, and Evaluation (UniDic), 2008. [pdf]
- Balanced Corpus of contemporary written Japanese (the BCCWJ paper?), 2008. [pdf]
Links
- **http://pj.ninjal.ac.jp/corpus_center/bccwj/en**/

The BCCWJ Corpus is available for purchase for between 50.000 - 400.000 yen for a yearly license.

It is fully annotated and contains a mix of newspaper, journal and internet articles.

The corpus can be manually queried free of charge via either the Shonagon or Chunagon (registration required) service. There doesn't appear to exist an API.

Shonagon: http://www.kotonoha.gr.jp/shonagon/
Chunagon: https://chunagon.ninjal.ac.jp/

Other corpora

EDR Electronic Dictionary Technical Guide (1993) , Japan Electronic Dictionary Research Institute, Ltd [1995 version; ACM link][pdf]
- Via Mori et. al 2015 [pdf] "... many fully annotated corpora, in which the sentences are divided into words completely and all the words are annotated with POSs, are available. Almost all annotated corpora produced through corpus annotation research [EDR], [17] fall in this category."
- Large Japanese/English word-sense dictionary combined with Japanese and English corpora.
Tsukuba Web Corpus (online only)
- http://nlt.tsukuba.lagoinst.info/search/
Neologd - Neogolism Dictionary for ipadic
- An extension to IPAdic which includes many neologisms (new word), that have been extracted from "many language resources on the Web". Exact method is a bit uncertain.
- Has monthly updates.
- https://github.com/neologd/mecab-ipadic-neologd

Other papers / approaches:

Yasuharu Den, Toshinobu Ogiso, Hideki Ogura, Atsushi Yamada, Nobuaki Minematsu, Kiyotaka Uchimoto, and Hanae Koiso. 2007. The development of an electronic dictionary for morphological analysis and its application to Japanese corpus linguistics (in Japanese). Japanese Linguistics, 22:101â€“123 (via A proper approach to Japanese morphological analysis: Dictionary, model, and evaluation)
Santos, Cicero D., and Bianca Zadrozny. "Learning character-level representations for part-of-speech tagging." Proceedings of the 31st International Conference on Machine Learning (ICML-14). 2014. [pdf]
Accurate Word Segmentation and POS Tagging for Japanese Microblogs: Corpus Annotation and Joint Modeling with Lexical Normalization, 2014 [pdf]
- POS tagging twitter
- build a twitter corpus of 1000 posts, 1831 sentences all manually segmented and annotated with POS.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

corpora_software_lit-review.md

corpora_software_lit-review.md

Software

MeCab

links:

Related papers:

Software comparison

KyTea

links:

Related papers:

Dictionaries / models

Kurumoji

links:

Dictionaries

Tagsets and other related resources

IPA dictionary (mecab-ipadic-2.7.0-20070801.tar.gz)

About IPA Dictionary

Juman dictionary (mecab-jumandic-7.0-20130310.tar.gz)

About the Kyoto Corpus

Unidic dictionary (unidic-mecab-2.1.2_src.zip or unidic-MLJ_14.zip)

About the Balanced Corpus of Contemporary Written Japanese (BCCWJ)

Other corpora

Other papers / approaches:

Files

corpora_software_lit-review.md

Latest commit

History

corpora_software_lit-review.md

File metadata and controls

Software

MeCab

links:

Related papers:

Software comparison

KyTea

links:

Related papers:

Dictionaries / models

Kurumoji

links:

Dictionaries

Tagsets and other related resources

IPA dictionary (mecab-ipadic-2.7.0-20070801.tar.gz)

About IPA Dictionary

Juman dictionary (mecab-jumandic-7.0-20130310.tar.gz)

About the Kyoto Corpus

Unidic dictionary (unidic-mecab-2.1.2_src.zip or unidic-MLJ_14.zip)

About the Balanced Corpus of Contemporary Written Japanese (BCCWJ)

Other corpora

Other papers / approaches: