Skip to content

Latest commit

 

History

History
39 lines (33 loc) · 1.29 KB

README.md

File metadata and controls

39 lines (33 loc) · 1.29 KB

word2vec4kor

tensorboard_log word2vec visualization data

  • tensorboard embedding(projector) file format
  • window size: 1
  • Total unique words: 10,000
  • Tokenized: white-space
  • Embedding Dimension: 300
  • Skip-Gram + Negative Sampling + Subsampling
mkdir ~/workspace
cd ~/workspace

git clone https://github.com/bage79/word2vec4kor
tensorboard --logdir=~/workspace/word2vec4kor/tensorboard_log

demo

ko.wikipedia.org.sentences raw corpus

  • from https://ko.wikipedia.org
  • Total sentences: about 3,115,431
wget https://gitlab.com/bage79/nlp4kor-ko.wikipedia.org/raw/master/data/ko.wikipedia.org.sentences.gz
gzip -d ko.wikipedia.org.sentences.gz

Tips

Download korean Wikipedia dump file

  • https://dumps.wikimedia.org/kowiki/20180220/

Parse dump file(mediawiki format) to text file

  • https://pypi.python.org/pypi/mediawiki-parser/

Word2vec open source

  • https://github.com/theeluwin/pytorch-sgns