Antoniak et Mimno, 2018
- Remove words appearing less than 20 times
- Number of documents
- Unique Words
- Vocabulary Density (Unique Word / Number of words ?)
- Words per document (Average)
- One which reads document in the same order every time
- Evaluates randomness of algorithm (Random initialization, etc.)
- One which shuffle documents
- Evaluates impact of document order in the learning
- One which removes documents from the set and replace them to keep the same size of of corpus
- Evaluates variability due to the presence of specific sequences
- Keep 20% of the corpus, preferably the beginning
2 Settings :
- 1 document = 1 sentence
- 1 document = multiple sentences
- Term-Document matrix with TF-IDF
- L2 Normalization
- Sublinear TF Scaling
- Dimensionality reduction via randomized solver
- Skip-gram with negative sampling (Mikolov 2013 : Word2Vec ?)
- Gensim
Nothing specific
- library
hyperwords
cds
=0.75eig
=0.5sub
=10^-5win
=5
- Train 50 of each
- Generate topic models to get 20 relevant words using 200 topics from an LDA topic model
- Compute cosine similarity of each word to other words, calculate mean and standard deviation across each set of 50 models
eg.
word 1 | word 2 | LSA1 | LSA2 | LSA 3 | .... | LSA 50 |
---|---|---|---|---|---|---|
lascivus | bonus | 0.1 | 0.1 | 0.1 | ... | 0.1 |
LSA Average Lascivus-Bonus = 0.1, std deviation = 0
- Select 20 closest words to the test words. Jacard similarity between these lists.
- Embeddings are not an objective view of a corpus nor of a language
- "The use of embeddings as sources of evidence needs to be tempered with the understanding that fine-grained distinctions between cosine similarities are not reliable and that smaller corpora and longer docu-ments are more susceptible to variation in the cosine similarities between embeddings."
- Trouver les paramètres où la deviation standard est la plus faible. Hyperparameters en somme