Antoniak et Mimno, 2018

Remove words appearing less than 20 times

Compute

Number of documents
Unique Words
Vocabulary Density (Unique Word / Number of words ?)
Words per document (Average)

Corpus Parameters

Set-up three reader for three different process

One which reads document in the same order every time
- Evaluates randomness of algorithm (Random initialization, etc.)
One which shuffle documents
- Evaluates impact of document order in the learning
One which removes documents from the set and replace them to keep the same size of of corpus
- Evaluates variability due to the presence of specific sequences

Size of corpus

Keep 20% of the corpus, preferably the beginning

Documents length

2 Settings :

1 document = 1 sentence
1 document = multiple sentences

Algorithms

LSA

Term-Document matrix with TF-IDF
L2 Normalization
Sublinear TF Scaling
Dimensionality reduction via randomized solver

SGNS

Skip-gram with negative sampling (Mikolov 2013 : Word2Vec ?)
Gensim

Glove

Nothing specific

PPMI

library hyperwords
cds=0.75
eig=0.5
sub=10^-5
win=5

Methods

Train 50 of each
Generate topic models to get 20 relevant words using 200 topics from an LDA topic model
Compute cosine similarity of each word to other words, calculate mean and standard deviation across each set of 50 models

eg.

word 1	word 2	LSA1	LSA2	LSA 3	....	LSA 50
lascivus	bonus	0.1	0.1	0.1	...	0.1

LSA Average Lascivus-Bonus = 0.1, std deviation = 0

Select 20 closest words to the test words. Jacard similarity between these lists.

Conclusion

Embeddings are not an objective view of a corpus nor of a language
"The use of embeddings as sources of evidence needs to be tempered with the understanding that fine-grained distinctions between cosine similarities are not reliable and that smaller corpora and longer docu-ments are more susceptible to variation in the cosine similarities between embeddings."

Dans ma thèse

Trouver les paramètres où la deviation standard est la plus faible. Hyperparameters en somme

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluating the Stability of Embedding-based Word Similarities.md

Evaluating the Stability of Embedding-based Word Similarities.md

Compute

Corpus Parameters

Set-up three reader for three different process

Size of corpus

Documents length

Algorithms

LSA

SGNS

Glove

PPMI

Methods

Conclusion

Dans ma thèse

Files

Evaluating the Stability of Embedding-based Word Similarities.md

Latest commit

History

Evaluating the Stability of Embedding-based Word Similarities.md

File metadata and controls

Compute

Corpus Parameters

Set-up three reader for three different process

Size of corpus

Documents length

Algorithms

LSA

SGNS

Glove

PPMI

Methods

Conclusion

Dans ma thèse