-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Migrating from Gensim 3.x to 4
Gensim 4.0 is compatible with older releases (3.8.3 and prior) for the most part. Your existing stored models and code will continue to work in 4.0, except:
Gensim 4.0+ is Python 3 only. See the Gensim & Compatibility policy page for supported Python 3 versions.
The *2Vec
-related classes (Word2Vec
, FastText
, & Doc2Vec
) have undergone significant internal refactoring for clarity, consistency, efficiency & maintainability.
-
size
ctr parameter is nowvector_size
:model = Word2Vec(size=100, β¦) # π« model = FastText(size=100, β¦) # π« model = Doc2Vec(size=100, β¦) # π« model = Word2Vec(vector_size=100, β¦) # π model = FastText(vector_size=100, β¦) # π model = Doc2Vec(vector_size=100, β¦) # π
-
iter
ctr parameter is now consistentlyepochs
everywhere:model = Word2Vec(iter=5, β¦) # π« model = FastText(iter=5, β¦) # π« model = Doc2Vec(iter=5, β¦) # π« model = Word2Vec(epochs=5, β¦) # π model = FastText(epochs=5, β¦) # π model = Doc2Vec(epochs=5, β¦) # π
-
index2word
andindex2entity
attribute is nowindex_to_key
:random_word = random.choice(model.wv.index2word) # π« random_word = random.choice(model.wv.index_to_key) # π
-
vocab
dict becamekey_to_index
for looking up a key's integer index, orget_vecattr()
andset_vecattr()
for other per-key attributes:rock_idx = model.wv.vocab["rock"].index # π« rock_cnt = model.wv.vocab["rock"].count # π« vocab_len = len(model.wv.vocab) # π« rock_idx = model.wv.key_to_index["rock"] rock_cnt = model.wv.get_vecattr("rock", "count") # π vocab_len = len(model.wv) # π
-
no more
init_sims()
L2-normalized vectors are now computed dynamically, on request. The full numpy array of "normalized vectors" is no longer stored in memory:
all_normed_vectors = model.wv.get_normed_vectors() # still works but now creates a new array on each call! normed_vector = model.wv.vectors_norm[model.wv.vocab["rock"].index] # π« normed_vector = model.wv.get_vector("rock", norm=True) # π
-
no more
vocabulary
andtrainables
attributes; properties previously there have been moved back to the model:out_weights = model.trainables.syn1neg # π« min_count = model.vocabulary.min_count # π« out_weights = model.syn1neg # π min_count = model.min_count # π
-
Doc2Vec.docvecs
attribute is nowDoc2Vec.dv
β¦and it's a standard
KeyedVectors
object, so has all the standard attributes and methods ofKeyedVectors
:random_doc_id = np.random.randint(doc2vec_model.docvecs.count) # π« document_vector = doc2vec_model.docvecs["some_document_tag"] # π« random_doc_id = np.random.randint(len(doc2vec_models.dv)) # π document_vector = doc2vec_model.dv["some_document_tag"] # π
-
check if a word is fully "OOV" (out of vocabulary) for FastText:
"night" in model.wv.vocab # π« "night" in model.wv.key_to_index # π
Of course, even OOV words have vectors in FastText (assembled from vectors of their character ngrams), so the following is not a good way to test the presence of a vector:
"no_such_word" in model.wv # π« always returns True for FastText! model.wv["no_such_word"] # returns a vector even for OOV words
The following notes are for advanced users, who were using or extending the Gensim internals more deeply, perhaps relying on protected / private attributes.
-
A key change is the creation of a unified
KeyedVectors
class for working with sets-of-vectors, that's reused for both word-vectors and doc-vectors, both when these are a subcomponent of the full algorithm models (for training) and when they are separate vector-sets (for lighter-weight re-use). Thus, this unified class shares the same (& often improved) convenience methods & implementations. -
One notable internal implementation change means that performing the usual similarity operations no longer requires the creation of a 2nd full cache of unit-normalized vectors, via the
.init_sims()
method & stored in the.vectors_norm
property. That used to involve a noticeable delay on 1st use, much higher memory use, and extra complications when attempting to deploy/share vectors among multiple processes. -
A number of errors and inefficiencies in the FastText implementation have been corrected. Model size in memory and when saved to disk will be much smaller, and using
FastText
as if it wereWord2Vec
, by disabling character n-grams (withmax_n=0
), should be as fast & performant as vanillaWord2Vec
. -
When supplying a Python iterable corpus to instance-initialization,
build_vocab()
, ortrain()
, the parameter name is nowcorpus_iterable
, to reflect the central expectation (that it is an iterable) and for correspondence with thecorpus_file
alternative. The prior model-specific names for this parameter, likesentences
ordocuments
, were overly-specific, and sometimes led users to the mistaken belief that such input must be precisely natural-language sentences.
If you're unsure or getting unexpected results, let us know at the Gensim mailing list.
-
Phraser
class is now calledFrozenPhrases
β¦to be more explicit in its intent, and easier to tell apart from its chunkier parent
Phrases
:phrases = Phrases(corpus) phraser = Phraser(phrases) # π« phrases = Phrases(corpus) frozen_phrases = phrases.freeze() # π
Note that phrases (collocation detection, multi-word expressions) have been pretty much rewritten from scratch for Gensim 4.0, and are more efficient and flexible now overall.
-
Removed
gensim.summarization
Despite its general-sounding name, the module will not satisfy the majority of use cases in production and is likely to waste people's time. See this Github ticket for more motivation behind this.
-
Removed "TFIDF pivoted normalization". A rarely used contributed module, of poor quality of both code and documentation.
-
Renamed overly broad
similarities.index
to the more appropriatesimilarities.annoy
. -
FIXME Misha anything else? HDP? wordrank? wrappers? sklearn?