Migrating from Gensim 3.x to 4

Migrating to Gensim 4.0

Gensim 4.0 is compatible with older releases (3.8.3 and prior) for the most part. Your existing stored models and code will continue to work in 4.0, except:

I. No more Python 2

Gensim 4.0+ is Python 3 only. See the Gensim & Compatibility policy page for supported Python 3 versions.

II. Word2Vec, FastText, Doc2Vec, KeyedVectors

The *2Vec-related classes (Word2Vec, FastText, & Doc2Vec) have undergone significant internal refactoring for clarity, consistency, efficiency & maintainability.

size ctr parameter is now vector_size:

model = Word2Vec(size=100, …)  # 🚫
model = FastText(size=100, …)  # 🚫
model = Doc2Vec(size=100, …)  # 🚫

model = Word2Vec(vector_size=100, …)  # 👍
model = FastText(vector_size=100, …)  # 👍
model = Doc2Vec(vector_size=100, …)  # 👍

iter ctr parameter is now consistently epochs everywhere:

model = Word2Vec(iter=5, …)  # 🚫
model = FastText(iter=5, …)  # 🚫
model = Doc2Vec(iter=5, …)  # 🚫

model = Word2Vec(epochs=5, …)  # 👍
model = FastText(epochs=5, …)  # 👍
model = Doc2Vec(epochs=5, …)  # 👍

index2word and index2entity attribute is now index_to_key:

random_word = random.choice(model.wv.index2word)  # 🚫

random_word = random.choice(model.wv.index_to_key)  # 👍

vocab dict became key_to_index for looking up a key's integer index, or get_vecattr() and set_vecattr() for other per-key attributes:

rock_idx = model.wv.vocab["rock"].index  # 🚫
rock_cnt = model.wv.vocab["rock"].count  # 🚫
vocab_len = len(model.wv.vocab)  # 🚫

rock_idx = model.wv.key_to_index["rock"]
rock_cnt = model.wv.get_vecattr("rock", "count")  # 👍
vocab_len = len(model.wv)  # 👍

no more init_sims()

L2-normalized vectors are now computed dynamically, on request. The full numpy array of "normalized vectors" is no longer stored in memory:

all_normed_vectors = model.wv.get_normed_vectors()  # still works but now creates a new array on each call!

normed_vector = model.wv.vectors_norm[model.wv.vocab["rock"].index]  # 🚫

normed_vector = model.wv.get_vector("rock", norm=True)  # 👍

no more vocabulary and trainables attributes; properties previously there have been moved back to the model:

out_weights = model.trainables.syn1neg  # 🚫
min_count = model.vocabulary.min_count  # 🚫

out_weights = model.syn1neg  # 👍
min_count = model.min_count  # 👍

Additional Doc2Vec-specific changes

Doc2Vec.docvecs attribute is now Doc2Vec.dv

…and it's a standard KeyedVectors object, so has all the standard attributes and methods of KeyedVectors:

random_doc_id = np.random.randint(doc2vec_model.docvecs.count)  # 🚫
document_vector = doc2vec_model.docvecs["some_document_tag"]  # 🚫

random_doc_id = np.random.randint(len(doc2vec_models.dv))  # 👍
document_vector = doc2vec_model.dv["some_document_tag"]  # 👍

Additional FastText-specific changes

check if a word is fully "OOV" (out of vocabulary) for FastText:
```
"night" in model.wv.vocab  # 🚫

"night" in model.wv.key_to_index  # 👍
```
Of course, even OOV words have vectors in FastText (assembled from vectors of their character ngrams), so the following is not a good way to test the presence of a vector:
```
"no_such_word" in model.wv  # 🚫 always returns True for FastText!
model.wv["no_such_word"]  # returns a vector even for OOV words
```

Other advanced notes

The following notes are for advanced users, who were using or extending the Gensim internals more deeply, perhaps relying on protected / private attributes.

A key change is the creation of a unified KeyedVectors class for working with sets-of-vectors, that's reused for both word-vectors and doc-vectors, both when these are a subcomponent of the full algorithm models (for training) and when they are separate vector-sets (for lighter-weight re-use). Thus, this unified class shares the same (& often improved) convenience methods & implementations.
One notable internal implementation change means that performing the usual similarity operations no longer requires the creation of a 2nd full cache of unit-normalized vectors, via the .init_sims() method & stored in the .vectors_norm property. That used to involve a noticeable delay on 1st use, much higher memory use, and extra complications when attempting to deploy/share vectors among multiple processes.
A number of errors and inefficiencies in the FastText implementation have been corrected. Model size in memory and when saved to disk will be much smaller, and using FastText as if it were Word2Vec, by disabling character n-grams (with max_n=0), should be as fast & performant as vanilla Word2Vec.
When supplying a Python iterable corpus to instance-initialization, build_vocab(), or train(), the parameter name is now corpus_iterable, to reflect the central expectation (that it is an iterable) and for correspondence with the corpus_file alternative. The prior model-specific names for this parameter, like sentences or documents, were overly-specific, and sometimes led users to the mistaken belief that such input must be precisely natural-language sentences.

If you're unsure or getting unexpected results, let us know at the Gensim mailing list.

III. Phraser

Phraser class is now called FrozenPhrases

…to be more explicit in its intent, and easier to tell apart from its chunkier parent Phrases:
```
phrases = Phrases(corpus)
phraser = Phraser(phrases)  # 🚫

phrases = Phrases(corpus)
frozen_phrases = phrases.freeze()  # 👍
```
Note that phrases (collocation detection, multi-word expressions) have been pretty much rewritten from scratch for Gensim 4.0, and are more efficient and flexible now overall.

IV. Removal of deprecations and unmaintained modules

Removed gensim.summarization

Despite its general-sounding name, the module will not satisfy the majority of use cases in production and is likely to waste people's time. See this Github ticket for more motivation behind this.
Removed "TFIDF pivoted normalization". A rarely used contributed module, of poor quality of both code and documentation.
Renamed overly broad similarities.index to the more appropriate similarities.annoy.
FIXME Misha anything else? HDP? wordrank? wrappers? sklearn?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly