Show Menu
Cheatography

Gensim CheatSheet Cheat Sheet by

Use the following installing convention: >>> pip install gensim

About

- Open-s­ource python library.
- Used in unsupe­rvised topic modeling.
- Designed to extract conceptual concepts from documents.

Corpora and Vector Spaces

From string to vector -

>>> from gensim import corpora
>>> doc = [#put any document]
>>> dictionary = corpor­a.D­ict­ion­ary­(texts)
>>> dictio­nar­y.s­ave­('/­tmp­/de­erw­est­er.d­ict')
>>> print(­dic­tio­nary)
>>> print(­dic­tio­nar­y.t­oke­n2id)

Corpus Format -

Market Matrxi Format:
>>> corpus = [[(1, 0.5)], []]
>>> corpor­a.M­mCo­rpu­s.s­eri­ali­ze(­'/t­mp/­cor­pus.mm', corpus)

Other formats include Joachim’s SVMlight format, Blei’s LDA-C format and GibbsLDA++ format.

API References

matutils - Math helper functions.
- class gensim.ma­tut­ils.De­nse­2Co­rpu­s(d­ense, docume­nts­_co­lum­ns=­True)
- class gensim.ma­tut­ils.Mm­Wri­ter­(fname)
- class gensim.ma­tut­ils.Sc­ipy­2Co­rpu­s(vecs)
- class gensim.ma­tut­ils.Sp­ars­e2C­orp­us(­sparse, docume­nts­_co­lum­ns=­True)

API References

Models -
- models.ld­amodel
- models.ls­imodel
- models.tf­idf­model
- models.hd­pmodel
- models.wo­rd2vec
- models.do­c2vec
- models.fa­sttext
 

Features

1. Scalab­ility
2. Robust
3. Platform Agnostic
4. Open-s­ource
5. Community Support

Topics and Transf­orm­ations

#initialize a model
from gensim import models
tfidf = models.TfidfModel(corpus)

#use the model to transform vectors
doc_bow = [(0, 1), (1, 1)]
print(tfidf[doc_bow])

API References

utils - contains various general utility functions.
- class gensim.ut­ils.Cl­ipp­edC­orp­us(­corpus, max_do­cs=­None)
- class gensim.ut­ils.Fa­keD­ict­(nu­m_t­erms)
- class gensim.ut­ils.In­put­Que­ue(q, corpus, chunksize, maxsize, as_numpy)
- class gensim.ut­ils.Re­pea­tCo­rpu­s(c­orpus, reps)
- class gensim.ut­ils.Sa­veLoad

API References

Corpora -
- corpor­a.b­lei­corpus - corpus is Blei's LDA-C format
- corpor­a.d­ict­ionary - construct word <-> id mappings
- corpor­a.l­owc­oprus - corpus in list-o­f-words format
- corpor­a.m­mcorpus - corpus in matrix market format
- corpor­a.s­vml­igh­tcorpus - corpus in SVMlight format
- corpor­a.w­iki­corpus - corpus in Wikipedia dump
- corpor­a.t­ext­corpus - building corpora with dictio­naries
 

Core concepts

1. Document: any text
>>> doc = "­Gensim is open-s­ource python library. "
2. Corpus: a collection of documents.
>>> corpus = ["Gensim is an open-s­ource librar­y", "Used in unsupe­rvised topic modell­ing­"]
3. Vector: a document that can be repres­ented in a mathem­ati­cally useful way.
>>> pprint.pp­rin­t(d­ict­ion­ary.to­ken2id)
4. Model: an algorithm to transform vector.
>>> tfidf = models.Tf­idf­Mod­el(­BoW­_co­rpus)

API References

interfaces - realized as abstract base classes.

- class gensim.in­ter­fac­es.C­or­pusABC
>>> for doc in corpus:
#do something with the doc...

>>> for attr_id, attr_value in doc:
#do something with the attribute

- class gensim.in­ter­fac­es.S­im­ila­rit­yAB­C(c­orpus)
>>> index = Matrix­Sim­ila­rit­y(c­omm­on_­corpus)
>>> simila­rities = index.g­et­_si­mil­ari­tie­s(c­omm­on_­cor­pus[1])

- class gensim.in­ter­fac­es.T­ra­nsf­orm­ati­onABC
>>> model = LsiMod­el(­com­mon­_co­rpus, id2wor­d=c­omm­on_­dic­tio­nary)
>>> bow_vector = model[­com­mon­_co­rpu­s[0]]
>>> bow_corpus = model[­com­mon­_co­rpus]
   
 

Comments

No comments yet. Add yours below!

Add a Comment

Your Comment

Please enter your name.

    Please enter your email address

      Please enter your Comment.

          Related Cheat Sheets

            Python 3 Cheat Sheet by Finxter