is450 Cheat Sheet

preprocessing pipeline

def corpus2docs(corpus):
    fids = corpus.fileids()
    docs1 = []
    for fid in fids:
        doc_raw = corpus.raw(fid)
        doc = nltk.word_tokenize(doc_raw)
        docs1.append(doc)
    docs2 = [[w.lower() for w in doc] for doc in docs1]
    docs3 = [[w for w in doc if re.search('^[a-z]+$', w)] for doc in docs2]
    docs4 = [[w for w in doc if w not in stop_list] for doc in docs3]
    docs5 = [[stemmer.stem(w) for w in doc] for doc in docs4]
    return docs5

def docs2vecs(docs, dictionary):
    # docs is a list of documents returned by corpus2docs.
    # dictionary is a gensim.corpora.Dictionary object.
    vecs1 = [dictionary.doc2bow(doc) for doc in docs]
    tfidf = gensim.models.TfidfModel(vecs1)
    vecs2 = [tfidf[vec] for vec in vecs1]
    return vecs2

POS tags

N (noun)	dog, cat, chair
V (verb)	read, write, get
ADJ (adjective)	pretty, smart, blue
ADV (adverb)	gently, carefully, extremely
P (preposition)	in, on, by, with, about
PRO (pronoun)	I, me, mine, it, they...
CON (conjunction)	and, or, but, while, because
INT (interjection)	ooh, wow, yeah
DET (determiner)	all, his, they
AUX (auxiliary verb)	have done, might do
PAR (particle)	look up, get on
NUM (numeral)	one, two, three

Context-free grammar

Grammar = {
    objects: [
        Words/tokens: terminals,
        Right above: pos tags,
        Above: syntactic tags,
        Above: sentence
    ];
    Rules: [
        X: node name,    #eg "VP" (verb phrase)
        Y: sequence of objects that make up X    #eg (V+NP)
    ]
}

Morphemes

stems, affixes (prefix/suffix). Useful for POS tagging and text normalization

Semantics

synonyms	diff words, same meaning
polyseme	same word, diff meaning
hypernym/hyponym	category >>> specific
meronym/metonym	part >>> whole

LDA

gibbs sampling	1. random word-to-topic assignment
	2. re-assign each word to a topic, one by one, assuming all other assignments are correct
hyperparameters	high $alpha$ --> documents feature a mixture of most topics
	high $eta$ --> topics feature a mixture of most words
evaluation	coherence (PMI), human eval

Sentiment-Topic Model (Plate Notation)

Cluster Purity

Overall purity

Cluster Entropy

Pointwise Mutual Information

Discourse Markers

causal	because
consequence	as a result
conditional	if
temporal	when
additive	and
elaboration	[exemplification, re-wording]
contrastive/concessive	but

Preparation for NLTK classifier

#doc_tuple = (doc_representation, label)
>

({'police':1, 'lawyer':1, 'court':1}, 'Crime')

#train_set = [doc_tuple1, doc_tuple2, ...]

Download the is450 Cheat Sheet

2 Pages

Add a Comment

Related Cheat Sheets

Natural Language Processing with Python & nltk Cheat Sheet

NLP Cheat Sheet

Latest Cheat Sheet

7 Pages

(0)

Python Beginner to Advanced Cheat Sheet

A detailed Python cheat sheet covering beginner to advanced topics. Python is a popular programming language that can be used on a server to create web applications and this cheat sheet will cover all essential concepts.

musmankkh

3 Aug 25

python, programming, flask, leetcode, w3school, hackerrank

is450 Cheat Sheet by cheatingcvrlo

preprocessing pipeline

POS tags

Context-free grammar

Morphemes

Semantics

LDA

LDA

Sentiment-Topic Model (Plate Notation)

Cluster Purity

Overall purity

Cluster Entropy

Pointwise Mutual Information

Discourse Markers

Preparation for NLTK classifier

Created By

Metadata

Comments

Add a Comment

Related Cheat Sheets

Latest Cheat Sheet

Random Cheat Sheet

About Cheatography

Behind the Scenes

Recent Cheat Sheet Activity

Please Disable Your Ad Blocker

is450 Cheat Sheet by cheatingcvrlo

prepro­cessing pipeline

POS tags

Contex­t-free grammar

Morphemes

Semantics

LDA

LDA

Sentim­ent­-Topic Model (Plate Notation)

Cluster Purity

Overall purity

Cluster Entropy

Pointwise Mutual Inform­ation

Discourse Markers

Prepar­ation for NLTK classifier

Created By

Metadata

Comments

Add a Comment

Related Cheat Sheets

Latest Cheat Sheet

Random Cheat Sheet

About Cheatography

Behind the Scenes

Recent Cheat Sheet Activity

Please Disable Your Ad Blocker

preprocessing pipeline

Context-free grammar

Sentiment-Topic Model (Plate Notation)

Pointwise Mutual Information

Preparation for NLTK classifier