Cheatography
https://cheatography.com
A quick reference guide for basic (and more advanced) natural language processing tasks in Python, using mostly nltk (the Natural Language Toolkit package), including POS tagging, lemmatizing, sentence parsing and text classification.
Handling Text
|
assign string |
|
Split text into character tokens |
|
Unique tokens |
|
Number of characters |
Accessing corpora and lexical resources
from nltk.corpus import brown
|
import CorpusReader object |
|
Returns pretokenised document as list of words |
|
Lists docs in Brown corpus |
|
Lists categories in Brown corpus |
Tokenization
|
Split by space |
nltk.word_tokenizer(text)
|
nltk in-built word tokenizer |
nltk.sent_tokenize(doc)
|
nltk in-built sentence tokenizer |
Lemmatization & Stemming
input="List listed lists listing listings"
|
Different suffixes |
words=input.lower().split(' ')
|
Normalize (lowercase) words |
porter=nltk.PorterStemmer
|
Initialise Stemmer |
[porter.stem(t) for t in words]
|
Create list of stems |
WNL=nltk.WordNetLemmatizer()
|
Initialise WordNet lemmatizer |
[WNL.lemmatize(t) for t in words]
|
Use the lemmatizer |
Part of Speech (POS) Tagging
nltk.help.upenn_tagset('MD')
|
Lookup definition for a POS tag |
|
nltk in-built POS tagger |
|
<use an alternative tagger to illustrate ambiguity> |
|
|
Sentence Parsing
g=nltk.data.load('grammar.cfg')
|
Load a grammar from a file |
g=nltk.CFG.fromstring("""...""")
|
Manually define grammar |
parser=nltk.ChartParser(g)
|
Create a parser out of the grammar |
trees=parser.parse_all(text)
|
for tree in trees: ... print tree
|
from nltk.corpus import treebank
|
treebank.parsed_sents('wsj_0001.mrg')
|
Treebank parsed sentences |
Text Classification
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
|
vect=CountVectorizer().fit(X_train)
|
Fit bag of words model to data |
vect.get_feature_names()
|
Get features |
vect.transform(X_train)
|
Convert to doc-term matrix |
Entity Recognition (Chunking/Chinking)
g="NP: {<DT>?<JJ>*<NN>}"
|
Regex chunk grammar |
cp=nltk.RegexpParser(g)
|
Parse grammar |
ch=cp.parse(pos_sent)
|
Parse tagged sent. using grammar |
|
Show chunks |
|
Show chunks in IOB tree |
cp.evaluate(test_sents)
|
Evaluate against test doc |
sents=nltk.corpus.treebank.tagged_sents()
|
print(nltk.ne_chunk(sent))
|
Print chunk tree |
RegEx with Pandas & Named Groups
df=pd.DataFrame(time_sents, columns=['text'])
|
df['text'].str.split().str.len()
|
df['text'].str.contains('word')
|
df['text'].str.count(r'\d')
|
df['text'].str.findall(r'\d')
|
df['text'].str.replace(r'\w+day\b', '???')
|
df['text'].str.replace(r'(\w)', lambda x: x.groups()[0][:3])
|
df['text'].str.extract(r'(\d?\d):(\d\d)')
|
df['text'].str.extractall(r'((\d?\d):(\d\d) ?([ap]m))')
|
df['text'].str.extractall(r'(?P<digits>\d)')
|
|
Created By
https://tutify.com.au
Metadata
Favourited By
Comments
Manash Sarma, 07:17 22 May 21
Very concise and effective !
Add a Comment
Related Cheat Sheets
More Cheat Sheets by murenei