Show Menu
Cheatography

NTLK Language Processing Python Cheat Sheet by

Cheat Sheet for Natural Language Processing using NTLK

Python Import

import nltk
nltk.d­own­load()
This step will bring up a window in which you can download ‘All Corpora’
from nltk.book import *

BASICS

 

Tokens

text1[­0:100]
-first 101 tokens
text2[5]
- fifth token

Concor­dance

> text3.c­on­cor­dan­ce(­‘be­gat’)
- basic keywor­d-i­n-c­ontext
text1.c­on­cor­dan­ce(­‘sea’, lines=100)
-- show other than default 25 lines
> text1.c­on­cor­dan­ce(­‘sea’, lines=100)
- show other than default 25 lines
text1.c­on­cor­dan­ce(­‘sea’, lines=all)
- show all results
text1.c­on­cor­dan­ce(­‘sea’, 10, lines=all) -
- change left and right context width to 10 characters and show all results

common­_co­ntexts

text1.c­om­mon­_co­nte­xts­([‘­sea­’,’­oce­an’])

COUNTING

 

COUNTING

Count a String
len(‘this is a string of text’) – number of charac­ter­sle­n(‘this is a string of text’) – number of characters
Count a list of tokens
len(text1) –number of tokens
Make and Count a list of unique tokens
len(se­t(t­ext1)) – notice that set return a list of unique tokens
Count Occurences
text1.c­ou­nt(­‘he­aven’) – how many times does a word occur?
Frequency
 
fd = nltk.F­req­Dis­t(t­ext1) – creates a new data object that contains inform­ation about word frequency
 
fd[‘the’] – how many occurences of the word ‘the’
 
fd.keys() – show the keys in the data object
 
fd.val­ues() – show the values in the data object
 
fd.items() – show everything
 
fd.key­s()­[0:50] – just show a portion of the info
Frequency Plots
fd.plo­t(5­0,c­umu­lat­ive­=False) – generate a chart of the 50 most frequent words
Other FreqDist functions
fd.hap­axes()
 
fd.fre­q(‘­the’)
Get word lengths
lengths = [len(w) for w in text1]
And do FreqDist
fd = nltk.F­req­Dis­t(l­engths)
FreqDist as Table
fd.tab­ulate()

PARTS OF SPEACH CODES

 
CC Coordi­nating conjun­ction
CD Cardinal number
DT Determiner
EX Existe­ntial there
FW Foreign word
IN Prepos­ition or subord­inating
conjun­ction
JJ Adjective
JJR Adjective, compar­ative
JJS Adjective, superl­ative
LS List item marker
MD Modal
NN Noun, singular or mass

PARTS OF SPEACH CODES

 
NNS Noun, plural
NNP Proper noun, singular
NNPS Proper noun, plural
PDT Predet­erminer
POS Possessive ending
PRP Personal pronoun
PRP$ Possessive pronoun
RB Adverb
RBR Adverb, compar­ative
RBS Adverb, superl­ative
RP Particle
SYM Symbol
TO to

PARTS OF SPEACH CODES

 
UH Interj­ection
VB Verb, base form
VBD Verb, past tense
VBG Verb, gerund or present
participle
VBN Verb, past participle
VBP Verb, non-3rd person singular
present
VBZ Verb, 3rd person singular
present
WDT Wh-det­erminer
WP Wh-pronoun
WP$ Possessive wh-pronoun
WRB Wh-adverb
 

NORMAL­IZING

De-pun­ctuate
[w for w in text1 if w.isal­pha() ] – not so much getting rid of punctu­ation, but keeping alphabetic characters
De-upp­erc­aseify (?)
>[w.lo­wer() for w in text] – make each word in the tokenized list lowercase
 
[w.lower() for w in text if w.isal­pha()] – all in one go
Sort
sorted­(text1) – careful with this!
Unique Words
set(text1) – set is oddly named, but very powerful. Leaves you with a list of only one of each word.
Exclude Stopwords
Make your own list of word to be excluded:
 
stopwords = [‘the’­,’i­t’,­’sh­e’,­’he’]
 
mynewtext = [w for w in text1 if w not in stopwords]
 
Or you can also use predefined stopword lists from NLTK:
 
from nltk.c­orpus import stopwords
 
stopwords = stopwo­rds.wo­rds­(‘e­ngl­ish’)
 
mynewtext = [w for w in text1 if w not in stopwords]

SEARCHING

Dispersion Plot
text4.d­is­per­sio­n_p­lot­([‘­Ame­ric­an’­,’L­ibe­rty­’,’­Gov­ern­ment’])
Find Word that ends with...
[w for w in text4 if w.ends­wit­h(‘­ness’)]
Find Word that start with...
[w for w in text4 if w.star­tss­wit­h(‘­ness’)]
Find Word that contain...
[w for w in text4 if ‘ee’ in w]
Combine them together
[w for w in text4 if ‘ee’ in w and w.ends­wit­h(‘­ing’)] Regular expres­sions ‘Regular expres­sions’ is a syntax for describing sequences
Regular Expres­sions
‘Regular expres­sions’ is a syntax for describing sequences of characters usually used to construct search queries. The Python ‘re’ module must first be imported:
Import
 
>>>­import re >>>[w for w in text1 if re.sea­rch­('^­ab',w)] – ‘Regular expres­sions’ is too big of a topic to cover here. Google it!

CHUNKING

 
>>>­import re >>>[w for w in text1 if re.sea­rch­('^­ab',w)] – ‘Regular expres­sions’ is too big of a topic to cover here. Google it!
Colloc­ations
> text4.c­ol­loc­ati­ons() - multi-word expres­sions that commonly co-occur. Notice that is not necess­arily related to the frequency of the words
 
>te­xt4.co­llo­cat­ion­s(n­um=100) – alter the number of phrases returned Bigrams, Trigrams, and n-grams are useful for comparing texts, partic­ularly for plagiarism detection and collation
Bi-grams
>nl­tk.b­ig­ram­s(t­ext4) – returns every string of two words
Tri-grams
nltk.t­rig­ram­s(t­ext4) – return every string of three word
n-grams
nltk.n­gra­ms(­text4, 5)

TAGGING

part-o­f-s­peach tagging
mytext = nltk.w­ord­_to­ken­ize­(“This is my sentence”)
 
nltk.p­os_­tag­(my­text)

Working with your own texts:

Open a file for reading
>file = open(‘­myf­ile.txt’) – make sure you are in the correct directory before starting Python
Read the file
t = file.r­ead();
Tokenize the file
tokens = nltk.w­ord­_to­ken­ize(t)
Convert to NLTK text object
text = nltk.T­ext­(to­kens)

QUITTING PYTHON

Quit
quit()
                       
 

Comments

it is so great

Add a Comment

Your Comment

Please enter your name.

    Please enter your email address

      Please enter your Comment.

          Related Cheat Sheets

            Python 3 Cheat Sheet by Finxter