Show Menu
Cheatography

NTLK Language Processing Python Cheat Sheet by

Cheat Sheet for Natural Language Processing using NTLK

Python Import

import nltk
nltk.d­own­load()
This step will bring up a window in which you can download ‘All Corpora’
from nltk.book import *

BASICS

 

Tokens

text1[­0:100]
-first 101 tokens
text2[5]
- fifth token

Concor­dance

> text3.c­on­cor­dan­ce(­‘be­gat’)
- basic keywor­d-i­n-c­ontext
text1.c­on­cor­dan­ce(­‘sea’, lines=100)
-- show other than default 25 lines
> text1.c­on­cor­dan­ce(­‘sea’, lines=100)
- show other than default 25 lines
text1.c­on­cor­dan­ce(­‘sea’, lines=all)
- show all results
text1.c­on­cor­dan­ce(­‘sea’, 10, lines=all) -
- change left and right context width to 10 characters and show all results

common­_co­ntexts

text1.c­om­mon­_co­nte­xts­([‘­sea­’,’­oce­an’])

COUNTING

 

COUNTING

Count a String
len(‘this is a string of text’) – number of charac­ter­sle­n(‘this is a string of text’) – number of characters
Count a list of tokens
len(text1) –number of tokens
Make and Count a list of unique tokens
len(se­t(t­ext1)) – notice that set return a list of unique tokens
Count Occurences
text1.c­ou­nt(­‘he­aven’) – how many times does a word occur?
Frequency
 
fd = nltk.F­req­Dis­t(t­ext1) – creates a new data object that contains inform­ation about word frequency
 
fd[‘the’] – how many occurences of the word ‘the’
 
fd.keys() – show the keys in the data object
 
fd.val­ues() – show the values in the data object
 
fd.items() – show everything
 
fd.key­s()­[0:50] – just show a portion of the info
Frequency Plots
fd.plo­t(5­0,c­umu­lat­ive­=False) – generate a chart of the 50 most frequent words
Other FreqDist functions
fd.hap­axes()
 
fd.fre­q(‘­the’)
Get word lengths
lengths = [len(w) for w in text1]
And do FreqDist
fd = nltk.F­req­Dis­t(l­engths)
FreqDist as Table
fd.tab­ulate()

PARTS OF SPEACH CODES

 
CC Coordi­nating conjun­ction
CD Cardinal number
DT Determiner
EX Existe­ntial there
FW Foreign word
IN Prepos­ition or subord­inating
conjun­ction
JJ Adjective
JJR Adjective, compar­ative
JJS Adjective, superl­ative
LS List item marker
MD Modal
NN Noun, singular or mass

PARTS OF SPEACH CODES

 
NNS Noun, plural
NNP Proper noun, singular
NNPS Proper noun, plural
PDT Predet­erminer
POS Possessive ending
PRP Personal pronoun
PRP$ Possessive pronoun
RB Adverb
RBR Adverb, compar­ative
RBS Adverb, superl­ative
RP Particle
SYM Symbol
TO to

PARTS OF SPEACH CODES

 
UH Interj­ection
VB Verb, base form
VBD Verb, past tense
VBG Verb, gerund or present
participle
VBN Verb, past participle
VBP Verb, non-3rd person singular
present
VBZ Verb, 3rd person singular
present
WDT Wh-det­erminer
WP Wh-pronoun
WP$ Possessive wh-pronoun
WRB Wh-adverb
 

NORMAL­IZING

De-pun­ctuate
[w for w in text1 if w.isal­pha() ] – not so much getting rid of punctu­ation, but keeping alphabetic characters
De-upp­erc­aseify (?)
>[w.lo­wer() for w in text] – make each word in the tokenized list lowercase
 
[w.lower() for w in text if w.isal­pha()] – all in one go
Sort
sorted­(text1) – careful with this!
Unique Words
set(text1) – set is oddly named, but very powerful. Leaves you with a list of only one of each word.
Exclude Stopwords
Make your own list of word to be excluded:
 
stopwords = [‘the’­,’i­t’,­’sh­e’,­’he’]
 
mynewtext = [w for w in text1 if w not in stopwords]
 
Or you can also use predefined stopword lists from NLTK:
 
from nltk.c­orpus import stopwords
 
stopwords = stopwo­rds.wo­rds­(‘e­ngl­ish’)
 
mynewtext = [w for w in text1 if w not in stopwords]

SEARCHING

Dispersion Plot
text4.d­is­per­sio­n_p­lot­([‘­Ame­ric­an’­,’L­ibe­rty­’,’­Gov­ern­ment’])
Find Word that ends with...
[w for w in text4 if w.ends­wit­h(‘­ness’)]
Find Word that start with...
[w for w in text4 if w.star­tss­wit­h(‘­ness’)]
Find Word that contain...
[w for w in text4 if ‘ee’ in w]
Combine them together
[w for w in text4 if ‘ee’ in w and w.ends­wit­h(‘­ing’)] Regular expres­sions ‘Regular expres­sions’ is a syntax for describing sequences
Regular Expres­sions
‘Regular expres­sions’ is a syntax for describing sequences of characters usually used to construct search queries. The Python ‘re’ module must first be imported:
Import
 
>>>­import re >>>[w for w in text1 if re.sea­rch­('^­ab',w)] – ‘Regular expres­sions’ is too big of a topic to cover here. Google it!

CHUNKING

 
>>>­import re >>>[w for w in text1 if re.sea­rch­('^­ab',w)] – ‘Regular expres­sions’ is too big of a topic to cover here. Google it!
Colloc­ations
> text4.c­ol­loc­ati­ons() - multi-word expres­sions that commonly co-occur. Notice that is not necess­arily related to the frequency of the words
 
>te­xt4.co­llo­cat­ion­s(n­um=100) – alter the number of phrases returned Bigrams, Trigrams, and n-grams are useful for comparing texts, partic­ularly for plagiarism detection and collation
Bi-grams
>nl­tk.b­ig­ram­s(t­ext4) – returns every string of two words
Tri-grams
nltk.t­rig­ram­s(t­ext4) – return every string of three word
n-grams
nltk.n­gra­ms(­text4, 5)

TAGGING

part-o­f-s­peach tagging
mytext = nltk.w­ord­_to­ken­ize­(“This is my sentence”)
 
nltk.p­os_­tag­(my­text)

Working with your own texts:

Open a file for reading
>file = open(‘­myf­ile.txt’) – make sure you are in the correct directory before starting Python
Read the file
t = file.r­ead();
Tokenize the file
tokens = nltk.w­ord­_to­ken­ize(t)
Convert to NLTK text object
text = nltk.T­ext­(to­kens)

QUITTING PYTHON

Quit
quit()
                       

Help Us Go Positive!

We offset our carbon usage with Ecologi. Click the link below to help us!

We offset our carbon footprint via Ecologi
 

Comments

it is so great

Add a Comment

Your Comment

Please enter your name.

    Please enter your email address

      Please enter your Comment.

          Related Cheat Sheets

            Python 3 Cheat Sheet by Finxter