Show Menu
Cheatography

Reg Ex CheatSheet Cheat Sheet by

Explore Regex With Python

Regex

(*) indicates that the preceding character can occur 0 or more times.
meo*w
mew, meow, meooow, and meoooo­ooo­ooooow
? - character can appear either 0 or 1 time
humou?r
humour humor
. and it can match any single character (letter, number, symbol or whites­pace) in a piece of text
.........
any 9-char­acter text
[] will match any of the characters included within the brackets
con[sc­]en­[sc]us
consensus, concensus, consencus, and concencus
{} contains the exact quantity
roa{3}r
roaaar
{}n. the quantity range of characters to be matched
roa{3,6}r
roaaar, roaaaar, roaaaaar, or roaaaaaar
|, allows for the matching of either of two subexp­res­sions.
baboon­s|g­orillas
will match the text baboons as well as the text gorillas.
Anchors (hat ^ and dollar sign $) are used in regular expres­sions to match text at the start and end of a string, respec­tively.
^Monkeys: my mortal enemy$
will completely match the text Monkeys: my mortal enemy but not match Spider Monkeys: my mortal enemy or Monkeys: my mortal enemy in the wild
[lette­r-l­etter] or [n-n]
a range of characters that can be matched
[A-Z]. : match any uppercase letter [a-z]. : match any lowercase letter [0-9]. : match any digit [A-Za-z] : match any uppercase or lowercase letter
Shorthand character classes simplify writing regular expres­sions
\w represents the regex range [A-Za-­z0-9_], \d represents [0-9],
\W represents [A-Za-z­0-9_] matching any character not included by \w, \D represents [0-9] matching any character not included by \d
Negated character set
[^cdh]are
will match the m in mare.
+ ndicates that the preceding character can occur 1 or more times
meo+w
will match meow, meooow, and meoooo­ooo­ooooow, but not match mew
 

Text Prepro­cessing

Noise removal
import re result = re.sub­(r'­[\.­\?­\!\,­\:­\;\"]', '', text)
Removes Punctu­ation
Tokeni­zation is the text prepro­cessing task of breaking up text into smaller components of text
from nltk.t­okenize import word_t­okenize text = "This is a text to tokeni­ze" tokenized = word_t­oke­niz­e(text)
print(­tok­enized) # ["Th­is", "­is", "­a", "­tex­t", "­to", "­tok­eni­ze"]
In natural language proces­sing, normal­ization encomp­asses many text prepro­cessing tasks including
stemming, lemmat­iza­tion,
upper or lowerc­asing, and stopwords removal.
Stemming In natural language proces­sing, stemming is the text prepro­cessing normal­ization task concerned with bluntly removing word affixes (prefixes and suffixes).
from nltk.stem import Porter­Stemmer tokenized = ["So­", "­man­y", "­squ­ids­", "­are­", "­jum­pin­g"] stemmer = Porter­Ste­mmer() stemmed = [stemm­er.s­te­m(t­oken) for token in tokenized]
# ['So', 'mani', 'squid', 'are', 'jump']
Lemmat­ization In natural language proces­sing, lemmat­ization is the text prepro­cessing normal­ization task concerned with bringing words down to their root forms.
from nltk.stem import WordNe­tLe­mma­tizer tokenized = ["So­", "­man­y", "­squ­ids­", "­are­", "­jum­pin­g"] lemmatizer = WordNe­tLe­mma­tizer() lemmatized = [lemma­tiz­er.l­em­mat­ize­(token) for token in tokenized]
['So', 'many', 'squid', 'be', 'jump']
stopword removal is the process of removing words from a string that don’t provide any inform­ation about the tone of a statement.
from nltk.c­orpus import stopwords # define set of English stopwords stop_words = set(st­opw­ord­s.w­ord­s('­eng­lish'))
# remove stopwords from tokens in dataset statem­ent­_no­_stop = [word for word in word_t­okens if word not in stop_w­ords]
parser. chunk.R­eg­exp­Parser
Uses a set of regular expression patterns to specify the behavior of the parser
{<D­T|J­J>} # chunk determ­iners and adjectives
Token = Smaller Component of Text
Stem = Remove prefix and suffix
Lemmat­ization = Bring down to root
Stopword = Remove meanin­gless
 

Lists and Strings

z = ’Natural Language Proces­sing’
z.repl­ace(’ ’, ’\n’)
’Natur­al­\nLa­ngu­age­\nP­roc­essing’
 
list(z)
Split text into character tokens
 
set(z)
Unique tokens
x = [’Natu­ral’, ’Langu­age’, ’Toolkit’]
x.inse­rt(0, ’Python’)
[’Lang­uage’, ’Natural’, ’Python’, ’Toolkit’]
               
 

Comments

No comments yet. Add yours below!

Add a Comment

Your Comment

Please enter your name.

    Please enter your email address

      Please enter your Comment.

          Related Cheat Sheets

          NTLK Language Processing Python Cheat Sheet
            Python 3 Cheat Sheet by Finxter

          More Cheat Sheets by datamansam