Show Menu
Cheatography

spaCy Cheat Sheet (DRAFT) by

Cheat sheet for spaCy

This is a draft cheat sheet. It is a work in progress and is not finished yet.

Init

from spacy.lang.en import English
nlp = English()

Basic

doc = nlp("SOME TEXTS")
span = doc[i:j]
token = doc[i]

Pre-tr­ained Model

nlp = spacy.load('en_core_web_sm')
doc = nlp(MY_TEXT)

Name entity

doc.ents
.text
.label_

spacy.t­okens

Doc
Doc(nl­p.v­ocab, words=­words, spaces = spaces)
Span
Span(doc, i, j, label=­"­PER­SON­")
index: i, j
words: a collection of words
spaces: a collecture of booleans

Matcher

matcher = spacy.matcher.Matcher(nlp.vocab)
matches = matcher(doc)
[(id, start, end)]

Add pattern to matcher

pattern = [ { key: value } ]
matcher.add("PATTERN_NAME", None, pattern)
Two types of key:
1. regex pattern
2. label (i.e. POS, entity)

Phrase matching

matcher = spacy.matcher.PhraseMatcher(nlp.vocab)
pattern = nlp("Golden Retriever")
matcher.add("DOG", None, pattern)

for match_id, start, end in matcher(doc):
    span = doc[start:end]
 

Similarity

word vector
token.v­ector
Doc
doc1.s­imi­lar­ity­(doc2)
Span
span1.s­im­ila­rit­y(s­pan2)
Token
token1.si­mil­ari­ty(­token2)
Doc by Token
doc.si­mil­ari­ty(­token)
return a similarity score 0~1
NOT for small model
cosine similarity by default

Pipeline

nlp.pi­pe_­names

nlp.pi­peline

Add pipeline component

def fn(doc):
    # function body
    return doc

nlp.add_pipe(fn, last, first, before, after)

Set custom attributes

add metadata
doc._.ATTR = "­ATT­RIBUTE NAME"
register globally
Doc.se­t_e­xte­nsi­on(­"­ATT­R", defaul­t=None)
set to doc, tokens, spans
access property via
._

Extension attribute types

attribute
Token.s­et­_ex­ten­sio­n("A­TTR­", defaut­=Bool)
property
Span.s­et_­ext­ens­ion­("PR­OP", getter=fn)
method
Doc.se­t_e­xte­nsi­on(­"­MET­HOD­", method=fn)
 

Boost up

nlp.pipe(DATA)

Passing in context

data = [ ("SOME TEXTS", {"KEY": "VAL"}),  (...), ]

# Method 1
for doc, ctx in nlp.pipe(data, as_tuple=True):
    print( doc.ATTR, ctx[KEY] )

# Method 2
Doc.set_extension("KEY", default=None)
for doc, ctx in nlp.pipe(data, as_tuples=True):
    doc._.KEY = ctx["KEY"]

Using tokenizer only

# Method 1
doc = nlp.make_doc("SOME TEXTS")

# Method 2
with nlp.disable_pipes("tagger", "parser"):
    doc = nlp(text)