Show Menu
Cheatography

data mining Cheat Sheet (DRAFT) by

This is a draft cheat sheet. It is a work in progress and is not finished yet.

Naive Bayes and LogReg

P(A|C) = (P(C|A­)P(­A))­/P(C)
predicts T/F, "­S" shaped, from 0-1
posterior = (likel­ihood x prior)­/no­rma­lizing constant
log(odds) = log(p/­(1-p))
pros: easy/fast, assuming indepe­ndence, catego­rical
z = estimated interc­ept/std error
cons: if not in set -> 0% -- can use Laplace estimation (add 1), bad estimator, indepe­ndent predictor assumption -> unlikely
y = log(F)B1 + log(T/F)B2
LR: p = elog(odds))/(1+elog(odds))
likelihood = mul. all T x all (1-F)
log(L) =sum i to n(log(Tn) + sum(lo­g(Fn))
R2=(SS(mean) - SS(fit­))/­SS(­mean)

ANNs

neuron = things that hold number from 0 to 1
boolean: T=1, F=0
ŷ = 1 if w₁x₁+w₂x₂+...wnxn-t(bias factor) >0
, -1 if <0
ŷ=sign­(w₁­x₁+­w₂x­₂+...w­nxn­-t=­sign(wx)
λ=learning rate
xij=val of jth attribute of training example xi
for weight update: wj(k+1) = weight param associated w/ ith input link after kth iteration
wj(k+1)=wj(k)+λ(­y1-ŷik^)xij
error = y - ŷ
if error = 2, inc w of +ves
if error = -2, in w of -ves
Error E = ΣEk | k∊outputs
Ek = 1/2(tk-ok)2
output oi = 1/(1+e-net i)
net i = Σwij*oi

Inductive Bias, No Free Lunch

IB: anything influe­ncing hypothesis choice other than training set
part of language access­ible, method of choosing
NFL: for any 2 algorithms A&B, there exists a dataset for which A outper­forms B
assuming uniform P(x,y)→#of datasets for which A>B = # B>A

SVM

frontier that best segregates 2 classes by margins
polynomial kernel: k(x,y)­=(x­*y+1)d
RBF kernel:k(x,y)=
e-𝛄(||x­-y||²
tune by k-fold cross-val (k=5)
adv: high dimension spaces, #of dimensions > #of samples
diff kernel functions for diff decisions
k1+k2 = even more complex
dis: if #features > #samples, CV
min(||w||2) for linear
ξ: how far ptᵢ is from correct side
wxᵢ+b>=1-ξ if yᵢ=1
wxᵢ+b>=-1+ξ if yᵢ=-1
min(||w||+C(Σi=1→nξᵢ)
max((Σλᵢ) -1/2(λᵢλⱼyᵢyⱼxᵢxⱼ)
dist btw parallel planes = z/||w||
||w|| = sqrt(w­₁+w­₂...)
genera­liz­ation error<= p(bar)(1-s2)/s2
p(bar) = avg correl­ation, s=strength
 

Errors

P(+) = 1/(1+ e-(w0+w­1x1...))
error= miscla­ssi­fic­ation
For new cases, predict:
1 if (w0+w1­x1+...)­>=1, 0 else
if w0 inc as x inc, p(+) inc
error = (FP+FN­)/All
sensit­ivity = TP/(TP+FN)
specif­icity = TN/(TN+FP)
+ve predicted val = TP/(TP+FP)
-ve predicted val = TN/(TN+FN)
true error: error on true underlying distri­bution (unmea­sur­able)
apparent error: error on example used to train model (under­est­imates TE)
genera­liz­ation: ability to predict unseen cases
Occam's Razor: should not be multiplied beyond necessity
Overfi­tting: memorizing training set
test error: error on ex. held out of training

KNN

select k: sqrt(n), if n is even, choose odd
Ri = {x:d(x­,xi­)< d(x,x2), i!=j}
euclidean distance =sqrt(­(x-x1)2+(y-y1)2)

Model Eval

Holdout
train on 2/3, test on 1/3, one is validation set (high variance on estimate)
Leave-­one-out
train on N-1, test on 1 (good estimate)
K-folds Cross Val
divide set into k parts, LOO each, repeat N times, compute mean and std dev for each
Bootstrapping
randomly draw N points (can repeat), train, test on S - S1
Compare 2 methods: H0: meanLR = meanNB, H1: meanLR­<meanNB
t=(mea­nNB­-me­anLR)/S (S:pooled variance), reject H0 if t>t alpha
OR H1: meanLR­!=m­eanNB
2-tailed t. t alpha/2. ((mean­Var­)x(­sqr­t(n))) / S
OR H1: meanLR­!=m­eanNB
2-tailed t. t alpha/2. ((mean­Var­)x(­sqr­t(n))) / S
Better: stratify each fold to contain same % of positives and negatives
 

Decision Trees

asks a question: classifies based on T/F
root, intern­al(­arrows to and from), extern­al(­arrows to)(le­aves)
break into categories
T/F and Y/N for each
P(Y|T), P(N|T), P(Y|F), P(N|F)
GI2= 1-(Y|F/(Y|F + N|F))2+(N|F/(Y|F + N|F))2
GI1 =
1-(Y|T/(Y|T + N|T))2+(N|T/(Y|T + N|T))2
GIall=(T/(T+F) x GI1)+(F/(T+F) x GI2)
entropy: H(S)=P­(y)­log­2(P­(y)­)-P­(n)­log­(P(n))
find H(Strue) and H(Sfalse), H(S)-w1H(Strue)-w2H(Sfalse)
w1 = T instances/all
w2 = F instan­ces/all
w1 = T instances/all
w2 = F instan­ces/all
largest info gain, least GI

ROC and Lift Curves

ROC: sensit­ivity vs. (1-spe­cif­icity), higher val the better,
flatter line the worse
sens: TP rate,
1-spec: FP rate
Lift curves: find % of each total response from sum of all
find % of each +ve responses from total +ve responses
y = +ve % / % of total
x = % of total

k-means clustering

user choose k, initialize k centers, loop: assign pts nearest those centers, move centroid of assigned pts
center in dense regions or random, optimizing (total distance)2
returns local solution

Ensembles

Bagging: bootstrap aggregating
Boosting: changing weights on pts and building series of classi­fiers, start w=1
incorrect pts weighed by # that is inversely proportional to training error
w inc if miscla­ssi­fied, dec else
classifiers combined by weight­ing­-ac­curacy of training set
Arcing­(Ad­aptive resample&combine):
like boosting but change w by update method
eg. Arc x4: w(x) = 1+e(x)4
e(x)=times x has been miscla­ssified so far
depends on: strength(perf of individuals), diversity (uncorrelated errors)
bagging error: from reducing var
boosting can reduce bias&var | bagging is > base classifier
boosting better or overfit noisy
Random forests: for tree,c­hoose pts,for node, features subset w/ best IG,split, end,re­cur­se,end
 

feature selection

removing irrelevant info for a better, faster model
drop missing values or encode them
drop: if all values are the same
if highly correl­ated, one of them
if low correl­ation with target|
trees with least info gain
forward, backward, stepwise selection: best model with f1, then keep going until validation error stops dropping
beam or heuristic search
for comput­ation interp­ret­ability |
genetic algorithms
1) filters: all above + other correl­ation
2) wrappers: build a classifier with a subset­+eval on validation data. but 2d possible subsets

Bias and Var

PCA : dimens­ion­ality reduction
linear combo of OG features
max. variance: smallest # until 90% var explained
μ=E(y|x)=T(uk)
ŷ=f(x,Ө)
error: MSE = (ŷ-μ)2
var: E(ŷE(ŷ))2
bias:(E(ŷ-μ)2+noise
^best estimate of y given x and fixed params Ө
KNN,AN­N,DT: low bias, high var
var: how much does my estimate var across datasets| bias: systematic error predic­tion, inability to fit

EM Expect­ation Maximi­zation clust.

hard cluste­ring: each pt only belongs to one cluster
soft cluste­ring: can belong to more than one cluster by %
EM: automa­tically discover all params for k "­sou­rce­s"→but we may not know source
if we know μ,σ, can find likeliness
mixture models: probab­ilistic way of soft clustering
each cluster Gaussian or multin­ominal
1/sqrt(2πσ2)*exp(-(xᵢ - μᵦ)2/2σᵦ2)
aᵢ=1-bᵢ=P(aᵢ)
Bayesian posterior: bᵢ = P(b|xᵢ) = (P(xᵢ|­b)P(b)) / (P(xᵢ|­b)P(b) + P(xᵢ|a­)P(a))
σᵦ2=(b₁(x­₁-μᵦ)2+...) /(b₁+b­₂+...)
μᵦ = (b₁x₁+­b₂x­₂+..) / (b₁+b₂­+...)
em: places random­ly,for each pt P(b|xᵢ): does it look like it came from b
Working to adjust (μₐ, σₐ2) and (μᵦ, σᵦ2) to fit points assigned
Iterate until convergence
P(a) = 1- P(b)
Could also estimate priors: P(b) = (b₁+b₂­+...)/n
“What proportion of the data is each distri­bution descri­bing”