data mining Cheat Sheet

Naive Bayes and LogReg

P(A\|C) = (P(C\|A)P(A))/P(C)	predicts T/F, "S" shaped, from 0-1
posterior = (likelihood x prior)/normalizing constant	log(odds) = log(p/(1-p))
pros: easy/fast, assuming independence, categorical	z = estimated intercept/std error
cons: if not in set -> 0% -- can use Laplace estimation (add 1), bad estimator, independent predictor assumption -> unlikely	y = log(F)B1 + log(T/F)B2
LR: p = e^log(odds))/(1+e^log(odds))	likelihood = mul. all T x all (1-F)

log(L) =sum i to n(log(Tn) + sum(log(Fn))
R²=(SS(mean) - SS(fit))/SS(mean)

ANNs

neuron = things that hold number from 0 to 1	boolean: T=1, F=0
ŷ = 1 if w₁x₁+w₂x₂+...wnxn-t(bias factor) >0	, -1 if <0
ŷ=sign(w₁x₁+w₂x₂+...wnxn-t=sign(w•x) λ=learning rate xij=val of jth attribute of training example xi	for weight update: wj^(k+1) = weight param associated w/ i^th input link after k^th iteration
wj^(k+1)=wj^{(k)+λ(y1-ŷi}k^)xij	error = y - ŷ
if error = 2, inc w of +ves	if error = -2, in w of -ves
Error E = ΣEk \| k∊outputs	Ek = 1/2(tk-ok)²
output oi = 1/(1+e^{-net i})	net i = Σwij*oi

Inductive Bias, No Free Lunch

IB: anything influencing hypothesis choice other than training set	part of language accessible, method of choosing
NFL: for any 2 algorithms A&B, there exists a dataset for which A outperforms B	assuming uniform P(x,y)→#of datasets for which A>B = # B>A

SVM

frontier that best segregates 2 classes by margins	polynomial kernel: k(x,y)=(x*y+1)^d
RBF kernel:k(x,y)= e^{-𝛄(\|\|x-y\|\|²}	tune by k-fold cross-val (k=5)
adv: high dimension spaces, #of dimensions > #of samples	diff kernel functions for diff decisions
k1+k2 = even more complex	dis: if #features > #samples, CV
min(\|\|w\|\|²) for linear ξ: how far ptᵢ is from correct side	wxᵢ+b>=1-ξ if yᵢ=1 wxᵢ+b>=-1+ξ if yᵢ=-1
min(\|\|w\|\|+C(Σi=1→nξᵢ)	max((Σλᵢ) -1/2(λᵢλⱼyᵢyⱼxᵢxⱼ)
dist btw parallel planes = z/\|\|w\|\|	\|\|w\|\| = sqrt(w₁+w₂...)
generalization error<= p(bar)(1-s²)/s²	p(bar) = avg correlation, s=strength

Errors

P(+) = 1/(1+ e^{-(w0+w1x1...)})	error= misclassification
For new cases, predict:	1 if (w0+w1x1+...)>=1, 0 else
if w0 inc as x inc, p(+) inc	error = (FP+FN)/All
sensitivity = TP/(TP+FN)	specificity = TN/(TN+FP)
+ve predicted val = TP/(TP+FP)	-ve predicted val = TN/(TN+FN)
true error: error on true underlying distribution (unmeasurable)	apparent error: error on example used to train model (underestimates TE)
generalization: ability to predict unseen cases	Occam's Razor: should not be multiplied beyond necessity
Overfitting: memorizing training set	test error: error on ex. held out of training

KNN

select k: sqrt(n), if n is even, choose odd	Ri = {x:d(x,xi)< d(x,x2), i!=j}
euclidean distance =sqrt((x-x1)²+(y-y1)²)

Model Eval

Holdout	train on 2/3, test on 1/3, one is validation set (high variance on estimate)
Leave-one-out	train on N-1, test on 1 (good estimate)
K-folds Cross Val	divide set into k parts, LOO each, repeat N times, compute mean and std dev for each
Bootstrapping	randomly draw N points (can repeat), train, test on S - S1
Compare 2 methods: H0: meanLR = meanNB, H1: meanLR<meanNB	t=(meanNB-meanLR)/S (S:pooled variance), reject H0 if t>t alpha
OR H1: meanLR!=meanNB	2-tailed t. t alpha/2. ((meanVar)x(sqrt(n))) / S
OR H1: meanLR!=meanNB	2-tailed t. t alpha/2. ((meanVar)x(sqrt(n))) / S

Better: stratify each fold to contain same % of positives and negatives

Decision Trees

asks a question: classifies based on T/F	root, internal(arrows to and from), external(arrows to)(leaves)
break into categories	T/F and Y/N for each
P(Y\|T), P(N\|T), P(Y\|F), P(N\|F)	GI²= 1-(Y\|F/(Y\|F + N\|F))²+(N\|F/(Y\|F + N\|F))²
GI¹ =	1-(Y\|T/(Y\|T + N\|T))²+(N\|T/(Y\|T + N\|T))²
GI^all=(T/(T+F) x GI¹)+(F/(T+F) x GI²)
entropy: H(S)=P(y)log2(P(y))-P(n)log(P(n))	find H(S^true) and H(S^false), H(S)-w1H(S^true)-w2H(S^false)
w1 = T instances/all	w2 = F instances/all
w1 = T instances/all	w2 = F instances/all

largest info gain, least GI

ROC and Lift Curves

ROC: sensitivity vs. (1-specificity), higher val the better, flatter line the worse	sens: TP rate, 1-spec: FP rate
Lift curves: find % of each total response from sum of all	find % of each +ve responses from total +ve responses
y = +ve % / % of total	x = % of total

k-means clustering

user choose k, initialize k centers, loop: assign pts nearest those centers, move centroid of assigned pts

center in dense regions or random, optimizing (total distance)²

returns local solution

Ensembles

Bagging: bootstrap aggregating	Boosting: changing weights on pts and building series of classifiers, start w=1
incorrect pts weighed by # that is inversely proportional to training error	w inc if misclassified, dec else classifiers combined by weighting-accuracy of training set
Arcing(Adaptive resample&combine): like boosting but change w by update method	eg. Arc x4: w(x) = 1+e(x)⁴ e(x)=times x has been misclassified so far
depends on: strength(perf of individuals), diversity (uncorrelated errors)	bagging error: from reducing var boosting can reduce bias&var \| bagging is > base classifier
boosting better or overfit noisy	Random forests: for tree,choose pts,for node, features subset w/ best IG,split, end,recurse,end

feature selection

removing irrelevant info for a better, faster model	drop missing values or encode them
drop: if all values are the same	if highly correlated, one of them
if low correlation with target\| trees with least info gain	forward, backward, stepwise selection: best model with f1, then keep going until validation error stops dropping
beam or heuristic search	for computation interpretability \| genetic algorithms
1) filters: all above + other correlation	2) wrappers: build a classifier with a subset+eval on validation data. but 2^d possible subsets

Bias and Var

PCA : dimensionality reduction	linear combo of OG features
max. variance: smallest # until 90% var explained	μ=E(y\|x)=T(uk) ŷ=f(x,Ө)
error: MSE = (ŷ-μ)² var: E(ŷE(ŷ))² bias:(E(ŷ-μ)²+noise	^best estimate of y given x and fixed params Ө
KNN,ANN,DT: low bias, high var

var: how much does my estimate var across datasets| bias: systematic error prediction, inability to fit

EM Expectation Maximization clust.

hard clustering: each pt only belongs to one cluster	soft clustering: can belong to more than one cluster by %
EM: automatically discover all params for k "sources"→but we may not know source if we know μ,σ, can find likeliness	mixture models: probabilistic way of soft clustering each cluster Gaussian or multinominal
1/sqrt(2πσ²)*exp(-(xᵢ - μᵦ)²/2σᵦ²) aᵢ=1-bᵢ=P(aᵢ)	Bayesian posterior: bᵢ = P(b\|xᵢ) = (P(xᵢ\|b)P(b)) / (P(xᵢ\|b)P(b) + P(xᵢ\|a)P(a))
σᵦ²=(b₁(x₁-μᵦ)²+...) /(b₁+b₂+...)	μᵦ = (b₁x₁+b₂x₂+..) / (b₁+b₂+...)
em: places randomly,for each pt P(b\|xᵢ): does it look like it came from b	Working to adjust (μₐ, σₐ²) and (μᵦ, σᵦ²) to fit points assigned
Iterate until convergence P(a) = 1- P(b)	Could also estimate priors: P(b) = (b₁+b₂+...)/n

“What proportion of the data is each distribution describing”

data mining Cheat Sheet (DRAFT) by mvyjayanti

Naive Bayes and LogReg

ANNs

Inductive Bias, No Free Lunch

SVM

Errors

KNN

Model Eval

Decision Trees

ROC and Lift Curves

k-means clustering

Ensembles

feature selection

Bias and Var

EM Expectation Maximization clust.

Latest Cheat Sheet

Random Cheat Sheet

About Cheatography

Behind the Scenes

Recent Cheat Sheet Activity

Please Disable Your Ad Blocker

data mining Cheat Sheet (DRAFT) by mvyjayanti

Naive Bayes and LogReg

ANNs

Inductive Bias, No Free Lunch

SVM

Errors

KNN

Model Eval

Decision Trees

ROC and Lift Curves

k-means clustering

Ensembles

feature selection

Bias and Var

EM Expect­ation Maximi­zation clust.

Latest Cheat Sheet

Random Cheat Sheet

About Cheatography

Behind the Scenes

Recent Cheat Sheet Activity

Please Disable Your Ad Blocker

EM Expectation Maximization clust.