Show Menu
Cheatography

ISDS 415 Data Mining Ch. 7 - 10.

Chapter 7

How to select K in kNN (1Q)
K= 1: By validation data; whichever k gives the lowest validation error.
Binary Classi­fic­ation K With Even K's
DO NOT USE even numbers since it could lead to an tie. XLMiner will pick the lowest probab­ility and can chose an even number but that doesn't mean it should be chosen
 
K >1: classify by the majority decision rule based on the nearest k records
Low K values:
Capture local structure but may also capture noise. You cant rely on one Neighbour
High K values:
Provide more smoothing but may lose local detail. K can be as large as the training sample
Chose the K that gives you the lowest valid ER

Euclidean Distance (1Q)

sometimes predictors need to be standa­rdized to equalize scales before computing distances. Standa­rized = normalized (-3, 3)
# of possible partitions in Recursive Partition (2Q)
Contin­uous: (n-1)*p
Catego­rical: 2ID - 1P, 3ID -3P, 4ID - 7P, 5ID - 15P

Cut Off Value in Classi­fic­ation

- Cutoff = 0.5 by default becuase the proportion of observ­ation neighbors 1's in the k nearest neighbors. Majority decision rule is related to the cut off value for classi­fying records
You can adjust the cut off value to improve accuracy
Y = 1 (if p> cutoff)
Y = 0 ( if p < cutoff)

Cut Off Example Question

Example: Suppose cutoff = 0.9, k=7, we observed 5 C1 and 2C0. Y = 1 or 0?
- Probab­ility (Y=1) = 5/7 = 0.71 ---> 0.71 < 0.9 -> Y= 0

Regression Tree

- Used with continuous outcome variables. Many splits attempted, chose the one that minimizes impurity
- Prediction is computed as the average of numerical target variables in the rectangle
- Impurity measured by the sum of squared deviation from leaf mean
- Perfor­mance measured by RMSE
Regression Tree is used for predic­tion. Compared to classi­fic­ation tree, we only have to ...
Replace impurity measure by the sum of squared deviation everything else will be the same.
Split by irrelevant variables = Bad impurity score
Only split with relevant variables

General Info

- Makes no assump­tions about the data
- Gets classified as whatever the predom­inant class us among nearby records
- the way to find the k nearest neighbors in Knn is through the Euclidean distance
Rescaling: Only for kNN do you need to rescale because the amount of contri­bution from each variable. No need for logistic regression since it does not change the P value or RMSE
No need for CART since it doesn't change the order of values in a variable
XLMiner can only handle up to K= 10
 

Chapter 9

Properties of CART (Class­ifi­cation And Regression Tree)(3Q)
- Model Free
- Automatic variable selection
- Needs large sample size (bc its model free)
- Only gives horizontal or vertical splits
- both methods of CART are BOTH model free
Best pruned tree: You naturally get overfi­tting when the natural end of process is 100% purity in each leaf which ends up fitting noise in the data. Slightly overfitted so people partition a bit less to accomm­odate based of the minimal error tree
Minimum error tree: The tree with lowest validation error.
Full tree: largest tree training error equals zero; overfitted
Note: The full tree can be the same as the minimum error tree BUT usually best pruned tree should be smaller than the other trees

Impurity Score

- Metric to determine the homoge­neity of the resulting subgroups of observ­ations
- For both, the lower the better
- One has no advantage using one over the other.
Gini Index (0, 0.50 binary)
Entropy Measure: (0, log22 if binary) OR (0, log2(m) -> m is the total # of classes of Y)
Overall Impurity Measure: Weighted average of impurity from indivi­duals' rectangles weights being the proportion of cases in each rectangle.
Choose the split that reduces impurity the most (split points becomes nodes on the tree)
Check notes for that distance = to weighted average ratio

Dimens­ional Predictors Q's

Continous Partitions
(n-1) x P -> p diment­ional predictors (more than 2 dimens­ional predic­tors)
Catego­rical Partitions
abcd split. (3Levels, 3P), (4Levels, 7P)
XLMiner only supports binary catego­rical variables

When to Stop Partit­ioning

- Error rate as a function of the number of splits for training vs validation data -> Indicates overfi­tting

We can continue partit­ioning the tree so a FULL tree will be obtained in the end. A full tree is usually overfitted so we have to impose and EARLY STOP ...
- Stop when training error rate is approa­ching 0 as you partition further but you must have an early stop before letting it touch 0.Early - Stop (Minimum Error Tree or Best Prune tree): OR Stop based off Chi-square tests:
- if the improv­ement of the additional split is statis­tically signif­icant -> continue. If not, STOP.
Largest to Smallest: Full Tree > Min error tree > Best prune tree ( Std usually smaller than min error). Keep in mind: Full tree CAN BE THE SAME as your Min error tree

Error Rate as you continue Splitting

Training error decreases as you split more
Validation error decreases and then increases with the tree size
 

Chapter 10

Assump­tions For Logistic Regression (1Q)
- Genera­lized linearity
Logistic Regression Equati­on(2Q)
NOT model free -> based on following equations
log odds = beta0 + beta1X1 + ... + betaqXq
log p/(1-p) = beta0 + beta1X1 + ... betaqXq
P = 1/(1+e­xp(­-(beta0 + beta 1X1 +... + betaqXq)))
EX. Probabiliy could look like this: **P = 1/(1+e­xp(­0.6535 - 0.4535­(X))) ----> where you can sub in x
Direct interp­ret­ation of beta 1 is that per unit increase of X1, log odds will increase by beta 1 -> not clear so thus you must say
The Log odds are going to increase by beta 1
Types of Regression :
- Logistic regression (Binary outcome)
- Multiple Linear Regression (Conti­nuous outcome)
- Multin­omial Logistic Regression (categ­orical outcome of 3 or more levels)
- Ordinal Logistic Regression (Categ­orical outcome of 3 or more ordinal levels)
- Poisson Regression (Count outcome)
- Negative Binomial Regression (Count outcome)
All those regression models can be called genera­lized linear models (GLM) !
- 3 equations equivalent to each other/
- Y= 0 in MLR is never true if Y is binary and thus cannot use this mode
Since Y is contin­uous, change Y into P (proba­bility) and it eliminates the error term since you add some randomness

The Odds

Odds of ration is the expone­ntial form of beta
- Beta is your coeffi­cient number on your regression model
The Odds: p/ (1 - p) ---> p is probab­ility
The Odds: e(beta1) =1
Logit: Log (odds). It takes the values from -infinity to positive infinity. Dependent var.
Probab­ility: P =(odds)/ (1 + Odds)

Comparing 2 Models

First criteria, pick the model with the lowest validation error
Second criterion, when the validation errors are compar­able, pick the one with few variables
E.g suppose models 1 and 2 have a validation errors 26.2% and 26.3%. Their model sizes are
10 and, respec­tively. Which model is better?
- Initially go based of lowest validation error but when its too similar (23% and 26% -> its compar­able) and thus you go based off of LOWEST model Size

Perfor­mance Evaluation

(1)Par­tition the data into Training and Validation Sets
Training set Used to grow tree
Validation set used to assess classi­fic­ation perfor­mance
(2) More than 2 classes (M >2 )
Same structure except that the terminal nodes would take one of the m-class labels

Example Probelm

Check if Binary variable or contin­uous.
If binary: use the odds ratio DIRECTLY and use ____ "­tim­es"
If contin­uous: odds ratio - 1 then convert to %.
 

Comments

No comments yet. Add yours below!

Add a Comment

Your Comment

Please enter your name.

    Please enter your email address

      Please enter your Comment.

          Related Cheat Sheets

          More Cheat Sheets by angelica9373