Show Menu
Cheatography

ISDS 415 Data Mining Ch. 7 - 10.

Chapter 7

How to select K in kNN (1Q)
K= 1: By validation data; whichever k gives the lowest validation error.
Binary Classi­fic­ation K With Even K's
DO NOT USE even numbers since it could lead to an tie. XLMiner will pick the lowest probab­ility and can chose an even number but that doesn't mean it should be chosen
 
K >1: classify by the majority decision rule based on the nearest k records
Low K values:
Capture local structure but may also capture noise. You cant rely on one Neighbour
High K values:
Provide more smoothing but may lose local detail. K can be as large as the training sample
Chose the K that gives you the lowest valid ER

Euclidean Distance (1Q)

sometimes predictors need to be standa­rdized to equalize scales before computing distances. Standa­rized = normalized (-3, 3)
# of possible partitions in Recursive Partition (2Q)
Contin­uous: (n-1)*p
Catego­rical: 1ID- _P, 2ID - _P, 3ID -_P, 4ID - _P, 5ID - 15P

Cut Off Value in Classi­fic­ation

- Cutoff = 0.5 by default becuase the proportion of observ­ation neighbors 1's in the k nearest neighbors. Majority decision rule is related to the cut off value for classi­fying records
You can adjust the cut off value to improve accuracy
Y = 1 (if p> cutoff)
Y = 0 ( if p < cutoff)

Cut Off Example Question

Example: Suppose cutoff = 0.9, k=7, we observed 5 C1 and 2C0. Y = 1 or 0?
- Probab­ility (Y=1) = 5/7 = 0.71 ---> 0.71 < 0.9 -> Y= 0

Advantages and Disadv­antages

Simple and intuitive
Curse of Dimens­ion­ality (req size and many predictors
No assump­tions required about data -> always correct
# of predictors x 1000 x 100? 50 predictors = need 5 mil observ­ations
Effective with large training data
n/a

General Info

- Makes no assump­tions about the data
- Gets classified as whatever the predom­inant class us among nearby records
- the way to find the k nearest neighbors in Knn is through the Euclidean distance
Rescaling: Only for kNN do you need to rescale because the amount of contri­bution from each variable. No need for logistic regression since it does not change the P value or RMSE
No need for CART since it doesn't change the order of values in a variable
XLMiner can only handle up to K= 10
 

Chapter 9

Properties of CART (3Q)
- Model Free
- Automatic variable selection
- Needs large sample size (bc its model free)
- Only gives horizontal or vertical splits
- Training error gets smaller and smaller with the tree size
- Validation error decreases and then increases with the tree size
- both methods of CART are BOTH model free
Trees
Best pruned tree: the tree whole validation error equal minimum error plus standard error; usually smaller than minimum error tree. You naturally get overfi­tting when the natural end of process is 100% purity in each leaf which ends up fitting noise in the data. Slightly overfitted so people partition a bit less to accomm­odate based of the minimal error tree
Minimum error tree: The tree with lowest validation error.
Full tree: largest tree training error equals zero; overfitted
Note: The full tree can be the same as the minimum error tree BUT usually best pruned tree should be smaller than the other trees

Recursive Partit­ioning

(1) Enumerate all possible partitions and select the one with the lowest impurity score
Impurity Score: Gini or Entropy Measure
(2) Partition following the first step is a subset partition of the same dataset -> Repeat choosing the lowest impurity score each time and drop
- Identify the midway point of the two lowest values of the output (14.0 & 14.8 -> split at 14.4)
- Repeat with the lowest purity being dropped and therefore compare values of 2nd and 3rd lowest (14.8 & 16.0) -> split at 15.4
(3) Continue the partit­ioning until ALL regions have either class 1 or class 0
- But must impose an early stop mark to prevent overfi­tting error since you can split it too much and lower training error to 0 but validation error will be very HIGH
algorithm decides where to partition

Impurity Score

- Metric to determine the homoge­neity of the resulting subgroups of observ­ations
- For both, the lower the better
- One has no advantage using one over the other.
Gini Index (0, 0.50 binary)
Entropy Measure: (0, log22 if binary) OR (0, log2(m) -> m is the total # of classes of Y)
Overall Impurity Measure: Weighted average of impurity from indivi­duals' rectangles weights being the proportion of cases in each rectangle.
Choose the split that reduces impurity the most (split points becomes nodes on the tree)
Check notes for that distance = to weighted average ratio

Dimens­ional Predictors Q's

With 21 observ­ations, 2 dimens­ional continuous predic­tors, how many partitions can we have?
# of partition = # of observ­ation -1 -> 20 P
Continous Partitions
(n-1) x P -> p diment­ional predictors (more than 2 dimens­ional predic­tors)
Catego­rical Partitions
abcd split. (3Levels, 3P), (4Levels, 7P)
XLMiner only supports binary catego­rical variables

When to Stop Partit­ioning

- Error rate as a function of the number of splits for training vs validation data -> Indicates overfi­tting

We can continue partit­ioning the tree so a FULL tree will be obtained in the end. A full tree is usually overfitted so we have to impose and EARLY STOP ...

- Stop when training error rate is approa­ching 0 as you partition further but you must have an early stop before letting it touch 0
Early - Stop (Minimum Error Tree or Best Prune tree):

OR

Stop based off Chi-square tests: (not commonly used for CART. They use min error tree or best prune tree)
- if the improv­ement of the additional split is statis­tically signif­icant -> continue. If not, STOP.
Largest to Smallest: Full Tree > Min error tree > Best prune tree ( Std usually smaller than min error). Keep in mind: Full tree CAN BE THE SAME as your Min error tree

Regression Tree

- Used with continuous outcome variables. Many splits attempted, chose the one that minimizes impurity
- Prediction is computed as the average of numerical target variables in the rectangle
- Impurity measured by the sum of squared deviation from leaf mean
- Perfor­mance measured by RMSE

Regression Tree is used for predic­tion. Compared to classi­fic­ation tree, we only have to ...
Replace impurity measure by the sum of squared deviation everything else will be the same.
Split by irrelevant variables = Bad impurity score
Only split with relevant variables

Error Rate as you continue Splitting

Perfor­mance Evaluation

(1)Par­tition the data into Training and Validation Sets
Training set Used to grow tree
Validation set used to assess classi­fic­ation perfor­mance
(2) More than 2 classes (M >2 )
Same structure except that the terminal nodes would take one of the m-class labels
 

Chapter 10

Assump­tions For Logistic Regression (1Q)
- Genera­lized linearity
Logistic Regression Equati­on(2Q)
NOT model free -> based on following equations
log odds = beta0 + beta1X1 + ... + betaqXq
log p/(1-p) = beta0 + beta1X1 + ... betaqXq
P = 1/(1+e­xp(­-(beta0 + beta 1X1 +... + betaqXq)))
Direct interp­ret­ation of beta 1 is that per unit increase of X1, log odds will increase by beta 1 -> not clear so thus you must say
The Log odds are going to increase by beta 1
- 3 equations equivalent to each other/
- All regression models can be called genera­lized linear modes
- Y= 0 in MLR is never true if Y is binary and thus cannot use this mode
Since Y is contin­uous, change Y into P (proba­bility) and it eliminates the error term since you add some randomness

The Odds

Odds of ration is the expone­ntial form of beta
- Beta is your coeffi­cient number on your regression model

Comparing 2 Models

First criteria, pick the model with the lowest validation error
Second criterion, when the validation errors are compar­able, pick the one with few variables
E.g suppose models 1 and 2 have a validation errors 26.2% and 26.3%. Their model sizes are
10 and, respec­tively. Which model is better?
- Initially go based of lowest validation error but when its too similar (23% and 26% -> its compar­able) and thus you go based off of LOWEST model Size
 

Comments

No comments yet. Add yours below!

Add a Comment

Your Comment

Please enter your name.

    Please enter your email address

      Please enter your Comment.