Chapter 7
How to select K in kNN (1Q) K= 1: By validation data; whichever k gives the lowest validation error.
|
Binary Classification K With Even K's DO NOT USE even numbers since it could lead to an tie. XLMiner will pick the lowest probability and can chose an even number but that doesn't mean it should be chosen
|
K >1: classify by the majority decision rule based on the nearest k records
|
Low K values: Capture local structure but may also capture noise. You cant rely on one Neighbour
|
High K values: Provide more smoothing but may lose local detail. K can be as large as the training sample
|
Chose the K that gives you the lowest valid ER |
Euclidean Distance (1Q)
sometimes predictors need to be standardized to equalize scales before computing distances. Standarized = normalized (-3, 3) |
# of possible partitions in Recursive Partition (2Q) |
Continuous: (n-1)*p |
Categorical: 2ID - 1P, 3ID -3P, 4ID - 7P, 5ID - 15P |
Cut Off Value in Classification
- Cutoff = 0.5 by default becuase the proportion of observation neighbors 1's in the k nearest neighbors. Majority decision rule is related to the cut off value for classifying records |
You can adjust the cut off value to improve accuracy |
Y = 1 (if p> cutoff) |
Y = 0 ( if p < cutoff) |
Cut Off Example Question
Example: Suppose cutoff = 0.9, k=7, we observed 5 C1 and 2C0. Y = 1 or 0? - Probability (Y=1) = 5/7 = 0.71 ---> 0.71 < 0.9 -> Y= 0
|
Regression Tree
- Used with continuous outcome variables. Many splits attempted, chose the one that minimizes impurity
- Prediction is computed as the average of numerical target variables in the rectangle
- Impurity measured by the sum of squared deviation from leaf mean
- Performance measured by RMSE
Regression Tree is used for prediction. Compared to classification tree, we only have to ...
Replace impurity measure by the sum of squared deviation everything else will be the same.
Split by irrelevant variables = Bad impurity score
Only split with relevant variables |
General Info
- Makes no assumptions about the data
- Gets classified as whatever the predominant class us among nearby records
- the way to find the k nearest neighbors in Knn is through the Euclidean distance
Rescaling: Only for kNN do you need to rescale because the amount of contribution from each variable. No need for logistic regression since it does not change the P value or RMSE
No need for CART since it doesn't change the order of values in a variable
XLMiner can only handle up to K= 10 |
|
|
Chapter 9
Properties of CART (Classification And Regression Tree)(3Q) |
- Model Free |
- Automatic variable selection |
- Needs large sample size (bc its model free) |
- Only gives horizontal or vertical splits |
- both methods of CART are BOTH model free |
Best pruned tree: You naturally get overfitting when the natural end of process is 100% purity in each leaf which ends up fitting noise in the data. Slightly overfitted so people partition a bit less to accommodate based of the minimal error tree |
Minimum error tree: The tree with lowest validation error. |
Full tree: largest tree training error equals zero; overfitted |
Note: The full tree can be the same as the minimum error tree BUT usually best pruned tree should be smaller than the other trees |
Impurity Score
- Metric to determine the homogeneity of the resulting subgroups of observations
- For both, the lower the better
- One has no advantage using one over the other.
Gini Index (0, 0.50 binary)
Entropy Measure: (0, log22 if binary) OR (0, log2(m) -> m is the total # of classes of Y)
Overall Impurity Measure: Weighted average of impurity from individuals' rectangles weights being the proportion of cases in each rectangle.
Choose the split that reduces impurity the most (split points becomes nodes on the tree) |
Check notes for that distance = to weighted average ratio
Dimensional Predictors Q's
Continous Partitions (n-1) x P -> p dimentional predictors (more than 2 dimensional predictors)
|
Categorical Partitions abcd split. (3Levels, 3P), (4Levels, 7P)
|
XLMiner only supports binary categorical variables
When to Stop Partitioning
- Error rate as a function of the number of splits for training vs validation data -> Indicates overfitting
We can continue partitioning the tree so a FULL tree will be obtained in the end. A full tree is usually overfitted so we have to impose and EARLY STOP ...
- Stop when training error rate is approaching 0 as you partition further but you must have an early stop before letting it touch 0.Early - Stop (Minimum Error Tree or Best Prune tree): OR Stop based off Chi-square tests:
- if the improvement of the additional split is statistically significant -> continue. If not, STOP.
Largest to Smallest: Full Tree > Min error tree > Best prune tree ( Std usually smaller than min error). Keep in mind: Full tree CAN BE THE SAME as your Min error tree |
Error Rate as you continue Splitting
Training error decreases as you split more
Validation error decreases and then increases with the tree size
|
|
Chapter 10
Assumptions For Logistic Regression (1Q) |
- Generalized linearity |
Logistic Regression Equation(2Q) |
NOT model free -> based on following equations |
log odds = beta0 + beta1X1 + ... + betaqXq |
log p/(1-p) = beta0 + beta1X1 + ... betaqXq |
P = 1/(1+exp(-(beta0 + beta 1X1 +... + betaqXq))) |
EX. Probabiliy could look like this: **P = 1/(1+exp(0.6535 - 0.4535(X))) ----> where you can sub in x |
Direct interpretation of beta 1 is that per unit increase of X1, log odds will increase by beta 1 -> not clear so thus you must say |
The Log odds are going to increase by beta 1 |
Types of Regression : |
- Logistic regression (Binary outcome) |
- Multiple Linear Regression (Continuous outcome) |
- Multinomial Logistic Regression (categorical outcome of 3 or more levels) |
- Ordinal Logistic Regression (Categorical outcome of 3 or more ordinal levels) |
- Poisson Regression (Count outcome) |
- Negative Binomial Regression (Count outcome) |
All those regression models can be called generalized linear models (GLM) ! |
- 3 equations equivalent to each other/
- Y= 0 in MLR is never true if Y is binary and thus cannot use this mode
Since Y is continuous, change Y into P (probability) and it eliminates the error term since you add some randomness
The Odds
Odds of ration is the exponential form of beta
- Beta is your coefficient number on your regression model
The Odds: p/ (1 - p) ---> p is probability
The Odds: e(beta1) =1
Logit: Log (odds). It takes the values from -infinity to positive infinity. Dependent var.
Probability: P =(odds)/ (1 + Odds) |
Comparing 2 Models
First criteria, pick the model with the lowest validation error
Second criterion, when the validation errors are comparable, pick the one with few variables
E.g suppose models 1 and 2 have a validation errors 26.2% and 26.3%. Their model sizes are
10 and, respectively. Which model is better?
- Initially go based of lowest validation error but when its too similar (23% and 26% -> its comparable) and thus you go based off of LOWEST model Size |
Performance Evaluation
(1)Partition the data into Training and Validation Sets |
Training set Used to grow tree |
Validation set used to assess classification performance |
(2) More than 2 classes (M >2 ) |
Same structure except that the terminal nodes would take one of the m-class labels |
Example Probelm
Check if Binary variable or continuous.
If binary: use the odds ratio DIRECTLY and use ____ "times"
If continuous: odds ratio - 1 then convert to %.
|
Created By
Metadata
Comments
No comments yet. Add yours below!
Add a Comment
Related Cheat Sheets
More Cheat Sheets by angelica9373