Data Mining Cheat Sheet

Data Mining Steps

1. Data Cleaning	Removal of noise and inconsistent records
2. Data Integration	Combing multiple sources
3. Data Selection	Only data relevant for the task are retrieved from the database
4. Data Transformation	Converting data into a form more appropriate for mining
5. Data Mining	Application of intelligent methods to extract data patterns
6. Model Evaluation	Identification of truly interesting patterns representing knowledge
7. Knowledge Presentation	Visualization or other knowledge presentation techniques

Data mining could also be called Knowledge Discovery in Databases (see kdnuggets.com)

Types of Attributes

Nomial	e.g., ID numbers, eye color, zip codes
Ordinal	e.g., rankings, grades, height
Interval	e.g., calendar dates, temperatures
Ratio	e.g., length, time, counts

Distance Measures

Manhattan = City Block

Jaccard coefficient, Hamming, Cosine are a similarity / dissimilarity measures

Measures of Node Impurity

Model Evaluation

Kappa = (observed agreement - chance agreement) / (1- chance agreement)

Kappa = (Dreal – Drandom) / (Dperfect – Drandom), where D indicates the sum of values in diagonal of the confusion matrix

K-Nearest Neighbor

* Compute distance between two points

* Determine the class from nearest neighbor list
    * Take the majority vote of class labels 
      among the k-nearest neighbors

    * Weigh the vote according to distance
        * weight factor, w = 1 / d^2

Rule-based Classification

Classify records by using a collection of
“if…then…” rules

Rule: (Condition) --> y

where:
* Condition is a conjunction of attributes
* y is the class label

LHS: rule antecedent or condition
RHS: rule consequent

Examples of classification rules:
(Blood Type=Warm) ^ (Lay Eggs=Yes) --> Birds
(Taxable Income < 50K) ^ (Refund=Yes) --> Evade=No

Sequential covering is a rule-based classifier.

Rule Evaluation

Bayesian Classification

p(a,b) is the probability that both a and b happen.

p(a|b) is the probability that a happens, knowing that b has already happened.

Terms

Association Analysis	Min-Apriori, LIFT, Simpson's Paradox, Anti-monotone property
Ensemble Methods	Staking, Random Forest
Decision Trees	C4.5, Pessimistic estimate, Occam's Razor, Hunt's Algorithm
Model Evaluation	Cross-validation, Bootstrap, Leave-one out (C-V), Misclassification error rate, Repeated holdout, Stratification
Bayes	Probabilistic classifier
Data Visualization	Chernoff faces, Data cube, Percentile plots, Parallel coordinates
Nonlinear Dimensionality Reduction	Principal components, ISOMAP, Multidimensional scaling

Ensemble Techniques

Manipulate training data: bagging and boosting ensemble of “experts”, each specializing on different portions of the instance space

Manipulate output values: error-correcting output coding (ensemble of “experts”, each predicting 1 bit of the {multibit} full class label)

Methods: BAGGing, Boosting, AdaBoost

Rules Analysis

Apriori Algorithm

Let k=1

Generate frequent itemsets of length 1

Repeat until no new frequent itemsets are identified

    Generate length (k+1) candidate itemsets from 
    length k frequent itemsets

    Prune candidate itemsets containing subsets 
    of length k that are infrequent

    Count the support of each candidate by 
    scanning the DB

    Eliminate candidates that are infrequent, 
    leaving only those that are frequent

K-means Clustering

Select K points as the initial centroids

repeat
    Form K Clusters by assigning all points to the closest centroid

    Recompute the centroid of each cluster

until the centroids don't change

Closeness is measured by distance (e.g., Euclidean), similarity (e.g., Cosine), correlation.

Centroid is typically the mean of the points in the cluster

Hierarchical Clustering

Single-Link or MIN
Similarity of two clusters is based on the two most similar (closest / minimum) points in the different clusters

Determined by one pair of points, i.e., by one link in the proximity graph.

Complete or MAX
Similarity of two clusters is based on the two least similar (most distant, maximum) points in the different clusters

Determined by all pairs of points in the two clusters

Group Average
Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Agglomerative clustering starts with points as individual clusters and merges closest clusters until only one cluster left.

Divisive clustering starts with one, all-inclusive cluster and splits a cluster until each cluster only has one point.

Dendrogram Example

Dataset: {7, 10, 20, 28, 35}

Density-Based Clustering

current_cluster_label <-- 1

for all core points do
    if the core point has no cluster label then
        current_cluster_label <-- current_cluster_label +1
        Label the current core point with the cluster label
    end if
    for all points in the Eps-neighborhood, except i-th the point itself do
        if the point does not have a cluster label then
            Label the point with cluster label
        end if
    end for
end for

DBSCAN is a popular algorithm

Density = number of points within a specified radius (Eps)

A point is a core point if it has more than a specified number of points (MinPts) within Eps

These are points that are at the interior of a cluster

A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point

A noise point is any point that is not a core point or a border point

Other Clustering Methods

Fuzzy is a partitional clustering method. Fuzzy clustering (also referred to as soft clustering) is a form of clustering in that each data point can belong to more than one cluster.

Graph-based methods: Jarvis-Patrick, Shared-Near Neighbor (SNN, Density), Chameleon

Model-based methods: Expectation-Maximization

Regression Analysis

* Linear Regression
| Least squares
* Subset selection
* Stepwise selection
* Regularized regression
| Ridge
| Lasso
| Elastic Net

Anomaly Detection

Anomaly is a pattern in the data that does not conform to the expected behavior (e.g., outliers, exceptions, peculiarities, surprise)

Types of Anomaly
Point: An individual data instance is anomalous w.r.t. the data
Contextual: An individual data instance is anomalous within a context
Collective: A collection of related data instances is anomalous

Approaches
* Graphical (e.g., boxplots, scatter plots)
* Statistical (e.g., normal distribution, likelihood)
| Parametric Techniques
| Non-parametric Techniques
* Distance (e.g., nearest-neighbor, density, clustering)

Local outlier factor (LOF) is a density-based distance approach

Mahalanobis Distance is a clustering-based distance approach

Created By

HockeyPlay21

Metadata

Languages: English

Published: 30th April, 2017
Rated: 5 out of 5 stars based on 5 ratings

Favourited By

Comments

No comments yet. Add yours below!

Add a Comment

Related Cheat Sheets

CS 2510 Cheat Sheet

A-Level Computing Key - Terms & Concepts Cheat Sheet

Python - Introduction Cheat Sheet

Data Mining Cheat Sheet by HockeyPlay21

Data Mining Steps

Types of Attributes

Distance Measures

Measures of Node Impurity

Model Evaluation

K-Nearest Neighbor

Rule-based Classification

Rule Evaluation

Bayesian Classification

Terms

Ensemble Techniques

Rules Analysis

Apriori Algorithm

K-means Clustering

Hierarchical Clustering

Dendrogram Example

Density-Based Clustering

Other Clustering Methods

Regression Analysis

Anomaly Detection

Created By

Metadata

Favourited By

Comments

Add a Comment

Related Cheat Sheets

Latest Cheat Sheet

Random Cheat Sheet

About Cheatography

Behind the Scenes

Recent Cheat Sheet Activity

Please Disable Your Ad Blocker

Data Mining Cheat Sheet by HockeyPlay21

Data Mining Steps

Types of Attributes

Distance Measures

Measures of Node Impurity

Model Evaluation

K-Nearest Neighbor

Rule-based Classi­fic­ation

Rule Evaluation

Bayesian Classi­fic­ation

Terms

Ensemble Techniques

Rules Analysis

Apriori Algorithm

K-means Clustering

Hierar­chical Clustering

Dendrogram Example

Densit­y-Based Clustering

Other Clustering Methods

Regression Analysis

Anomaly Detection

Created By

Metadata

Favourited By

Comments

Add a Comment

Related Cheat Sheets

Latest Cheat Sheet

Random Cheat Sheet

About Cheatography

Behind the Scenes

Recent Cheat Sheet Activity

Please Disable Your Ad Blocker

Rule-based Classification

Bayesian Classification

Hierarchical Clustering

Density-Based Clustering