Show Menu

Translate_Stats_ML Cheat Sheet by

Translating Between Statistics and Machine Learning


Machine learning
data point, record, row of data
example, instance
Both domains also use "­obs­erv­ati­on,­" which can refer to a single measur­ement or an entire vector of attributes depending on context.
response variable, dependent variable
label, output
Both domains also use "­tar­get." Since practi­cally all variables depend on other variables, the term "­dep­endent variab­le" is potent­ially mislea­ding.
variable, covariate, predictor, indepe­ndent variable
feature, side inform­ation, input
The term "­ind­epe­ndent variab­le" exists for historical reasons but is usually mislea­din­g--such a variable typically depends on other variables in the model.
supervised learners, machines
Both estimate output(s) in terms of input(s).
Both translate data into quanti­tative claims, becoming more accurate as the supply of relevant data increases.
hypothesis ≠ classifier
In both statistics and ML, a hypothesis is a scientific statement to be scruti­nized, such as "The true value of this parameter is zero."

In ML (but not in statis­tics), a hypothesis can also refer to the prediction rule that is output by a classifier algorithm.
bias ≠ regression intercept
Statistics distin­guishes between
(a) bias as form of estimation error and
(b) the default prediction of a linear model in the special case where all inputs are 0.
ML sometimes uses "­bia­s" to refer to both of these concepts, although the best ML resear­chers certainly understand the differ­ence.
Maximize the likelihood to estimate model parameters
If your target distri­bution is discrete (such as in logistic regres­sion), minimize the entropy to derive the best parame­ters.

If your target distri­bution is contin­uous, fine, just maximize the likeli­hood.
For discrete distri­but­ions, maximizing the likelihood is equivalent to minimizing the entropy.
Apply Occam's razor, or encode missing prior inform­ation with suitably uninfo­rmative priors
The principle of maximum entropy is conceptual and does not refer to maximizing a concrete objective function. The principle is that models should be conser­vative in the sense that they be no more confident in the predic­tions than is thoroughly justified by the data. In practice this works out as deriving an estimation procedure in terms of a bare-m­inimum set of criteria as exempl­ified here or here.
logist­ic/­mul­tin­omial regression
maximum entropy, MaxEnt
They are equivalent except in special multin­omial settings like ordinal logistic regres­sion. Note that maximum entropy here refers to the principle of maximum entropy, not the form of the objective function. Indeed, in MaxEnt, you minimize rather than maximize the entropy expres­sion.
X causes Y if surgical (or randomized contro­lled) manipu­lations in X are correlated with changes in Y
X causes Y if it doesn't obviously not cause Y. For example, X causes Y if X precedes Y in time (or is at least contem­por­aneous)
The stats definition is more aligned with common­-sense intuition than the ML one proposed here. In fairness, not all ML practi­tioners are so abusive of causation termin­ology, and some of the blame belongs with even earlier abuses such as Granger causality.
structural equations model
Bayesian network
These are nearly equivalent mathem­ati­cally, although interp­ret­ations differ by use case, as discussed.
sequential experi­mental design
active learning, reinfo­rcement learning, hyperp­ara­meter optimi­zation
Although these four subfields are very different from each other in terms of their standard use cases, they all address problems of optimi­zation via a sequence of querie­s/e­xpe­rim­ents.


No comments yet. Add yours below!

Add a Comment

Your Comment

Please enter your name.

    Please enter your email address

      Please enter your Comment.

          Related Cheat Sheets

          Weights and Measures Cheat Sheet
          ggplot2-scatterplots Cheat Sheet