Show Menu

Statistics I Cheat Sheet (DRAFT) by

This is a draft cheat sheet. It is a work in progress and is not finished yet.

explor­atory data analysis

types of variables:
Catego­rical (nominal no order ex color of eyes or ordinal order ex.lvl of education variables)/
Numerical variables: discrete and continous variables
numerical summaries
quantile: value that proportion p the data is smaller than Q(p) and 1-p bigger
first quantile Q1: p=0.25, median Q2: p=0.5 and third quantile: p=0.75 Q3,
IQR is the interq­uartile range = Q3-Q1 contains 50% of the data
Formula for the rank is p(n-1)+1 if not integer extrap­olate with 2 values between with weight
measures of center
MODE: most frequent value
MEDIAN: Q(0.50)/
MEAN: average, tot/n
if unimodal and symetric distri­bution mean=m­edian, right skewed mode<m­edi­an<mean
variance and sd
barplots (frequency or rf, any order, specific categories ex facult­ies),
contin­gency tables (2 or + catego­rical variab­les),
mosaic plot (trans­lation of CT, if aligned, indepe­ndant),
frequency table (numerical variable, f, rf=pro­por­tion, cumulative f, cummul­ative rf, densities rf/amp­litude, order),
hitograms (trans­latio of FT, area propor­tional to class frequency = density, numerical variables, order needed, size can be an interval no precise value as bp),
BOXPLOT (IQR and 1.5*IQR, put median, LB, UB),
QQ-plot (compare two distri­bution theorical and empitr­ical, if 45° same distri­bution)

Statis­tical inference

simpson paradox
hetero­genous sources: divide to more homogenous subgroups: ex by major because could bias the proportion : contro­lling for the confou­nding factor men chose the easiest program whereas women chose the more difficult to enter:
the solution is to use a weighted average of the admission rates
sampling the population
population: what we want to analyse, want to find the popula­tion's parame­ters, these are true and ifxed values but usually unknown
sample: what we have, piece of the population chosen randomly, parameters are random variables, should be as large as possible to limit bias, sample have incomplete inform­ation, if finite population without replac­ement of sample can affect results
point estimation
estimators an estimator is a parametor calculated with the simple. it tries to estimate the true parameter of the population it is a random variables and parameter are fixed but unknown within a certain certitude: confidence intervals
to estimate a parameter and its uncert­ainty: ex: μ, the more sampling, the more precise because variance decreases with N large concen­trated distri­bution around true value
central limit thm
when we sum random variables from the same distri­bution: sum/n= new variable that follow a normal distri­bution when n is large special case for proportion (binomial)
estimating variance s^2 and s ̃2,
if x follow a normal distrb­ution, follos khi 2 distri­bution with n-1 degrees of freedom similar to variance estimation
confidence intervals
from central limit thm: C is a certain value for with prob of (1-a) that the estimator is in the interval, small alfa, bigger interval, not exactly 95/100 but around value, prob, if normal distri­bution use student distri­bution so modify CI to be more precise,
for propor­tions:
^p estimate mean
for median
for variance
for the difference of means
when 0 is not in the interval: signif­icante différence
theory of estimation
depends on situation, can evaluate the quality of estimator, good one has nu bias,