exploratory data analysis
types of variables: 
Categorical (nominal no order ex color of eyes or ordinal order ex.lvl of education variables)/ Numerical variables: discrete and continous variables 
numerical summaries 
quantile: value that proportion p the data is smaller than Q(p) and 1p bigger first quantile Q1: p=0.25, median Q2: p=0.5 and third quantile: p=0.75 Q3, IQR is the interquartile range = Q3Q1 contains 50% of the data Formula for the rank is p(n1)+1 if not integer extrapolate with 2 values between with weight 
measures of center 
MODE: most frequent value MEDIAN: Q(0.50)/ MEAN: average, tot/n if unimodal and symetric distribution mean=median, right skewed mode<median<mean 
variance and sd 
Graphics 
pies, barplots (frequency or rf, any order, specific categories ex faculties), contingency tables (2 or + categorical variables), mosaic plot (translation of CT, if aligned, independant), frequency table (numerical variable, f, rf=proportion, cumulative f, cummulative rf, densities rf/amplitude, order), hitograms (translatio of FT, area proportional to class frequency = density, numerical variables, order needed, size can be an interval no precise value as bp), BOXPLOT (IQR and 1.5*IQR, put median, LB, UB), QQplot (compare two distribution theorical and empitrical, if 45° same distribution) 


Statistical inference
simpson paradox 
heterogenous sources: divide to more homogenous subgroups: ex by major because could bias the proportion : controlling for the confounding factor men chose the easiest program whereas women chose the more difficult to enter: the solution is to use a weighted average of the admission rates 
sampling the population 
population: what we want to analyse, want to find the population's parameters, these are true and ifxed values but usually unknown sample: what we have, piece of the population chosen randomly, parameters are random variables, should be as large as possible to limit bias, sample have incomplete information, if finite population without replacement of sample can affect results 
point estimation 
estimators an estimator is a parametor calculated with the simple. it tries to estimate the true parameter of the population it is a random variables and parameter are fixed but unknown within a certain certitude: confidence intervals 
Estimator 
to estimate a parameter and its uncertainty: ex: μ, the more sampling, the more precise because variance decreases with N large concentrated distribution around true value 
central limit thm 
when we sum random variables from the same distribution: sum/n= new variable that follow a normal distribution when n is large special case for proportion (binomial) 
estimating variance s^2 and s ̃2, 
if x follow a normal distrbution, follos khi 2 distribution with n1 degrees of freedom similar to variance estimation 
confidence intervals 
from central limit thm: C is a certain value for with prob of (1a) that the estimator is in the interval, small alfa, bigger interval, not exactly 95/100 but around value, prob, if normal distribution use student distribution so modify CI to be more precise, 
for proportions: 
^p estimate mean 
for median 
for variance 
for the difference of means 
when 0 is not in the interval: significante différence 
theory of estimation 
depends on situation, can evaluate the quality of estimator, good one has nu bias, 
