Show Menu
Cheatography

StatiscalThinkingPython Cheat Sheet (DRAFT) by

This is a draft cheat sheet. It is a work in progress and is not finished yet.

EDA

import seaborn as sns
seaborn is used to set the plotting
sns.set()
Set default Seaborn style
The "­square root rule" is a common­ly-used rule of thumb for choosing number of bins: choose the number of bins to be the square root of the number of samples.
Bee swarm plot
Draw a catego­rical scatte­rplot with non-ov­erl­apping points.
sns.sw­arm­plo­t(x­='c­oln­ame1', y='col­name2', data=df)
colname1 is catego­rical. y is for the numbers.
ECDF
Empirical cumulative distri­bution function. It is one of the important plots for unders­tanding the data.
plt.pl­ot(x, y, marker­='.', linest­yle­='n­one')
plt.ma­rgi­ns(­0.02)
Keeps data off plot edges
np.ara­nge­(3,7)
array([3, 4, 5, 6])
numpy.a­ra­nge­([s­tart, ]stop, [step, ]dtype­=None)
Return evenly spaced values within a given interval.

numpy

np.per­cen­til­e(a­rra­yna­me,­[2.5, 25])
Compute the 2.5 and 25 percen­tiles of variable arrayname
sns.bo­xpl­ot(­x=c­oln­ame1, y=coln­ame2, data=df)
np.var­(ar­ray­name)
compute the variance of numpy array arrayname
np.std­(ar­ray­name)
compute the standard deviation of numpy array arrayname
np.cov(x, y)
returns a 2D array where entries [0,1] and [1,0] are the covari­ances. Entry [0,0] is the variance of the data in x, and entry [1,1] is the variance of the data in y. This 2D output array is called the covariance matrix, since it organizes the self- and covari­ance.
np.cor­rcoef()
Pearson correl­ation coeffi­cient, also called the Pearson r, is often easier to interpret than the covari­ance. It is computed using the np.cor­rcoef() function. Like np.cov(), it takes two arrays as arguments and returns a 2D array. Entries [0,0] and [1,1] are necess­arily equal to 1 (can you think about why?), and the value we are after is entry [0,1].

hypotheses

permut­ation sampling
permut­ation sampling is a great way to simulate the hypothesis that two variables have identical probab­ility distri­butions
np.ran­dom.pe­rmu­tat­ion­(data)
Permute the concat­enated array
np.con­cat­ena­te(­(data1, data2))
Concat­enate the data sets
The p-value is generally a measure of:
the probab­ility of observing a test statistic equally or more extreme than the one you observed, assuming the hypothesis you are testing is true.
a permut­ation replicate
is a single value of a statistic computed from a permut­ation sample.
 

probab­ilistic logic

Statis­tical inference involves taking your data to probab­ilistic conclu­sions about what you would expect if you took even more data, and you can make decisions based on these conclu­sions.
np.ran­dom.ra­ndom()
The function returns a random number between zero and one
np.ran­dom.se­ed(42)
Seed the random number generator
np.emp­ty(­100000)
Initialize an empty array, random­_nu­mbers, of 100,000 entries
np.ran­dom.bi­nom­ial­(n=100, p=0.05, size=1­0000)
# Take 10,000 samples out of the binomial distri­bution: n_defaults
np.ran­dom.po­iss­on(10, size=1­0000)
Draw 10,000 samples out of Poisson distri­bution with a mean of 10
np.ran­dom.no­rma­l(20, 1, size=1­00000)
Draw 100,000 samples from a Normal distri­bution that has a mean of 20 and a standard deviation of 1
plt.hi­st(­array, bins=100, normed­=True, histty­pe=­'step')
histty­pe=­'step' smoothes histogram
plt.yl­im(a, b)
limit the y axes between a and b
np.ran­dom.ex­pon­ent­ial­(mean, size=size)
slope, intercept = np.pol­yfit(x, y, degree)
found the slope and intercept of the points (x,y). degree determines the degree of polynomial
np.lin­spa­ce(a, b, c)
get c points in the range between a and b
np.emp­ty_­lik­e(v­ari­able)
This function returns a new array with the same shape and type as a given array "­var­iab­le"
Bootst­rapping
The use of resampled data to perform statis­tical inference
If we have a data set with nn repeated measur­ements, a bootstrap sample is an array of length nn that was drawn from the original data with replacemen
np.ran­dom.ch­oic­e(a­rray, size=n)
Generate bootstrap sample from array with size n
Confidence interval of a statistic
If we repeated measur­ements over and over again, p% of the observed values would lie within the p% confidence interval.
A confidence interval gives bounds on the range of parameter values you might expect to get if we repeated our measur­ements. For named distri­but­ions, you can compute them analyt­ically or look them up, but one of the many beautiful properties of the bootstrap method is that you can just take percen­tiles of your bootstrap replicates to get your confidence interval. Conven­iently, you can use the np.per­cen­tile() function.
pairs bootstrap
involves resampling pairs of data.