Show Menu
Cheatography

Kaggle Data Science Cheat Sheet (DRAFT) by

for kaggle data science basics

This is a draft cheat sheet. It is a work in progress and is not finished yet.

Creating, Reading, Writing

df = pd.Dat­aFr­ame­({"c­ol0­": [val0, val1], "­col­1": [val0, val1]}, index=[0, 1])
Create a dataframe
series = pd.Ser­ies­(["v­al0­", "­val­1", "­val­2"], index=[0, 1, 2, 3], name="n­ame­")
Create a series
read = pd.rea­d_c­sv(­"../­fol­der­/fo­lde­r/f­ile.cs­v", index_­col=0)
Read a csv
save.t­o_c­sv(­"­fil­e.c­sv")
Save an existing dataframe as a csv

Indexing, Selecting, Assigning

table.head
Show first 5 rows of a dataframe
table[­"­col­"]
Select the col from table
table.c­ol.il­oc[0]
Select 1st value of a col from table
table.i­loc[0]
Select 1st row of data from table
table.c­ol.il­oc[:10]
Select 1st 10 values from col in table (index­-based select)
table.c­ol.lo­c[:10]
Select 1st 10 values from col in table (label­-based select)
table.l­oc­[in­dices, cols]
Select certain rows from certain cols
table[­tab­le.col == 'val']
Select cols have a certain val (condi­tional select)
table.c­ol.is­in(­['v­al1,' 'val2'])
Select cols have certain vals (condi­tional select)

Summary Functions & Maps

table.c­ol.de­scr­ibe()
Get high-lvl summary of given col's attributes
table.c­ol.mean()
Get mean of a col with numerical vals
table.c­ol.un­ique()
Get each unique val of a col w/ no dupes
table.c­ol.va­lue­_co­unts()
Get frequency of each val in col
table.c­ol.ma­p(l­ambda p: p - s)
Map function to remap a Series of point vals (p) by using a transf­orm­ation (p-s) -> returns new Series
table.a­pp­ly(­func, axis='­col­umns')
Apply function to transform entire df by calling custom method (func taking a row) on each row
 

Grouping & Sorting

table.g­ro­upb­y('­col­').c­ol.co­unt()
Group data w/ same vals in the given col -> count frequency of given col (same as value_­cou­nts())
table.g­ro­upb­y('­col­').s­ize()
Same as above
table.g­ro­upb­y('­col­').a­pp­ly(­lambda df: df.tit­le.i­lo­c[0])
Select name (title) of the 1st thing in col
table.c­ol.id­xmax()
Get index of max val in col
table.g­ro­upb­y([­'co­l0'­]).c­ol­1.a­gg([f1, f2, f3])
agg() runs diff. funcs. simult­ane­ously on a df
table.g­ro­upb­y([­'col0', 'col1'­]).c­ol­2.a­gg(­[len])
Multi-­index output has tiered structure. Require 2 levels of labels to retrieve a val
df.res­et_­index()
Muti-index method used to converting back to regular index
df.sor­t_v­alu­es(­by=­'col')
Sort rows of data by vals in col (ascen­ding)
df.sor­t_v­alu­es(­by=­'col', ascend­ing­=False)
Sort rows of data by vals in col (desce­nding)
df.sor­t_v­alu­es(­by=­['c­ol0', 'col1'])
Sort rows by more than 1 col at a time
df.sor­t_i­ndex()
Sort rows by index (default order; ascending)

Data Types & Missing Values

table.c­ol.dtype
Get data type of a col
table.d­types
Get data types of each col in table
table.c­ol.as­typ­e('­dat­atype')
Convert col to datatype if allowed (e,g, int64 -> float64)
table.i­nd­ex.d­type
Number indices are int64
table[­pd.i­sn­ull­(ta­ble.col)]
Select NaN entries in a col
table.c­ol.fi­lln­a("f­ill­er")
Replace all NaN vals in a col with a sentinel val ("Un­kno­wn", "­Und­isc­los­ed", "­Inv­ali­d") or non-null val
table.c­ol.re­pla­ce(­"­ini­t_v­al", "­new­_va­l")
Replace, in col, all existing vals with new_vals

Renaming & Combining

table.r­en­ame­(co­lum­ns=­{'i­nit': 'new'})
Rename col or index col names
table.r­en­ame­(in­dex={0: 'first­Entry', 1: 'secon­dEn­try})
Rename index or col vals by specifying an index or col param
table.r­en­ame­_ax­is(­"­nam­e", axis='­row­s').re­nam­e_a­xis­("na­me1­", axis='­col­umns')
Rename row index &/or col index
pd.con­cat­(list, of, els)
Smush together the list of elements along an axis
left.j­oin­(right, lsuffi­x='­strL', rsuffi­x='­strR')
Combine diff df objects that have an index in common. left and right are df.s defined beforehand