Data_Analysis Cheat Sheet

Academic Integrity

Collaboration - must write own solutions. Group work - acknowledge contributions. Unacknowledged →collusion or plagiarism

Collusion - working together, identical/similar solutions

Plagiarism - using code, ideas, words, data without acknowledgement

Avoid plagiarism - could be unintentional, cite sources - URL, data accessed, code "adapted from"

Open Source - duplicate license in code

Contract Cheating - using third party

External Resources - used as learning support (stackoverflow). Solution repos (former students work) disallowed

Python

Paradigms - OOP, functional, structured, procedural

Jupyter - Julia, Python, R (.ipynb)

Statement - instruction executed by interpreter i.e. print(), assignment,

Expression - values/variables/operators, represents single result value, can be used on the right side of assignment statements

print() - displays value of expression, not evaluating, no "quotes"

Dynamic Programming - real time execution, without compiling

Dynamically Typed - type not declared, determined at runtime, not compile time

Strongly Typed - checks to ensure type safety, can't typecast anything

Python2

If statement

if <expr>: 
   statements
elif/else:
   statements

Function args - keyword (unordered/usually optional, omit for default) or positional

Function definition:

def <name> (params):
   statements

Compound statement - header & body

Param - variables declared in functions, Args - values passed to functions

None type - no value :

return None, return, pass

Traceback/Stack Trace - prints program, line, functions when an error occurs

Type Conversion

float() int() str() eval(<string>)

Python Strings

Array of bytes, no char type - immutable, ordered

Concat -

, Repeat -

F-String

f"String {<py code/var>}"

String Format -

"String {one}, {two}".format(one=, two=)

Indexing - 0 based, []

Slicing -

[start:end:step], [inclusive, exclusive]

Omitting start/end

[:]

For loop

for c in string:
   statements

In operator - membership returns T/F

Methods -

var.methodName(args)

Git

Git - Distributed VCS, clients mirror full repo, fault tolerant vs Centralised - single server

Git converts directory to VFC - versioned filesystem, provides operations

Snapshots, not deltas - changed files are stored, unchanged files use pointers

Only needs local files, can push with network

Git directory, metadata + object db for project

Secure Shell Protocol(SSH) - public/private keys for secure exchange over network

Encrypt with public key, decrypt with private

Data Science

Answer questions/solve problems with data

Hypothesis then collect data, or find dataset, explore, generate hypothesis

Data wrangling/munging - cleaning data, uses pipeline, new data = rerun or extend

EDA - no hypothesis or model, use graphs and summary stats

Can reveal unclean data, more processing

Analysis - model for explanatory research (cause relationships)

Matplotlib

Object orientated - figures and axes

Figure - canvas, set dimensions, background colour, place objects on it and save

Axes - frontend for plots, placed onto figures

Init -

plt.figure()

figsize

for size

Add plot

ax1 = fig.add_subplot()

use

add_subplots(2,2)

for 2x2 grid

ax1.clear()

clear plot

axes[0,0].hist()

Plots to top left, [0,1] = top right, [1,0] = bottom left

.plot(x,y, linestyle=, color=, marker=)

Plot range

ax.set_xlim(x1, x2)

Tick locations

ax.set_xticks([])

Tick labels

ax.set_xticklabels([])

Plot labels

.set_x/ylabel() .set_title() .legend()

Saving

fig.savefig("file.png, dpi=)

Seaborn

Countplot - for categorical, y=freq, x=categories, type of barplot

Scatterplot - 2 cont. variables, shows relationship

Linear reg. scatter + line to model relationship

Barplot - x=categories, y=continuous i.e. means of each category, swap x and y to make horizontal

Can apply hue to count, scatter, bar, box - splits categories using another column

Boxplot - for cont. shows the spread

Can use

.set(title=,xlabel=,ylabel=)

UNIX

Written in Assembly, then rewritten in C

UNIX Filesystem - hierarchical, tree structure with nodes - root = /

Nodes - has metadata, at least a name. Leaf Nodes - no children

Path - sequence of nodes to id other nodes in the tree

Absolute Path - sequence of nodes from the root - resolves to a location

Relative Path - starts navigation at current location - intermediate node
Label - traverse to child, .. to parent, . stay

Non-leaf nodes = directory, Leaf nodes = directory/files

Subdirectory - directory inside directory

Files - stores information - name, contents, location, privileges

UNIX Shell - program allows users to interact with UNIX system

Terminal - physical hardware, input + output, dumb terminal/thin client relies on host computer

Terminal emulator - program, text-only on GUI

Terminal needs a SHELL to run commands

SHELL - text only, access OS, executes commands, interacts files, scripting

CLI - style of interface, text-only, runs shell

Check running shell

echo $0

Data Structures

List - ordered, mutable, dynamic array, heterogenous, head→tail, uses []

Tuple - ordered, immutable, static array, uses ()

Dict - unordered, hashable key-value, associative array, mutable {}

Set - unordered, mutable, unique objects {}

Sort list -

myList.sort()

modifies list,

sorted(myList)

creates new sorted list

List for loop (same as string)

Concat lists with

Dict - uses hash function, converts key to an index to retrieve value - O(1) insertion, access, deletion

Retrieve value by key

myDict["key"]

Keys must be unique, else they overwrite values

Remove key-value pair

del myDict["key"]

Dict for loop -

for k,v in myDict.items():

Can sort list of k,v pairs using

sorted(dict.items())

Sets - add values -

set.add(new val)

Unique values using

set(obj)

Intersection -

set1.intersection(set2)

Files

Permissions - UID (user id), GID(id for group of users)

Access Rights - Class (who can access) and type (type of access)

Class - u (user) = owner, g (group) = user in group, o (other) = anyone else with access

Type - r (read), w (write), x (execute)

chmod

change mode - absolute or symbolic mode

Symbolic - class(u, g, o, a), type(r, w, x), op (+, -, =)

chmod g+r *.txt

chmod u=rw go= file

Absolute - 3 digits (user, group other), 4=r, 2=w, 1=x

chmod 730 a.txt

u=rwx, g=wx

chmod 641 a.txt

u=rw, g=r, o=x

cp <src1> <src2> <target>

cp -r <src> <targ>

directory + its contents

cp -r <src>/ <targ>

contents only

mv <src1> <src2> <targ>

moves/renames

rm <path1> <path2>

by rel/abs path

rmdir <dir>

removes empty directory

rm -r <dir>

non-empty directory

Git Anatomy

Tree - snapshot along the timeline in VFS

Commit - snapshot in timeline, has a tree & hash(SHA1) code (id)

Repo - commit in ./git, commit is a tree

Staging area - index, stores info for next commit - chosen changes

Working directory - UNIX directory, files in repo, checked out version

Snapshot - records a treem all files in project at point in tiem

Branch - alt dev path - ptr to latest in its timeline, default = main branch

Init, no parents. Merges, 2. Most have 1

Head - commit we are on, can be ptr to any commit, normally main branch

Git Cycle

3 Stages - modified(changes since checkout), staged(modified and added), committed(in git directory)

Status - untracked, modified, unmodified, staged

Working directory→staging→commit to repo

Repo → working directory (checkout)

Tracked - last snapshot or staged

Untracked - not in last snapshot and not staged

New file (untracked) →staged→committed

Existing file→modifed→committed

Unmod→removed(repo)→untracked

Un/Staged at the same time - modify, stage changes, modify again

Git History/Undo

git restore <file>

discards changes in WD

git restore --staged <file>

unstage, modifications stay, previously tracked are now modified, else untracked

git rm <file>

removes from staging and WD

git rm --cached <file>

removes from stage

git diff

- WD vs staged

--stage

staged vs last commit

git log

- latest at top, commits = hashcode

--pretty=format:""

%h(hash), %an(author) %ar(time) %s(message)

--graph

git show

metadata, edits, file content of latest or

<commit>

for specific, or HEAD~n

git checkout HEAD~n <file>

copies to WD, all changes lost

Numpy

n-dimensions, homogenous, ordered

Init - list, tuples, np methods

arange -

[inclusive, exclusive]

linspace -

[inclusive, inclusive]

Vectorized Operations - faster, implemented in C, contiguous memory, parallel processing

Attributes - dtype, ndim, shape, size

Iteration - nested loop or single loop

d.flat

Broadcasting - compatible if equal dim or dim of 1 i.e. (4,3) - (), (3,), (1,3), (4,1)

Operations(mean, max..) by axis. Row ↓:

d.mean(axis=0)

Column→:

axis=1

Indexing - single el

[row, col]

, entire row

[1,]

, rows

[0:2]

, rows select col

[0:2,1]

, rows cols

[0:2,1:3]

Shallow - view, slice, reshape, ravel

Deep - copy, resize, flatten

Pandas Series

1D, homogenous, indexed (default 0 → n-1)

Like fixed length, ordered dicts, maps index to values

Can initialize with dicts - with custom index, autofills with NaN

Can use

in

by index

'a' in d

d.index

→ indexes,

d.array

→ array

Custom index -

pd.Series([1,2], index=['b','a'])

Reindex - to change order/drop indexes

d.reindex(['a','b'])

Indexing - single

d['a']

multi

d[['a','b']]

Vectorized operations on values

Index alignment with operations on 2 series with overlapping indexes

d1 + d2

Update indexes with

d.index = [new indexes]

, same size only

Forward fill

d.reindex([i], method="ffill")

SHELL cmds

pwd

- present working directory

ls <path>

cd <path>

mkdir <path>

mkdir -p foo/bar/final

manual

man ls

man man

ls -a

all files (hidden ones)

Wildcards

ls main*

- selectively show files,

main.c main.o main

ls <path1> <path2>

lists files of 2 paths

ls -l

long format - perms, owner/owner group, size (bytes), modify date

ls -R

current and all subdirectories

Dataframes Groupby

SAC (split, apply, combine)

Split - by key, done on an axis

Apply - function to each group

Combine results into new object

Keys can be list/array of values (same len as axis) or single value (col)

df["data1"].groupby(df["key1"]).sum()

returns a series, key1 as the index, sum of data1 as value

df.data1.groupby([df["key1"], 
df["key2"]]).mean()

returns series with multi index

Unstack multi index with

.unstack()

first index = row, second = column

Group entire df and apply function to all columns

df.groupby("key1").mean()

Error if a column is categorical, can't apply fn

Use

groupby

and

.size

to count values in column (categorical)

df.groupby("key")[col].op

returns series

df.groupby("key")[[col]].op

returns df

Dataframes np functions

np.add(df, 10)

adds 10 to all values

df.add()

is the same

np.abs(df)

np.sqrt(df)

df.apply(fn, axis=)

for custom functions, axis=0 by default

np.sum()

df.sum()

np.mean()

df.mean()

Dataframes

2D arrays (higher with hierarchical indexing)

Custom rows and columns indices, can have missing data, column = pd.Series

Construct with dict equal length lists/arrays

dict = {"col1" : ["val1", "val2"],
        "col2" : ["val1", "val2"]}

Reorder and add columns

pd.DataFrame(data, columns=["new"])

Retrieve col as series

df["col1"]

Update all col values with assignment

df["col"] = 2

or can use series

Update with series - index matching with df

Can use boolean op to create new col

df["western"] = df.province == "Alberta"

Delete column

del df["western"]

df.reindex([])

to rearrange rows/cols, add new ones

df.drop()

- default rows,

axis=1

for cols

df.iloc[row,col]

[inc, ex]

[[1,2],[0,2]]

for specific indexes

df.loc["row","col"]

[inc, inc]

grades >= 50

returns df with T/F values

Boolean indexing

grades[(grades >= 50) &/| (grades < 70)]

CSV

with

statement assigns csv file to var, uses it, then closes to release resource

open()

opens file, and assigns to var

mode="w"

for writing

for reading

newline=""

processes new lines

csv.writer(file)

and

csv.reader()

writer.writerow([])

for row in reader:

x1, x2.. = row

df = pd.read_csv("file", names=[])

names=columns

Data_Analysis Cheat Sheet (DRAFT) by Glowie

Academic Integrity

Python

Python2

Python Strings

Git

Data Science

Matplotlib

Seaborn

UNIX

Data Structures

Files

Git Anatomy

Git Cycle

Git History/Undo

Numpy

Pandas Series

SHELL cmds

Dataframes Groupby

Dataframes np functions

Dataframes

CSV

Latest Cheat Sheet

Random Cheat Sheet

About Cheatography

Behind the Scenes

Recent Cheat Sheet Activity

Please Disable Your Ad Blocker

Data_Analysis Cheat Sheet (DRAFT) by Glowie

Academic Integrity

Python

Python2

Python Strings

Git

Data Science

Matplotlib

Seaborn

UNIX

Data Structures

Files

Git Anatomy

Git Cycle

Git Histor­y/Undo

Numpy

Pandas Series

SHELL cmds

Dataframes Groupby

Dataframes np functions

Dataframes

CSV

Latest Cheat Sheet

Random Cheat Sheet

About Cheatography

Behind the Scenes

Recent Cheat Sheet Activity

Please Disable Your Ad Blocker

Git History/Undo