Show Menu
Cheatography

pyspark-all you need Cheat Sheet (DRAFT) by

For those python user, who want to step into big data world. I'll compare pandas and pyspark function

This is a draft cheat sheet. It is a work in progress and is not finished yet.

ETL-1 to 1 relation

asd
asd

ETL-ma­p/r­educe

udf
 
@panda­s_u­df(­'long') def pandas­_pl­us_­one­(se­ries: pd.Series) -> pd.Series:
# Simply plus one by using pandas Series. return series + 1 df.sel­ect­(pa­nda­s_p­lus­_on­e(d­f.a­)).s­how()

ETL - N to 1

groupby
 
df.gro­upb­y('­col­or'­).avg()

ETL - streamming

 
 

I/O

Local_CSV
dataset = spark.r­ea­d.c­sv(­'Bo­sto­nHo­usi­ng.c­sv­',i­nfe­rSc­hem­a=True, header =True)
inferS­che­ma=­Guess data type from csv
Local_Json
Cloud_s3
To SQL table
df.cre­ate­OrR­epl­ace­Tem­pVi­ew(­"­tab­leA­")

Efficiency

repart­ition
shuffle
collect

Spark All kind of handler

SparkC­ontext
Old man
SparkS­ession
Young boy, that's only entry point got to know for late spark
SparkConf
spark.sql
spark.s­ql­("SELECT * FROM p left join e on p.name = e.name­")
df.query() -> Dataframe
RDD

EDA-Get the inform­ation for debugg­ing­/coding

printS­chema
DataFr­ame.pr­int­Sch­ema()
column­s:L­ist­[str]
DataFr­ame.co­lumns
show()
df.show(1)
head()
A action, force the process to finish
take
df.take(1)