Cheatography
https://cheatography.com
For those python user, who want to step into big data world.
I'll compare pandas and pyspark function
This is a draft cheat sheet. It is a work in progress and is not finished yet.
ETL-map/reduce
udf |
|
@pandas_udf('long') def pandas_plus_one(series: pd.Series) -> pd.Series: # Simply plus one by using pandas Series. return series + 1 df.select(pandas_plus_one(df.a)).show()
|
ETL - N to 1
groupby |
|
df.groupby('color').avg() |
|
|
I/O
Local_CSV |
dataset = spark.read.csv('BostonHousing.csv',inferSchema=True, header =True) |
inferSchema=Guess data type from csv |
Local_Json |
Cloud_s3 |
To SQL table |
df.createOrReplaceTempView("tableA") |
Efficiency
repartition |
shuffle |
collect |
Spark All kind of handler
SparkContext |
Old man |
SparkSession |
Young boy, that's only entry point got to know for late spark |
SparkConf |
|
spark.sql |
spark.sql("SELECT * FROM p left join e on p.name = e.name") |
df.query() -> Dataframe |
RDD |
EDA-Get the information for debugging/coding
printSchema |
DataFrame.printSchema() |
columns:List[str] |
DataFrame.columns |
show() |
df.show(1) |
head() |
A action, force the process to finish |
take |
df.take(1) |
|