Cheatography
https://cheatography.com
This PySpark cheat sheet is designed for those who want to learn and practice and is most useful for freshers.
Reading data from a file
df = spark.read.csv("file.csv", header=True)
df = spark.read.parquet("file.parquet", header=True)
|
Casting a column to a different data type
df.withColumn("col1", col("col1").cast("double"))
|
Displaying the schema of a Dataframe
Get distinct count of columns
df.select("col").distinct().count()
|
Filtering rows based on a condition
# Filter entries of age, only keep those records of
which the values are >24
df.filter(df["age"]>24).show()
|
Renaming columns of DataFrame
#Syntax
df.withColumnRenamed("old_name","new_name")
#Example
df = df.withColumnRenamed('CallNumber', 'PhoneNumber')
|
Inspect Data
# Return first n rows
df.head()
# Return first row
df.first()
# Return the first n rows
df.take(2)
# Print the schema of df
df.printSchema()
# Print the (logical and physical) plans
df.explain()
#Get All column names from DataFrame
df.columns
|
Get count
# Get_row count
rows = empDF.count()
# Get_columns count
cols = len(empDF.columns)
|
|
|
Selecting specific columns of a Dataframe
df.select("col1","col2").show()
# Select All columns
df.select("*").show()
|
Full content of the columns without truncation
df. show(truncate=False) |
Handling missing or null values
# Fill all null values with 0
df.fillna(0)
#Fill specific columns with specified values
df.fillna({col1: 0, "col2":"missing"})
|
Joining two dataframes
#syntax:
joined_df = df1.join(df2, on="key_column", how="inner")
#example
joined_df = empDF.join(deptDF,empDF.emp_dept_id == deptDF.dept_id,"inner") \
.show(truncate=False)
|
Adding a new column to a DataFrame
df.withColumn("new_col", col("col1") + col("col2"))
|
Dropping columns from a Dataframe
Grouping data by a colm and agg. with a function
#syntax
df.groupBy("col1").agg({"col2":"mean"})
#examples
df.groupBy("department") \
.agg(sum("salary").alias("sum_salary"), \
avg("salary").alias("avg_salary"), \
sum("bonus").alias("sum_bonus"), \
max("bonus").alias("max_bonus") \
) \
.show(truncate=False)
|
Stopping the SparkSession
|
Created By
Metadata
Comments
shivprasadgadekar, 06:04 21 Mar 23
After downloading , there are lots of space inside and also not matching properly
Add a Comment
Related Cheat Sheets