Show Menu

PySpark Fingertip Commands Cheat Sheet by

This PySpark cheat sheet is designed for those who want to learn and practice and is most useful for freshers.

Reading data from a file

df ="file.csv", header=True)
df ="file.parquet", header=True)

Casting a column to a different data type

df.withColumn("col1", col("col1").cast("double"))

Displaying the schema of a Dataframe


Get distinct count of columns"col").distinct().count()

Filtering rows based on a condition

# Filter entries of age, only keep those records of
 which the values are >24 


Renaming columns of DataFrame


df = df.withColumnRenamed('CallNumber', 'PhoneNumber')

Inspect Data

# Return first n rows
# Return first row
# Return the first n rows
# Print the schema of df
# Print the (logical and physical) plans
#Get All column names from DataFrame

Get count

# Get_row count
  rows = empDF.count()

# Get_columns count
  cols = len(empDF.columns)

Selecting specific columns of a Dataframe"col1","col2").show()

 # Select All columns"*").show()

Full content of the columns without truncation

df. show(t­run­cat­e=F­alse)

Handling missing or null values

# Fill all null values with 0

#Fill specific columns with specified values
  df.fillna({col1: 0, "col2":"missing"})

Joining two dataframes

  joined_df = df1.join(df2, on="key_column", how="inner")

  joined_df = empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"inner") \

Adding a new column to a DataFrame

df.withColumn("new_col", col("col1") + col("col2"))

Dropping columns from a Dataframe


Grouping data by a colm and agg. with a function


df.groupBy("department") \
    .agg(sum("salary").alias("sum_salary"), \
         avg("salary").alias("avg_salary"), \
         sum("bonus").alias("sum_bonus"), \
         max("bonus").alias("max_bonus") \
     ) \

Stopping the SparkS­ession



After downloading , there are lots of space inside and also not matching properly

Add a Comment

Your Comment

Please enter your name.

    Please enter your email address

      Please enter your Comment.

          Related Cheat Sheets

          Comparing Core Pyspark and Pandas Code Cheat Sheet
          Cleaning with PySpark Cheat Sheet