Show Menu
Cheatography

PySpark Fingertip Commands Cheat Sheet by

This PySpark cheat sheet is designed for those who want to learn and practice and is most useful for freshers.

Reading data from a file

df = spark.read.csv("file.csv", header=True)
df = spark.read.parquet("file.parquet", header=True)

Casting a column to a different data type

df.withColumn("col1", col("col1").cast("double"))

Displaying the schema of a Dataframe

df.printSchema()

Get distinct count of columns

df.select("col").distinct().count()

Filtering rows based on a condition

# Filter entries of age, only keep those records of
 which the values are >24 

  df.filter(df["age"]>24).show()

Renaming columns of DataFrame

#Syntax
df.withColumnRenamed("old_name","new_name")

#Example
df = df.withColumnRenamed('CallNumber', 'PhoneNumber')

Inspect Data

# Return first n rows
  df.head() 
# Return first row
  df.first() 
# Return the first n rows
  df.take(2) 
# Print the schema of df
  df.printSchema() 
# Print the (logical and physical) plans
 df.explain()
#Get All column names from DataFrame
  df.columns

Get count

# Get_row count
  rows = empDF.count()

# Get_columns count
  cols = len(empDF.columns)
 

Selecting specific columns of a Dataframe

df.select("col1","col2").show()

 # Select All columns 
df.select("*").show()

Full content of the columns without truncation

df. show(t­run­cat­e=F­alse)

Handling missing or null values

# Fill all null values with 0
  df.fillna(0)

#Fill specific columns with specified values
  df.fillna({col1: 0, "col2":"missing"})

Joining two dataframes

#syntax:
  joined_df = df1.join(df2, on="key_column", how="inner")

#example
  joined_df = empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"inner") \
     .show(truncate=False)

Adding a new column to a DataFrame

df.withColumn("new_col", col("col1") + col("col2"))

Dropping columns from a Dataframe

df.drop("col1")

Grouping data by a colm and agg. with a function

#syntax
df.groupBy("col1").agg({"col2":"mean"})

#examples
df.groupBy("department") \
    .agg(sum("salary").alias("sum_salary"), \
         avg("salary").alias("avg_salary"), \
         sum("bonus").alias("sum_bonus"), \
         max("bonus").alias("max_bonus") \
     ) \
    .show(truncate=False)

Stopping the SparkS­ession

spark.stop()
       
 

Comments

After downloading , there are lots of space inside and also not matching properly

Add a Comment

Your Comment

Please enter your name.

    Please enter your email address

      Please enter your Comment.

          Related Cheat Sheets

          Comparing Core Pyspark and Pandas Code Cheat Sheet
          Cleaning with PySpark Cheat Sheet