Reading data from a file

df ="file.csv", header=True)
df ="file.parquet", header=True)

Casting a column to a different data type

df.withColumn("col1", col("col1").cast("double"))

Displaying the schema of a Dataframe


Get distinct count of columns"col").distinct().count()

Filtering rows based on a condition

# Filter entries of age, only keep those records of
 which the values are >24 


Renaming columns of DataFrame


df = df.withColumnRenamed('CallNumber', 'PhoneNumber')

Inspect Data

# Return first n rows
# Return first row
# Return the first n rows
# Print the schema of df
# Print the (logical and physical) plans
#Get All column names from DataFrame

Get count

# Get_row count
  rows = empDF.count()

# Get_columns count
  cols = len(empDF.columns)

Selecting specific columns of a Dataframe"col1","col2").show()

 # Select All columns"*").show()

Full content of the columns without truncation

df. show(t­run­cat­e=F­alse)

Handling missing or null values

# Fill all null values with 0

#Fill specific columns with specified values
  df.fillna({col1: 0, "col2":"missing"})

Joining two dataframes

  joined_df = df1.join(df2, on="key_column", how="inner")

  joined_df = empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"inner") \

Adding a new column to a DataFrame

df.withColumn("new_col", col("col1") + col("col2"))

Dropping columns from a Dataframe


Grouping data by a colm and agg. with a function


df.groupBy("department") \
    .agg(sum("salary").alias("sum_salary"), \
         avg("salary").alias("avg_salary"), \
         sum("bonus").alias("sum_bonus"), \
         max("bonus").alias("max_bonus") \
     ) \

Stopping the SparkS­ession



