Show Menu
Cheatography

Comparing Core Pyspark and Pandas Code Cheat Sheet by

Do you already know Python and work with Pandas? Do you work with Big Data? Then PySpark should be your friend! PySpark is a Python API for Spark which is a general-purpose distributed data processing engine. It does computations in a distributed manner which enables the ability to analyse a large amount of data in a short time.

Importing Dataset

#SPARK titanic_sp = spark.t­ab­le(­"­tit­ani­c_t­rai­n")
#PANDAS titanic_pd = titani­c_s­p.s­ele­ct(­"­*").t­oP­andas()
#FromP­dBa­ckT­oSPARK pysparkDF2 = spark.c­re­ate­Dat­aFr­ame­(pa­ndasDF)

View data in DataFrame

titani­c_s­p.s­how()
displa­y(t­ita­nic_pd)
 

Display DataFrame schema

titani­c_s­p.p­rin­tSc­hema()
titani­c_p­d.i­nfo()
The column names, column data type, non-null values and Pandas memory use

Renaming a column in a DataFrame

column­_re­nam­ed=­tit­ani­c_s­p.w­ith­Col­umn­Ren­ame­d("N­ame­"­,"Pa­sse­nge­rNa­me").co­lumns
titani­c_p­d.r­ena­me(­col­umn­s={­'Name': 'Passe­nge­rNa­me'­}).c­olumns

View number of columns and rows

#SPARK print(­(ti­tan­ic_­sp.c­ou­nt(), len(ti­tan­ic_­sp.c­ol­umns)))
#PANDAS titani­c_p­d.shape

Dropping Columns

flight­_data = flight­_da­ta.d­ro­p(c­olu­mns­_to­_drop, axis = 1)
flight­_data = flight­_da­ta.d­ro­p(*­col­umn­s_t­o_drop)

Unique values of a column

titani­c_s­p.s­ele­ct(­'Su­rvi­ved­').d­is­tin­ct(­).s­how()
titani­c_p­d['­Sur­viv­ed'­].u­nique()
 

View column names

titani­c_s­p.c­olumns
titani­c_p­d.c­olumns

Display column datatypes

titani­c_s­p.d­types
titani­c_p­d.d­types

Convert tyoes

flight­_data = flight­_da­ta.w­it­hCo­lum­n('­dt_­dep­art­ure', f.to_t­ime­sta­mp(­fli­ght­_da­ta.d­ep­art­ure­_da­tetime, 'yyyy-­MM-dd HHmm'))
flight­_da­ta[­'dt­_de­par­ture'] = pd.to_­dat­eti­me(­fli­ght­_da­ta[­'dt­_de­par­tur­e_d­ate­time'], format­='%­Y-%m-%d %H%M')

Summary Stats

df.des­cribe()
df.des­cri­be(­).s­how()

Aggreg­ation

df.gro­upB­y("C­omp­any­"­).a­gg(­{'S­ale­s':­'su­m'}­).s­how()
pyspar­kDF.gr­oup­By(­"­gen­der­") \ .agg(m­ean­("ag­e"),­mea­n("s­ala­ry")­,ma­x("s­ala­ry")) \ .show()
 

Filter compar­isons

df[df[­'sp­eci­es'­].i­sin­(['­Chi­nst­rap', 'Gento­o']­)].s­how(5)
df[df[­'sp­eci­es'­].i­sin­(['­Chi­nst­rap', 'Gento­o']­)].h­ead()
df[df[­'sp­eci­es'­].r­lik­e('­G.'­)].s­how(5)
df[df[­'sp­eci­es'­].s­tr.m­at­ch(­'G.'­)].head()
df[df[­'fl­ipp­er'­].b­etw­een­(22­5,2­29)­].s­how(5)
df[df[­'fl­ipp­er'­].b­etw­een­(22­5,2­29)­].h­ead()
df[df[­'ma­ss'­].i­sNu­ll(­)].s­how(5)
df[df[­'ma­ss'­].i­snu­ll(­)].h­ead()
df[(df­['m­ass­']<­3400) & (df['s­ex'­]==­'Ma­le'­)].h­ead()
df[(df­['m­ass­']<­3400) & (df['s­ex'­]==­'Ma­le'­)].s­how(5)
df[~df­['f­lip­per­'].b­et­wee­n(2­25,­229­)].s­how(5)
df[~df­['f­lip­per­'].b­et­wee­n(2­25,­229­)].h­ead()

Condit­ional Transf­orm­ations

flight­_da­ta[­'de­par­tur­e_d­ela­y_s­tatus'] = np.whe­re(­fli­ght­_da­ta[­'de­par­tur­e_d­elay'] > 90, 'Heavy', 'Moder­ate')
flight­_data = flight­_da­ta.w­it­hCo­lum­n('­dep­art­ure­_de­lay­_st­atus', f.when­(fl­igh­t_d­ata.de­par­tur­e_delay > 90, 'Heavy­').o­th­erw­ise­('M­ode­rate')
           
 

Comments

No comments yet. Add yours below!

Add a Comment

Your Comment

Please enter your name.

    Please enter your email address

      Please enter your Comment.

          Related Cheat Sheets

          Cleaning with PySpark Cheat Sheet

          More Cheat Sheets by datamansam

          Apache Spark
          Core Cloud Concepts with AWS Cheat Sheet