Clare S. Y. Huang Data Scientist | Atmospheric Dynamicist

Conversion of pandas dataframe to pyspark dataframe with an older version of pandas

Pandas dataframe can be converted to pyspark dataframe easily in the newest version of pandas after v0.19.2. If you are using an older version of pandas, you have to do a bit more work for such conversion as follows.

First, load the packages and initiate a spark session.

from pyspark.sql import SparkSession, DataFrame, Row
import pandas as pd

spark = SparkSession.builder \
        .master("local") \
        .appName("Pandas to pyspark DF") \
        .getOrCreate()

Here is an example of pandas dataframe to be converted.

df = pd.DataFrame({'index':[i for i in range(7)],
                   'alphabet':[i for i in 'pyspark']})
df.head(7)
index alphabet
0 0 p
1 1 y
2 2 s
3 3 p
4 4 a
5 5 r
6 6 k

To convert it to a pyspark dataframe, one has to create a list of Row objects and pass it into createDataFrame:

df_pyspark = spark.createDataFrame([
    Row(index=row[1]['index'], alphabet=row[1]['alphabet'])
    for row in df.iterrows()
])
df_pyspark.show()
+--------+-----+
|alphabet|index|
+--------+-----+
|       p|    0|
|       y|    1|
|       s|    2|
|       p|    3|
|       a|    4|
|       r|    5|
|       k|    6|
+--------+-----+
<< Previous Page