Conversion of pandas dataframe to pyspark dataframe with an older version of pandas
30 Oct 2019Pandas dataframe can be converted to pyspark dataframe easily in the newest version of pandas
after v0.19.2
. If you are using an older version of pandas, you have to do a bit more work for such conversion as follows.
First, load the packages and initiate a spark session.
from pyspark.sql import SparkSession, DataFrame, Row
import pandas as pd
spark = SparkSession.builder \
.master("local") \
.appName("Pandas to pyspark DF") \
.getOrCreate()
Here is an example of pandas dataframe to be converted.
df = pd.DataFrame({'index':[i for i in range(7)],
'alphabet':[i for i in 'pyspark']})
df.head(7)
index | alphabet | |
---|---|---|
0 | 0 | p |
1 | 1 | y |
2 | 2 | s |
3 | 3 | p |
4 | 4 | a |
5 | 5 | r |
6 | 6 | k |
To convert it to a pyspark
dataframe, one has to create a list of Row
objects and pass it into createDataFrame
:
df_pyspark = spark.createDataFrame([
Row(index=row[1]['index'], alphabet=row[1]['alphabet'])
for row in df.iterrows()
])
df_pyspark.show()
+--------+-----+
|alphabet|index|
+--------+-----+
| p| 0|
| y| 1|
| s| 2|
| p| 3|
| a| 4|
| r| 5|
| k| 6|
+--------+-----+