Clare S. Y. Huang Data Scientist | Atmospheric Dynamicist

Split a vector/list in a pyspark DataFrame into columns

Split an array column

To split a column with arrays of strings, e.g. a DataFrame that looks like,

|   strCol|
|[A, B, C]|

into separate columns, the following code without the use of UDF works.

import pyspark.sql.functions as F

df2 =[F.col("strCol")[i] for i in range(3)])


|        A|        B|        C|

Split a vector column

To split a column with doubles stored in DenseVector format, e.g. a DataFrame that looks like,

|       intCol|

one have to construct a UDF that does the convertion of DenseVector to array(python list) first:

import pyspark.sql.functions as F
from pyspark.sql.types import ArrayType, DoubleType

def split_array_to_list(col):
    def to_list(v):
        return v.toArray().tolist()
    return F.udf(to_list, ArrayType(DoubleType()))(col)

df3 ="intCol")).alias("split_int"))\
    .select([F.col("split_int")[i] for i in range(3)])


|         2.0|         3.0|         4.0|
<< Previous Page