Ch.7 - Aggregations

11 Nov 2025

Aggregation Functions

count

Note when operating on whole DataFrame, all rows are counted
But when operating on a column, nulls are discarded

countDistinct

approx_count_distinct

pyspark.sql.functions.approx_count_distinct(col, rsd=None)
# rsd – maximum relative standard deviation allowed (default = 0.05). For rsd < 0.01, it is more efficient to use countDistinct()

first and last

df.select(F.first(col), F.last(col))

min and max

sum

sumDistinct

avg

variance and standard deviation

from pyspark.sql.functions import var_pop, standdev_pop, var_samp, stddev_samp

skewness and kurtosis

Skewness measures the asymmetry of the value in my data around the mean

Kurtosis measures the tail of the data

from pyspark.sql.functions import skewness, kurtosis

Covariance and Correlation

from pyspark.sql.functions import corr, cov_samp, cov_pop

Aggregating to complex types

df.agg(F.collect_set(col), F.collect_list(col)) # returns array columns

or in SQL

SELECT collect_set(col), collect_list(col) FROM df_table

Grouping

df.groupBy("InvoiceNo").agg(F.count("Quantity".alias("quan")), F.expr("count(Quantity)")) # returns the same thing

Grouping with Maps

df.groupBy("InvoiceNo").agg(F.expr("avg(Quantity)"), F.expr("stddev_pop(Quantity)"))

or in SQL

SELECT avg(Quantity), stddev_pop(Quantity) FROM df
GROUP BY InvoiceNo

Window functions

(come back later…)

Clare S. Y. Huang Data Scientist | Atmospheric Dynamicist

Ch.7 - Aggregations

Aggregation Functions

count

countDistinct

approx_count_distinct

first and last

min and max

sum

sumDistinct

avg

variance and standard deviation

skewness and kurtosis

Covariance and Correlation

Aggregating to complex types

Grouping

Grouping with Maps

Window functions

Related Posts

falwa bugfix release v2.3.0 - correction in flux calculations and inconsistent LWA calculations over NH and SH 09 Nov 2025

Research work done with collaborators from PNNL published on Nature Communications 05 Nov 2025

falwa release v2.2.0 18 May 2025