Ch.7 - Aggregations
10 Jan 2025Aggregation Functions
count
- Note when operating on whole DataFrame, all rows are counted
- But when operating on a column, nulls are discarded
countDistinct
approx_count_distinct
pyspark.sql.functions.approx_count_distinct(col, rsd=None)
# rsd – maximum relative standard deviation allowed (default = 0.05). For rsd < 0.01, it is more efficient to use countDistinct()
first and last
df.select(F.first(col), F.last(col))
min and max
sum
sumDistinct
avg
variance and standard deviation
from pyspark.sql.functions import var_pop, standdev_pop, var_samp, stddev_samp
skewness and kurtosis
- Skewness measures the asymmetry of the value in my data around the mean
- Kurtosis measures the tail of the data
from pyspark.sql.functions import skewness, kurtosis
Covariance and Correlation
from pyspark.sql.functions import corr, cov_samp, cov_pop
Aggregating to complex types
df.agg(F.collect_set(col), F.collect_list(col)) # returns array columns
or in SQL
SELECT collect_set(col), collect_list(col) FROM df_table
Grouping
df.groupBy("InvoiceNo").agg(F.count("Quantity".alias("quan")), F.expr("count(Quantity)")) # returns the same thing
Grouping with Maps
df.groupBy("InvoiceNo").agg(F.expr("avg(Quantity)"), F.expr("stddev_pop(Quantity)"))
or in SQL
SELECT avg(Quantity), stddev_pop(Quantity) FROM df
GROUP BY InvoiceNo
Window functions
(come back later…)