I build machine learning models to handle huge amount of data on hadoop cluster with pyspark daily. Here are some solutions to problems with pyspark I solved:
Implementing QuantileTransformer in Spark - mapping any kinds of distribution to normal distribution
Aggregation of vectors using Spark Summarizer is too slow. How to get around it?
Databricks Certified Associate Developer for Apache Spark 3.0
Conversion of pandas dataframe to pyspark dataframe with an older version of pandas