PhD in Geophysical Sciences (UChicago). Love coding, music and writing.

Home About me Bookshelf Climate Analysis Tools Technical Notes Music Arrangements Publications PySpark/SQL Solutions Academic Research Handy Tools Index GitHub

© 2023 Shao Ying (Clare) Huang. All rights reserved.

Clare S. Y. Huang Data Scientist | Atmospheric Dynamicist

PySpark/SQL Solutions

I build machine learning models to handle huge amount of data on hadoop cluster with pyspark daily. Here are some solutions to problems with pyspark I solved:

Pyspark-related blog posts

Implementing QuantileTransformer in Spark - mapping any kinds of distribution to normal distribution
Aggregation of vectors using Spark Summarizer is too slow. How to get around it?
Reduce the number of files on HDFS
Databricks Certified Associate Developer for Apache Spark 3.0
Split a vector/list in a pyspark DataFrame into columns
Ranking hierarchical labels with SQL
Custom Transformer that can be fitted into Pipeline
More efficient way to do outer join with large dataframes
Conversion of pandas dataframe to pyspark dataframe with an older version of pandas
Generate sequence from an array column of pyspark dataframe
Pyspark error “Could not serialize object”
Read libsvm files into PySpark dataframe
Simple pyspark solutions

SQL related blog posts

Implementing QuantileTransformer in Spark - mapping any kinds of distribution to normal distribution
Aggregation of vectors using Spark Summarizer is too slow. How to get around it?
Reduce the number of files on HDFS
Ranking hierarchical labels with SQL
More efficient way to do outer join with large dataframes
Tips for writing more efficient SQL
Handling JSON in PostgreSQL