I love to share what I've learnt with others. Check out my blog posts and notes about my academic research, as well as technical solutions on software engineering and data science challenges.
Opinions expressed in this blog are solely my own.

Deployment onto Conda forge

05 Sep 2023

The deployment of python package on linux is not working (again). I am exploring solutions to automate deployment. Here are things I’ve found.

Documentation of conda forge: https://conda-forge.org/docs/
An example of feedstock of repo using F2PY: https://github.com/conda-forge/fastscapelib-f2py-feedstock
Conda-forge repo of fastscapelib-f2py: https://anaconda.org/conda-forge/fastscapelib-f2py/

[Updated on 2023/12/11] After some research, it seems that scikit-build would be a continuously maintained solution: https://scikit-build.readthedocs.io/

to be continued.

IMPORTANT Bug fix release hn2016_falwa v0.7.2

04 Sep 2023

We published an important bugfix release hn2016_falwa v0.7.2, which requires recompilation of fortran modules.

Two weeks ago, we discovered that there is a mistake in the derivation of expression of nonlinear zonal advective flux term, which leads to an underestimation of the nonlinear zonal advective flux component.

We will submit corrigendum for Neal et al. (2022, GRL) and Nakamura and Huang (2018, Science) to update the numerical results. The correct derivation of the flux expression can be found in the corrected supplementary materials of NH18 (to be submitted soon). There is no change in conclusions in any of the articles.

Please refer to Issue #83 for the numerical details and preliminary updated figures in NHN22 and NH18:

Thank you for your attention and let us know if we can help.

Generative AI with LLMs

29 Jun 2023

Here are course notes I am taking from the DeepLearning.ai course on Coursera: Generative AI with Large Language Models.

Deploy package on pip and conda channel

18 Jun 2023

To build pip distribution that contains the source files only (without compiled modules):

python3 setup.py sdist bdist_wheel
python3 -m twine upload dist/*

To compile the package to .whl on Mac (Example: SciPy pip repository):

python setup.py bdist_wheel
python3 -m twine upload dist/*

This deploy the wheel file dist/hn2016_falwa-0.7.0-cp310-cp310-macosx_10_10_x86_64.whl to pip channel.

However, when repeating the Mac procedures above on Linux, I got the error:

Binary wheel 'hn2016_falwa-0.7.0-cp311-cp311-linux_x86_64.whl' has an unsupported platform tag 'linux_x86_64'.

I googled and found this StackOverflow thread: Binary wheel can’t be uploaded on pypi using twine.

ManyLinux repo: https://github.com/pypa/manylinux

Good tutorial:

http://www.martin-rdz.de/index.php/2022/02/05/manually-building-manylinux-python-wheels/

Create fresh start environment:

$ conda create --name test_new

But using conda on mac to compile wheel would have this issue:

ld: unsupported tapi file type '!tapi-tbd' in YAML file

Related thread: https://developer.apple.com/forums/thread/715385

Create virtual environment (not via conda)

python3 -m venv /Users/claresyhuang/testbed_venv
source /Users/claresyhuang/testbed_venv/bin/activate

MDTF meeting 2023/6/5

05 Jun 2023

The google slides used in my presentation in the meeting of NOAA Model Diagnostics Task Force can be found here.

JetBrains Open-source license

17 May 2023

Thrilled that my open-source climate data analysis package gets sponsored by JetBrains Licenses for Open Source Development. 🥳

I’m really glad I started this project in 2016 when I was still in graduate school, with the hope that the climate data diagnostic proposed in my thesis can be applied by other more easily. Even though I have been working in industry after finishing my PhD, by maintaining this package, I’ve established valuable connections with many academic researchers. 😊 I’m grateful that JetBrains support open-source community and encourage the culture of sharing.

JetBrains Licenses for Open Source Development: https://www.jetbrains.com/community/opensource/

Compile cython modules

04 May 2023

It took me a while to figure out how to import cython modules. Here is a configuration that works:

├── hn2016_falwa
│   ├── cython_modules
│   |   ├── pyx_modules
|   |   |   ├── __init__.py
|   |   |   ├── dirinv_cython.pyx
|   |   ├──setup_cython.py
|   |   ├──check_import.py
...

In dirinv_cython.pyx I have:

def x_sq_minus_x(x):
    return x**2 - x

In setup_cython.py I have:

from distutils.core import setup
from Cython.Build import cythonize

setup(name='dirinv_cython',
      package_dir={'pyx_modules': ''},
      ext_modules=cythonize("pyx_modules/dirinv_cython.pyx"))

First, I compile the cython modules by executing in the directory cython_modules:

python setup_cython.py build_ext --inplace

This produces dirinv_cython.c in the directory pyx_modules/.

Put this in __init__.py:

import dirinv_cython

Then I run the script to test the imports check_import.py:

from pyx_modules import dirinv_cython

ans = dirinv_cython.x_sq_minus_x(19)
print(f"ans = {ans}")

Executing check_import.py gives

ans = 342

Implementing QuantileTransformer in Spark - mapping any kinds of distribution to normal distribution

23 Mar 2023

This blog post is motivated by the Scikit-learn documentation of QuantileTransformer and a StackExchange discussion thread about it.

There are 2 parts in this post. Part I reviews the idea of Quantile Transformer. Part II shows the implementation of Quantile Transformer in Spark using Pandas UDF.

Part I: Quantile Transformer transforms data of arbitrary distribution to normal (or uniform) distribution

Problem Statement: I have some individuals (id) with 3 attributes of different distributions. I want to combine them linearly and also want to make sure the outcome follows a normal distribution.

In python, I create a toy dataset with column id, and 3 columns corresponding to random variables following different distributions:

import numpy as np
import pandas as pd
import scipy
import math
import matplotlib.pyplot as plt

num_of_items = 10000  # the size of my population

df = pd.DataFrame({
    'id': [str(i) for i in np.arange(num_of_items)],
    'uniform': np.random.rand(num_of_items),
    'power_law': np.random.power(3, num_of_items),
    'exponential': np.random.exponential(1, num_of_items)})

The first 5 rows of df looks like

 id   uniform  power_law  exponential
0.543253   0.690897     0.523969
0.161339   0.802748     0.808497
0.487836   0.818129     1.409843
0.594641   0.808148     2.233953
0.513764   0.783795     1.841159

Here is a visualization of their distributions:

Initial distribution

To transform all the columns to normal distribution, first, get the rank (or quantile, if rank is too expensive) for each id in each column:

df_rank = df.rank(
    axis=0, method='average', numeric_only=True, na_option='keep', ascending=True, pct=True)

The first 5 rows of df_rank looks like:

   uniform  power_law  exponential
 0.5351     0.3388       0.4099
 0.1544     0.5243       0.5543
 0.4814     0.5546       0.7586
 0.5882     0.5351       0.8950
 0.5053     0.4885       0.8427

Let’s say we want to map these values to a normal distribution with mean=0.5 and standard deviation=0.15. To look up the corresponding value in the CDF of normal distribution, we can use scipy.stats.norm.ppf:

df_transformed = df_rank.applymap(lambda col: scipy.stats.norm.ppf(col, loc=0.5, scale=0.15))

The first 5 rows of df_transformed looks like:

    uniform  power_law  exponential
0.513214   0.437639     0.465830
0.347339   0.509142     0.520480
0.493004   0.520594     0.605271
0.533438   0.513214     0.688035
0.501993   0.495675     0.650843

Let’s see how the distribution of values in df_transformed look like:

Transformed distribution

Perfect! Their distributions look identical now! 😝

If I do a uniform linear combination of them per id,

df_transformed['linear_combination'] = df_transformed.apply(
    lambda row: np.mean([row['uniform'], row['power_law'], row['exponential']]), axis=1)

I would get results of the same distribution. On the right, I show the results from linear combination of the original values for comparison:

Linear combination

Another combination strategy would be to get the max value among the 3 columns. The transformed variable follows similar distribution, dispute the mean shifts to larger values.

Max combination

Part II: Implementation of Quantile Transformer in Spark

Given the introduction of Pandas UDF in Spark, the implementation is relatively simple. If ranking is too expensive, you can consider using approximate quantile instead.

from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf(PandasUDFType.SCALAR)
def to_normal_distribution(data: pd.Series):
    pdf = data.rank(pct=True)\
        .applymap(lambda col: scipy.stats.norm.ppf(col, loc=0.5, scale=0.15))
    return pdf

(Note: Later I realized that the newest Spark version has pyspark.pandas.DataFrame.rank, see Spark documentation. That’s not available at my work station yet.)

You can append the transformed value to the original dataframe:

spark_df = spark_session.createDataFrame(df)
spark_df = spark_df.withColumn(
    'transformed_data', 
    to_normal_distribution(spark_df.uniform))

Aggregation of vectors using Spark Summarizer is too slow. How to get around it?

21 Mar 2023

I have a dataframe df with columns id (integers) and document (string):

root
 |-- id: integer
 |-- document: string

Each id is associated with multiple documents. Each document would be transformed into a vector representation (embedding of dimension 100) using Spark NLP. Eventually, I want to get the average of vectors associated with each id.

When testing with small amount of data, i.e. 10k id with each associated with ~100 document, pyspark.ml.stat.Summarizer does the job quickly:

df.groupby('id').agg(Summarizer.mean(F.col('embedding')).alias('mean_vector'))

However, the real situation is that I have to deal with Big Data that consists of 100M distinct id and 200M distinct document. Each id can be associated with at most 30k document. The time taken to (1) attach embedding using Spark NLP and (2) aggregate the vectors per id took me 10 hours, which is too slow!

Eventually, I figured out a way to do the same thing while having the computing time shortened to ~2 hours.

Thanks to my colleague who spot out the bottleneck - step (1) is indeed not slow. It was step (2) that takes most of the time when there is a huge number of id to work with. In this scenario, the aggregation of values in 100 separate columns is actually faster than the aggregation of 100-dimension vectors.

Here is what I do to optimize the procedures:

1. Obtain vector representation as array for distinct `document`

One can specify in sparknlp.base.EmbeddingFinisher whether you want to output a vector or an array. To make the split easier, I set the output as array:

from sparknlp.base import EmbeddingFinisher
...

finisher = EmbedingFinisher() \
    .setOutputAsVector(False) \
    ...

2. Split the 100-d vector into 100 columns

Now I have document_df of schema

root
 |-- document: string
 |-- embedding: array
 |    |-- element: float

I split the vectors into 100 columns (v1, v2, … v100) by:

import pyspark.sql.functions as F

split_to_cols = [F.col('embedding')[i].alias(f'v{i}') for i in range(1, 101)]
document_df = document_df.select([F.col('document')] + split_to_cols)

3. Join the resultant dataframe with the original and do aggregation per column

I then join document_df to df and obtain the average of embedding columns per id:

avg_expr = [F.mean(F.col(f'v{i}')).alias(f'v{i}') for i in range(1, 101)]
df = df.join(document_df, on='document').groupby('id').agg(*avg_expr)

4. (If needed) Use VectorAssembler to concatenate the columns into vectors

One can turn the values in the 100 columns to a vector per id if needed:

from pyspark.ml import VectorAssembler

assembler = VectorAssembler(
    inputCols=[f'v{i}' for i in range(1, 101)],
    outputCol='final_vector')
df = assembler.transform(df) \
    .select('id', 'final_vector')

That’s how I speed up the aggregation of Vector output from Spark NLP. Hope that helps. 😉

Reduce the number of files on HDFS

21 Jan 2023

Sometimes, hive tables are saved not in an optimal way that creates lots of small files and reduce the performance of the cluster. This blog post is about pyspark functions to reduce the number of files (and can shrink storage size when the data is indeed sparse).

To check the number of files and their size, I used this HDFS command:

$ hadoop cluster_name fs -count /hive/warehouse/myschema.db/

There will be 4 columns printed out. I’m showing one of the examples among the table I shrank today:

Directory count	File count	Content size	Table name
3	854	104950877	original_table_w_many_small_files

When I check the file sizes of the 854 files using hadoop cluster_name fs -du -h /hive/warehouse/myschema.db/original_table_w_many_small_files/*/*/*, I find that all of them are indeed of small size:

9 K  /hive/warehouse/myschema.db/original_table_w_many_small_files/some_part_id=30545/timestamp=1595397053/part-00845-29c4c506-e991-4d1d-be67-43e0a9976179.c000
7 K  /hive/warehouse/myschema.db/original_table_w_many_small_files/some_part_id=30545/timestamp=1595397053/part-00846-29c4c506-e991-4d1d-be67-43e0a9976179.c000
4 K  /hive/warehouse/myschema.db/original_table_w_many_small_files/some_part_id=30545/timestamp=1595397053/part-00847-29c4c506-e991-4d1d-be67-43e0a9976179.c000
6 K  /hive/warehouse/myschema.db/original_table_w_many_small_files/some_part_id=30545/timestamp=1595397053/part-00848-29c4c506-e991-4d1d-be67-43e0a9976179.c000
8 K  /hive/warehouse/myschema.db/original_table_w_many_small_files/some_part_id=30545/timestamp=1595397053/part-00849-29c4c506-e991-4d1d-be67-43e0a9976179.c000
5 K  /hive/warehouse/myschema.db/original_table_w_many_small_files/some_part_id=30545/timestamp=1595397053/part-00850-29c4c506-e991-4d1d-be67-43e0a9976179.c000
5 K  /hive/warehouse/myschema.db/original_table_w_many_small_files/some_part_id=30545/timestamp=1595397053/part-00851-29c4c506-e991-4d1d-be67-43e0a9976179.c000
9 K  /hive/warehouse/myschema.db/original_table_w_many_small_files/some_part_id=30545/timestamp=1595397053/part-00852-29c4c506-e991-4d1d-be67-43e0a9976179.c000
5 K  /hive/warehouse/myschema.db/original_table_w_many_small_files/some_part_id=30545/timestamp=1595397053/part-00853-29c4c506-e991-4d1d-be67-43e0a9976179.c000

To combine the files, the following was what I do with a pyspark (jupyter) notebook.

Create table using parquet format

First of all, I check the schema of the table using SHOW CREATE TABLE myschema.original_table_w_many_small_files:

CREATE TABLE myschema.original_table_w_many_small_files (
  `some_id` BIGINT,
  `string_info_a` STRING,
    ...
  `some_part_id` INT,
  `timestamp` BIGINT)
USING orc
PARTITIONED BY (some_part_id, timestamp)
TBLPROPERTIES (
  ... )

I create a table called myschema.new_table_with_fewer_files in parquet format instead:

CREATE TABLE myschema.new_table_with_fewer_files (
  `some_id` BIGINT,
  `string_info_a` STRING,
    ...
  `some_part_id` INT,
  `timestamp` BIGINT)
USING PARQUET
PARTITIONED BY (some_part_id, timestamp)

Then, I retrieved the original table and repartitioned it, such that all data in each partition is combined to a single file (numPartitions=1) since my cluster can handle files of size up to 100 M. You shall adjust numPartitions based on the resources you have.

df = spark.table('myschema.original_table_w_many_small_files')
df.repartition(numPartitions=1).write.insertInto('myschema.new_table_with_fewer_files')

Here is a comparison between the storage of the old and new table:

Table name	Directory count	File count	Content size
original_table_w_many_small_files	3	854	104950877
new_table_with_fewer_files	7	4	14866996

When checking the table size using human readable format hadoop fs -du -h, I find that the table has been shrunk from 100.1 M to 39.6 M.

Local wave activity budget correction

22 Dec 2022

It came to our attention that one of the terms in the LWA budget as computed in the currently published code on my GitHub may require correction. This concerns the meridional convergence of eddy momentum flux, which is currently evaluated in the displacement coordinate (Φ’), where it should really be evaluated in the real latitude (Φ).

The discrepancy does not affect the zonal mean budget of wave activity, but it may cause spurious residual/sources/sinks locally if not corrected.

Noboru and I are currently assessing the error that this causes in our previous results and we will let you know as soon as we find out.

In the meantime, if you want to correct your diagnostic by yourself, the remedy is relatively simple (adding a correction term) – please see the write-up from Noboru on GitHub. If you have specific questions or concerns, we’ll be happy to help.

Our contact method:

Noboru Nakamura: nnn@uchicago.edu
Clare Huang: csyhuang@uchicago.edu

Discussion on Diffusion models beat GANs on image synthesis

22 Oct 2022

Today I led the Machine Learning Journal Club discussion on the paper:

Dhariwal, P., & Nichol, A. (2021). Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 8780-8794.

This time instead of making slides for overview, I put the focus of discussion on the trick, i.e. Classifier Guidance (section 4 of the paper), which makes the whole thing work well.

hn2016_falwa release 0.6.1 + remarks on GitHub CLI and repo management

28 Aug 2022

Great thanks to Christopher Polster who submitted a pull request to include Xarray interface to the QGField class of the local wave activity package. 😄 Please find here the release note for version 0.6.1.

I learned something new cleaning up the GitHub repo and I’m gonna write about it.

1. Untrack changes in the repo

There are actually two places which you can define the files to be ignored by git:

.gitignore: I used to have this locally. Christopher suggested I include that in the GitHub repo for everyone’s use, and I think that’s a better idea.
.git/info/exclude: This is indeed the right place to specify personal files to be excluded (not shared on the repo).

2. Skip unit test when optional packages are not installed

In test_xarrayinterface.py, I modified the xarray import statement:

try:
    import xarray as xr
except ImportError:
    pytest.skip("Optional package Xarray is not installed.", allow_module_level=True)

Given that xarray is an optional package, even if it is not installed, unit test for this package shall still run through.

3. Move fortran modules into the package directory

The .f90 files that contain the f2py modules were located in hn2016_falwa/ before. Now it is moved to hn2016_falwa/f90_modules/. The modifications done are:

(1) In setpy.py, the extension is changed to something like:

ext4 = Extension(name='interpolate_fields_direct_inv',
                 sources=['hn2016_falwa/f90_modules/interpolate_fields_dirinv.f90'],
                 f2py_options=['--quiet'])

The compiled modules of suffix .so will be located in hn2016_falwa/ with this change.

(2) Add in hn2016_falwa/__init__.py:

from .interpolate_fields import interpolate_fields
...
from .compute_lwa_and_barotropic_fluxes import compute_lwa_and_barotropic_fluxes

4. Fix the display of documentation on readthedocs.org

To debug, go to https://readthedocs.org/projects/hn2016-falwa/ and look at the Builds. Even if it passes, the document may not be compiled properly. Go to View raw and check out all warnings/errors to see if you have a clue.

The fix I have done are:

In docs/source/conf.py:
- Fix the appended sys path
- Add modules/packages that causes the compilation to fail/raise warning to autodoc_mock_imports in
Add docs/requirements.txt and specify all external packages imported (i.e. numpy, scipy, xarray) in the package

5. Compare commits using GitHub’s interface

To display the difference between two commits using GitHub’s web interface, put in the following URL:

https://github.com/csyhuang/hn2016_falwa/compare/OLD-COMMIT-HASH..NEW-COMMIT-HASH

6. To pull a branch which other forked your repo to local

Discussion on Adapters as a mean of parameter-efficient trasnfer learning for NLP

23 Jul 2022

I led the Machine Learning Journal Club discussion on the paper:

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., … & Gelly, S. (2019, May). Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning (pp. 2790-2799). PMLR.

Here are the slides I made.

Discussion on Difference-based Contrastive Learning for Sentence Embedding (DiffCSE)

21 May 2022

I led the Machine Learning Journal Club discussion on the paper:

Chuang, Y. S., Dangovski, R., Luo, H., Zhang, Y., Chang, S., Soljačić, M., … & Glass, J. (2022). DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings. arXiv preprint arXiv:2204.10298.

Here are the slides I made.

Paper published on GRL!

18 Apr 2022

Our newest research paper:

Neal, Emily, Huang C. S. Y., Nakamura N. The 2021 Pacific Northwest heat wave and associated blocking: meteorology and the role of an upstream diabatic source of wave activity. Geophysical Research Letters (2021)

is available online!

Check out the latest release of my python package that includes the scripts and new reference state inversion algorithm to reproduce the analyses.😄

Package release hn2016_falwa v0.6.0

18 Mar 2022

Here comes the package release hn2016_falwa v0.6.0. This version contains the updated algorithm for reference state inversion - now the reference state is solved with absolute vorticity field at 5N (defined by user) as boundary condition. The analysis code to reproduce results in Neal et al. “The 2021 Pacific Northwest heat wave and associated blocking: Meteorology and the role of an upstream cyclone as a diabatic source of wave activity” (submitted to GRL) can be found in the directory scripts/nhn_grl2022/.

Please refer to the release notes for details.

We are in a hurry to submit our manuscript revision, so this package version is not available on PyPI yet.😅 I will upload it to PyPI asap.

Discussion on Translation-Counterfactual Word Replacement (TCWR)

12 Feb 2022

I led the Machine Learning Journal Club discussion on the paper:

Liu, Q., Kusner, M., & Blunsom, P. (2021, June). Counterfactual data augmentation for neural machine translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 187-197). Here are the slides I made.

Discussion on Supervised Contrastive Learning

04 Dec 2021

I led the Machine Learning Journal Club discussion on the paper:

Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., … & Krishnan, D. (2020). Supervised contrastive learning. arXiv preprint arXiv:2004.11362. Here are the slides I made.

Package release hn2016_falwa v0.5.0

15 Aug 2021

Here comes the package release hn2016_falwa v0.5.0. Please refer to the release notes for details. Great thanks to Christopher Polster who submitted a pull request to optimize the fortran modules, which makes the flux calculations a lot faster. 😄

The package is available on PyPI as well.

The CI procedures of the package has been migrated from Travis CI to GitHub workflow. I spent a while on fixing related issues, so I wrote a blog post about the procedures. Hopefully this can save others time if they want to do the same thing.

Older Newer

Clare S. Y. Huang Data Scientist | Atmospheric Dynamicist

Part I: Quantile Transformer transforms data of arbitrary distribution to normal (or uniform) distribution

Part II: Implementation of Quantile Transformer in Spark

1. Obtain vector representation as array for distinct document

2. Split the 100-d vector into 100 columns

3. Join the resultant dataframe with the original and do aggregation per column

4. (If needed) Use VectorAssembler to concatenate the columns into vectors

Create table using parquet format

1. Untrack changes in the repo

2. Skip unit test when optional packages are not installed

3. Move fortran modules into the package directory

4. Fix the display of documentation on readthedocs.org

5. Compare commits using GitHub’s interface

6. To pull a branch which other forked your repo to local

1. Obtain vector representation as array for distinct `document`