Clare S. Y. Huang Data Scientist | Atmospheric Dynamicist

I love to share what I've learnt with others. Check out my blog posts and notes about my academic research, as well as technical solutions on software engineering and data science challenges.
Opinions expressed in this blog are solely my own.


JetBrains Open-source license

Thrilled that my open-source climate data analysis package gets sponsored by JetBrains Licenses for Open Source Development. 🥳

JetBrainsCert

I’m really glad I started this project in 2016 when I was still in graduate school, with the hope that the climate data diagnostic proposed in my thesis can be applied by other more easily. Even though I have been working in industry after finishing my PhD, by maintaining this package, I’ve established valuable connections with many academic researchers. 😊 I’m grateful that JetBrains support open-source community and encourage the culture of sharing.

JetBrains Licenses for Open Source Development: https://www.jetbrains.com/community/opensource/

Compile cython modules

It took me a while to figure out how to import cython modules. Here is a configuration that works:

├── hn2016_falwa
│   ├── cython_modules
│   |   ├── pyx_modules
|   |   |   ├── __init__.py
|   |   |   ├── dirinv_cython.pyx
|   |   ├──setup_cython.py
|   |   ├──check_import.py
...

In dirinv_cython.pyx I have:

def x_sq_minus_x(x):
    return x**2 - x

In setup_cython.py I have:

from distutils.core import setup
from Cython.Build import cythonize

setup(name='dirinv_cython',
      package_dir={'pyx_modules': ''},
      ext_modules=cythonize("pyx_modules/dirinv_cython.pyx"))

First, I compile the cython modules by executing in the directory cython_modules:

python setup_cython.py build_ext --inplace

This produces dirinv_cython.c in the directory pyx_modules/.

Put this in __init__.py:

import dirinv_cython

Then I run the script to test the imports check_import.py:

from pyx_modules import dirinv_cython

ans = dirinv_cython.x_sq_minus_x(19)
print(f"ans = {ans}")

Executing check_import.py gives

ans = 342

Implementing QuantileTransformer in Spark - mapping any kinds of distribution to normal distribution

This blog post is motivated by the Scikit-learn documentation of QuantileTransformer and a StackExchange discussion thread about it.

There are 2 parts in this post. Part I reviews the idea of Quantile Transformer. Part II shows the implementation of Quantile Transformer in Spark using Pandas UDF.

Part I: Quantile Transformer transforms data of arbitrary distribution to normal (or uniform) distribution

Problem Statement: I have some individuals (id) with 3 attributes of different distributions. I want to combine them linearly and also want to make sure the outcome follows a normal distribution.

In python, I create a toy dataset with column id, and 3 columns corresponding to random variables following different distributions:

import numpy as np
import pandas as pd
import scipy
import math
import matplotlib.pyplot as plt

num_of_items = 10000  # the size of my population

df = pd.DataFrame({
    'id': [str(i) for i in np.arange(num_of_items)],
    'uniform': np.random.rand(num_of_items),
    'power_law': np.random.power(3, num_of_items),
    'exponential': np.random.exponential(1, num_of_items)})

The first 5 rows of df looks like

 id   uniform  power_law  exponential
  0  0.543253   0.690897     0.523969
  1  0.161339   0.802748     0.808497
  2  0.487836   0.818129     1.409843
  3  0.594641   0.808148     2.233953
  4  0.513764   0.783795     1.841159

Here is a visualization of their distributions:

Initial distribution

To transform all the columns to normal distribution, first, get the rank (or quantile, if rank is too expensive) for each id in each column:

df_rank = df.rank(
    axis=0, method='average', numeric_only=True, na_option='keep', ascending=True, pct=True)

The first 5 rows of df_rank looks like:

   uniform  power_law  exponential
0   0.5351     0.3388       0.4099
1   0.1544     0.5243       0.5543
2   0.4814     0.5546       0.7586
3   0.5882     0.5351       0.8950
4   0.5053     0.4885       0.8427

Let’s say we want to map these values to a normal distribution with mean=0.5 and standard deviation=0.15. To look up the corresponding value in the CDF of normal distribution, we can use scipy.stats.norm.ppf:

df_transformed = df_rank.applymap(lambda col: scipy.stats.norm.ppf(col, loc=0.5, scale=0.15))

The first 5 rows of df_transformed looks like:

    uniform  power_law  exponential
0  0.513214   0.437639     0.465830
1  0.347339   0.509142     0.520480
2  0.493004   0.520594     0.605271
3  0.533438   0.513214     0.688035
4  0.501993   0.495675     0.650843

Let’s see how the distribution of values in df_transformed look like:

Transformed distribution

Perfect! Their distributions look identical now! 😝

If I do a uniform linear combination of them per id,

df_transformed['linear_combination'] = df_transformed.apply(
    lambda row: np.mean([row['uniform'], row['power_law'], row['exponential']]), axis=1)

I would get results of the same distribution. On the right, I show the results from linear combination of the original values for comparison:

Linear combination

Another combination strategy would be to get the max value among the 3 columns. The transformed variable follows similar distribution, dispute the mean shifts to larger values.

Max combination

Part II: Implementation of Quantile Transformer in Spark

Given the introduction of Pandas UDF in Spark, the implementation is relatively simple. If ranking is too expensive, you can consider using approximate quantile instead.

from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf(PandasUDFType.SCALAR)
def to_normal_distribution(data: pd.Series):
    pdf = data.rank(pct=True)\
        .applymap(lambda col: scipy.stats.norm.ppf(col, loc=0.5, scale=0.15))
    return pdf

(Note: Later I realized that the newest Spark version has pyspark.pandas.DataFrame.rank, see Spark documentation. That’s not available at my work station yet.)

You can append the transformed value to the original dataframe:

spark_df = spark_session.createDataFrame(df)
spark_df = spark_df.withColumn(
    'transformed_data', 
    to_normal_distribution(spark_df.uniform))

Aggregation of vectors using Spark Summarizer is too slow. How to get around it?

I have a dataframe df with columns id (integers) and document (string):

root
 |-- id: integer
 |-- document: string

Each id is associated with multiple documents. Each document would be transformed into a vector representation (embedding of dimension 100) using Spark NLP. Eventually, I want to get the average of vectors associated with each id.

When testing with small amount of data, i.e. 10k id with each associated with ~100 document, pyspark.ml.stat.Summarizer does the job quickly:

df.groupby('id').agg(Summarizer.mean(F.col('embedding')).alias('mean_vector'))

However, the real situation is that I have to deal with Big Data that consists of 100M distinct id and 200M distinct document. Each id can be associated with at most 30k document. The time taken to (1) attach embedding using Spark NLP and (2) aggregate the vectors per id took me 10 hours, which is too slow!

Eventually, I figured out a way to do the same thing while having the computing time shortened to ~2 hours.

Thanks to my colleague who spot out the bottleneck - step (1) is indeed not slow. It was step (2) that takes most of the time when there is a huge number of id to work with. In this scenario, the aggregation of values in 100 separate columns is actually faster than the aggregation of 100-dimension vectors.

Here is what I do to optimize the procedures:

1. Obtain vector representation as array for distinct document

One can specify in sparknlp.base.EmbeddingFinisher whether you want to output a vector or an array. To make the split easier, I set the output as array:

from sparknlp.base import EmbeddingFinisher
...

finisher = EmbedingFinisher() \
    .setOutputAsVector(False) \
    ...

2. Split the 100-d vector into 100 columns

Now I have document_df of schema

root
 |-- document: string
 |-- embedding: array
 |    |-- element: float

I split the vectors into 100 columns (v1, v2, … v100) by:

import pyspark.sql.functions as F

split_to_cols = [F.col('embedding')[i].alias(f'v{i}') for i in range(1, 101)]
document_df = document_df.select([F.col('document')] + split_to_cols)

3. Join the resultant dataframe with the original and do aggregation per column

I then join document_df to df and obtain the average of embedding columns per id:

avg_expr = [F.mean(F.col(f'v{i}')).alias(f'v{i}') for i in range(1, 101)]
df = df.join(document_df, on='document').groupby('id').agg(*avg_expr)

4. (If needed) Use VectorAssembler to concatenate the columns into vectors

One can turn the values in the 100 columns to a vector per id if needed:

from pyspark.ml import VectorAssembler

assembler = VectorAssembler(
    inputCols=[f'v{i}' for i in range(1, 101)],
    outputCol='final_vector')
df = assembler.transform(df) \
    .select('id', 'final_vector')

That’s how I speed up the aggregation of Vector output from Spark NLP. Hope that helps. 😉

Reduce the number of files on HDFS

Sometimes, hive tables are saved not in an optimal way that creates lots of small files and reduce the performance of the cluster. This blog post is about pyspark functions to reduce the number of files (and can shrink storage size when the data is indeed sparse).

To check the number of files and their size, I used this HDFS command:

$ hadoop cluster_name fs -count /hive/warehouse/myschema.db/

There will be 4 columns printed out. I’m showing one of the examples among the table I shrank today:

Directory count File count Content size Table name
3 854 104950877 original_table_w_many_small_files

When I check the file sizes of the 854 files using hadoop cluster_name fs -du -h /hive/warehouse/myschema.db/original_table_w_many_small_files/*/*/*, I find that all of them are indeed of small size:

99.9 K  /hive/warehouse/myschema.db/original_table_w_many_small_files/some_part_id=30545/timestamp=1595397053/part-00845-29c4c506-e991-4d1d-be67-43e0a9976179.c000
102.7 K  /hive/warehouse/myschema.db/original_table_w_many_small_files/some_part_id=30545/timestamp=1595397053/part-00846-29c4c506-e991-4d1d-be67-43e0a9976179.c000
104.4 K  /hive/warehouse/myschema.db/original_table_w_many_small_files/some_part_id=30545/timestamp=1595397053/part-00847-29c4c506-e991-4d1d-be67-43e0a9976179.c000
100.6 K  /hive/warehouse/myschema.db/original_table_w_many_small_files/some_part_id=30545/timestamp=1595397053/part-00848-29c4c506-e991-4d1d-be67-43e0a9976179.c000
98.8 K  /hive/warehouse/myschema.db/original_table_w_many_small_files/some_part_id=30545/timestamp=1595397053/part-00849-29c4c506-e991-4d1d-be67-43e0a9976179.c000
108.5 K  /hive/warehouse/myschema.db/original_table_w_many_small_files/some_part_id=30545/timestamp=1595397053/part-00850-29c4c506-e991-4d1d-be67-43e0a9976179.c000
106.5 K  /hive/warehouse/myschema.db/original_table_w_many_small_files/some_part_id=30545/timestamp=1595397053/part-00851-29c4c506-e991-4d1d-be67-43e0a9976179.c000
101.9 K  /hive/warehouse/myschema.db/original_table_w_many_small_files/some_part_id=30545/timestamp=1595397053/part-00852-29c4c506-e991-4d1d-be67-43e0a9976179.c000
65.5 K  /hive/warehouse/myschema.db/original_table_w_many_small_files/some_part_id=30545/timestamp=1595397053/part-00853-29c4c506-e991-4d1d-be67-43e0a9976179.c000

To combine the files, the following was what I do with a pyspark (jupyter) notebook.

Create table using parquet format

First of all, I check the schema of the table using SHOW CREATE TABLE myschema.original_table_w_many_small_files:

CREATE TABLE myschema.original_table_w_many_small_files (
  `some_id` BIGINT,
  `string_info_a` STRING,
    ...
  `some_part_id` INT,
  `timestamp` BIGINT)
USING orc
PARTITIONED BY (some_part_id, timestamp)
TBLPROPERTIES (
  ... )

I create a table called myschema.new_table_with_fewer_files in parquet format instead:

CREATE TABLE myschema.new_table_with_fewer_files (
  `some_id` BIGINT,
  `string_info_a` STRING,
    ...
  `some_part_id` INT,
  `timestamp` BIGINT)
USING PARQUET
PARTITIONED BY (some_part_id, timestamp)

Then, I retrieved the original table and repartitioned it, such that all data in each partition is combined to a single file (numPartitions=1) since my cluster can handle files of size up to 100 M. You shall adjust numPartitions based on the resources you have.

df = spark.table('myschema.original_table_w_many_small_files')
df.repartition(numPartitions=1).write.insertInto('myschema.new_table_with_fewer_files')

Here is a comparison between the storage of the old and new table:

Table name Directory count File count Content size
original_table_w_many_small_files 3 854 104950877
new_table_with_fewer_files 7 4 14866996

When checking the table size using human readable format hadoop fs -du -h, I find that the table has been shrunk from 100.1 M to 39.6 M.

Local wave activity budget correction

It came to our attention that one of the terms in the LWA budget as computed in the currently published code on my GitHub may require correction. This concerns the meridional convergence of eddy momentum flux, which is currently evaluated in the displacement coordinate (Φ’), where it should really be evaluated in the real latitude (Φ).

The discrepancy does not affect the zonal mean budget of wave activity, but it may cause spurious residual/sources/sinks locally if not corrected.

Noboru and I are currently assessing the error that this causes in our previous results and we will let you know as soon as we find out.

In the meantime, if you want to correct your diagnostic by yourself, the remedy is relatively simple (adding a correction term) – please see the write-up from Noboru on GitHub. If you have specific questions or concerns, we’ll be happy to help.

Our contact method:

  • Noboru Nakamura: nnn@uchicago.edu
  • Clare Huang: csyhuang@uchicago.edu

Discussion on Diffusion models beat GANs on image synthesis

Today I led the Machine Learning Journal Club discussion on the paper:

Dhariwal, P., & Nichol, A. (2021). Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 8780-8794.

This time instead of making slides for overview, I put the focus of discussion on the trick, i.e. Classifier Guidance (section 4 of the paper), which makes the whole thing work well.

hn2016_falwa release 0.6.1 + remarks on GitHub CLI and repo management

Great thanks to Christopher Polster who submitted a pull request to include Xarray interface to the QGField class of the local wave activity package. 😄 Please find here the release note for version 0.6.1.

I learned something new cleaning up the GitHub repo and I’m gonna write about it.

1. Untrack changes in the repo

There are actually two places which you can define the files to be ignored by git:

  • .gitignore: I used to have this locally. Christopher suggested I include that in the GitHub repo for everyone’s use, and I think that’s a better idea.
  • .git/info/exclude: This is indeed the right place to specify personal files to be excluded (not shared on the repo).

2. Skip unit test when optional packages are not installed

In test_xarrayinterface.py, I modified the xarray import statement:

try:
    import xarray as xr
except ImportError:
    pytest.skip("Optional package Xarray is not installed.", allow_module_level=True)

Given that xarray is an optional package, even if it is not installed, unit test for this package shall still run through.

3. Move fortran modules into the package directory

The .f90 files that contain the f2py modules were located in hn2016_falwa/ before. Now it is moved to hn2016_falwa/f90_modules/. The modifications done are:

(1) In setpy.py, the extension is changed to something like:

ext4 = Extension(name='interpolate_fields_direct_inv',
                 sources=['hn2016_falwa/f90_modules/interpolate_fields_dirinv.f90'],
                 f2py_options=['--quiet'])

The compiled modules of suffix .so will be located in hn2016_falwa/ with this change.

(2) Add in hn2016_falwa/__init__.py:

from .interpolate_fields import interpolate_fields
...
from .compute_lwa_and_barotropic_fluxes import compute_lwa_and_barotropic_fluxes

4. Fix the display of documentation on readthedocs.org

To debug, go to https://readthedocs.org/projects/hn2016-falwa/ and look at the Builds. Even if it passes, the document may not be compiled properly. Go to View raw and check out all warnings/errors to see if you have a clue.

The fix I have done are:

  • In docs/source/conf.py:
    • Fix the appended sys path
    • Add modules/packages that causes the compilation to fail/raise warning to autodoc_mock_imports in
  • Add docs/requirements.txt and specify all external packages imported (i.e. numpy, scipy, xarray) in the package

5. Compare commits using GitHub’s interface

To display the difference between two commits using GitHub’s web interface, put in the following URL:

https://github.com/csyhuang/hn2016_falwa/compare/OLD-COMMIT-HASH..NEW-COMMIT-HASH

6. To pull a branch which other forked your repo to local

Discussion on Adapters as a mean of parameter-efficient trasnfer learning for NLP

I led the Machine Learning Journal Club discussion on the paper:

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., … & Gelly, S. (2019, May). Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning (pp. 2790-2799). PMLR.

Here are the slides I made.

Discussion on Difference-based Contrastive Learning for Sentence Embedding (DiffCSE)

I led the Machine Learning Journal Club discussion on the paper:

Chuang, Y. S., Dangovski, R., Luo, H., Zhang, Y., Chang, S., Soljačić, M., … & Glass, J. (2022). DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings. arXiv preprint arXiv:2204.10298.

Here are the slides I made.

Paper published on GRL!

Our newest research paper:

Neal, Emily, Huang C. S. Y., Nakamura N. The 2021 Pacific Northwest heat wave and associated blocking: meteorology and the role of an upstream diabatic source of wave activity. Geophysical Research Letters (2021)

is available online!

Check out the latest release of my python package that includes the scripts and new reference state inversion algorithm to reproduce the analyses.😄

Package release hn2016_falwa v0.6.0

Here comes the package release hn2016_falwa v0.6.0. This version contains the updated algorithm for reference state inversion - now the reference state is solved with absolute vorticity field at 5N (defined by user) as boundary condition. The analysis code to reproduce results in Neal et al. “The 2021 Pacific Northwest heat wave and associated blocking: Meteorology and the role of an upstream cyclone as a diabatic source of wave activity” (submitted to GRL) can be found in the directory scripts/nhn_grl2022/.

Please refer to the release notes for details.

We are in a hurry to submit our manuscript revision, so this package version is not available on PyPI yet.😅 I will upload it to PyPI asap.

Discussion on Translation-Counterfactual Word Replacement (TCWR)

I led the Machine Learning Journal Club discussion on the paper:

Liu, Q., Kusner, M., & Blunsom, P. (2021, June). Counterfactual data augmentation for neural machine translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 187-197). Here are the slides I made.

Discussion on Supervised Contrastive Learning

I led the Machine Learning Journal Club discussion on the paper:

Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., … & Krishnan, D. (2020). Supervised contrastive learning. arXiv preprint arXiv:2004.11362. Here are the slides I made.

Package release hn2016_falwa v0.5.0

Here comes the package release hn2016_falwa v0.5.0. Please refer to the release notes for details. Great thanks to Christopher Polster who submitted a pull request to optimize the fortran modules, which makes the flux calculations a lot faster. 😄

The package is available on PyPI as well.

The CI procedures of the package has been migrated from Travis CI to GitHub workflow. I spent a while on fixing related issues, so I wrote a blog post about the procedures. Hopefully this can save others time if they want to do the same thing.

Migrating CI from Travis CI to GitHub Workflow

Few days ago, I realized that the coverage test for my python package, which was run on travis-ci.org, has stopped running for two years. After migrating the to travis-ci.com, the service is no longer free after spending 10000 credits. One has to pay for $69/mo 💸 for unlimited service.

Note that I already used 2170 credits when testing various things and preparing for upcoming package release in a day 😱(my fault?), so I don’t think a free plan is sustainable, but $69/mo is definitely too expensive if I only have to maintain a single package.

I found a free solution on GitHub marketplace: CodeCov. With this GitHub Action, one can run deployment test and code coverage diagnostic automatically when pushing commits. I spent a bit of time on trial and error, so I think it is worth the time to write down what I did to save time for others.

Define the workflow

The GitHub Workflow is defined in the file .github/workflows/workflow.yml. Here is what I did for my package:

name: CodeCov
on: [push, pull_request]
jobs:
  run:
    runs-on: ubuntu-latest
    env:
      OS: ubuntu-latest
      PYTHON: '3.7'
    steps:
    - uses: actions/checkout@v2
      with:
        fetch-depth: ‘2’

    - name: Setup Python
      uses: actions/setup-python@master
      with:
        python-version: 3.7
    - name: Generate Report
      run: |
        pip install coverage
        pip install -U pytest
        pip install --upgrade pip setuptools wheel
        sudo apt-get install gfortran
        pip install numpy
        pip install scipy
        coverage run setup.py pytest
    - name: Upload Coverage to Codecov
      run: |
        bash <(curl -s https://codecov.io/bash)

The procedures under "Generate Report" are copied from my Travis job definition .travis.yml:

language: python
python:
  - "3.7"

before_install:
  - python --version
  - pip install -U pip
  - pip install -U pytest
  - pip install coverage

install:
  - pip install --upgrade pip setuptools wheel
  - sudo apt-get install python-numpy python-scipy gfortran
  - pip install numpy
  - pip install scipy

# command to run tests
script:
  - coverage run setup.py pytest

after_success:
  - bash <(curl -s https://codecov.io/bash)

In order to upload the coverage report to CodeCov, I used the command bash <(curl -s https://codecov.io/bash) after failing to send report using the line uses: codecov/codecov-action@v2 from CodeCov template.

Specify files to include/exclude in the coverage test

The specifications are made in the file .coveragerc:

[run]
source=hn2016_falwa

[report]
show_missing = True
omit =
    hn2016_falwa/legacy/*

Since the modules in the folder legacy/ is no longer maintained, I exclude that from the coverage test.

Implement coverage test locally

If you have the python package coverage installed, you can test the coverage run command, e.g.

$ coverage run setup.py pytest

which will generate a coverage report .coverage. To visualize the report, use the command

$ coverage report

to render .coverage in readable manner. As an example, the coverage report of my package looks like:

Name                               Stmts   Miss  Cover   Missing
----------------------------------------------------------------
hn2016_falwa/__init__.py               0      0   100%
hn2016_falwa/barotropic_field.py      39      4    90%   72, 79, 86, 131
hn2016_falwa/basis.py                 46      2    96%   113, 124
hn2016_falwa/constant.py               6      0   100%
hn2016_falwa/oopinterface.py         270     28    90%   151, 222, ...
hn2016_falwa/utilities.py             61     48    21%   53-87, 146-181, 237-250
hn2016_falwa/wrapper.py              154    154     0%   1-579
----------------------------------------------------------------
TOTAL                                576    236    59%

Include Build badge and Coverage badge in README of your repo

As an example, the image path to the build status of my package is:

https://github.com/csyhuang/hn2016_falwa/actions/workflows/workflow.yml/badge.svg

Build Status I can check my build status here.

The image path to my CodeCov results is:

https://codecov.io/gh/csyhuang/hn2016_falwa/branch/master/graph/badge.svg

codecov.io The report is visualized on CodeCov like this.

Discussion on the application of Contrastive Learning to train Sentence Embedding

In the previous discussion I led, I introduced the idea of learning visual representation in unsupervised manner. This week, I am introducing the following paper that applies deep contrastive learning to train sentence embedding:

Giorgi, J. M., Nitski, O., Bader, G. D., & Wang, B. (2020). Declutr: Deep contrastive learning for unsupervised textual representations. arXiv preprint arXiv:2006.03659.

The presentation slides I used for discussion can be found here.

Discussion on Contrastive Learning

I led the Machine Learning Journal Club discussion on the two papers:

You can find here the slides I made which provides an introduction to Contrastive Learning. Below are the main points from the slides:

  • Contrastive learning is a self-supervised method to learn a representation of objects by maximizing/minimizing distance between the same/different class(es)
  • Contrastive learning benefits from data augmentation and increase in model parameters
  • Under-clustering occurs when there is not enough negative samples to learn from; over-clustering occurs when the model overfits (memorize the data)
  • To solve inefficiency issue, median(rank-k) triplet loss is used instead of the total loss

I finished the GANs Specialization on Coursera!

I finished the 3-course Generative Adversarial Networks (GANs) Specialization on Coursera! It was super fun! Here are some slides I presented in a deep learning journal club with my peers, which is a survey of GANs covered in this Specialization.

Click here to view my certificate. 😬

Added a directory for some of my music arrangements

🎹 I have been uploading improvised piano cover on my YouTube channel. Recently, I received some comments asking for the music sheet of my arrangement, and it seems to be a good idea to keep a record of my work. Therefore, I started making music sheets of my arrangement with MuseScore to share with others.

🎼 Check out the page Music Arrangement for the collection of arrangement with music sheets I’ve made. For the rest of my recordings, go to my YouTube Channel Clare’s Studio.

my widget for counting (since Dec24, 2016)