Clare S. Y. Huang Data Scientist | Atmospheric Dynamicist

I love to share what I've learnt with others. Check out my blog posts and notes about my academic research, as well as technical solutions on software engineering and data science challenges.
Opinions expressed in this blog are solely my own.


Clean up cached file from local test

I found a useful python package for cleaning up cached files created in a local (suite of unit) test. called pyclean:

https://pypi.org/project/pyclean/

After installing, to clean up after a test, execute

pyclean --verbose . 

would clean up the files and list out all the files deleted.

Capturing error with more details in Python

The python native library traceback can provide more details about an (unexpected) error compared to error catching with except Exception as ex: and then examine ex.

Let’s make a function that would result in error for demonstration:

import traceback


def do_something_wrong():
    cc = int("come on!")
    return cc

try:  # first catch
    do_something_wrong()
except Exception as ex:
    print(f"The Exception is here:\n{ex}")

try:  # second catch
    do_something_wrong()
except:
    print(f"Use traceback.format_exc() instead:\n{traceback.format_exc()}")

The first catch would only display

The Exception is here:
invalid literal for int() with base 10: 'come on!'

while the second catch includes not only the error but where it occurs

Use traceback.format_exc() instead:
Traceback (most recent call last):
  File "/Users/csyhuang/JetBrains/PyCharm2024.1/scratches/scratch.py", line 16, in <module>
    do_something_wrong()
  File "/Users/csyhuang/JetBrains/PyCharm2024.1/scratches/scratch.py", line 5, in do_something_wrong
    cc = int("come on!")

In a large software project, the second example would be way more helpful than the first.

Problem solver for GitHub Workflow

I wanted to create a GitHub workflow that did web API query and return the results as text files in the repo. Here are several problems I’ve solved during the development.

Passing keys and tokens via secrets to the web API

Several tokens and secrets are necessary to query the web API. I stored that as GitHub secrets and access them in the workflow file via:

jobs:
  job-name:
    environment: query_env
    runs-on: ubuntu-latest
    name: ...
    steps:
      ...
      - name: Query API
        id: query_api
        env:
          oauthtoken: ${{ secrets.OAUTH_TOKEN }}
          oauthtokensecret: ${{ secrets.OAUTH_TOKEN_SECRET }}
        run: python run_script.py $oauthtoken $oauthtokensecret
        ...

After running run_script.py, there will be several .txt files produced in the directory data_dir/ inside the repository which I want to push to the GitHub repository. I tried committing and pushing the files with actions/checkout@v4 but it does not work:

      ...
      - name: add files to git # Below is a version that does not work
        uses: actions/checkout@v4
        with:
           token: ${{ secrets.REPO_TOKEN }}
      - name: do the actual push
        run: |
          git add data_dir/*.txt
          git commit -m "add files"
          git push

Running this, I receive an error: nothing to commit, working tree clean. Error: Process completed with exit code 1. .

The version that works eventually looks like this:

      - name: Commit files
        uses: stefanzweifel/git-auto-commit-action@v5
        with:
           token: ${{ secrets.REPO_TOKEN }}

Note that it would commit all files produced to the repository, including some unwanted cached files. Therefore, I included a step before this to clean up the files:

      - name: Remove temp files
        run: |
          [ -d "package_dir/__pycache__" ] && rm -r package_dir/__pycache__

falwa release v2.0.0

A new release (v2.0.0) of the python package falwa has been published to cope with the deprecation of numpy.disutils in python 3.12 and involves some changes in installation procedures, which you can find in README section “Package Installation”.

Great thanks to Christopher Polster for figuring out a timely and clean solution for migration to python 3.12. 👏 For details and references related to this migration, users can refer to Christopher’s Pull request.

Socket Timeout error in BigDL Orca

To train deep learning model written in PyTorch with Big Data in a distributed manner, we use BigDL-Orca at work. 🛠️

Compared to the Keras interface of BigDL, PyTorch (Orca) supports customization of various components for Deep Learning. For example, using bigdl-dllib keras API, you are constrained to use only available operations in Autograd module to customize loss functions, while you can do whatever you like in PyTorch (Orca) by creating customized subclass of torch.nn.modules.loss._Loss . 😁

One drawback of Orca, though, is the mysterious error logging, as what happened within the java class (i.e. what causes the error) is not logged at all. I got stuck in error during model training, but what I got from the Spark log was just socket timeout . There can be many possibilities, but the one I encountered was about the size of train_data.

Great thanks to my colleague Kevin Mueller who figured out the cause 🙏 - when the partitions contain different number of batches in Orca, some barriers can never be reached and that results in such error.

To get around this, I dropped some rows to make sure the total size of train_data is a multiple of batch size:

train_data = train_data.limit(train_data.count() - train_data.count() % batch_size)

The training process worked afterwards. 😁

falwa release v1.3.0

A new release (v1.3.0) of the python package falwa with some improvement in numerical scheme and enhanced functionalities has been made:

https://github.com/csyhuang/hn2016_falwa/releases/tag/v1.3.0

If you find an error, or have any questions, please submit an issue ticket to let us know.

Thank you for your attention.

Running pytest coverage check locally

I wrote a blog post in 2021 about how to integrate pytest coverage check to GitHub Workflow.

To run coverage locally, execute coverage run --source=falwa -m pytest tests/ && coverage report -m would yield the report (this is from the PR for falwa release 1.3):

Name                           Stmts   Miss  Cover   Missing
------------------------------------------------------------
falwa/__init__.py                 11      0   100%
falwa/barotropic_field.py         41      4    90%   79, 86, 93, 138
falwa/basis.py                    66      8    88%   57-62, 175, 186
falwa/constant.py                  6      0   100%
falwa/data_storage.py            146      3    98%   52, 59, 107
falwa/legacy/__init__.py           0      0   100%
falwa/legacy/beta_version.py     240    240     0%   1-471
falwa/netcdf_utils.py              9      9     0%   6-30
falwa/oopinterface.py            400     32    92%   297, 320, 336, 355, 366, 393, 550, 559, 721, 731, 734, 799, 818, 860, 870, 880, 890, 907, 918, 929, 993, 1006-1008, 1019, 1031, 1179-1180, 1470-1471, 1550-1565
falwa/plot_utils.py              125    125     0%   6-343
falwa/preprocessing.py             7      7     0%   6-30
falwa/stat_utils.py               11     11     0%   6-26
falwa/utilities.py                61     48    21%   58-92, 151-186, 242-255
falwa/wrapper.py                 146    146     0%   6-570
falwa/xarrayinterface.py         266     37    86%   107-109, 317, 322-323, 478, 616-648, 683-704
------------------------------------------------------------
TOTAL                           1535    670    56%

I guess it’s time to work on increasing coverage again. 🙂 (Too much work recently, through.)

CS topics not covered in class

Our team lead shared with us some useful learning materials on advanced CS topics not covered in class: The Missing Semester of Your CS Education from MIT. I’ll spend some time to read this.

Discussion on KAN (Kolmogorov-Arnold Networks)

I led the Machine Learning Journal Club discussion on the paper:

Liu, Z., Wang, Y., Vaidya, S., Ruehle, F., Halverson, J., Soljačić, M., … & Tegmark, M. (2024). Kan: Kolmogorov-arnold networks. arXiv preprint arXiv:2404.19756.

Here are the slides I made.

In short, I believe it can only be practically useful if the scalability problem is solved. 👀 Let’s see how the development of this technique goes.

Important announcement about the GitHub repo hn2016_falwa

Below is the email I sent to the users of GitHub repo hn2016_falwa:

I am writing to inform you two recent releases of the GitHub repo v1.0.0 (major release) and v1.1.0 (a bug fix). You can refer to the release notes for the details. There are two important changes in v1.0.0:

  1. The python package is renamed from hn2016_falwa to falwa since this package implements finite-amplitude wave activity and flux calculation beyond those published in Huang and Nakamura (2016). The GitHub repo URL remains the same: https://github.com/csyhuang/hn2016_falwa/ . The package can be installed via pip as well: https://pypi.org/project/falwa/

  2. It happens that the bug fix release v0.7.2 has a bug in the code such that it over-corrects the nonlinear zonal advective flux term. v1.0.0 fix this bug. Thanks Christopher Polster for spotting the error. The fix requires re-compilation of fortran modules.

The rest of the details can be found in the release notes:

Please let us know on issue page if you have any questions: https://github.com/csyhuang/hn2016_falwa/issues

GitHub Actions for Package Deployment

Bookmarking some useful links:

Deployment onto Conda forge

The deployment of python package on linux is not working (again). I am exploring solutions to automate deployment. Here are things I’ve found.

[Updated on 2023/12/11] After some research, it seems that scikit-build would be a continuously maintained solution: https://scikit-build.readthedocs.io/

to be continued.

IMPORTANT Bug fix release hn2016_falwa v0.7.2

We published an important bugfix release hn2016_falwa v0.7.2, which requires recompilation of fortran modules.

Two weeks ago, we discovered that there is a mistake in the derivation of expression of nonlinear zonal advective flux term, which leads to an underestimation of the nonlinear zonal advective flux component.

We will submit corrigendum for Neal et al. (2022, GRL) and Nakamura and Huang (2018, Science) to update the numerical results. The correct derivation of the flux expression can be found in the corrected supplementary materials of NH18 (to be submitted soon). There is no change in conclusions in any of the articles.

Please refer to Issue #83 for the numerical details and preliminary updated figures in NHN22 and NH18:

Thank you for your attention and let us know if we can help.

Generative AI with LLMs

Here are course notes I am taking from the DeepLearning.ai course on Coursera: Generative AI with Large Language Models.

Deploy package on pip and conda channel

To build pip distribution that contains the source files only (without compiled modules):

python3 setup.py sdist bdist_wheel
python3 -m twine upload dist/*

To compile the package to .whl on Mac (Example: SciPy pip repository):

python setup.py bdist_wheel
python3 -m twine upload dist/*

This deploy the wheel file dist/hn2016_falwa-0.7.0-cp310-cp310-macosx_10_10_x86_64.whl to pip channel.

However, when repeating the Mac procedures above on Linux, I got the error:

Binary wheel 'hn2016_falwa-0.7.0-cp311-cp311-linux_x86_64.whl' has an unsupported platform tag 'linux_x86_64'.

I googled and found this StackOverflow thread: Binary wheel can’t be uploaded on pypi using twine.

ManyLinux repo: https://github.com/pypa/manylinux

Good tutorial:

http://www.martin-rdz.de/index.php/2022/02/05/manually-building-manylinux-python-wheels/


Create fresh start environment:

$ conda create --name test_new

But using conda on mac to compile wheel would have this issue:

ld: unsupported tapi file type '!tapi-tbd' in YAML file

Create virtual environment (not via conda)

python3 -m venv /Users/claresyhuang/testbed_venv
source /Users/claresyhuang/testbed_venv/bin/activate

MDTF meeting 2023/6/5

The google slides used in my presentation in the meeting of NOAA Model Diagnostics Task Force can be found here.

JetBrains Open-source license

Thrilled that my open-source climate data analysis package gets sponsored by JetBrains Licenses for Open Source Development. 🥳

JetBrainsCert

I’m really glad I started this project in 2016 when I was still in graduate school, with the hope that the climate data diagnostic proposed in my thesis can be applied by other more easily. Even though I have been working in industry after finishing my PhD, by maintaining this package, I’ve established valuable connections with many academic researchers. 😊 I’m grateful that JetBrains support open-source community and encourage the culture of sharing.

JetBrains Licenses for Open Source Development: https://www.jetbrains.com/community/opensource/

Compile cython modules

It took me a while to figure out how to import cython modules. Here is a configuration that works:

├── hn2016_falwa
│   ├── cython_modules
│   |   ├── pyx_modules
|   |   |   ├── __init__.py
|   |   |   ├── dirinv_cython.pyx
|   |   ├──setup_cython.py
|   |   ├──check_import.py
...

In dirinv_cython.pyx I have:

def x_sq_minus_x(x):
    return x**2 - x

In setup_cython.py I have:

from distutils.core import setup
from Cython.Build import cythonize

setup(name='dirinv_cython',
      package_dir={'pyx_modules': ''},
      ext_modules=cythonize("pyx_modules/dirinv_cython.pyx"))

First, I compile the cython modules by executing in the directory cython_modules:

python setup_cython.py build_ext --inplace

This produces dirinv_cython.c in the directory pyx_modules/.

Put this in __init__.py:

import dirinv_cython

Then I run the script to test the imports check_import.py:

from pyx_modules import dirinv_cython

ans = dirinv_cython.x_sq_minus_x(19)
print(f"ans = {ans}")

Executing check_import.py gives

ans = 342

Implementing QuantileTransformer in Spark - mapping any kinds of distribution to normal distribution

This blog post is motivated by the Scikit-learn documentation of QuantileTransformer and a StackExchange discussion thread about it.

There are 2 parts in this post. Part I reviews the idea of Quantile Transformer. Part II shows the implementation of Quantile Transformer in Spark using Pandas UDF.

Part I: Quantile Transformer transforms data of arbitrary distribution to normal (or uniform) distribution

Problem Statement: I have some individuals (id) with 3 attributes of different distributions. I want to combine them linearly and also want to make sure the outcome follows a normal distribution.

In python, I create a toy dataset with column id, and 3 columns corresponding to random variables following different distributions:

import numpy as np
import pandas as pd
import scipy
import math
import matplotlib.pyplot as plt

num_of_items = 10000  # the size of my population

df = pd.DataFrame({
    'id': [str(i) for i in np.arange(num_of_items)],
    'uniform': np.random.rand(num_of_items),
    'power_law': np.random.power(3, num_of_items),
    'exponential': np.random.exponential(1, num_of_items)})

The first 5 rows of df looks like

 id   uniform  power_law  exponential
  0  0.543253   0.690897     0.523969
  1  0.161339   0.802748     0.808497
  2  0.487836   0.818129     1.409843
  3  0.594641   0.808148     2.233953
  4  0.513764   0.783795     1.841159

Here is a visualization of their distributions:

Initial distribution

To transform all the columns to normal distribution, first, get the rank (or quantile, if rank is too expensive) for each id in each column:

df_rank = df.rank(
    axis=0, method='average', numeric_only=True, na_option='keep', ascending=True, pct=True)

The first 5 rows of df_rank looks like:

   uniform  power_law  exponential
0   0.5351     0.3388       0.4099
1   0.1544     0.5243       0.5543
2   0.4814     0.5546       0.7586
3   0.5882     0.5351       0.8950
4   0.5053     0.4885       0.8427

Let’s say we want to map these values to a normal distribution with mean=0.5 and standard deviation=0.15. To look up the corresponding value in the CDF of normal distribution, we can use scipy.stats.norm.ppf:

df_transformed = df_rank.applymap(lambda col: scipy.stats.norm.ppf(col, loc=0.5, scale=0.15))

The first 5 rows of df_transformed looks like:

    uniform  power_law  exponential
0  0.513214   0.437639     0.465830
1  0.347339   0.509142     0.520480
2  0.493004   0.520594     0.605271
3  0.533438   0.513214     0.688035
4  0.501993   0.495675     0.650843

Let’s see how the distribution of values in df_transformed look like:

Transformed distribution

Perfect! Their distributions look identical now! 😝

If I do a uniform linear combination of them per id,

df_transformed['linear_combination'] = df_transformed.apply(
    lambda row: np.mean([row['uniform'], row['power_law'], row['exponential']]), axis=1)

I would get results of the same distribution. On the right, I show the results from linear combination of the original values for comparison:

Linear combination

Another combination strategy would be to get the max value among the 3 columns. The transformed variable follows similar distribution, dispute the mean shifts to larger values.

Max combination

Part II: Implementation of Quantile Transformer in Spark

Given the introduction of Pandas UDF in Spark, the implementation is relatively simple. If ranking is too expensive, you can consider using approximate quantile instead.

from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf(PandasUDFType.SCALAR)
def to_normal_distribution(data: pd.Series):
    pdf = data.rank(pct=True)\
        .applymap(lambda col: scipy.stats.norm.ppf(col, loc=0.5, scale=0.15))
    return pdf

(Note: Later I realized that the newest Spark version has pyspark.pandas.DataFrame.rank, see Spark documentation. That’s not available at my work station yet.)

You can append the transformed value to the original dataframe:

spark_df = spark_session.createDataFrame(df)
spark_df = spark_df.withColumn(
    'transformed_data', 
    to_normal_distribution(spark_df.uniform))

Aggregation of vectors using Spark Summarizer is too slow. How to get around it?

I have a dataframe df with columns id (integers) and document (string):

root
 |-- id: integer
 |-- document: string

Each id is associated with multiple documents. Each document would be transformed into a vector representation (embedding of dimension 100) using Spark NLP. Eventually, I want to get the average of vectors associated with each id.

When testing with small amount of data, i.e. 10k id with each associated with ~100 document, pyspark.ml.stat.Summarizer does the job quickly:

df.groupby('id').agg(Summarizer.mean(F.col('embedding')).alias('mean_vector'))

However, the real situation is that I have to deal with Big Data that consists of 100M distinct id and 200M distinct document. Each id can be associated with at most 30k document. The time taken to (1) attach embedding using Spark NLP and (2) aggregate the vectors per id took me 10 hours, which is too slow!

Eventually, I figured out a way to do the same thing while having the computing time shortened to ~2 hours.

Thanks to my colleague who spot out the bottleneck - step (1) is indeed not slow. It was step (2) that takes most of the time when there is a huge number of id to work with. In this scenario, the aggregation of values in 100 separate columns is actually faster than the aggregation of 100-dimension vectors.

Here is what I do to optimize the procedures:

1. Obtain vector representation as array for distinct document

One can specify in sparknlp.base.EmbeddingFinisher whether you want to output a vector or an array. To make the split easier, I set the output as array:

from sparknlp.base import EmbeddingFinisher
...

finisher = EmbedingFinisher() \
    .setOutputAsVector(False) \
    ...

2. Split the 100-d vector into 100 columns

Now I have document_df of schema

root
 |-- document: string
 |-- embedding: array
 |    |-- element: float

I split the vectors into 100 columns (v1, v2, … v100) by:

import pyspark.sql.functions as F

split_to_cols = [F.col('embedding')[i].alias(f'v{i}') for i in range(1, 101)]
document_df = document_df.select([F.col('document')] + split_to_cols)

3. Join the resultant dataframe with the original and do aggregation per column

I then join document_df to df and obtain the average of embedding columns per id:

avg_expr = [F.mean(F.col(f'v{i}')).alias(f'v{i}') for i in range(1, 101)]
df = df.join(document_df, on='document').groupby('id').agg(*avg_expr)

4. (If needed) Use VectorAssembler to concatenate the columns into vectors

One can turn the values in the 100 columns to a vector per id if needed:

from pyspark.ml import VectorAssembler

assembler = VectorAssembler(
    inputCols=[f'v{i}' for i in range(1, 101)],
    outputCol='final_vector')
df = assembler.transform(df) \
    .select('id', 'final_vector')

That’s how I speed up the aggregation of Vector output from Spark NLP. Hope that helps. 😉

my widget for counting (since Dec24, 2016)