Clare S. Y. Huang Data Scientist | Atmospheric Dynamicist

I love to share what I've learnt with others. Check out my blog posts and notes about my academic research, as well as technical solutions on software engineering and data science challenges.

Discussion on Difference-based Contrastive Learning for Sentence Embedding (DiffCSE)

I led the Machine Learning Journal Club discussion on the paper:

Chuang, Y. S., Dangovski, R., Luo, H., Zhang, Y., Chang, S., Soljačić, M., … & Glass, J. (2022). DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings. arXiv preprint arXiv:2204.10298.

Here are the slides I made.

Paper published!

Our newest research paper:

Neal, Emily, Huang C. S. Y., Nakamura N. The 2021 Pacific Northwest heat wave and associated blocking: meteorology and the role of an upstream diabatic source of wave activity. Geophysical Research Letters (2021)

is available online today!

Package release hn2016_falwa v0.6.0

Here comes the package release hn2016_falwa v0.6.0. This version contains the updated algorithm for reference state inversion - now the reference state is solved with absolute vorticity field at 5N (defined by user) as boundary condition. The analysis code to reproduce results in Neal et al. “The 2021 Pacific Northwest heat wave and associated blocking: Meteorology and the role of an upstream cyclone as a diabatic source of wave activity” (submitted to GRL) can be found in the directory scripts/nhn_grl2022/.

Please refer to the release notes for details.

We are in a hurry to submit our manuscript revision, so this package version is not available on PyPI yet.😅 I will upload it to PyPI asap.

Discussion on Translation-Counterfactual Word Replacement (TCWR)

I led the Machine Learning Journal Club discussion on the paper:

Liu, Q., Kusner, M., & Blunsom, P. (2021, June). Counterfactual data augmentation for neural machine translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 187-197). Here are the slides I made.

Discussion on Supervised Contrastive Learning

I led the Machine Learning Journal Club discussion on the paper:

Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., … & Krishnan, D. (2020). Supervised contrastive learning. arXiv preprint arXiv:2004.11362. Here are the slides I made.

Package release hn2016_falwa v0.5.0

Here comes the package release hn2016_falwa v0.5.0. Please refer to the release notes for details. Great thanks to Christopher Polster who submitted a pull request to optimize the fortran modules, which makes the flux calculations a lot faster. 😄🙏🏻

The package is available on PyPI as well.

The CI procedures of the package has been migrated from Travis CI to GitHub workflow. I spent a while on fixing related issues, so I wrote a blog post about the procedures. Hopefully this can save others time if they want to do the same thing.

Migrating CI from Travis CI to GitHub Workflow

Few days ago, I realized that the coverage test for my python package, which was run on, has stopped running for two years. After migrating the to, the service is no longer free after spending 10000 credits. One has to pay for $69/mo 💸 for unlimited service.

Note that I already used 2170 credits when testing various things and preparing for upcoming package release in a day 😱(my fault?), so I don’t think a free plan is sustainable, but $69/mo is definitely too expensive if I only have to maintain a single package.

I found a free solution on GitHub marketplace: CodeCov. With this GitHub Action, one can run deployment test and code coverage diagnostic automatically when pushing commits. I spent a bit of time on trial and error, so I think it is worth the time to write down what I did to save time for others.

Define the workflow

The GitHub Workflow is defined in the file .github/workflows/workflow.yml. Here is what I did for my package:

name: CodeCov
on: [push, pull_request]
    runs-on: ubuntu-latest
      OS: ubuntu-latest
      PYTHON: '3.7'
    - uses: actions/checkout@v2
        fetch-depth: ‘2’

    - name: Setup Python
      uses: actions/setup-python@master
        python-version: 3.7
    - name: Generate Report
      run: |
        pip install coverage
        pip install -U pytest
        pip install --upgrade pip setuptools wheel
        sudo apt-get install gfortran
        pip install numpy
        pip install scipy
        coverage run pytest
    - name: Upload Coverage to Codecov
      run: |
        bash <(curl -s

The procedures under "Generate Report" are copied from my Travis job definition .travis.yml:

language: python
  - "3.7"

  - python --version
  - pip install -U pip
  - pip install -U pytest
  - pip install coverage

  - pip install --upgrade pip setuptools wheel
  - sudo apt-get install python-numpy python-scipy gfortran
  - pip install numpy
  - pip install scipy

# command to run tests
  - coverage run pytest

  - bash <(curl -s

In order to upload the coverage report to CodeCov, I used the command bash <(curl -s after failing to send report using the line uses: codecov/codecov-action@v2 from CodeCov template.

Specify files to include/exclude in the coverage test

The specifications are made in the file .coveragerc:


show_missing = True
omit =

Since the modules in the folder legacy/ is no longer maintained, I exclude that from the coverage test.

Implement coverage test locally

If you have the python package coverage installed, you can test the coverage run command, e.g.

$ coverage run pytest

which will generate a coverage report .coverage. To visualize the report, use the command

$ coverage report

to render .coverage in readable manner. As an example, the coverage report of my package looks like:

Name                               Stmts   Miss  Cover   Missing
hn2016_falwa/               0      0   100%
hn2016_falwa/      39      4    90%   72, 79, 86, 131
hn2016_falwa/                 46      2    96%   113, 124
hn2016_falwa/               6      0   100%
hn2016_falwa/         270     28    90%   151, 222, ...
hn2016_falwa/             61     48    21%   53-87, 146-181, 237-250
hn2016_falwa/              154    154     0%   1-579
TOTAL                                576    236    59%

Include Build badge and Coverage badge in README of your repo

As an example, the image path to the build status of my package is:

Build Status I can check my build status here.

The image path to my CodeCov results is: The report is visualized on CodeCov like this.

Discussion on the application of Contrastive Learning to train Sentence Embedding

In the previous discussion I led, I introduced the idea of learning visual representation in unsupervised manner. This week, I am introducing the following paper that applies deep contrastive learning to train sentence embedding:

Giorgi, J. M., Nitski, O., Bader, G. D., & Wang, B. (2020). Declutr: Deep contrastive learning for unsupervised textual representations. arXiv preprint arXiv:2006.03659.

The presentation slides I used for discussion can be found here.

Discussion on Contrastive Learning

I led the Machine Learning Journal Club discussion on the two papers:

You can find here the slides I made which provides an introduction to Contrastive Learning. Below are the main points from the slides:

  • Contrastive learning is a self-supervised method to learn a representation of objects by maximizing/minimizing distance between the same/different class(es)
  • Contrastive learning benefits from data augmentation and increase in model parameters
  • Under-clustering occurs when there is not enough negative samples to learn from; over-clustering occurs when the model overfits (memorize the data)
  • To solve inefficiency issue, median(rank-k) triplet loss is used instead of the total loss

I finished the GANs Specialization on Coursera!

I finished the 3-course Generative Adversarial Networks (GANs) Specialization on Coursera! It was super fun! Here are some slides I presented in a deep learning journal club with my peers, which is a survey of GANs covered in this Specialization.

Click here to view my certificate. 😬

Added a directory for some of my music arrangements

🎹 I have been uploading improvised piano cover on my YouTube channel. Recently, I received some comments asking for the music sheet of my arrangement, and it seems to be a good idea to keep a record of my work. Therefore, I started making music sheets of my arrangement with MuseScore to share with others.

🎼 Check out the page Music Arrangement for the collection of arrangement with music sheets I’ve made. For the rest of my recordings, go to my YouTube Channel Clare’s Studio.

Databricks Certified Associate Developer for Apache Spark 3.0

I passed the Databricks Certified Associate Developer exam for Apache Spark 3.0 (python). Here is my certificate!

I registered for the exam when joining the Spark Summit this June, hoping to set a goal to push myself to dive deeper into spark architecture and performance tuning.

[Experience sharing] On top of coding with pyspark at work (which helps me with most of the syntax questions in the exam), my exam preparation mainly involves reading the two books, Spark: The Definitive Guide and Learning Spark 2.0.

It was my very first time taking an online proctored exam at home, and there were two things I wish I could have known before the exam:

  1. The spark documentation (PDF file) provided by Sentinel (i.e., the exam software) is not searchable. One has to scroll through the page to find what s/he needs.
  2. The proctor would check your workspace configuration during the exam (i.e., not at the beginning). The exam would pause during the check, so you don’t have to worry about losing time.

I believe there will be more tests conducted with an online proctor given the evolution of the pandemic. Perhaps we will get used to the online test workspace setup at home very soon.

Discussion on Deep Compression

I was leading a discussion on the paper Han, S., Mao, H., & Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. It introduces the methods to reduce the storage requirement of neural networks such that it can be stored on smaller devices.

The slides I made for the discussion can be found here.

Notes on ModelOps and MLOps talks

I wrote a short article on Medium, ODSC West 2020: Notes on ModelOps and MLOps talks, about what I learned from two talks in ODSC West 2020.

Running a single test case in the unittest suite

If your unit tests are written using the unittest package, to run a single test case in the TestSuite, the command line syntax is

python test -m tests.test_module.TestClass.test_case

Stack Overflow reference: Does unittest allow single case/suite testing through “ test”?

If your unit tests are written using pytest, the command used would be

pytest tests/

New pip release and changes in its way to resolve dependency conflicts

When I was trying to set up a conda environment to run a package…

ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

numpydoc 1.1.0 requires sphinx>=1.6.5, but you'll have sphinx 1.5.3 which is incompatible.

After googling, I found this post on StackOverflow explaining the issue - it is related to the changes in pip’s release 20.2.

To implement the check, install packages with the command

pip install --use-feature=2020-resolver package

Submitting pull request from forked repo to main repo

It’s HacktoberFest again! 👻 Here are some useful commands to merge your forked repo with upstream changes from the original repo (Also see GitHub Docs).

After forking, specify remote upstream repository:

git remote add upstream

You can then check the upstream locations via the command git remote -v.

Before submitting a pull request, you have to make sure your branch contains all the commits from upstream. You can do so by:

git fetch upstream
git checkout master
git merge upstream/master

🤓 Have fun coding! ⌨️

Split a vector/list in a pyspark DataFrame into columns

Split an array column

To split a column with arrays of strings, e.g. a DataFrame that looks like,

|   strCol|
|[A, B, C]|

into separate columns, the following code without the use of UDF works.

import pyspark.sql.functions as F

df2 =[F.col("strCol")[i] for i in range(3)])


|        A|        B|        C|

Split a vector column

To split a column with doubles stored in DenseVector format, e.g. a DataFrame that looks like,

|       intCol|

one have to construct a UDF that does the convertion of DenseVector to array(python list) first:

import pyspark.sql.functions as F
from pyspark.sql.types import ArrayType, DoubleType

def split_array_to_list(col):
    def to_list(v):
        return v.toArray().tolist()
    return F.udf(to_list, ArrayType(DoubleType()))(col)

df3 ="intCol")).alias("split_int"))\
    .select([F.col("split_int")[i] for i in range(3)])


|         2.0|         3.0|         4.0|

Ranking hierarchical labels with SQL

To give indices to hierarchical labels, I can use DENSE_RANK() or RANK() depending on the situation. For example, if I have a DataFrame that looks like this:

|Fridge|    Fruits|
|     A|     apple|
|     B|    orange|
|     C|     apple|
|     D|     pears|
|     C|Watermelon|

The following SQL code

    DENSE_RANK() OVER (ORDER BY Fridge, Fruits) AS Loc_id,
    DENSE_RANK() OVER (ORDER BY Fridge) AS Fridge_id_dense,
    RANK() OVER (ORDER BY Fridge) AS Fridge_id,
    DENSE_RANK() OVER (ORDER BY Fruits) AS Fruit_id_dense,
    RANK() OVER (ORDER BY Fruits) AS Fruit_id
FROM fridge_list

would yield the following table

|Fridge|    Fruits|Loc_id|Fridge_id_dense|Fridge_id|Fruit_id_dense|Fruit_id|
|     A|     apple|     1|              1|        1|             2|       2|
|     B|    orange|     2|              2|        2|             3|       4|
|     C|Watermelon|     3|              3|        3|             1|       1|
|     C|     apple|     4|              3|        3|             2|       2|
|     D|     pears|     5|              4|        5|             4|       5|

Custom Transformer that can be fitted into Pipeline

How to construct a custom Transformer that can be fitted into a Pipeline object? I learned from a colleague today how to do that.

Below is an example that includes all key components:

from pyspark import keyword_only
from import Transformer
from import HasInputCol, HasOutputCol, Param, Params, TypeConverters
from import DefaultParamsReadable, DefaultParamsWritable
from pyspark.sql import DataFrame
from pyspark.sql.types import StringType
import pyspark.sql.functions as F

class CustomTransformer(Transformer, HasInputCol, HasOutputCol, DefaultParamsReadable, DefaultParamsWritable):
  input_col = Param(Params._dummy(), "input_col", "input column name.", typeConverter=TypeConverters.toString)
  output_col = Param(Params._dummy(), "output_col", "output column name.", typeConverter=TypeConverters.toString)
  def __init__(self, input_col: str = "input", output_col: str = "output"):
    super(CustomTransformer, self).__init__()
    self._setDefault(input_col=None, output_col=None)
    kwargs = self._input_kwargs
  def set_params(self, input_col: str = "input", output_col: str = "output"):
    kwargs = self._input_kwargs
  def get_input_col(self):
    return self.getOrDefault(self.input_col)
  def get_output_col(self):
    return self.getOrDefault(self.output_col)
  def _transform(self, df: DataFrame):
    input_col = self.get_input_col()
    output_col = self.get_output_col()
    # The custom action: concatenate the integer form of the doubles from the Vector
    transform_udf = F.udf(lambda x: '/'.join([str(int(y)) for y in x]), StringType())
    return df.withColumn(output_col, transform_udf(input_col))

Let’s test it with a dataframe

df = spark.createDataFrame([(Vectors.dense([2.0, 1.0, 3.0]),), (Vectors.dense([0.4, 0.9, 7.0]),)], ["numbers"])

and a Pipeline like this:

from import ElementwiseProduct
from import Vectors
from import Pipeline

elementwise_product = ElementwiseProduct(scalingVec=Vectors.dense([2.0, 3.0, 5.0]), inputCol="numbers", outputCol="product")
custom_transformer = CustomTransformer(input_col="product", output_col="results")
pipeline = Pipeline(stages=[elementwise_product, custom_transformer])
model =
results = model.transform(df)

I manage to get the expected results:

|      numbers|       product|results|
|[2.0,1.0,3.0]|[4.0,3.0,15.0]| 4/3/15|
|[0.4,0.9,7.0]|[0.8,2.7,35.0]| 0/2/35|

my widget for counting (since Dec24, 2016)