Clare S. Y. Huang Data Scientist | Atmospheric Dynamicist

I love to share what I've learnt with others. Check out my blog posts and notes about my academic research, as well as technical solutions on software engineering and data science challenges.


Package release hn2016_falwa v0.5.0

Here comes the package release hn2016_falwa v0.5.0. Please refer to the release notes for details. Great thanks to Christopher Polster who submitted a pull request to optimize the fortran modules, which makes the flux calculations a lot faster. 😄🙏🏻

The package is available on PyPI as well.

The CI procedures of the package has been migrated from Travis CI to GitHub workflow. I spent a while on fixing related issues, so I wrote a blog post about the procedures. Hopefully this can save others time if they want to do the same thing.

Migrating CI from Travis CI to GitHub Workflow

Few days ago, I realized that the coverage test for my python package, which was run on travis-ci.org, has stopped running for two years. After migrating the to travis-ci.com, the service is no longer free after spending 10000 credits. One has to pay for $69/mo 💸 for unlimited service.

Note that I already used 2170 credits when testing various things and preparing for upcoming package release in a day 😱(my fault?), so I don’t think a free plan is sustainable, but $69/mo is definitely too expensive if I only have to maintain a single package.

I found a free solution on GitHub marketplace: CodeCov. With this GitHub Action, one can run deployment test and code coverage diagnostic automatically when pushing commits. I spent a bit of time on trial and error, so I think it is worth the time to write down what I did to save time for others.

Define the workflow

The GitHub Workflow is defined in the file .github/workflows/workflow.yml. Here is what I did for my package:

name: CodeCov
on: [push, pull_request]
jobs:
  run:
    runs-on: ubuntu-latest
    env:
      OS: ubuntu-latest
      PYTHON: '3.7'
    steps:
    - uses: actions/checkout@v2
      with:
        fetch-depth: ‘2’

    - name: Setup Python
      uses: actions/setup-python@master
      with:
        python-version: 3.7
    - name: Generate Report
      run: |
        pip install coverage
        pip install -U pytest
        pip install --upgrade pip setuptools wheel
        sudo apt-get install gfortran
        pip install numpy
        pip install scipy
        coverage run setup.py pytest
    - name: Upload Coverage to Codecov
      run: |
        bash <(curl -s https://codecov.io/bash)

The procedures under "Generate Report" are copied from my Travis job definition .travis.yml:

language: python
python:
  - "3.7"

before_install:
  - python --version
  - pip install -U pip
  - pip install -U pytest
  - pip install coverage

install:
  - pip install --upgrade pip setuptools wheel
  - sudo apt-get install python-numpy python-scipy gfortran
  - pip install numpy
  - pip install scipy

# command to run tests
script:
  - coverage run setup.py pytest

after_success:
  - bash <(curl -s https://codecov.io/bash)

In order to upload the coverage report to CodeCov, I used the command bash <(curl -s https://codecov.io/bash) after failing to send report using the line uses: codecov/codecov-action@v2 from CodeCov template.

Specify files to include/exclude in the coverage test

The specifications are made in the file .coveragerc:

[run]
source=hn2016_falwa

[report]
show_missing = True
omit =
    hn2016_falwa/legacy/*

Since the modules in the folder legacy/ is no longer maintained, I exclude that from the coverage test.

Implement coverage test locally

If you have the python package coverage installed, you can test the coverage run command, e.g.

$ coverage run setup.py pytest

which will generate a coverage report .coverage. To visualize the report, use the command

$ coverage report

to render .coverage in readable manner. As an example, the coverage report of my package looks like:

Name                               Stmts   Miss  Cover   Missing
----------------------------------------------------------------
hn2016_falwa/__init__.py               0      0   100%
hn2016_falwa/barotropic_field.py      39      4    90%   72, 79, 86, 131
hn2016_falwa/basis.py                 46      2    96%   113, 124
hn2016_falwa/constant.py               6      0   100%
hn2016_falwa/oopinterface.py         270     28    90%   151, 222, ...
hn2016_falwa/utilities.py             61     48    21%   53-87, 146-181, 237-250
hn2016_falwa/wrapper.py              154    154     0%   1-579
----------------------------------------------------------------
TOTAL                                576    236    59%

Include Build badge and Coverage badge in README of your repo

As an example, the image path to the build status of my package is:

https://github.com/csyhuang/hn2016_falwa/actions/workflows/workflow.yml/badge.svg

Build Status I can check my build status here.

The image path to my CodeCov results is:

https://codecov.io/gh/csyhuang/hn2016_falwa/branch/master/graph/badge.svg

codecov.io The report is visualized on CodeCov like this.

Discussion on the application of Contrastive Learning to train Sentence Embedding

In the previous discussion I led, I introduced the idea of learning visual representation in unsupervised manner. This week, I am introducing the following paper that applies deep contrastive learning to train sentence embedding:

Giorgi, J. M., Nitski, O., Bader, G. D., & Wang, B. (2020). Declutr: Deep contrastive learning for unsupervised textual representations. arXiv preprint arXiv:2006.03659.

The presentation slides I used for discussion can be found here.

Discussion on Contrastive Learning

I am leading the Machine Learning Journal Club discussion on the two papers:

You can find here the slides I made which provides an introduction to Contrastive Learning. Below are the main points from the slides:

  • Contrastive learning is a self-supervised method to learn a representation of objects by maximizing/minimizing distance between the same/different class(es)
  • Contrastive learning benefits from data augmentation and increase in model parameters
  • Under-clustering occurs when there is not enough negative samples to learn from; over-clustering occurs when the model overfits (memorize the data)
  • To solve inefficiency issue, median(rank-k) triplet loss is used instead of the total loss

I finished the GANs Specialization on Coursera!

I finished the 3-course Generative Adversarial Networks (GANs) Specialization on Coursera! It was super fun! Here are some slides I presented in a deep learning journal club with my peers, which is a survey of GANs covered in this Specialization.

Click here to view my certificate. 😬

Added a directory for some of my music arrangements

🎹 I have been uploading improvised piano cover on my YouTube channel. Recently, I received some comments asking for the music sheet of my arrangement, and it seems to be a good idea to keep a record of my work. Therefore, I started making music sheets of my arrangement with MuseScore to share with others.

🎼 Check out the page Music Arrangement for the collection of arrangement with music sheets I’ve made. For the rest of my recordings, go to my YouTube Channel Clare’s Studio.

Databricks Certified Associate Developer for Apache Spark 3.0

I passed the Databricks Certified Associate Developer exam for Apache Spark 3.0 (python). Here is my certificate!

I registered for the exam when joining the Spark Summit this June, hoping to set a goal to push myself to dive deeper into spark architecture and performance tuning.

[Experience sharing] On top of coding with pyspark at work (which helps me with most of the syntax questions in the exam), my exam preparation mainly involves reading the two books, Spark: The Definitive Guide and Learning Spark 2.0.

It was my very first time taking an online proctored exam at home, and there were two things I wish I could have known before the exam:

  1. The spark documentation (PDF file) provided by Sentinel (i.e., the exam software) is not searchable. One has to scroll through the page to find what s/he needs.
  2. The proctor would check your workspace configuration during the exam (i.e., not at the beginning). The exam would pause during the check, so you don’t have to worry about losing time.

I believe there will be more tests conducted with an online proctor given the evolution of the pandemic. Perhaps we will get used to the online test workspace setup at home very soon.

Discussion on Deep Compression

I was leading a discussion on the paper Han, S., Mao, H., & Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. It introduces the methods to reduce the storage requirement of neural networks such that it can be stored on smaller devices.

The slides I made for the discussion can be found here.

Notes on ModelOps and MLOps talks

I wrote a short article on Medium, ODSC West 2020: Notes on ModelOps and MLOps talks, about what I learned from two talks in ODSC West 2020.

Running a single test case in the unittest suite

If your unit tests are written using the unittest package, to run a single test case in the TestSuite, the command line syntax is

python setup.py test -m tests.test_module.TestClass.test_case

Stack Overflow reference: Does unittest allow single case/suite testing through “setup.py test”?

If your unit tests are written using pytest, the command used would be

pytest tests/test_module.py::test_case

New pip release and changes in its way to resolve dependency conflicts

When I was trying to set up a conda environment to run a package…

ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

numpydoc 1.1.0 requires sphinx>=1.6.5, but you'll have sphinx 1.5.3 which is incompatible.

After googling, I found this post on StackOverflow explaining the issue - it is related to the changes in pip’s release 20.2.

To implement the check, install packages with the command

pip install --use-feature=2020-resolver package

Submitting pull request from forked repo to main repo

It’s HacktoberFest again! 👻 Here are some useful commands to merge your forked repo with upstream changes from the original repo (Also see GitHub Docs).

After forking, specify remote upstream repository:

git remote add upstream https://github.com/ORIGINAL_OWNER/ORIGINAL_REPOSITORY.git

You can then check the upstream locations via the command git remote -v.

Before submitting a pull request, you have to make sure your branch contains all the commits from upstream. You can do so by:

git fetch upstream
git checkout master
git merge upstream/master

🤓 Have fun coding! ⌨️

Split a vector/list in a pyspark DataFrame into columns

Split an array column

To split a column with arrays of strings, e.g. a DataFrame that looks like,

+---------+
|   strCol|
+---------+
|[A, B, C]|
+---------+

into separate columns, the following code without the use of UDF works.

import pyspark.sql.functions as F

df2 = df.select([F.col("strCol")[i] for i in range(3)])
df2.show()

Output:

+---------+---------+---------+
|strCol[0]|strCol[1]|strCol[2]|
+---------+---------+---------+
|        A|        B|        C|
+---------+---------+---------+

Split a vector column

To split a column with doubles stored in DenseVector format, e.g. a DataFrame that looks like,

+-------------+
|       intCol|
+-------------+
|[2.0,3.0,4.0]|
+-------------+

one have to construct a UDF that does the convertion of DenseVector to array(python list) first:

import pyspark.sql.functions as F
from pyspark.sql.types import ArrayType, DoubleType

def split_array_to_list(col):
    def to_list(v):
        return v.toArray().tolist()
    return F.udf(to_list, ArrayType(DoubleType()))(col)

df3 = df.select(split_array_to_list(F.col("intCol")).alias("split_int"))\
    .select([F.col("split_int")[i] for i in range(3)])
df3.show()

Output:

+------------+------------+------------+
|split_int[0]|split_int[1]|split_int[2]|
+------------+------------+------------+
|         2.0|         3.0|         4.0|
+------------+------------+------------+

Ranking hierarchical labels with SQL

To give indices to hierarchical labels, I can use DENSE_RANK() or RANK() depending on the situation. For example, if I have a DataFrame that looks like this:

+------+----------+
|Fridge|    Fruits|
+------+----------+
|     A|     apple|
|     B|    orange|
|     C|     apple|
|     D|     pears|
|     C|Watermelon|
+------+----------+

The following SQL code

SELECT
    Fridge,
    Fruits,
    DENSE_RANK() OVER (ORDER BY Fridge, Fruits) AS Loc_id,
    DENSE_RANK() OVER (ORDER BY Fridge) AS Fridge_id_dense,
    RANK() OVER (ORDER BY Fridge) AS Fridge_id,
    DENSE_RANK() OVER (ORDER BY Fruits) AS Fruit_id_dense,
    RANK() OVER (ORDER BY Fruits) AS Fruit_id
FROM fridge_list

would yield the following table

+------+----------+------+---------------+---------+--------------+--------+
|Fridge|    Fruits|Loc_id|Fridge_id_dense|Fridge_id|Fruit_id_dense|Fruit_id|
+------+----------+------+---------------+---------+--------------+--------+
|     A|     apple|     1|              1|        1|             2|       2|
|     B|    orange|     2|              2|        2|             3|       4|
|     C|Watermelon|     3|              3|        3|             1|       1|
|     C|     apple|     4|              3|        3|             2|       2|
|     D|     pears|     5|              4|        5|             4|       5|
+------+----------+------+---------------+---------+--------------+--------+

Custom Transformer that can be fitted into Pipeline

How to construct a custom Transformer that can be fitted into a Pipeline object? I learned from a colleague today how to do that.

Below is an example that includes all key components:

from pyspark import keyword_only
from pyspark.ml import Transformer
from pyspark.ml.param.shared import HasInputCol, HasOutputCol, Param, Params, TypeConverters
from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
from pyspark.sql import DataFrame
from pyspark.sql.types import StringType
import pyspark.sql.functions as F

class CustomTransformer(Transformer, HasInputCol, HasOutputCol, DefaultParamsReadable, DefaultParamsWritable):
  input_col = Param(Params._dummy(), "input_col", "input column name.", typeConverter=TypeConverters.toString)
  output_col = Param(Params._dummy(), "output_col", "output column name.", typeConverter=TypeConverters.toString)
  
  @keyword_only
  def __init__(self, input_col: str = "input", output_col: str = "output"):
    super(CustomTransformer, self).__init__()
    self._setDefault(input_col=None, output_col=None)
    kwargs = self._input_kwargs
    self.set_params(**kwargs)
    
  @keyword_only
  def set_params(self, input_col: str = "input", output_col: str = "output"):
    kwargs = self._input_kwargs
    self._set(**kwargs)
    
  def get_input_col(self):
    return self.getOrDefault(self.input_col)
  
  def get_output_col(self):
    return self.getOrDefault(self.output_col)
  
  def _transform(self, df: DataFrame):
    input_col = self.get_input_col()
    output_col = self.get_output_col()
    # The custom action: concatenate the integer form of the doubles from the Vector
    transform_udf = F.udf(lambda x: '/'.join([str(int(y)) for y in x]), StringType())
    return df.withColumn(output_col, transform_udf(input_col))
  

Let’s test it with a dataframe

df = spark.createDataFrame([(Vectors.dense([2.0, 1.0, 3.0]),), (Vectors.dense([0.4, 0.9, 7.0]),)], ["numbers"])

and a Pipeline like this:

from pyspark.ml.feature import ElementwiseProduct
from pyspark.ml.linalg import Vectors
from pyspark.ml import Pipeline

elementwise_product = ElementwiseProduct(scalingVec=Vectors.dense([2.0, 3.0, 5.0]), inputCol="numbers", outputCol="product")
custom_transformer = CustomTransformer(input_col="product", output_col="results")
pipeline = Pipeline(stages=[elementwise_product, custom_transformer])
model = pipeline.fit(df)
results = model.transform(df)
results.show()

I manage to get the expected results:

+-------------+--------------+-------+
|      numbers|       product|results|
+-------------+--------------+-------+
|[2.0,1.0,3.0]|[4.0,3.0,15.0]| 4/3/15|
|[0.4,0.9,7.0]|[0.8,2.7,35.0]| 0/2/35|
+-------------+--------------+-------+

Minor release of my python package + release procedures

[hn2016_falwa Release 0.4.1] A minor release of my python package hn2016_falwa is published. Thanks Christopher Polster for submitting a pull request that fixes the interface of BarotropicField. Moreover, I added procedures to process masked array in QGField such that it can be conveniently used to process ERA5 data which is stored as masked array in netCDF files.

As a memo to myself - procedures for a release (which I often forget and have to google 😅):

  • Update version number in setup.py, readme.md and documentation pages.
  • Add a (light-weighted) tag to the commit: git tag <tagname>.
  • Not only push the commits but also the tag by git push origin <tagname>.
  • Update on Aug 15, 2021: To push the commit and corresponding tag simultaneously, use
    git push --atomic origin <branch name> <tag>
    

If I have time, I would update the version on PYPI too:

  • Create the dist/ directory and the installation files: python3 setup.py sdist bdist_wheel
  • Upload the package onto TestPyPI to test deployment: python3 -m twine upload --repository testpypi dist/*
  • Deploy the package onto PyPI for real: python3 -m twine upload dist/*

Bulk download of ERA5 data from CDSAPI

I wrote a sample script to download ERA5 reanalysis data via CDSAPI month by month. Here is the GitHub repo with instructions how to use it.

Reading Notes on Spark - The Definitive Guide

I am reading the book Spark: The Definitive Guide by Bill Chambers, Matei Zaharia. Here are my reading notes:

READ MORE button via jekyll

Found a workable solution adding “read more” button to jekyll posts from Jonny Langefeld’s blog post. 😄 Thanks for the solution!

Here we go. 😉

More efficient way to do outer join with large dataframes

Today I learned from a colleague the way of doing outer join of large dataframes more efficiently: instead of doing the outer join, you can first union the key column, and then implement left join twice. I have done an experiment myself on the cluster with two dataframes (df1 and df2) - each dataframe has ~10k rows, and there is only ~10% of overlap(i.e. an inner-join would result in ~1k rows).

The usual way of doing outer join would be like:

df3 = df1.join(df2, how='outer', on='id').drop_duplicates()

Here is an equivalent way(I call it union-left here) that takes less time to compute:

df3 = df1.select('id').union(df2.select('id'))
df3 = df3.join(x1_df, how='left', on='id')
df3 = df3.join(x2_df, how='left', on='id').drop_duplicates()

The distribution of IDs in the two dataframes:

  df1 only overlap df2 only
# of ids 8625 914 8623

Here is the distribution of computing times for inner, left, outer, right and union-left(that gives same results as outer) joins(I repeated each join 20 times):

For these sizes of dataframes, the union-left join is on average ~20% faster than the equivalent outer join.

my widget for counting (since Dec24, 2016)