I love to share what I've learnt with others. Check out my blog posts and notes about my
academic research, as well as technical solutions on software engineering and data
science challenges. Opinions expressed in this blog are solely my own.
Thrilled that my open-source climate data analysis package gets sponsored by JetBrains Licenses for Open Source Development. 🥳
I’m really glad I started this project in 2016 when I was still in graduate school, with the hope that the climate data diagnostic proposed in my thesis can be applied by other more easily. Even though I have been working in industry after finishing my PhD, by maintaining this package, I’ve established valuable connections with many academic researchers. 😊 I’m grateful that JetBrains support open-source community and encourage the culture of sharing.
There are 2 parts in this post. Part I reviews the idea of Quantile Transformer. Part II shows the implementation of Quantile Transformer in Spark using Pandas UDF.
Part I: Quantile Transformer transforms data of arbitrary distribution to normal (or uniform) distribution
Problem Statement: I have some individuals (id) with 3 attributes of different distributions. I want to combine them linearly and also want to make sure the outcome follows a normal distribution.
In python, I create a toy dataset with column id, and 3 columns corresponding to random variables following different distributions:
importnumpyasnpimportpandasaspdimportscipyimportmathimportmatplotlib.pyplotaspltnum_of_items=10000# the size of my population
df=pd.DataFrame({'id':[str(i)foriinnp.arange(num_of_items)],'uniform':np.random.rand(num_of_items),'power_law':np.random.power(3,num_of_items),'exponential':np.random.exponential(1,num_of_items)})
Let’s say we want to map these values to a normal distribution with mean=0.5 and standard deviation=0.15. To look up the corresponding value in the CDF of normal distribution, we can use scipy.stats.norm.ppf:
I would get results of the same distribution. On the right, I show the results from linear combination of the original values for comparison:
Another combination strategy would be to get the max value among the 3 columns. The transformed variable follows similar distribution, dispute the mean shifts to larger values.
Part II: Implementation of Quantile Transformer in Spark
Given the introduction of Pandas UDF in Spark, the implementation is relatively simple. If ranking is too expensive, you can consider using approximate quantile instead.
(Note: Later I realized that the newest Spark version has pyspark.pandas.DataFrame.rank, see Spark documentation. That’s not available at my work station yet.)
You can append the transformed value to the original dataframe:
I have a dataframe df with columns id (integers) and document (string):
root
|-- id: integer
|-- document: string
Each id is associated with multiple documents. Each document would be transformed into a vector representation (embedding of dimension 100) using Spark NLP. Eventually, I want to get the average of vectors associated with each id.
When testing with small amount of data, i.e. 10k id with each associated with ~100 document, pyspark.ml.stat.Summarizer does the job quickly:
However, the real situation is that I have to deal with Big Data that consists of 100M distinct id and 200M distinct document. Each id can be associated with at most 30k document. The time taken to (1) attach embedding using Spark NLP and (2) aggregate the vectors per id took me 10 hours, which is too slow!
Eventually, I figured out a way to do the same thing while having the computing time shortened to ~2 hours.
Thanks to my colleague who spot out the bottleneck - step (1) is indeed not slow. It was step (2) that takes most of the time when there is a huge number of id to work with. In this scenario, the aggregation of values in 100 separate columns is actually faster than the aggregation of 100-dimension vectors.
Here is what I do to optimize the procedures:
1. Obtain vector representation as array for distinct document
One can specify in sparknlp.base.EmbeddingFinisher whether you want to output a vector or an array. To make the split easier, I set the output as array:
Sometimes, hive tables are saved not in an optimal way that creates lots of small files and reduce the performance of the cluster. This blog post is about pyspark functions to reduce the number of files (and can shrink storage size when the data is indeed sparse).
To check the number of files and their size, I used this HDFS command:
There will be 4 columns printed out. I’m showing one of the examples among the table I shrank today:
Directory count
File count
Content size
Table name
3
854
104950877
original_table_w_many_small_files
When I check the file sizes of the 854 files using hadoop cluster_name fs -du -h /hive/warehouse/myschema.db/original_table_w_many_small_files/*/*/*, I find that all of them are indeed of small size:
99.9 K /hive/warehouse/myschema.db/original_table_w_many_small_files/some_part_id=30545/timestamp=1595397053/part-00845-29c4c506-e991-4d1d-be67-43e0a9976179.c000
102.7 K /hive/warehouse/myschema.db/original_table_w_many_small_files/some_part_id=30545/timestamp=1595397053/part-00846-29c4c506-e991-4d1d-be67-43e0a9976179.c000
104.4 K /hive/warehouse/myschema.db/original_table_w_many_small_files/some_part_id=30545/timestamp=1595397053/part-00847-29c4c506-e991-4d1d-be67-43e0a9976179.c000
100.6 K /hive/warehouse/myschema.db/original_table_w_many_small_files/some_part_id=30545/timestamp=1595397053/part-00848-29c4c506-e991-4d1d-be67-43e0a9976179.c000
98.8 K /hive/warehouse/myschema.db/original_table_w_many_small_files/some_part_id=30545/timestamp=1595397053/part-00849-29c4c506-e991-4d1d-be67-43e0a9976179.c000
108.5 K /hive/warehouse/myschema.db/original_table_w_many_small_files/some_part_id=30545/timestamp=1595397053/part-00850-29c4c506-e991-4d1d-be67-43e0a9976179.c000
106.5 K /hive/warehouse/myschema.db/original_table_w_many_small_files/some_part_id=30545/timestamp=1595397053/part-00851-29c4c506-e991-4d1d-be67-43e0a9976179.c000
101.9 K /hive/warehouse/myschema.db/original_table_w_many_small_files/some_part_id=30545/timestamp=1595397053/part-00852-29c4c506-e991-4d1d-be67-43e0a9976179.c000
65.5 K /hive/warehouse/myschema.db/original_table_w_many_small_files/some_part_id=30545/timestamp=1595397053/part-00853-29c4c506-e991-4d1d-be67-43e0a9976179.c000
To combine the files, the following was what I do with a pyspark (jupyter) notebook.
Create table using parquet format
First of all, I check the schema of the table using SHOW CREATE TABLE myschema.original_table_w_many_small_files:
Then, I retrieved the original table and repartitioned it, such that all data in each partition is combined to a single file (numPartitions=1) since my cluster can handle files of size up to 100 M. You shall adjust numPartitions based on the resources you have.
It came to our attention that one of the terms in the LWA budget as computed in the currently published code on my GitHub may require correction. This concerns the meridional convergence of eddy momentum flux, which is currently evaluated in the displacement coordinate (Φ’), where it should really be evaluated in the real latitude (Φ).
The discrepancy does not affect the zonal mean budget of wave activity, but it may cause spurious residual/sources/sinks locally if not corrected.
Noboru and I are currently assessing the error that this causes in our previous results and we will let you know as soon as we find out.
In the meantime, if you want to correct your diagnostic by yourself, the remedy is relatively simple (adding a correction term) – please see the write-up from Noboru on GitHub. If you have specific questions or concerns, we’ll be happy to help.
This time instead of making slides for overview, I put the focus of discussion on the trick, i.e. Classifier Guidance (section 4 of the paper), which makes the whole thing work well.
I learned something new cleaning up the GitHub repo and I’m gonna write about it.
1. Untrack changes in the repo
There are actually two places which you can define the files to be ignored by git:
.gitignore: I used to have this locally. Christopher suggested I include that in the GitHub repo for everyone’s use, and I think that’s a better idea.
.git/info/exclude: This is indeed the right place to specify personal files to be excluded (not shared on the repo).
2. Skip unit test when optional packages are not installed
In test_xarrayinterface.py, I modified the xarray import statement:
try:importxarrayasxrexceptImportError:pytest.skip("Optional package Xarray is not installed.",allow_module_level=True)
Given that xarray is an optional package, even if it is not installed, unit test for this package shall still run through.
3. Move fortran modules into the package directory
The .f90 files that contain the f2py modules were located in hn2016_falwa/ before. Now it is moved to hn2016_falwa/f90_modules/. The modifications done are:
(1) In setpy.py, the extension is changed to something like:
4. Fix the display of documentation on readthedocs.org
To debug, go to https://readthedocs.org/projects/hn2016-falwa/ and look at the Builds. Even if it passes, the document may not be compiled properly. Go to View raw and check out all warnings/errors to see if you have a clue.
The fix I have done are:
In docs/source/conf.py:
Fix the appended sys path
Add modules/packages that causes the compilation to fail/raise warning to autodoc_mock_imports in
Add docs/requirements.txt and specify all external packages imported (i.e. numpy, scipy, xarray) in the package
5. Compare commits using GitHub’s interface
To display the difference between two commits using GitHub’s web interface, put in the following URL:
I led the Machine Learning Journal Club discussion on the paper:
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., … & Gelly, S. (2019, May). Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning (pp. 2790-2799). PMLR.
Here comes the package release hn2016_falwa v0.6.0. This version contains the updated algorithm for reference state inversion - now the reference state is solved with absolute vorticity field at 5N (defined by user) as boundary condition. The analysis code to reproduce results in Neal et al. “The 2021 Pacific Northwest heat wave and associated blocking: Meteorology and the role of an upstream cyclone as a diabatic source of wave activity” (submitted to GRL) can be found in the directory scripts/nhn_grl2022/.
I led the Machine Learning Journal Club discussion on the paper:
Liu, Q., Kusner, M., & Blunsom, P. (2021, June). Counterfactual data augmentation for neural machine translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 187-197). Here are the slides I made.
I led the Machine Learning Journal Club discussion on the paper:
Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., … & Krishnan, D. (2020). Supervised contrastive learning. arXiv preprint arXiv:2004.11362. Here are the slides I made.
The CI procedures of the package has been migrated from Travis CI to GitHub workflow. I spent a while on fixing related issues, so I wrote a blog post about the procedures. Hopefully this can save others time if they want to do the same thing.
Few days ago, I realized that the coverage test for my python package, which was run on travis-ci.org, has stopped running for two years. After migrating the to travis-ci.com, the service is no longer free after spending 10000 credits. One has to pay for $69/mo 💸 for unlimited service.
Note that I already used 2170 credits when testing various things and preparing for upcoming package release in a day 😱(my fault?), so I don’t think a free plan is sustainable, but $69/mo is definitely too expensive if I only have to maintain a single package.
I found a free solution on GitHub marketplace: CodeCov. With this GitHub Action, one can run deployment test and code coverage diagnostic automatically when pushing commits. I spent a bit of time on trial and error, so I think it is worth the time to write down what I did to save time for others.
Define the workflow
The GitHub Workflow is defined in the file .github/workflows/workflow.yml. Here is what I did for my package:
In order to upload the coverage report to CodeCov, I used the command bash <(curl -s https://codecov.io/bash) after failing to send report using the line uses: codecov/codecov-action@v2 from CodeCov template.
Specify files to include/exclude in the coverage test
The specifications are made in the file .coveragerc:
In the previous discussion I led, I introduced the idea of learning visual representation in unsupervised manner. This week, I am introducing the following paper that applies deep contrastive learning to train sentence embedding:
You can find here the slides I made which provides an introduction to Contrastive Learning. Below are the main points from the slides:
Contrastive learning is a self-supervised method to learn a representation of objects by maximizing/minimizing distance between the same/different class(es)
Contrastive learning benefits from data augmentation and increase in model parameters
Under-clustering occurs when there is not enough negative samples to learn from; over-clustering occurs when the model overfits (memorize the data)
To solve inefficiency issue, median(rank-k) triplet loss is used instead of the total loss
🎹 I have been uploading improvised piano cover on my YouTube channel. Recently, I received some comments asking for the music sheet of my arrangement, and it seems to be a good idea to keep a record of my work. Therefore, I started making music sheets of my arrangement with MuseScore to share with others.
🎼 Check out the page Music Arrangement for the collection of arrangement with music sheets I’ve made. For the rest of my recordings, go to my YouTube Channel Clare’s Studio.