Two manuscripts submitted for peer review
30 Oct 2024I am glad to share that we have recently submitted two manuscripts to academic journals for peer review. ✌️ Please check the Publications page for details. 😁
I love to share what I've learnt with others. Check out my blog posts and notes about my
academic research, as well as technical solutions on software engineering and data
science challenges.
Opinions expressed in this blog are solely my own.
I am glad to share that we have recently submitted two manuscripts to academic journals for peer review. ✌️ Please check the Publications page for details. 😁
This is just a post to locate resources I need. 🤔
As I am participating in the MDTF project, I need some sample climate data to test my POD. NOAA has a Google Cloud repository that stores CMIP6 data. To download the data, I need the gsutil installed on my linux machine.
I created the minimally necessary conda evironment by
conda create -n gcloud_cli python=3.11
I found a slide deck about NOAA’s public datasets online about how to read the data using Xarray:
datasets = []
for file in lets_get:
data_path = 'gs://' + bucket_name + '/' + file
ds3 = xr.open_dataset(fs.open(data_path), engine='h5netcdf')
datasets.append(ds3['TSkin'])
I downloaded two datasets
$ gsutil -m cp -r "gs://cmip6/CMIP6/CMIP/CAMS/CAMS-CSM1-0/historical/r1i1p1f2/Amon/ta" "gs://cmip6/CMIP6/CMIP/CAMS/CAMS-CSM1-0/historical/r1i1p1f2/Amon/ua" "gs://cmip6/CMIP6/CMIP/CAMS/CAMS-CSM1-0/historical/r1i1p1f2/Amon/va" .
$ gsutil -m cp -r "gs://cmip6/CMIP6/C4MIP/E3SM-Project/E3SM-1-1/hist-bgc/r1i1p1f1/Amon/ta" "gs://cmip6/CMIP6/C4MIP/E3SM-Project/E3SM-1-1/hist-bgc/r1i1p1f1/Amon/ua" "gs://cmip6/CMIP6/C4MIP/E3SM-Project/E3SM-1-1/hist-bgc/r1i1p1f1/Amon/va" .
Will give a try and report my findings here.
Additional information:
This year’s Nobel Prize of Physics is granted to the inventers of Artificial Neural Network (ANN). As someone who have worked on both Physics and Machine Learning, I wonder what this implies - the power of ANN lies in making predictions without understanding the underlying mechanisms, while physics is precisely about making predictions by finding out the underlying mechanisms. Is that a shift in regime? 🤔
Regardless, the rise of ANN has created tonnes of job opportunities for physicists, which is an invaluable contributions to the physics community (as faculty/staff scientist positions are very limited). As an international PhD grad in the US, I’m glad that USCIS can no longer complain about Physics degrees being irrelevant to machine learning for H-1B Visa applications! 🥳🎉
I led the Machine Learning Journal Club discussion on the paper:
Oord, A. V. D., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
The slides used can be found here.
I found a useful python package for cleaning up cached files created in a local (suite of unit) test. called pyclean
:
https://pypi.org/project/pyclean/
After installing, to clean up after a test, execute
pyclean --verbose .
would clean up the files and list out all the files deleted.
The python native library traceback
can provide more details about an (unexpected) error compared to error catching with except Exception as ex:
and then examine ex
.
Let’s make a function that would result in error for demonstration:
import traceback
def do_something_wrong():
cc = int("come on!")
return cc
try: # first catch
do_something_wrong()
except Exception as ex:
print(f"The Exception is here:\n{ex}")
try: # second catch
do_something_wrong()
except:
print(f"Use traceback.format_exc() instead:\n{traceback.format_exc()}")
The first catch would only display
The Exception is here:
invalid literal for int() with base 10: 'come on!'
while the second catch includes not only the error but where it occurs
Use traceback.format_exc() instead:
Traceback (most recent call last):
File "/Users/csyhuang/JetBrains/PyCharm2024.1/scratches/scratch.py", line 16, in <module>
do_something_wrong()
File "/Users/csyhuang/JetBrains/PyCharm2024.1/scratches/scratch.py", line 5, in do_something_wrong
cc = int("come on!")
In a large software project, the second example would be way more helpful than the first.
I wanted to create a GitHub workflow that did web API query and return the results as text files in the repo. Here are several problems I’ve solved during the development.
Several tokens and secrets are necessary to query the web API. I stored that as GitHub secrets and access them in the workflow file via:
jobs:
job-name:
environment: query_env
runs-on: ubuntu-latest
name: ...
steps:
...
- name: Query API
id: query_api
env:
oauthtoken: ${{ secrets.OAUTH_TOKEN }}
oauthtokensecret: ${{ secrets.OAUTH_TOKEN_SECRET }}
run: python run_script.py $oauthtoken $oauthtokensecret
...
After running run_script.py
, there will be several .txt
files produced in the directory data_dir/
inside the repository which I want to push to the GitHub repository. I tried committing and pushing the files with actions/checkout@v4
but it does not work:
...
- name: add files to git # Below is a version that does not work
uses: actions/checkout@v4
with:
token: ${{ secrets.REPO_TOKEN }}
- name: do the actual push
run: |
git add data_dir/*.txt
git commit -m "add files"
git push
Running this, I receive an error: nothing to commit, working tree clean. Error: Process completed with exit code 1.
.
The version that works eventually looks like this:
- name: Commit files
uses: stefanzweifel/git-auto-commit-action@v5
with:
token: ${{ secrets.REPO_TOKEN }}
Note that it would commit all files produced to the repository, including some unwanted cached files. Therefore, I included a step before this to clean up the files:
- name: Remove temp files
run: |
[ -d "package_dir/__pycache__" ] && rm -r package_dir/__pycache__
A new release (v2.0.0) of the python package falwa
has been published to cope with the deprecation of numpy.disutils in python 3.12 and involves some changes in installation procedures, which you can find in README section “Package Installation”.
Great thanks to Christopher Polster for figuring out a timely and clean solution for migration to python 3.12. 👏 For details and references related to this migration, users can refer to Christopher’s Pull request.
To train deep learning model written in PyTorch with Big Data in a distributed manner, we use BigDL-Orca at work. 🛠️
Compared to the Keras interface of BigDL, PyTorch (Orca) supports customization of various components for Deep Learning. For example, using bigdl-dllib
keras API, you are constrained to use only available operations in Autograd module to customize loss functions, while you can do whatever you like in PyTorch (Orca) by creating customized subclass of torch.nn.modules.loss._Loss
. 😁
One drawback of Orca, though, is the mysterious error logging, as what happened within the java class (i.e. what causes the error) is not logged at all. I got stuck in error during model training, but what I got from the Spark log was just socket timeout
. There can be many possibilities, but the one I encountered was about the size of train_data
.
Great thanks to my colleague Kevin Mueller who figured out the cause 🙏 - when the partitions contain different number of batches in Orca, some barriers can never be reached and that results in such error.
To get around this, I dropped some rows to make sure the total size of train_data
is a multiple of batch size:
train_data = train_data.limit(train_data.count() - train_data.count() % batch_size)
The training process worked afterwards. 😁
A new release (v1.3.0) of the python package falwa
with some improvement in numerical scheme and enhanced functionalities has been made:
https://github.com/csyhuang/hn2016_falwa/releases/tag/v1.3.0
If you find an error, or have any questions, please submit an issue ticket to let us know.
Thank you for your attention.
I wrote a blog post in 2021 about how to integrate pytest coverage check to GitHub Workflow.
To run coverage
locally, execute coverage run --source=falwa -m pytest tests/ && coverage report -m
would yield the report (this is from the PR for falwa
release 1.3):
Name Stmts Miss Cover Missing
------------------------------------------------------------
falwa/__init__.py 11 0 100%
falwa/barotropic_field.py 41 4 90% 79, 86, 93, 138
falwa/basis.py 66 8 88% 57-62, 175, 186
falwa/constant.py 6 0 100%
falwa/data_storage.py 146 3 98% 52, 59, 107
falwa/legacy/__init__.py 0 0 100%
falwa/legacy/beta_version.py 240 240 0% 1-471
falwa/netcdf_utils.py 9 9 0% 6-30
falwa/oopinterface.py 400 32 92% 297, 320, 336, 355, 366, 393, 550, 559, 721, 731, 734, 799, 818, 860, 870, 880, 890, 907, 918, 929, 993, 1006-1008, 1019, 1031, 1179-1180, 1470-1471, 1550-1565
falwa/plot_utils.py 125 125 0% 6-343
falwa/preprocessing.py 7 7 0% 6-30
falwa/stat_utils.py 11 11 0% 6-26
falwa/utilities.py 61 48 21% 58-92, 151-186, 242-255
falwa/wrapper.py 146 146 0% 6-570
falwa/xarrayinterface.py 266 37 86% 107-109, 317, 322-323, 478, 616-648, 683-704
------------------------------------------------------------
TOTAL 1535 670 56%
I guess it’s time to work on increasing coverage again. 🙂 (Too much work recently, through.)
Our team lead shared with us some useful learning materials on advanced CS topics not covered in class: The Missing Semester of Your CS Education from MIT. I’ll spend some time to read this.
I led the Machine Learning Journal Club discussion on the paper:
Liu, Z., Wang, Y., Vaidya, S., Ruehle, F., Halverson, J., Soljačić, M., … & Tegmark, M. (2024). Kan: Kolmogorov-arnold networks. arXiv preprint arXiv:2404.19756.
Here are the slides I made.
In short, I believe it can only be practically useful if the scalability problem is solved. 👀 Let’s see how the development of this technique goes.
Below is the email I sent to the users of GitHub repo hn2016_falwa:
I am writing to inform you two recent releases of the GitHub repo v1.0.0 (major release) and v1.1.0 (a bug fix). You can refer to the release notes for the details. There are two important changes in v1.0.0:
The python package is renamed from hn2016_falwa to falwa since this package implements finite-amplitude wave activity and flux calculation beyond those published in Huang and Nakamura (2016). The GitHub repo URL remains the same: https://github.com/csyhuang/hn2016_falwa/ . The package can be installed via pip as well: https://pypi.org/project/falwa/
It happens that the bug fix release v0.7.2 has a bug in the code such that it over-corrects the nonlinear zonal advective flux term. v1.0.0 fix this bug. Thanks Christopher Polster for spotting the error. The fix requires re-compilation of fortran modules.
The rest of the details can be found in the release notes:
Please let us know on issue page if you have any questions: https://github.com/csyhuang/hn2016_falwa/issues
Bookmarking some useful links:
The deployment of python package on linux is not working (again). I am exploring solutions to automate deployment. Here are things I’ve found.
fastscapelib-f2py
: https://anaconda.org/conda-forge/fastscapelib-f2py/[Updated on 2023/12/11] After some research, it seems that scikit-build
would be a continuously maintained solution: https://scikit-build.readthedocs.io/
to be continued.
We published an important bugfix release hn2016_falwa v0.7.2, which requires recompilation of fortran modules.
Two weeks ago, we discovered that there is a mistake in the derivation of expression of nonlinear zonal advective flux term, which leads to an underestimation of the nonlinear zonal advective flux component.
We will submit corrigendum for Neal et al. (2022, GRL) and Nakamura and Huang (2018, Science) to update the numerical results. The correct derivation of the flux expression can be found in the corrected supplementary materials of NH18 (to be submitted soon). There is no change in conclusions in any of the articles.
Please refer to Issue #83 for the numerical details and preliminary updated figures in NHN22 and NH18:
Thank you for your attention and let us know if we can help.
Here are course notes I am taking from the DeepLearning.ai course on Coursera: Generative AI with Large Language Models.
To build pip distribution that contains the source files only (without compiled modules):
python3 setup.py sdist bdist_wheel
python3 -m twine upload dist/*
To compile the package to .whl on Mac (Example: SciPy pip repository):
python setup.py bdist_wheel
python3 -m twine upload dist/*
This deploy the wheel file dist/hn2016_falwa-0.7.0-cp310-cp310-macosx_10_10_x86_64.whl
to pip channel.
However, when repeating the Mac procedures above on Linux, I got the error:
Binary wheel 'hn2016_falwa-0.7.0-cp311-cp311-linux_x86_64.whl' has an unsupported platform tag 'linux_x86_64'.
I googled and found this StackOverflow thread: Binary wheel can’t be uploaded on pypi using twine.
ManyLinux repo: https://github.com/pypa/manylinux
Good tutorial:
http://www.martin-rdz.de/index.php/2022/02/05/manually-building-manylinux-python-wheels/
Create fresh start environment:
$ conda create --name test_new
But using conda on mac to compile wheel would have this issue:
ld: unsupported tapi file type '!tapi-tbd' in YAML file
Create virtual environment (not via conda)
python3 -m venv /Users/claresyhuang/testbed_venv
source /Users/claresyhuang/testbed_venv/bin/activate
The google slides used in my presentation in the meeting of NOAA Model Diagnostics Task Force can be found here.