I love to share what I've learnt with others. Check out my blog posts and notes about my
academic research, as well as technical solutions on software engineering and data
science challenges. Opinions expressed in this blog are solely my own.
(In practical cases, self._mapping is a huge object containing the dictionary and other attributes that are derived from methods in the AnimalsToNumbers class.) If I want to transform a dataframe df that looks like this:
---------------------------------------------------------------------------
Py4JError Traceback (most recent call last)
~/anaconda3/envs/pyspark_demo/lib/python3.5/site-packages/pyspark/serializers.py in dumps(self, obj)
589 try:
--> 590 return cloudpickle.dumps(obj, 2)
591 except pickle.PickleError:
...
PicklingError: Could not serialize object: Py4JError: An error occurred while calling o25.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
The issue is that, as self._mapping appears in the function addition, when applying addition_udf to the pyspark dataframe, the object self (i.e. the AnimalsToNumbers class) has to be serialized but it can’t be.
A (surprisingly simple) way is to create a reference to the dictionary (self._mapping) but not the object:
The colloquium I gave on Monday was an overview of the finite-amplitude local Rossby wave activity theory and its application to study blocking. We learned from this framework that atmospheric blocking can be modelled as a traffic jam problem. I also mentioned the follow-up work by Paradise et al. (2019, JAS) that discusses the implication of this notion.
Below are solutions I curated online to solve problems related to Git when collaborating with others and working on several branches together. This post will be updated from time to time.
Copy a file from one branch to another
To copy a file to the current branch from another branch (ref)
git checkout another_branch the_file_you_want.txt
Merge changes from master branch to yours
To merge changes from another branch to yours, you can use merge or rebase depending on the preferred commit order. BitBucket has a nice tutorial discussing the difference btween the two. Usually I’d love to have the changes pulled from another branch as a single commit with git merge.
git checkout master # the branch with changes
git pull # pull the remote changes to local master branch
git checkout mybranch # go back to mybranch
git merge master # incorporate the changes into mybranch
I wanted to load the libsvm files provided in tensorflow/ranking into PySpark dataframe, but couldn’t find existing modules for that. Here is a version I wrote to do the job. (Disclaimer: not the most elegant solution, but it works.)
First of all, load the pyspark utilities required.
defread_libsvm(filepath,query_id=True):'''
A utility function that takes in a libsvm file and turn it to a pyspark dataframe.
Args:
filepath (str): The file path to the data file.
query_id (bool): whether 'qid' is present in the file.
Returns:
A pyspark dataframe that contains the data loaded.
'''withopen(filepath,'r')asf:raw_data=[x.split()forxinf.readlines()]train_outcome=[int(x[0])forxinraw_data]ifquery_id:train_qid=[int(x[1].lstrip('qid:'))forxinraw_data]index_value_dict=list()forrowinraw_data:index_value_dict.append(dict([(int(x.split(':')[0]),float(x.split(':')[1]))forxinrow[(1+int(query_id)):]]))max_idx=max([max(x.keys())forxinindex_value_dict])rows=[Row(qid=train_qid[i],label=train_outcome[i],feat_vector=SparseVector(max_idx+1,index_value_dict[i]))foriinrange(len(index_value_dict))]df=spark_session.createDataFrame(rows)returndf
Let’s see how the train and test sets look like in the tf-ranking package:
Unit tests are worth the time writing to make sure your package works as you expected.
I also found some commercial packages using unit tests as sample script for user to
refer to (e.g. AllenNLP).
After opening an account on pythonanywhere,
go to the Web tab and select Add a new web app.
When prompted to select a Python Web framework, choose Flask.
Choose your python version. Here, I am choosing Python 3.6 (Flask 0.12).
Enter a path for a Python file I wish to hold my Dash app. I entered:
/home/username/mysite/dashing_demo_app.py
Put the script of your Dash app in dashing_demo_app.py. You can use the script in the sample file
dashing_demo_app.py
provided on the GitHub repo of pythonanywhere’s staff.
Next I have to set up a virtual environment that the app is running in. I am using the
requirements3.6.txt
provided in the above GitHub repo.
Go to the Files tab to create requirements3.6.txt in your home directory. Then,
go to the Consoles tab to start a new bash session.
Create a virtual environment dashappenv with the following command in the home directory:
Then, go to the Web tab and enter under Virtualenv the path of your virtual environment:
/home/username/.virtualenvs/dashappenv
Lastly, modify your WSGI file. Instead of
from dashing_demo_app import app as application
provided, enter
from dashing_demo_app import app
application = app.server
to import your app.
It’s all done. Go to Web to reload your app. You can then click the URL of your webapp and see it running. :)
Here is the sample webapp I built based on the example in
Dash tutorial.
Here is the
press release
from UChicago about the publication.
For interested researchers, the sample script to reproduce the results can be found in the directory
nh2018_science
of the my python package’s GitHub repo hn2016_falwa.
You can download ERA-Interim reanalysis data with download_example.py to run the local wave
activity and flux analysis in the jupyter notebook demo demo_script_for_nh2018.ipynb.
Have fun and feel free to email me (csyhuang at uchicago.edu) if you are interested
in using the code and/or have questions about it.
In Step 5, you have to include the .jar files in the directory CoreNLP/lib and
CoreNLP/liblocal in your CLASSPATH. To do this, first, I install coreutils:
brew install coreutils
such that I can use the utility realpath there. Then, I include the following in my ~/.bashrc:
for file in `find /Users/clare.huang/CoreNLP/lib/ -name "*.jar"`;
do export CLASSPATH="$CLASSPATH:`realpath $file`";
done
for file in `find /Users/clare.huang/CoreNLP/liblocal/ -name "*.jar"`;
do export CLASSPATH="$CLASSPATH:`realpath $file`";
done
(I guess there are better ways to combine the commands above. Let me know if there are.)
To run CoreNLP, I have to download the latest version of it, and place it in the directory
CoreNLP/:
The latest version is available on
their official website. Unzip
it, and add all the .jar there to the $CLASSPATH.
Afterwards, you shall be able to run CoreNLP with the commands provided
in the blogpost of Khalid Alnajjar
(under Running Stanford CoreNLP Server). If you have no problem starting the server,
you shall be able to see the interface on your browser at http://localhost:9000/:
With brew cask installed on Mac (see homebrew-cask instructions),
different versions of java can be installed via the command (I want to install java9 here, for example):
brew tap caskroom/versions
brew cask install java9
After installing, the symlink /usr/bin/java is still pointing to the old native Java. You can check
where it points to with the command ls -la /usr/bin/java. It is probably pointing to the old native
java path:
/System/Library/Frameworks/JavaVM.framework/Versions/Current/Commands/java
.
However, homebrew installed java into the directory
/Library/Java/JavaVirtualMachines/jdkx.x.x_xxx.jdk/Contents/Home.
To easily switch between different java environments, you can use jEnv. The installing
instructions can be found on jEnv’s official page.
The ECMWF API Python Client is now available on pypi and anaconda.
The Climate Corporation has distributed the ECMWF API Python Client on
pypi. Now it can be installed via:
pip install ecmwf-api-client
Anaconda users on OS X/linux system can install the package via:
conda install -c bioconda ecmwfapi
To use the sample script, you need an API key ( .ecmwfapirc ) placed in your home directory. You can retrieve that by logging in: https://api.ecmwf.int/v1/key/
Create a file named “.ecmwfapirc” in your home directory and put in the content shown on the page:
After doing that, in the directory with the sample script example.py, you can test the package by running it:
python example.py
You should see it successfully retrieves a .grib file if the package has been set up properly.
There are sample scripts
available on the ECMWF website (look under “Same request NetCDF format”). Below is a example of python
script I wrote to retrieves zonal wind, meridional wind and temperature data at all pressure levels
during the time period 2017-07-01 to 2017-07-31 in 6-hour intervals:
The publication page
has been updated with 3 submitted manuscripts.
Updates on Feb 9, 2018: The manuscript “Role of Finite-Amplitude Rossby Waves and Nonconservative
Processes in Downward Migration of Extratropical Flow Anomalies” has been accepted by Journal of Atmospheric Sciences.
The subroutine
wrapper.qgpv_eqlat_lwa_ncforce for computing effective diffusivity, which quantifies the
damping on wave transiences by irreversible mixing in the stratosphere during a
stratospheric sudden warming event, can be found in my python package.
I am interested in going through the exercise from Princeton University’s
Algorithm course. I found someone
wrote a handy bash script to set up the environment on Mac OS/Linux: