GSoC Blog 2. Week 2,3,4

Utkarsh Maheshwari
4 min readJul 8, 2021

I have shared the prototype of the regression plot function in Blog 1. Kindly check it out to get a high-level design. In this blog, I’ll be sharing the details to use the function and its implementation.

Week 2

After going through several articles and other resources, I proposed a high-level design of the function.

idata: az.InferenceData object, Optional if y is Sequence.
y: str or DataArray or ndarray
If str, variable name from observed_data
x: str or tuple of strings or DataArray or array-like, Optional
If str or tuple, variable name from constant_data If ndarray, could be 1D, or 2D for multiple plots. If none, coords name of y (y should be DataArray).
y_model: str or Sequence, Optional.
If str, varibale name from posterior. Its dimensions should be the same as y plus added chains and draws.
y_ppc: str, Optional If str, variable name from posterior_predictive. Its dimensions should be the same as y plus added chains and draws.
and some other low-level aruments…

It is recommended that the user should provide inferenceData object and all other variables as sets of data inside it. However, if the user provides array-like input, it works fine.
Users can also provide multiple x and y in case there are multiple independent and dependent variables. The output would number of x * number of y subplots, showing the relationship between each x and y.
y_model is assumed to be present in the posterior group because it has similar dimensions as other posterior variables. It is the responsibility of the user to compute y_model as per the actual modelling equation, otherwise, plots will fail to show the relationship accurately.

The high-level API function is the most challenging one. As there are multiple cases that need to be kept in mind. For example, y could be multidimensional. In this case, the user should provide an added argument plot_dim , the dimension name that contains the x/y values. I used a selection utility function of arviz xarray_var_iter , to iterate over the xarray Dataset, given dimensions and variable names. It made the code simpler (though it still is a bit complex).

There are multiple ways to visualize uncertainty.
1. HDI: Plotting high-density interval.
2. Random samples: Plotting random draws from posterior predictive samples.
3. HDI Bands: Extended version of HDI but where the opacity changes as a function of the HDI.

The first and second are implemented in this project in order to provide the basic functionality.

Week 2 was the most challenging part. Thankfully, with the help of Ravin and Ari, I was able to complete it. I am hoping there are no major bugs, which will be figured out at the time of adding tests. So, let’s quickly move to add backend functions for matplotlib and bokeh in week 3.

Week 3

This part was more graphical, less computational and more fun. Some of the characteristics of the graph were new to me. For example, zorder, alpha(transparency), jitter, HDI, etc. These properties are useful to improve the visualization and thus better the user experience.

Zorder: This property determines how close the points or plot is to the observer. The higher the value of Zorder closer the plot or points to the viewer.

Source : https://matplotlib.org/3.1.1/gallery/misc/zorder_demo.html

alpha: Between 0–1. More alpha → more transparency.

Jitter: We can use jitter to add a little random noise to the data in order to see the cloud more clearly. Adding jitter in x, transforms the visualization as following.

Backend functions just plot whatever is passed from the User API function. User API has already taken care of the dimensions of x and y. If there occurs a dimension mismatch, most likely it will be resolved in the user API function.

The backend function is comprised of 4 parts.

Plotting Observed values.
Plotting Posterior Predictive values: “Samples” or “HDI” based on input args
Plotting y_model values: “Samples” or “HDI” based on input args
Plotting means of y_model.

As of now, Arviz supports 2 backends, Matplotlib and Bokeh. Bokeh is an interactive visualization library, unlike matplotlib. The basic structure of both the backend files is the same.

I am using the plot_hdi function of ArviZ to plot hdi plots of y_model and y_ppc. Thanks to the developers!

Week 4

Week 4 is all about tests for both the backends and fixing small bugs faced at the time of testing. Arviz has a separate package for testing the functions. It is highly dependent on pytest parameterize tests, which made testing and maintenance way too easy. From “Why would people need pytest ?” to “How would it be possible without pytest !”, I have learnt a lot!

There are several other packages like pylint, black, pydocstyle which are always complaining about the code (Just kidding! They are very helpful tools to make our code look neat, at the same time offers speed, Maintenance, and several other hidden features!).

Week 4 also was full of bug hunting and solving, interactions and suggestions. I made small changes to the code based on the reviews provided by the mentors to solve the bugs. To know more details about the, you can have a look into the PR.

Feel free to experiment with it and comment in the PR. I would love to know suggestions and improvements, to make it as bug-free as possible. See you in the next blog!

--

--