How to go from a user need to implementation?
It’s been a wonderful experience so far with the GSoC. Happily, my first major PR is merged. This means that the changes that I have made are accepted and are now reflected in the main repository! It’s a great feeling to see some major contributions to a big project like ArviZ!
I started in January 2021, with minor bug fixes, which added to my understanding of the library, data structures being used and functions of different directories. After getting regression plots merged, it’s time to move on to implement time series plots to cater for the needs of time series Bayesian models. Do you know the life cycle of a plot project in ArviZ?
1. Get familiar with the type of plots that you are going to implement.
This is a basic step. One must know the basics of for example time series models before implementing the diagnostics for them. One must know what time series components are. I have briefly discussed them in my previous blog. See how other people are designing the model, how are they visualizing the uncertainty. I have referred to many blogs and other resources to get an idea of what is needed to be supported.
2. Propose a high-level design, defining the major input parameters.
ArviZ is a package for bayesian models exploratory data analysis. One of the major directories of the ArviZ project is arviz/plots. It contains all the code for generating the plots by just inputting a couple of necessary arguments. The first step after knowing “what to do”, is “how to do”, that is, how to add automated plots. In my case, it’s how to add time series plots’ code in the project? What all inputs should you take? The easiest way to move forward is by looking into the code and understanding the structure of plot functions. I know that it looks a bit uneasy to understand the codebase of a big project. Here’s where nicely written documentation comes into play.
I watched other plot functions and try to understand what is their basic structure. Then I pick some functions and try to understand the code line by line. Ah! not every line was clear to me in the starting, but I get to know that when the user calls a plot function, the call goes to the user API function. User API function applies transformation techniques to extract the exact useful information from the input and nicely organize it before sending it to the backend plot functions. ArviZ supports 2 backends, Matplotlib and bokeh. Now, this y well-organized data is sent to the backend plot functions as stated by the user. Backend functions have only the necessary information. They just handle the visualization of the information. No mathematical operations are performed here.
I proposed a user API design specifying the input parameters and mathematical computations needed to be performed on them. Then comes a lot of discussions on the design and I came up with a better approach. That is all about the design phase.
3. Design a notebook showing the inputs and sample output.
The next step is to code down the design that is decided in the previous step. For it, you need to implement a basic time series model and write a function to visualize the model. It must show the output visualizations. There is no need to consider all the input cases for this step. You just need to explain the design in the form of code. I and my mentor discussed the output visualizations. Some comments and edits were made. All set, it is time to open a pull request.
4. Open a pull request and get feedback. Update the code according to reviews and suggestions.
A lot of code commits and reviews in this part. You learn a lot of SDE stuff in this part. I learnt various code practices, documentation practices, visualization arguments and other utility tools written within the project. I learned about the possibilities of xarray in terms of computation. I refer to other plot functions whenever got stuck or ask my mentors for guidance. I learnt about various python libraries like pylint, pydocstyle, black, pytest and much more.
You should have a basic knowledge of git commands understanding of branching operations to open a pull request. You can watch a short tutorial on YouTube and learn further by doing. Once, you are ready with the code in the local repository, you need to push the changes to your personal branch and open a pull request in the main repository.
I experienced a lot of CI/CD tests failure because I forget to check all the tests before pushing the changes. It is always a good practice to check your work locally and edit if any error occurs. You can find the details in contributing guidelines tab of ArviZ documentation.
5. Add tests to test the basic functionality.
It’s good that you have implemented the function but it’s incomplete without tests. The test helps you to validate your work. It also helps mentors and other developers with making code reviews. Reviews ultimately help you to make your function more robust. Ideally, you should add enough tests so that it covers all of your code functionality. Ideally, try to cover each line of code under tests.
Pytest helps you to write parameterized tests, reducing code length and complexity by reducing duplicate code. It helps to simplify your work and run checks easily. You just have enabled pytest to try different possible inputs and check that no error is encountered. You also have to check the erroneous inputs and make relevant assertions.
6. Add lots of documentation and examples to demonstrate the results.
The main focus of ArviZ is to ease the data analysis part of the Bayesian modelling workflow. It is incomplete unless one documents the features nicely, demonstrating different cases that the function supports. User API docstring is also a part of the documentation. Having a clean and easy user API docstring provides a seamless user experience.