Cal Poly's Dr. Matt Haberland Contributes to Open-Source Scientific Solutions
Python, a programming language, is heavily used by the biomedical research community. Researchers depend on Python tools, and Python tools depend on SciPy. SciPy is an open-source Python library that helps in the solution of complex mathematical or scientific problems - the kind of problems frequently seen in biomedical research. SciPy (short for Scientific Python) has algorithms for use in optimization, differential equations, integration, statistics, and other mathematical problems.
SciPy has recently played a key role in pandemic response, being used for problems such as:
- COVID-19 vaccine optimization: determining how to distribute vaccines for maximum societal effectiveness.
- Modeling and forecasting the early evolution of COVID-19 in Brazil.
- Determining how to most effectively execute pandemic-related lockdowns.
SciPy has also been used in exciting astronomy/physics applications, notably as a tool in the first-ever imaging of a black hole by the Event Horizon Telescope, seen below:
First image of black hole Sagittarius A*, Image courtesy of NASA
SciPy is used to process image data from the James Webb Telescope, a telescope that has been capturing images of the universe in stunning detail. Without data manipulation capabilities such as the ones SciPy provides, the images sent to earth would be almost useless for science.
The edge of a nearby, young, star-forming region called NGC 3324 in the Carina Nebula, taken by the James Webb telescope. Courtesy of NASA, ESA, CSA, and STScI
The better SciPy gets, the more capable the library is of helping tackle complex scientific problems. Cal Poly's Professor Matt Haberland is working to improve some of SciPy's shortcomings in partnership with the Chan-Zuckerberg Initiative (CZI), which recently awarded Dr. Haberland with a grant of $325,000 for his proposed work through their Essential Open Source Software for Science (EOSS) program.
As stated on CZI's EOSS website, the program supports software maintenance, growth, development, and community engagement for critical open-source tools. Open-source software is crucial to modern scientific research, advancing biology and medicine while providing reproducibility and transparency. Yet even the most widely-used research software often lacks dedicated funding.
Dr. Haberland's goals for his work are to:
- Make SciPy's biomedical research tools faster, more accurate, and more reliable. This will enable biomedical research tools to develop more quickly, because developers can increase their reliance on SciPy to perform lower-level tasks.
- Add more features to SciPy so that biomedical researchers can use it for most (if not all) of their analysis pipelines, and in some cases, enable work that was previously impossible.
- Improve SciPy's user interface and eliminate unexpected behaviors, allowing researchers to spend less time debugging code and more time on their valuable work.
- Enable other biomedical researchers to learn about and use the improvements via outreach and documentation.
There are many tools for biomedical research that can accomplish the same things as SciPy, but SciPy has some important benefits that make it valuable:
- Cross-platform compatibility. Many other software tools are restricted to certain operating systems and are limited by graphical user interfaces. SciPy is cross-platform and interfaces with other types of code well.
- It is open-source and free. SciPy is accessible for anyone to use. Almost all other statistical analysis tools are proprietary. CZI believes improving access to open-source software is key for scientific breakthrough.
- SciPy is more self-contained than other software solutions, allowing users to rely less on multiple programs to accomplish their needs.
A recent critical SciPy use case involved determining which groups of people should be vaccinated first upon release of the first COVID-19 vaccines in order to most effectively reduce the pandemic's impacts. Researchers used SciPy's optimization capabilities to answer questions like, "If the first vaccine is 50% (or 60%, or 90%, etc.) effective, what percentage of the population must receive it in order to not overwhelm hospital ICU's?"
Vaccine optimization output, representing different scenarios of vaccine effectiveness (VE) and percentage of the population receiving the vaccine, for the purpose of not overwhelming ICU's. (Matrajt et al., "Vaccine optimization for COVID-19: who to vaccinate first?", 15 Dec 2020, https://www.medrxiv.org/content/10.1101/2020.08.14.20175257v3.full.pdf)
Results from SciPy-enabled analyses like this helped researchers recommend how to allocate vaccines as they were cleared and released, based on qualities like a particular vaccine's effectiveness against infection.
“SciPy aims to give everyone easy access to reliable and efficient implementations of essential scientific computing algorithms, regardless of their institutional affiliation or financial resources. Satisfying these basic needs of engineers and researchers frees their time to focus on solving important problems,” says Dr. Haberland.
While these attributes are valuable to all researchers, they also make SciPy more accessible to those who might not have access to its more restrictive and expensive alternatives. CZI supports the science and technology that will make it possible to cure, prevent, or manage all diseases by the end of this century. Disease affects everyone, but under-resourced communities are disproportionately affected. Moreover, due to systemic barriers, the scientific enterprise is not a place where all voices and talents thrive.
CZI, Dr. Haberland, and the research community at Cal Poly believes that the strongest scientific teams incorporate a wide range of backgrounds, experiences, and perspectives. They also desire to empower community partners to engage in science. An open-source, powerful, free, cross-platform option such as SciPy is aligned well with these goals and needs.
Open-source software is exciting to me because every contribution could be a step toward discovering a new planet or curing a disease.- Professor Matt Haberland
Dr. Haberland and his colleague Pamphile Roy are both SciPy core developers with commit rights, meaning they have access and permission to modify the SciPy library. To accomplish their first goal of making SciPy faster, more accurate, and more reliable, the researchers searched the source code and issue trackers for CZI-supported projects. They also surveyed biomedical Python project maintainers to better understand what is needed from SciPy. This helped Haberland and Roy better understand existing issues with SciPy.
To determine what new features are most needed, they reviewed open access articles from the March issues of Biomedical Engineering, Nature Biotechnology, and the New England Journal of Medicine to characterize the types of statistical analysis tools being used in modern research. They also searched 10,000+ citations of SciPy's 2020 Nature Methods article for direct use of SciPy in biomedicine and surveyed corresponding authors of SciPy-citing papers to better understand their computing needs, their current uses of SciPy, and improvements they can expect from SciPy.
From this review, Haberland and Roy found that “scipy.stats” is the part of SciPy most used by biomedical researchers. The two will be adding new statistical features to scipy.stats. They will also be performing essential maintenance throughout the SciPy library and fixing bugs under the guidance of the survey results.
As a member of Cal Poly's BioResource and Agricultural Engineering Department (BRAE), Dr. Haberland's work will continue to strengthen the relationship between statistics and the world of agriculture, food, and environmental concerns. "Much of statistics development has been inspired by the needs of agriculture and related industries," he says, "For instance, many undergraduate students will be familiar with 'Student's t-test'. What they might not know is that 'Student' was a pseudonym of William Sealy Gossett, who developed the test while working on quality control for the Guinness Brewery."
As they perform this work, they will disseminate the results by adding to SciPy documentation examples and tutorials, presenting the improvements at conferences and CZI meetings, and hosting “office hours” in which biomedical researchers can get live help.
Dr. Haberland became interested in this work out of necessity. “In grad school, I used a Matlab function to schedule shifts for the pub in my dorm. Years later, when I no longer had access to Matlab, I looked for a similar function in SciPy,” he said, “It didn't exist yet, so I decided to write it myself. Why not let everyone else use it, too?” He has since become one of the most active contributors to SciPy's success, having made thousands of contributions over the past few years.
Dr. Haberland's work will continue through late 2024, and will involve Cal Poly students in addition to his and Pamphile Roy's efforts. To find out more about SciPy, visit scipy.org.
Acknowledgement Statement: This project was made possible by the work of the units in the Cal Poly Division of Research, Economic Development & Graduate Education to support student research, Learn-by-Doing, the Teacher-Scholar Model, proposal submission, award negotiation, compliance review, and post-award management. See more at research.calpoly.edu.