Collaborative Proposal: EarthCube Integration: Pangeo: An Open Source Big Data Climate Science Platform

Lead PI: Ryan Abernathey , Dr. Richard Seager , , Michael Tippett, Chiara Lepore, Naomi Henderson

Unit Affiliation: Ocean and Climate Physics, Lamont-Doherty Earth Observatory (LDEO)

September 2017 - August 2021
Project Type: Research

DESCRIPTION: Climate, weather, and ocean simulations (Earth System Models; ESMs) are crucial tools for the study of the Earth system, providing both scientific insight into fundamental dynamics as well as valuable practical predictions about Earth's future. Continuous increases in ESM spatial resolution have led to more realistic, more detailed physical representations of Earth system processes, while the proliferation of statistical ensembles of simulations has greatly enhanced understanding of uncertainty and internal variability. Hand in hand with this progress has come the generation of Petabytes of simulation data, resulting in huge downstream challenges for geoscience researchers. The task of mining ESM output for scientific insights has now itself become a serious Big Data problem. Existing Big Data tools cannot easily be applied to the analysis of ESM data, leading to a building crisis across a wide range of geoscience fields. This is exactly the sort of problem EarthCube was conceived to address. The project will integrate a suite of open-source software tools (the "Pangeo Platform") which together can tackle petabyte-scale ESM datasets. Additionally, training and educational materials for these tools will be developed, distributed widely online, and integrated into existing educational curricula at Columbia. A workshop at NCAR in the final year will help inform the broader community about Pangeo. Collaborators at other US climate modeling centers will encourage adoption and participation in the Pangeo project by their scientists. Beyond climate and related fields, multidimensional numeric arrays are common in many fields of science (e.g. astronomy, materials science, microscopy). However, the dominant Big Data software stack (Hadoop) is oriented towards tabular text-based data structures and cannot easily ingest petabyte scale multidimensional numeric arrays. The proposed work thus has potential to transform Data Science itself, enabling analysis of such datasets via a novel, highly scalable, highly flexible tool with a syntax familiar to disciplinary researchers. The core technologies are the python packages Dask, a flexible parallel computing library which provides dynamic task scheduling, and XArray, a wrapper layer over Dask data structures which provides user-friendly metadata tracking, indexing, and visualization. These tools interface with netCDF datasets and understand CF conventions. They will be brought to bear on four high impact Geoscience Use Cases in atmospheric science, land-surface hydrology, and physical oceanography. Disciplinary scientists will define workflows for each use case and interact with computational scientists to demonstrate, benchmark, and optimize the software. The resulting software improvements will be contributed back to the upstream open source projects, ensuring long-term sustainability of the platform. The end result will be a robust new software toolkit for climate science and beyond. This toolkit will enhance the Data Science aspect of EarthCube. Implementation of these tools on the cloud will also be tested, taking advantage of agreement between commercial cloud service providers and NSF for the BIGDATA solicitation.