Machine learning is being used in nearly every discipline in science, from biology and environmental science to chemistry, cosmology and particle physics. Scientific data sets continue to grow exponentially due to improvements in detectors, accelerators, imaging, and sequencing as well as networks of embedded sensors and personal devices. In some domains, large data sets are being constructed, curated, and shared with the scientific community and data may be reused for multiple problems using emerging algorithms and tools for new insights. Machine learning adds a powerful set of techniques to the scientific toolbox, used to analyze complex, high-dimensional data, automate and control experiments, approximate expensive experiments, and augment physical models with those learned from data.
On the systems side, scientists have always demanded some of the fastest computers for large and complex simulations and more recently for high throughput simulations that produce databases of annotated materials and more. Now the desire to train machine learning models on scientific data sets and for robotics, speech and vision, has created a new set of users and demands for high end computing. The changing architectural landscape has increased node level parallelism, added new forms of hardware specialization, and continued the ever-growing gap between the cost of computation and data movement at all levels. These changes are being reflected in both commercial clouds and HPC facilities—including upcoming exascale facilities—and also placing new requirements on scientific applications, whether they are performing physics-based simulations, traditional data analytics, or machine learning. Using examples from my own research in bioinformatics for the microbiome, I will describe some of the algorithmic challenges and how these machines are being used to deliver new scientific capabilities.