Funded PhD position: Big data and climate modelling

When climate models execute on 100 million cores, and generate exabytes of data, how will we work with this data? How will we account for the diverse numerical schemes used to produce it? How will the users of our research know that our calculations were valid and that our results can be relied on?

Applications are sought for a fully funded PhD position in automatic code generation applied to calculations on the results of massively parallel climate simulations. The position will start at in autumn 2014

The Candidate

A candidate for this position will have good results in a masters degree in mathematics, computer science, physics or a similar highly numerate discipline. You will have experience in programming, ideally for scientific and/or high performance applications. You will have an interest in applying advanced mathematical and computational techniques to climate science problems of extremely high societal relevance, and you will have a broad interest in the climate science context of your research. Funding for this programme is usually limited to UK and EU students.

The Programme

The student will join a well-motivated and highly performing team of students and post-doctoral scientists spanning the departments of mathematics and computing at Imperial College. In addition to conducting high level original research on this project, you will participate in the Science and Solutions for a Changing Planet Doctoral Training Programme of the Grantham Institute for Climate Change at Imperial. This will provide key multidisciplinary insights into the climate system and the role your research and expertise can play. This programme will also develop transferable skills in communicating science in writing and orally, working with data and statistics and working independently on large projects.

Depending on your background and preference, you will be based in either the Department of Mathematics or the Department of Computing at Imperial and will benefit from the full facilities of a leading research institution.

Benefits and opportunities

You will receive a tax-free stipend of £15 700 per year for at least 3 years, funding to present your work at international conferences and access to high performance supercomputing facilities to conduct your research.

Contact

For more information about applying for this position, please contact the primary supervisor Dr David Ham (David.Ham@imperial.ac.uk). The deadline for applications is 20 January!

Project Description

Conducting new analyses of climate simulations is a core mechanism for developing understanding of the climate system. As computers become larger and the models behind these simulations become ever more sophisticated, the ability of scientists to work effectively with the data is frustrated. The last international Climate Model Intercomparison Project (CMIP5) was estimated to produce 3 petabytes of data (1000 state-of-the art hard drives) and future simulation sets will be far larger. Key countries including the UK, the US and Germany are currently rebuilding their climate model software on the basis of more sophisticated numerics. This will produce more accurate simulations, but also data sets which are more complex to process correctly.

A climate statistic is a mathematical statement, which a climate scientist can typically express in a few lines of mathematics. Conversely, the current approach to the evaluation of this statement is for a scientist to spend weeks or months developing a bespoke script and tuning it to the separate data structure of each climate model to which it is to be applied. This is labour-intensive and requires reworking for each new statistic and each new model. Most critically, there is no effective mechanism for users of the results to verify that the statistic is correctly evaluated. Furthermore, this approach typically requires the data to be downloaded by each research group, an increasingly infeasible task.

The missing link in this process is the ability to take the mathematical statement of the statistic and automatically and efficiently evaluate it correctly in the light of the discrete data representation of each model. The student on this project will make a major contribution to the solution of this problem by producing a system which generates climate data query software from the high-level mathematical specification of the diagnostic to be calculated. They will leverage the existing Firedrake project to automatically generate mathematically correct parallel implementations from mathematical statements in the UFL language. The student will create a semantic description of currently employed model data formats and employ this to create an expressive, verifiable language for climate query. The resulting system will be:
Efficient: rather than spending months on coding, climate scientists will be able to move directly from formulating the question to studying the outputs.
Model portable: the same mathematical statement can be run on different models. This is essential for reliable and trustable intercomparisons.
Verifiably correct: the statistics will be correctly calculated from the underlying numerics, this will be testable through extensive test suites, and the scientist will be able to publish the actual mathematical code in their papers, so the provenance of their results is established and testable.
Cloud-ready: statistics can be calculated and processed where the data is archived, without downloading huge data sets.

If individual scientists are to continue to do innovative work with climate model data on which the users of climate science can rely, solving the problems this project addresses is essential.