Research

I am a Lecturer in Computing at Imperial College. My research focuses on developing novel algorithms and methods to manage and analyze unprecedented amounts of data so it can be turned into knowledge. The research is strongly connected to and driven by applications, typically from the scientific domain (e.g., neuroscientists of the Blue Brain Project) but also from business.

I hold a Ph.D. and a M.Sc. in Computer Science, both from the Swiss Federal Institute of Technology in Zürich (ETH Zürich), and was a Postdoctoral fellow in the DIAS Lab at EPFL. In 2004 I was awarded a Fulbright scholarship to visit Purdue University.

Tackling Big Data Challenges

My research center around three major directions:

Large-Scale Scientific Data/Spatial Data

My current research addresses developing novel data management algorithms for querying and analyzing big scientific data. The unprecedented size and growth of scientific datasets makes analyzing them a challenging data management problem. Current algorithms are not efficient for today’s data and will not scale to analyze the rapidly growing datasets of the future. I therefore want to develop next generation data management tools and techniques able to manage tomorrow’s scientific data, thereby putting efficient and scalable big data analytics at the fingertips of scientists so they can again focus on their science, unperturbed by the challenges of big data.

Key to my research is that the algorithms and methods developed are inspired by real use cases from other sciences and that they are also implemented and put to use for scientists. The algorithms developed so far are inspired by a collaboration with the neuroscientists from the Blue Brain project (BBP) who attempt to simulate the human brain on a supercomputer.

Large-Scale Data Analytics

Analysing massive amounts of data to extract value has become key across different disciplines. As the amounts of data grow rapidly, however, current approaches for data analysis struggle. I am consequently interested in developing new methods for the large-scale analysis of high-dimensional data. One direction is to do high-performance data analytics (HPDA) on supercomputing infrastructure. To do so existing algorithms need to be fundamentally redesigned and implemented to make use of the massive parallelism large-scale supercomputing infrastructure provide, i.e., analytics problems need to be formulated as embarrassingly parallel problems with little need for synchronization and communication. A second direction is to introduce approximation to enable analyses at scale. Introducing minimal approximation can accelerate analyses by several orders of magnitude and the goal of the research in this direction therefore is the development of new approximate analytics algorithms with tight error bounds.

Data Management on Novel Hardware

Hardware underlying data processing and analysis evolves at a rapid pace. New hardware offers interesting trade-offs which enables new data analysis techniques. Novel algorithms have to be developed for new storage technologies like Flash, PCM, 3D memory or even just abundantly available main memory. Similarly, new algorithms have to be developed for new processing technology like neuromorphic hardware.

Research Highlights

FLAT: Accelerating Range Queries For Brain Simulations

Neuroscientists increasingly use computational tools to build and simulate models of the brain. The amounts of data involved in their simulations are immense and the importance of their efficient management is primordial.

flat_overview

One particular problem in analyzing this data is the scalable execution of range queries on spatial models of the brain. Known indexing approaches do not perform well, even on today’s small models containing only few million densely packed spatial elements. The problem of current approaches is that with the increasing level of detail in the models, the overlap in the tree structure also increases, ultimately slowing down query execution. The neuroscientists’ need to work with bigger and more importantly, with increasingly detailed (denser) models, motivates us to develop a new indexing approach.

To this end we have developed FLAT, a scalable indexing approach for dense data sets. We based the development of FLAT on the key observation that current approaches suffer from overlap in case of dense data sets. We hence designed FLAT as an approach with two phases, each independent of density.

Our experimental results confirm that FLAT achieves independence from data set size as well as density and also outperforms R-Tree variants in terms of I/O overhead from a factor of two up to eight. Find out more here…

SCOUT: Prefetching for Latent Structure Following Queries

Today’s scientists are quickly moving from in vitro to in silico experimentation: they no longer analyze natural phenomena in a petri dish, but instead they build models and simulate them. Managing and analyzing the massive amounts of data involved in simulations is a major task. Yet, they lack the tools to efficiently work with data of this size.

One problem many scientists share is the analysis of the massive spatial models they build. For several types of analysis they need to interactively follow the structures in the spatial model, e.g., the arterial tree, neuron fibers, etc., and issue range queries along the way. Each query takes a long time to execute, and the total time for executing a sequence of queries significantly delays data analysis. Prefetching the spatial data reduces the response time considerably, but known approaches do not prefetch with high accuracy.

scout_examples

We have therefore developed SCOUT, a structure-aware method for prefetching data along interactive spatial query sequences. SCOUT uses an approximate graph model of the structures involved in past queries and attempts to identify what particular structure the user follows. Our experiments with neuroscience data show that SCOUT prefetches with an accuracy from 71% to 92%, which translates to a speedup of 4x-15x. SCOUT also improves the prefetching accuracy on datasets from other scientific domains, such as medicine and biology. Continue to read here…