TWiki> ANC Web>PhdProjects (02 Dec 2011, Main.imurray2)EditAttach

Please add your proposed PhD projects here.

CS: Note to Amos: My first project is on the ILCC Web as well, so it is possible that that one is already posted on the School list.

IM: There was an automatic wiki link on the word PhD before, which meant people had split projects across that page and this one. I've merged the pages.

Statistical NLP for Programming Languages

Supervisor: Charles Sutton

Find syntactic patterns in corpora of programming language text.

The goal of this project is to apply the advanced statistical techniques from natural language processing to a completely different and new textual domain: programming language text. Think about how you program when you are using a new library or new environment for the first time. You "program by search engine", i.e., you search for examples of people who have used the same library, and you copy chunks of code from them. I want to systemize this process, and apply it at a large scale. We have collected a corpus of 1.5 billion lines of source code from 8000 software projects, and we want to find syntactic patterns that recur across projects. These can then be presented to a programmer as she is writing code, providing an autocomplete functionality that can suggest entire function bodies. Statistical techniques involved include language modeling, data mining, and Bayesian nonparametrics. This also raises some deep and interesting questions in software engineering: i.e., Why do syntactic patterns occur in professionally written software when they could be refactored away?

Structure Learning for Computer Systems

Supervisor: Charles Sutton

Automatically determine the structure of models to describe the performance of warehouse-scale and cloud applications.

Modern computer systems have become more complex than ever before, with distributed systems becoming a mainstream computing tool. Low latency is a crucial design goal for these systems, because users will not adopt an interactive Web service that is slow. Understanding the performance of a distributed system is extremely difficult because of the mayn intercations between components. In this project, we will address this problem by attempting to learn the structure of models to describe the performance of these systems. Possible structure may include networks of nonparametric regression models, networks of queues, or more complex performance models such as stochastic process algebras. The idea is that the learning structure will be useful for visulation, i.e., that it will provide a compact, interpretable description of the system's performance, so that performance bugs in the system will be visually apparent as bottlenecks in the learned queueing network. Essentially, the learned model will serve as a summary of the large amount of performance data used to generate it. Structure learning is a notoriously complex problem in machine learning, so this new application may serve as a chaallenge problem for this area.

Computational epigenetics

Supervisor: Guido Sanguinetti

The overwhelming majority of quantitative biology has focused on studying molecules like mRNA, which decay within hours at most. How can this help us explain phenomena that take years to establish, e.g. ageing, cancer, neurodegenerative diseases? People increasingly think that a determining factor is so called "epigenetics", i.e. changes in the spatial organisation/ chemical state of DNA (e.g. how it is wrapped around histones, its methylation state; for a very accessible review see here ). Data about these epigenetics factors is becoming increasingly available thanks to next generation sequencing. Can we use computational methods to discover whether there are networks connecting these various epigenetic factors, and connecting epigenetics with genetics? Can we use computational methods to discover whether there are networks connecting these various epigenetic factors, and connecting epigenetics with genetics?

Machine learning for spatio-temporal systems

Supervisor: Guido Sanguinetti

Advances in remote sensing technologies mean that there is an increasing number of data sets detailing physical processes at a spatial and temporal resolution. As an example, our collaborator Dr John Quinn, Makerere University Kampala, is gathering a very large data set in the following way: farmers in Uganda often own GPS phones, and they are asked to send photographs of suspect Cassava plants (main staple in East Africa) to a server in Kampala where a computer vision algorithm classifies the pics in a certain number of disease classes. We therefore get a nation-scale data set of occurrence of diseased plants as events in space and time. How do we analyse such types of data and extract information e.g. about the dynamics of the spread? Can we make online predictions which can be useful to decision makers? I would be very interested in working on these questions, perhaps building on this online general estimation tool for a class of spatio-temporal models I recently worked on with collaborators in systems engineering.

Unsupervised learning for hierarchical image modelling

Supervisor: Chris Williams

Develop models for shapes and appearances of image regions and objects

It is highly desirable to frame image understanding in terms of hierarchical generative probabilistic models. These allow top-down and bottom-up flows of information to take place, in order to provide a scene interpretation. Encoded within such a model would be knowledge at various levels, e.g. lower-level models of regions and boundaries, and at a higher level the shape and appearance of object classes, and their contextual relationships. Due to the difficulties in obtaining appropriate annotated data, such models should be learned in a largely unsupervised fashion from image data. Hinton's "deep learning" agenda is attractive here in that it provides an upgrade path from lower-level to higher-level regularities.

The specific PhD project would develop components that would fit into this framework; for example one might decompose an image into regions based on visual texture, and at a higher level model the typical shapes and appearances of co-occurring regions that arise from object classes.

Models for Understanding Time Series from Intensive Care Units

Supervisor: Chris Williams

Identifying physiological and artifactual events in patient monitoring data so as to make "smart alarms" for medical staff possible

Patients in intensive care are monitored by many sensors (heart rate, blood pressure, temperature etc) giving rise to time-series data that has rich structure. The goal of this project is to identify various events in the data streams, both physiological and artifactual. If this can be achieved reliably then identified or predicted physiological events could be flagged to medical staff, as a "smart alarm". Artifactual events (such as a probe recalibration) need to be identified and then discounted. The methods for this work will be based on the Factorial Switching Linear Dynamical System (FSLDS; Quinn, Williams and McIntosh, 2009), but there are many new directions to explore. The work will be carried out in collaboration with Intensive Care Units in Scotland.

-- AmosStorkey - 02 Nov 2011

Topic revision: r3 - 02 Dec 2011 - 11:21:31 - Main.imurray2
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies