Skip to main content
KBS_Icon_questionmark link-ico
tim-hubbard-banner ;

The secret of life - part 2: AI provides a solution to the protein folding problem

Professor Tim Hubbard

07 December 2020

20 years ago, the sequencing of the human genome gave us our own blueprint, right?

Wrong — it gave us an encrypted blueprint! Every gene is there, but information about its function is effectively encrypted. An artificial intelligence (AI) network developed by Google AI offshoot DeepMind has recently made a gigantic leap in solving one of biology’s biggest challenges — determining a protein’s 3D shape from its amino-acid sequence, unlocking a universal decryption key, which scientists have been hunting for 50 years.

Biology is hierarchical: The genome specifies how to make RNA transcripts which are translated to make proteins; proteins are the machines that compose and build cells; cells together form organs which make up a whole organism. In theory, a complete genome contains all the information to specify a model of a complete organism, provided the rules to build each layer of the hierarchy are understood. The rules of translation — how 4 letters of DNA encode the 20 types of amino acids that make up proteins — both linear molecules — were decoded by 1966. However, the sequence of a protein encodes how it spontaneously folds up in 3D, with the resulting shape determining its function. Until the CASP14 findings (see below), we hadn’t been able to decode the rules of this folding process — at least not well enough to infer the next layer of the biological system systematically, the complete set of protein structures. At this level, the genome has remained the equivalent of an encrypted disk drive without the decryption key.

Researchers have been able to partially work around this: protein 3D structures can be determined by a variety of experimental methods: Xray, NMR and most recently cryo-electron microscopy. However, experimental processes are hard and slow, so the set of protein structures is far from complete for any organism, e.g. for humans only about 17% of the protein sequence encoded by the genome has an experimental 3D structure despite decades of effort.

A schematic of what this has meant for biological research is shown in figure 1 below. The absence of a complete set of 3D protein structures has limited our ability to project information from the genome upwards to directly infer higher organisational structures of biology. Instead, advances in understanding have depended on experimental collection of intermediates: transcripts, epigenetic states, structures, cells etc. and the experimental investigation of different components by tens of thousands of researchers worldwide.

Tim-Hubbard-Feature-figure-1

Figure 1: Schematic of organisational layers of biology until now, using human as an example. Despite knowing the complete human genome sequence our inability to predict protein structure has severely limited our ability to infer higher organisational layers directly (red). Advances have instead depended on inference from other intermediate data sources (blue).

Hence the 50 year worldwide hunt for a method to decode protein folding, the last 26 years of which has been carried out under the auspices of bi-annual blind test evaluation of methods, CASP. The new announcement at CASP14, that Deepmind’s Alphafold artificial intelligence (AI) algorithm can predict most protein structures to experimental accuracy, represents an amazing breakthrough and brings the prospect of complete sets of protein 3D structures being rapidly generated for all organisms, including human.

A schematic of what the solution to the folding problem can mean is shown in figure 2 below. With a complete set of protein 3D structures, it becomes practical to build complete mechanistic models of biological processes like transcription, regulation and progressively infer higher organisational structures of biology directly, relying less on of generation of intermediate datasets. Mapping variants onto more complete mechanistic models will also enhance our ability to infer the consequence of sequence differences in the genome, with implications for personalised healthcare.

Tim-Hubbard-Feature-figure-2

Figure 2: Schematic of organisational layers of biology post CASP14, using human as an example. Accurate structure prediction allows improved direct inference of higher organisational layers of biology (red), reducing depended on inference from other intermediate data sources (blue). Improved mechanistic models improves the ability to interpret the clinical consequences of genome sequence variants in individuals (green).

Beyond these implications, there are many other aspects of the scientific breakthrough announced by Deepmind and CASP. It’s a lesson in the benefits of structures to support team science with engagement across an entire world community and embedded evaluation and openness of results. The development of the successful approach builds on the progressive developments and progress exposed through 26 years of the CASP process. It’s also a demonstration of the positive and world-changing power of using AI methods to model systems where arguably classical physics approaches fail, with implications for many other problems.

There are still gaps in the current AI solution: extending from predicting structures of isolated protein monomers to multimers; extending from predicting structures to predicting interactions between structures. However, there is huge activity and progress, and while Deepmind are well ahead in CASP14, it’s clear from progress by other groups using AI techniques developed since CASP13 that there isn’t going to be a monopoly in algorithm development in this space, which will also help drive further refinement and extension. Collectively these applications of AI will progressively transform our ability to model biological systems, similar to the sweeping impact of the wide availability of genome sequences and genome sequencing.

Tim Hubbard is Professor of Bioinformatics within the School of Basic & Medical Biosciences at King’s College London, with roles at Genomics England and Health Data Research UK. He was a co-organiser of CASP 1996–2007.

This article was first published in medium. Read the original article.

In this story

Tim  Hubbard

Tim Hubbard

Professor of Bioinformatics

Latest news