A Big Time for Big Data
A Big Time for Big Data
Science has firmly entered the ‘Big Data age’ – where the volumes of data we can generate from experiments vastly outweigh the ability of traditional methods to analyse and extract information from. Across the sessions at SEB Prague 2020, various researchers will be demonstrating how – with ingenuity and new approaches – we can make new discoveries from the data deluge. By Caroline Wood.
Stressed under the surface?
Big Data is proving especially useful to measure responses in organisms where there are no observable changes, or which take place over prolonged timescales. “It is particularly challenging to assess how Antarctic marine species respond to environmental change as they tend to be long-lived species with low-energy lifestyles” says Melody Clark (British Antarctic Survey), who will be speaking on ‘Life in the slow lane’. Part of her research investigates how these marine communities will be affected by global climate change and increased sea temperatures. One method is to analyse the fitness of filter-feeding species that will colonise heated settlement panels to form an ‘encrusting community’. These panels allow a thin layer of water above the panel to be heated by +1 and +2 °C above ambient temperatures, mimicking the 50 and 100 years predictions for warming in the Southern Ocean.
An initial, surprising result seemed to indicate that these communities thrived on the + 1 °C panels, with markedly increased growth rates: the colonies of the bryozoan Fenestrulina rugula, for instance, were more than double those on the control panels. What wasn’t clear, however, was whether this represented true acclimation to the warmed conditions. “It can be difficult to know what’s going on under the surface of these species as they have solid calcified exoskeletons, meaning there are no visual signs of stress” Melody says. “The only way to be sure is to go to the molecular level”. Differential expression profiling of RNA-sequencing data revealed that the species on the warmed panels showed a considerably more active response, even after 18 months, with thousands of genes up-regulated compared to the controls, including elements of the cellular stress response, such as heat shock proteins and antioxidants.
“This transcriptional profile indicated that these animals were experiencing cellular stress and difficulty in acclimating to the warmer conditions” Melody says. “Rather than thriving, they were potentially at, or close to, a tipping point to survive”. This was validated using upper thermal limit trials: animals grown on the warmed panels showed no difference in the maximum temperature they could tolerate before becoming unresponsive to external stimuli, suggesting there had been no acclimation at the whole organism level. “These results demonstrate that even a +1°C increase in the environment of these species significantly affects cellular homoeostasis, a situation that is likely not sustainable long term, particularly if they have to maintain enhanced growth rates” says Melody.
Her current work seeks to investigate the extent to which epigenetic factors influence the ability of Antarctic species to adapt to changed environments. In a previous study, she found that intertidal and subtidal limpets showed significant DNA methylation differences when transplanted to the opposite tidal zone. “This suggests that epigenetic factors play an important role in physiological flexibility associated with environmental niche, but we don’t know how widespread this is among Antarctic species” Melody says. She intends to start by using bisulphite sequencing to assess epigenetic variation between Laternula elliptica clams of different ages, since older clams appear to respond more poorly to environmental change. “We often have to rely on generating huge data sets and using Big Data techniques to study these species because their genomes haven’t been sequenced yet” Melody says. “Hopefully, as more genomes and molecular tools become available, we will be able to uncover the biochemistry of these animals’ responses, rather than just determining the end product – whether there is a stress of acclimation response”.
Making sense at the molecular level
Big Data technologies also have the potential to revolutionise the fields of proteomics and metabolomics, with the goal of understanding how changes at the molecular level determine whole organism fitness and their responses to environmental perturbations. Such knowledge has a broad range of potential applications, from breeding crops resistant to stresses and diseases, to developing novel drug therapies. But capturing the molecular diversity within cells, not to mention linking this to gene transcription data, remains a challenge although one which SEB researchers are making great strides towards.
“I work at the intersection of plant metabolism and stress responses” says Nick Smirnoff (University of Exeter, UK). “I am particularly interested in the involvement of reactive oxygen species and antioxidants in the response of plants to stresses such as high light, drought and temperature extremes.” During the talk ‘Using metabolomics to investigate plant stress responses’, he hopes to present the results of a project to profile metabolite changes in the model plant Arabidopsis thaliana, to understand the effect of vitamin C deficiency on stress responses. This primarily involves the targeted analysis of plant extracts subjected to mass spectrometry coupled to gas or liquid chromatography. However, given that typically only 10-20% of compounds can be firmly identified in these experiments, Nick has also been using Big Data for “untargeted” profiling. “This enables us to detect thousands of chemical features in a plant extract and follow their changes under different conditions”. Nick says.
His results so far suggest that vitamin C-deficient plants accumulate more anti-pathogen compounds such as phytoalexins, but accumulate pigments such as anthocyanins much more slowly, possibly because of an impairment of redox signalling processes. “We have also found a significant number of vitamin status-responsive compounds that have not yet been identified” says Nick. “This is where that ability to mine big datasets would be helpful, particularly to see if the same compounds have changed under other conditions.”
Whilst these projects demonstrate the power of Big Data, Nick argues that greater impact can only be realised if different researchers can work collaboratively to combine different datasets. “The value of Big Data can be massively increased by joining up results from different labs in public repositories” he says. “We ultimately hope to combine our metabolomics data with transcriptomics, proteomics and phenomics data from plants under stress to identify patterns of response”. But this may require a fundamental shift in how datasets are published and shared. “The default is typically to publish datasets in the supplementary information of journal articles, but these are hard to access and reuse, and they are often only available in formats that can’t be indexed and searched”. What’s more, supplementary materials may not undergo the rigorous peer-review process as the main body of the paper, resulting in a lack of quality control. Instead, he suggests there should be more funding for ‘community databases’, typically developed from the bottom up to organically meet the evolving and immediate needs of researchers. The Ensembl platform, for instance, was originally a resource for human biology, but has recent expanded to include plants and other organisms. “At the conference, I will be interested to see what other researchers across disciplines consider to be Big Data and to explore how these datasets are being archived to facilitate re-use, analysis across experiments and integration of different types of data to improve understanding” he concludes.
An exercise in discovery…
As our knowledge of gene regulation expands, it is becoming increasingly clear that complex gene regulatory networks and their interaction with environmental factors are at the heart of an organism’s biology, from its physiology to behaviour. Understanding how these complex networks work together remains a challenge, but one that Big Data shows great potential for addressing. During the talk ‘Omics approaches to ecophysiology’, Francesco Falciani (University of Liverpool, UK) will illustrate this using case studies from his research. “Recently, for instance, we wanted to use computational biology techniques to investigate which genes determine the response of skeletal muscle to physical exercise, and how these are affected by physiological factors such as insulin and hormone balance” he says. Through the USA HERITAGE Family Study, Francesco was able to access a comprehensive dataset involving over 700 individuals. Besides extensive genotyping (GWAS) and skeletal muscle gene expression profiling, the data included hundreds of physiological measurements taken both before and after physical training. “To make sense of this very large dataset, we used a computational technique called network inference” says Francesco. “This applies mathematics to learn the structure of an underlying network from observational data”.
According to the model developed by this approach, the most likely regulators of energy metabolism, is Eukaryotic Translation Initiation Factor 6 (EIF6). “We got very excited at this point because this gene had never been implicated in this process before” says Francesco. “Either we had discovered something new or our model was completely wrong”. It was now time to go back to the lab to confirm the result, using a mouse strain carrying only one active copy of the EIF6 gene. Satisfyingly, the experiment validated the model’s predictions: “The ability of the muscles in these mice to generate energy efficiently was compromised, an observation that was consistent with the fact that they were not very good at performing physical exercise” says Francesco.
Having demonstrated the power of Big Data to identify new gene regulators, Francesco now intends to apply this technique to microRNAs: small regulatory RNA molecules that don’t produce functional proteins. “We are interested in these because every tissue in an organism releases microRNAs into the bloodstream so sequencing the microRNAs in the HERTIAGE study’s blood samples could yield valuable tissue-specific information” he says. A long-term goal is to combine multiple datasets (such as metabolomics, proteomics and transcriptomics) and to investigate inter-organism networks, for instance between humans and their microbiome. But this will require highly sophisticated analytical tools. “Our ability to generate data is still far superior to our ability to integrate it and generate conceptional models to explain how these levels of complexity work” Francesco explains. He argues that we need to move beyond simply looking for patterns in complex datasets: there is an urgent need for a conceptual model that reflects how information flows between different levels, for instance from genes to proteins. “Once we have the conceptual framework in place, then we can simply feed the data into it” Francesco says. He hopes the sessions at SEB Prague will provide opportunities to share ideas and discuss how such a model could be developed.
Hunting for answers
“More and more people are interested in individual differences in animal behaviour and automated tracking is the best way to collect reliable data. As a result, over the last several years we've seen a large increase in the number of software packages designed for tracking animal movements” says ecophysiologist Shaun Killen (University of Glasgow, UK), one of the organisers for the session ‘Automated animal tracking in behavioural studies’. His co-organiser, marine ecophysiologist Stefano Marras (Italian National Research Council, CNR), for instance, is using automated tracking to decipher the kinematics of predator-prey interactions. “Prey maximize their individual fitness by spending their time foraging for food or finding a mating partner and reproducing. However, they need to avoid becoming the meal of a predator and have to evaluate the trade-offs between feeding and survival” he explains. Consequently, both predators and prey have evolved behaviours to increase their individual fitness. Traditionally, observing these behaviours required laborious manual quantification of field observations or video footage. But automated animal tracking is swiftly replacing these methods, supported by ever-more sophisticated computational analysis.
Stefano’s research model is the hunting behaviour of sailfish (Istiophorus platypterus) and striped marlin (Kajikia audax) whilst attacking schooling sardine prey. Both these fish, popular with sport-fishers, are characterised by an extended bill, widely thought to function in prey capture. To investigate this, Stefano compared the hunting strategies of the two fish, to see if any differences could be linked to morphological variations in their bills. The first step was to transform recorded video footage of the predators hunting into quantitative data. To do this, he used a computational technique which broke down the sequences into discrete elements, such as approach, dash, bill use and prey contact. The video data were then characterised using a Markhov Chain model, where the probability of each event in a sequence depends only on the state attained in the previous event. This model therefore automatically tracked and analysed 150 attack sequences for sailfish and 665 for striped marlin, saving countless hours of human labour.
“We found that sailfish tended to position themselves on the outside of the school, using their bill to perform many lateral slashes” Stefano says. “Whereas striped marlin tended to dash through the school in a ‘rush and grab’ movement that dispersed the fish”. Various morphological features may account for these differences. For example, the sailfish bill is thinner and rounder, which may camouflage it sufficiently so that it can be inserted into the school without the prey dispersing. In addition, sailfish have a more laterally compressed shape compared with striped marlin, which would give greater manoeuvrability for turning sideways. “Our analysis has shown that Markhov chains have considerable potential for analysing predator attack strategies because they can go beyond simple binary tests to compare complex, dynamic behaviour patterns” Stefano says. In his current work, he hopes to advance these methods from analysing individuals to modelling the dynamics of entire schools of fish. “Using automated tracking methods, we can analyse the movement of each individual within the school and merge these behaviours into a collective group response to define the effect of physiological assortment on the school’s dynamics” he says. This could be used, for instance, to investigate the effect of climate change on the social dynamics of large groups of animals.
Meanwhile the complimentary session ‘Putting animal biology in ecological context with advances in animal tracking and bio-logging’ will focus on monitoring animals in the wild using bio-logging and telemetry. “It would be great if the researchers attending these sessions can share strategies for dealing with the enormous datasets collected in completely different systems” says Shaun. “There is a huge amount of biological information in these automated traces of animal movement, but often the trick is determining how to best deal with sampling error to reveal the most accurate picture of what the animals are actually doing.”