Study uses computer learning to provide quality control for Genetic Databases

0
216
computer learning_indianbureaucracy
computer learning_indianbureaucracy

Summary:A new study helps to shed light on the transcriptomic differences between different tissues in Arabidopsis, an important model organism, by creating a standardized “atlas” that can automatically annotate samples to include lost metadata such as tissue type.

DNA doesn’t exist in a vacuum: even though every cell contains the entire genome of its host organism, they know how to differentiate, to become part of an eye, or a bone, or a leaf. These differences are related to each cell’s transcriptome — the array of messenger RNA (mRNA) that describe which parts of the genome are expressed as they are translated into proteins.

A new study published in The Plant Journal helps to shed light on the transcriptomic differences between different tissues in Arabidopsis, an important model organism, by creating a standardized “atlas” that can automatically annotate samples to include lost metadata such as tissue type. By combining data from over 7000 samples and 200 labs, this work represents a way to leverage the increasing amounts of publically available ‘omics data while improving quality control, to allow for large scale studies and data reuse.

“As more and more ‘omics data are hosted in the public databases, it become increasingly difficult to leverage those data. One big obstacle is the lack of consistent metadata,” says first author and Brookhaven National Laboratory research associate Fei He. “Our study shows that metadata might be detected based on the data itself, opening the door for automatic metadata re-annotation.”

The study focuses on data from microarray analyses, an early high-throughput genetic analysis technique that remains in common use. Such data are often made publically available through tools such as the National Center for Biotechnology Information’s Gene Expression Omnibus (GEO), which over time accumulates vast amounts of information from thousands of studies.

Though this abundance of data opens the door for large and inexpensive studies, there are often issues integrating multiple data sets. For example, University of Illinois bioengineer and Carl R. Woese Institute for Genomic Biology affiliate Sergei Maslov describes, “tissue type is a major metadata point for a sample. However, different researchers use different vocabularies to describe the same tissue, [… and] errors exist during the data submission process.”

Because the sheer amount of data precludes manual correction or quality control, Maslov, He and collaborators were inspired to create an automated solution that could deduce metadata from the expression profiles themselves by identifying similarities between tissue types. Their findings suggest that expression profiles remain remarkably similar between samples of the same tissue type, even when taken from plants grown under very different conditions.

By identifying the most similar samples with tissue types already annotated, researchers were able to teach their algorithm to identify other samples of the same type with an excellent degree of accuracy. The team generated over 10,000 entries of metadata, and was even able to correct some mistaken annotation in another lab’s study by confirming with the original author. The end result is a massive “atlas” of well-annotated data that can be used for future studies.

“Our ultimate goal is to provide cloud-based computer infrastructure for the study of energy/agriculture related plants, such as poplar and maize,” says Maslov. “If our strategies have been successfully applied on Arabidopsis, they can be applied on other species as well.”

Meanwhile, adds He, their integrated Arabidopsis atlas is itself an important contribution to plant genetics. “It can be used for constructing coexpression networks, one of the popular methods to leverage transcriptome data for annotation of gene function. We hope it will become a gold standard dataset in many applications.”

More: Science

Previous articleImpulsive children raised in caring families drink less during adolescence
Next articleREC registers Highest PAT of 5,628 Cr for FY 2015-16
Saurabh
Saurabh Sinha, Editor of IndianBureaucracy.com, is known for his credible, precise and insightful coverage of governance, civil services and administrative developments in India. Under his leadership, the portal has grown into a trusted national platform for accurate updates, appointments and policy movements within the bureaucratic ecosystem. Saurabh’s strong professional networking and deep understanding of government functioning enable him to present timely, reliable and well-contextualised information to readers across sectors. As a thought-driven editor, he promotes informed dialogue on governance reforms while maintaining high editorial standards. His calm, consistent and detail-oriented approach continues to strengthen the portal’s reputation. इंडियनब्यूरोक्रेसी.कॉम के संपादक सौरभ सिन्हा देश की नौकरशाही, शासन व्यवस्था और प्रशासनिक गतिविधियों की विश्वसनीय तथा संतुलित रिपोर्टिंग के लिए जाने जाते हैं। उनके नेतृत्व में यह पोर्टल नियुक्तियों, नीतिगत बदलावों और प्रशासनिक खबरों का एक भरोसेमंद राष्ट्रीय स्रोत बन चुका है। शासन तंत्र की गहरी समझ और मजबूत पेशेवर नेटवर्क के कारण सौरभ पाठकों को समयबद्ध, सटीक और संदर्भित जानकारी प्रदान करते हैं। एक विचारशील संपादक के रूप में वे सुशासन, पारदर्शिता और सुधारों पर सकारात्मक संवाद को बढ़ावा देते हैं। उनकी शांत, सूक्ष्म और पेशेवर संपादकीय शैली पोर्टल की प्रतिष्ठा को लगातार मजबूत कर रही है।