The Bioinformatics aspect at TARA
The national biobank of Thailand or NBT was established last year with the goal to sustainably long-term conserve and promote efficient utilization of Thailand rich bio-resources, including plants, microbes, animal cell lines and human genome. NBT harbors advanced storage/cryopreservation technologies to ensure the underlying viability of all NBT living biomaterials. Every specimen entry must be accompanied with wide variety of valuable/well curated biodata ranging from specimen phenotypic data, molecular-based classification information, DNA and functional characterization data which is the heart of NBT core data service.
Collected biodata such as phenotypes and demographic information often can be stored directly to our databases. Genetic data of these specimens, however, are quite tricky as there are computational preprocessing involved before they can be used to annotate the underlying specimens. In layperson’s term, millions of nucleotides from a genome will be converted (sequenced) into a long string containing only four letters, A, T, C and G. These alphabets are to be read into RAM for further processing; thus increasing space complexity. A term bioinformatics was coined to represent a study in which data at molecular levels, e.g., genomics (DNA), transcriptomics (RNA) or proteomics (protein) are to be processed. Some of these bioinformatics common tasks are intrinsically bound to NP-hard, while some are polynomial time solvable yet the large genome size makes it impossible to operate on a conventional personal notebook/computer. Thus, NBT requires high performance computing (HPC) services to facilitate our day-to-day operations.
The arrival of Genomics Thailand—the largest population genomic study conducted in Thailand—drives the country HPC demand further as it opens up the new challenge in Thailand big data. Sequencing fifty thousand Thai individuals would take up a shear storage space of 5 petabytes not to mention other pertaining information produced during the analyses. For a common task such as transforming raw sequencing data to an individual genetic variation profile, termed variant discovery, some specialized hardware such as field programmable gate array (FPGA) or graphical processing unit (GPU) and software specifically designed to maximize computation on such resources must be used to expedite the computing process. To put this burden into perspective, a 32-core with 3TB of RAM would require 40 hours or more to do variant discovery of one individual data; yet using the aforementioned hardware swiftly completes the same task in under two hours.
More and more demands on computational side will be coming as the data service at NBT grows. TARA’s HPC is an important Thailand advanced computing infrastructure that surely fosters activities that help us reach Thailand 4.0 goal.