Distributed Processing of Next Generation Sequencing Data Set

Hadadian Nejad Yousefi, Mostafa; Goudarzi, Maziar Motahari, Abolfazl

Please enable javascript in your browser.

Distributed Processing of Next Generation Sequencing Data Set

Hadadian Nejad Yousefi, Mostafa | 2017

515 Viewed

Type of Document: M.Sc. Thesis
Language: Farsi
Document No: 50168 (19)
University: Sharif University of Technology
Department: Computer Engineering
Advisor(s): Goudarzi, Maziar; Motahari, Abolfazl
Abstract:
DNA analysis plays a significant role in fields such as pharmacy, agriculture, genealogy, and forensics. Next generation sequencing datasets cover a gene several times due to a large number of readings. Therefore, the initial data volume is several times the amount of memory required to store the DNA strand. First, the DNA sequence of a sample should be made using the primary data, and then the difference should be found by comparing the sample DNA sequence with the reference DNA sequence. By finding these differences, one can extract the characteristics of the tested species. The extracted properties are precious for genetics researchers. For example, they can produce drugs that are tailored to the patient's characteristics, who can be treated faster and with fewer side effects. In this study, we implemented this application from scratch and accelerated it up to 6 times faster than the nearest competitor. We also identified bottlenecks for its performance and provided solutions to fix them. We designed a distributed processing model for this application and implemented it on the Apache Spark platform. We achieved a speed up of 2x with a cluster of 4 nodes with respect to a cluster of 2 nodes
Keywords:
Bioinformatics ; Big Data ; Genome Analysis ; Gene Expression Data ; Apache Spark ; Aligner

Digital Object List

محتواي کتاب
view

Bookmark

No TOC

Friend's email
Your name
Your email
enter code