Loading...

Distributed Processing of Next Generation Sequencing Data Set

Hadadian Nejad Yousefi, Mostafa | 2017

515 Viewed
  1. Type of Document: M.Sc. Thesis
  2. Language: Farsi
  3. Document No: 50168 (19)
  4. University: Sharif University of Technology
  5. Department: Computer Engineering
  6. Advisor(s): Goudarzi, Maziar; Motahari, Abolfazl
  7. Abstract:
  8. DNA analysis plays a significant role in fields such as pharmacy, agriculture, genealogy, and forensics. Next generation sequencing datasets cover a gene several times due to a large number of readings. Therefore, the initial data volume is several times the amount of memory required to store the DNA strand. First, the DNA sequence of a sample should be made using the primary data, and then the difference should be found by comparing the sample DNA sequence with the reference DNA sequence. By finding these differences, one can extract the characteristics of the tested species. The extracted properties are precious for genetics researchers. For example, they can produce drugs that are tailored to the patient's characteristics, who can be treated faster and with fewer side effects. In this study, we implemented this application from scratch and accelerated it up to 6 times faster than the nearest competitor. We also identified bottlenecks for its performance and provided solutions to fix them. We designed a distributed processing model for this application and implemented it on the Apache Spark platform. We achieved a speed up of 2x with a cluster of 4 nodes with respect to a cluster of 2 nodes
  9. Keywords:
  10. Bioinformatics ; Big Data ; Genome Analysis ; Gene Expression Data ; Apache Spark ; Aligner

 Digital Object List

 Bookmark

No TOC