About the Course

  • Title: Bioinformatics: Practical Application of Data Mining & Simulation

  • Instructor: Gerstein, Mark

  • TAs: Huang, Xiu & Lee, Donghoon

  • Introduction: Bioinformatics encompasses the analysis of gene sequences, macromolecular structures, and functional genomics data on a large scale. It represents a major practical application for modern techniques in data mining and simulation. Specific topics to be covered include sequence alignment, large-scale processing, next-generation sequencing data, comparative genomics, phylogenetics, biological database design, geometric analysis of protein structure, molecular-dynamics simulation, biological networks, normalization of microarray data, mining of functional genomics data sets, and machine-learning approaches to data integration.

  • Check out our awesome course website.

  • Check out our post on bioinformatics.

About the Final Project

  • Why?

    Instead of generating papers or codes that nobody would ever read (expect for the TAs), we want to encourage the innovative generation of products that could potentially benefit the bioinformatics community.

  • When?

    Released: April 14th

    Due: May 9th 11:59PM

  • How?

    Each student will coorporate with classmates to work on three (or four for extra credits) different projects. The generated codes and documents will be published on this website to be resources for later students and researches.

  • What?

    Project topics are as following. Students can choose three to four favorite projects to work on.

For each sub-project, a group of three people will work on:

  1. R card: sample input, source code in R, sample output, and documentation on how to execute your code

  2. Python card: sample input, source code in Python, sample output, and documentation on how to execute your code

  3. English card: methodology and background introductory documentation

Topics and Grouping Assignment

1. QC steps

1.1 Propose a tool that removes barcode or sequence identifier from FastQ file.(Discard)

1.2 Propose a tool that generates “quality control statistics” from FastQ file. (E:Aparna, P:Peter, R:Dan)

1.3 Propose a tool that trims reads based on quality score from FastQ file. (E:Nathan, P:Heather, R:Dan)

2. Sequence Analysis

2.1 Propose a tool that generates pileup format from SAM file.(Discard)

2.2 Propose a tool that calculates FPKM (or TPM, and justify your choice) from given SAM and GTF files. (E:Edmond Dantes, P:Kevin, R:Julian)

2.3 Propose a tool that calculates intersection between two BED files. (Discard)

2.4 Propose a tool that calls SNVs from pileup file, and generate the output in VCF format. (Discard)

2.5 Propose a tool that calculates differentially expressed genes from GCT file of gene expressions.(E:Edmond Dantes, P:Heather, R:Calvin)

2.6 Propose a tool that finds k-mer motif enrichment from a given nucleotide sequence.(E:Nathan, P:ELK, R:Julian)

3. Network Analysis

3.1 Propose a tool that calculates co-expressed gene network from GCT file of gene expressions.(E:Aparna, P:ELK, R:Dan)

3.2 Propose a tool that calculate their degree centrality and betweenness centrality from PPI file. PPI data can be downloaded from DIP, BIND, MIPS, MINT, and InAct databases.(E:Edmond Dantes, P:ELK, R:Julian)

3.3 Propose a tool that calculates enrichment level of gene expression data given pre-defined gene sets (http://software.broadinstitute.org/gsea/msigdb).(E:Aparna, P:Kevin, R:Calvin)

4. Structure Analysis

4.1 Propose a tool that calculate distance between two alpha carbons from a PDB file. (The program should output a distance between two atoms in angstroms) (E:Gawain, P:Peter, R:Cavin)

4.2 Propose a tool that calculate the Lennard-Jones potential based on the input of a PDB file consisting of just alpha carbons and a query point’s xyz coordinates. (E:Nathan, P:Heather, R:Gawain)

4.3 Propose a tool that calculate the dihedral angle based on the input of four points’ xyz coordinates in PDB format.(E:Gawain, P:Peter, R:Kevin)

Grouping Assignment Table

Name Primary 1.1 1.2 1.3 2.1 2.2 2.3 2.4 2.5 2.6 3.1 3.2 3.3 4.1 4.2 4.3
Julian R R R R
Kevin P/R P P R
Edmond Dantes E E E E
ELK P/E P P P
Aparna E E E E
Dan R R R R
Nathan E E E E
Heather P/E P P P
Peter P/E P P P
Calvin R/E R R R
Gawain R/E E R E