CCB Bioinformatics BLAST processing Workflow
This is a simple workflow demonstrating a sequence Basic Local Alignment Search Tool (BLAST)
Problem addressed by this workflow
This workflow shows an example of a common bioinformatics pipeline workflow using tools from several different institutions. This workflow starts by formatting the NCIBI/NCBI Escherichia coli (E. coli) database, creating a database-index table, using a FASTA query instructions to create a filtering file, and finally running miBLAST, an efficient Basic Local Alignment Search Tool (BLAST) for batch of nucleotide sequence queries. Such batch workloads contain a large number of query sequences and can be evaluated for each individual query one at time. We have integrated NCBI BLAST with miBLAST (University of Michigan) to improve the sequence search and alignment efficiency without any loss in sensitivity. miBLAST employs a q-gram indexing and a filtering algorithm for quickly detecting sequence similarity between the query sequences and the database sequences, which results in a substantial increase in overall performance. This bioinformatics pipeline workflow specifies a FASTA database, the size of the search word and a set of search instructions, performs the P+BLAST search and completes in 1 minute. The nested insert images illustrate a fragment of the final alignment results of this pipeline.
Detailed Workflow Usage & Specifications
- Outputs and results:
- Expected times: Workflow takes about 10 minutes
- Contact person/group: SIG-FLOW Team
- Pubs: Publications
- Tools/packages used in this workflow:
Common informatics/genomics data formats
gcg, embl, swissprot, fasta, ncbi, genbank, nbrf, codata, strider, clustal, phylip, acedb, msf, ig, staden, text, raw, asis.
Additional Bioinformatics tools and workflows
We plan to include additional Bioinformatics tools in the Pipeline Informatics Environment. Such software is useful for basic sequence analysis, phylogenetic and population genetics analyses, protein structure modeling, expression array analysis, statistics and mathematical modeling. Examples include:
- Genomic databases:
- Genes information retrieving: HGNC.
- Association with some genes with disease: GAD database
- Information about deletions/duplications/indel: DGV
- 1000genomes. In the data section there is also a browser that allows to look areas of the genome and the evidences obtained by the 1000 genomes project (this is a project that sequenced 1000 genomes, allowing the identification di SNPs more rare than the ones identified by the Hap Map project).
- MOTHUR, Catchall
- General Sequence Analysis Packages (EMBOSS, etc.)
- Database Access (EMBL, PDB, SCOP, GenBank, etc.)
- Phylogenetic Inference (PHYLIP, PAUP*, MrBayes?, fastDNAml, GeneTree?, MODELTEST, P4, PAML, Seq-Gen, TreeView?)
- Population Genetics (Migrate, Fluctuate, Recombine, Lamarc, GeneConv?)
- Sequence Alignment (HMMER, ClustalW?, mafft, muscle, etc.)
- Sequence Assembly (Phred/Phrap/Consed, RepeatMasker?)
- Protein Structure Visualization (Amber, Charmm, Cn3D, Rasmol, 3D Molecular Viewer)
- Statistical/Mathematical Packages (R, Matlab, and S3).