Thursday, 7 May 2026

Bioinformatics Ultimate Cheat Sheet

🌟 Premium DBT JRF Masterclass

Bioinformatics Ultimate Cheat Sheet

1000+ words of essential computational biology. Master Sequence Alignment, BLAST Algorithms, Biological Databases, and Phylogenetics to secure guaranteed marks in DBT BET 2026.

1. Biological Databases & Formats

Bioinformatics relies on storing and retrieving massive amounts of omics data. Databases are classified into three main types: Primary, Secondary, and Composite. You must know which database falls into which category.

Classification of Databases

  • Primary Databases: Store raw, experimentally derived data (Nucleotide or Protein sequences, or 3D structures). Examples include GenBank (NCBI), EMBL (EBI), DDBJ (Japan), and the PDB (Protein Data Bank).
  • Secondary Databases: Store derived information from primary databases, such as conserved patterns, motifs, signatures, and domains. Examples include PROSITE, Pfam, PRINTS, and CATH/SCOP.
  • Composite Databases: Combine data from multiple primary databases to eliminate redundancy. Examples include UniProt and OWL.
FASTA Format: The most universally accepted text-based format for representing nucleotide or peptide sequences. It MUST begin with a greater-than symbol (>) followed by a unique description line, and the sequence begins on the next line.
Database Name Data Type Maintained Host / Organization
GenBank Nucleotide Sequences NCBI (USA)
SWISS-PROT Curated Protein Sequences SIB (Swiss Institute of Bioinformatics)
TrEMBL Uncurated Protein Sequences EBI (Europe)
PDB 3D Macromolecular Structures RCSB (Determined via X-Ray / NMR)
Pfam Protein Families & Domains Uses Hidden Markov Models (HMMs)

2. Sequence Alignment Algorithms

Aligning sequences is the core of bioinformatics to find evolutionary or functional relationships. It relies on Dynamic Programming algorithms.

Global vs. Local Alignment

Global Alignment (Needleman-Wunsch Algorithm): Attempts to align every single residue in both sequences from end to end. It is best used when two sequences are of similar length and highly conserved.

Local Alignment (Smith-Waterman Algorithm): Finds the most highly conserved sub-regions (domains/motifs) between two sequences, ignoring the poorly aligned ends. Best for sequences of differing lengths.

Scoring Matrices: PAM vs. BLOSUM (Extremely Important!)

To align protein sequences, we need a mathematical model to score matches, mismatches, and gaps. Amino acid substitutions are scored based on their evolutionary probability.

Feature PAM (Point Accepted Mutation) BLOSUM (Blocks Substitution Matrix)
Origin Data Derived from Global alignments of very closely related proteins. Derived from Local alignments of highly conserved blocks (domains).
Evolutionary Model Explicit evolutionary model (mutations over time). Based on observed frequencies in multiple sequence alignments.
Numbering System Higher number = Greater evolutionary distance (e.g., PAM250 is for distant sequences). Lower number = Greater evolutionary distance (e.g., BLOSUM45 is for distant sequences).
Standard Default PAM250 is standard for distant searches. BLOSUM62 is the default matrix used in BLASTp.

Exam Trick: Remember that PAM250 is roughly equivalent to BLOSUM45.

3. BLAST (Basic Local Alignment Search Tool)

Dynamic programming (like Smith-Waterman) is perfectly accurate but extremely slow. BLAST is a Heuristic algorithm. It trades mathematical perfection for extreme speed by finding short exact matches (called 'words' or 'k-mers') and extending them.

Variants of BLAST

  • BLASTn: Nucleotide query against a Nucleotide database.
  • BLASTp: Protein query against a Protein database.
  • BLASTx: Translated Nucleotide query against a Protein database (Used when you have a DNA sequence and want to find what protein it codes for).
  • tBLASTn: Protein query against a Translated Nucleotide database.
  • tBLASTx: Translated Nucleotide query against a Translated Nucleotide database (Most computationally intensive).
The E-value (Expectation Value): The number of hits one can "expect" to see by pure chance when searching a database of a particular size. The lower the E-value, the more significant the match. An E-value of 0.0 means the match is statistically perfect and not due to random chance. E-value increases as database size increases.

4. Phylogenetic Trees & Evolution

Phylogenetics uses sequence data to reconstruct the evolutionary history of species (Taxa/OTUs). Trees can be Rooted (showing a common ancestor) or Unrooted.

Root Internal Node Taxon A Taxon B Taxon C (Outgroup) Clade

Tree Building Methods

  • Distance-Based Methods: Calculate a genetic distance matrix first, then build the tree. Fast, but less accurate for deep evolutionary time. Examples: UPGMA (assumes a constant molecular clock) and Neighbor-Joining (NJ) (does not assume a constant clock).
  • Character-Based Methods: Evaluate each aligned column (character) individually. Computationally heavy but highly accurate. Examples: Maximum Parsimony (selects the tree requiring the fewest evolutionary changes/mutations) and Maximum Likelihood.

5. Protein Structure Prediction

If you have an amino acid sequence but no crystal structure, bioinformatics provides three main approaches to predict its 3D folding.

  • Homology Modeling (Comparative Modeling): The most accurate method. Used when your target sequence shares > 30% sequence identity with a protein whose 3D structure is already known (the template) in the PDB.
  • Threading (Fold Recognition): Used when sequence identity is low (< 30%), but you suspect the protein adopts a known structural fold. It forces the sequence through a library of known 3D folds to see which fits best based on energy scores.
  • Ab Initio (De Novo) Prediction: Used when no homologous templates exist. It predicts the structure entirely from scratch using physical and thermodynamic principles (seeking the lowest free energy state). Exceptionally difficult and requires immense computing power.

Guaranteed Exam Hits

PYQ Direct Statements (Ye questions aayenge hi aayenge!)
  • INSDC (International Nucleotide Sequence Database Collaboration): It is a joint effort comprising GenBank (USA), EMBL-Bank (Europe), and DDBJ (Japan). Data submitted to any one of these is automatically synchronized daily across all three.
  • Gap Penalties: In alignment algorithms, a Gap Opening Penalty is typically set much higher than a Gap Extension Penalty. This reflects biology: one large deletion of 5 bases is more likely to occur than 5 separate deletions of 1 base.
  • Entrez: It is not a database itself; rather, it is the federated search engine and retrieval system developed by NCBI that allows users to search across GenBank, PubMed, PubChem, and structural databases simultaneously.
  • Bootstrapping: A statistical method used in phylogenetics to test the reliability/confidence of the branches in a tree. A bootstrap value > 70% indicates strong support for that specific clade.
  • SCOP vs. CATH: Both are structural classification databases. SCOP (Structural Classification of Proteins) is largely curated manually by experts. CATH (Class, Architecture, Topology, Homologous superfamily) uses a mix of manual curation and automated algorithms.
  • Dot Plot: The simplest graphical method for comparing two sequences. A diagonal line represents a region of similarity. Parallel diagonal lines indicate tandem repeats.

No comments:

Post a Comment