Processing Stockholm alignments
R2DT can process Stockholm-format multiple sequence alignments that contain named secondary structure regions. This is particularly useful for viral genome alignments where different RNA structures have been manually annotated with names.
Overview
Many viral RNA databases use Stockholm alignments with special #=GC annotation lines to mark named structural elements. R2DT uses the following annotation lines:
#=GC SS_cons — Consensus secondary structure in dot-bracket notation
#=GC structureID — Individual structure names (e.g. SLI, IRES), separated by
|#=GC regionID — Parent genomic region names (e.g. 5’UTR, NS5B), separated by
|
R2DT extracts each named region, computes an RF-style consensus sequence using IUPAC ambiguity codes, and generates template-free visualizations for each structure. The outputs can be automatically stitched into a single combined diagram.
Quick start
r2dt.py stockholm examples/hcv-alignment.stk output/hcv/
This command:
Parses the Stockholm alignment
Extracts named structural elements from
structureIDAssigns parent regions from
regionIDComputes RF consensus sequences with IUPAC codes (R for purine, Y for pyrimidine, etc.)
Generates template-free 2D diagrams for each valid region
Stitches all diagrams into a combined view
Input format
The Stockholm file must contain #=GC SS_cons and at least one of:
#=GC structureID— for multi-region alignments (e.g. viral genomes)#=GC knownSS_names— legacy format equivalentNeither — the entire alignment is treated as a single structure (e.g. Rfam seed alignments)
structureID + regionID
The format uses two #=GC annotation lines:
structureIDlabels individual structural elements (stem-loops, junctions, etc.)regionIDlabels the broader genomic region each structure belongs to
# STOCKHOLM 1.0
seq1/1-200 AUCGAUCG...AUCGAUCG...AUCGAUCG...
seq2/1-200 AUCGAUCG...AUCGAUCG...AUCGAUCG...
#=GC SS_cons ...(((...)))......(((....)))......(((.....)))...
#=GC structureID |...SLI.........|......SLII......|.....SLIII...|
#=GC regionID |.........5'UTR...................|..core_protein.|
//
Both lines use the same pipe-delimited format:
|characters mark boundaries between named segmentsText between pipes (after stripping dots and whitespace) becomes the name
Dots (
.) are filler characters — only the text content mattersColumns in
structureID/regionIDcorrespond 1:1 with alignment andSS_conscolumns
In this example, SLI and SLII both fall within the 5’UTR region, while SLIII belongs to core_protein. R2DT uses the midpoint of each structure to determine its parent region.
Format rules
The | characters in the annotation lines mark the boundaries between regions:
Feature |
Description |
|---|---|
Delimiters |
Pipe |
Filler |
Dots |
Names |
Non-dot, non-whitespace text between pipes is the name |
Empty segments |
Segments with only dots/whitespace are unnamed (skipped) |
Column alignment |
Each character position corresponds to an alignment column |
Simple alignment (e.g. Rfam seed)
When neither structureID nor knownSS_names is present, R2DT treats the entire alignment as a single structure. This works out of the box for Rfam seed alignments and any other Stockholm file with SS_cons.
# STOCKHOLM 1.0
#=GF ID SAM
#=GF AC RF00162
seq1/1-108 CUCUUAUCAAGAG...
seq2/1-108 ACCUUAUUUUGAG...
#=GC SS_cons (.(((((((,,,,<.<<<.<.--......
#=GC RF c.ucUuAUcaAGAG.gGG.c.gG......
//
R2DT uses these annotations:
Annotation |
Purpose |
|---|---|
|
Region name (e.g. “SAM”) |
|
Fallback name if ID is absent (e.g. “RF00162”) |
|
Reference annotation — uppercase/lowercase positions define match columns; dots mark inserts to remove |
|
Consensus structure, may use WUSS notation |
If #=GC RF is present, match columns are determined by non-dot RF positions. Otherwise, R2DT computes an IUPAC consensus and strips all-gap columns.
Since only one region is produced, stitching is automatically skipped — no need to pass --no-stitch.
WUSS notation
Rfam and Infernal use WUSS (Washington University Secondary Structure) notation in SS_cons. R2DT automatically converts WUSS to standard dot-bracket:
WUSS character |
Meaning |
Converted to |
|---|---|---|
|
Base pair (depth 1) |
|
|
Base pair (depth 2) |
|
|
Base pair (depth 3) |
|
|
Base pair (depth 4) |
|
|
Pseudoknot (letter pair) |
Kept as-is |
|
Unpaired |
|
|
WUSS unpaired variants |
|
Example: HCV structural elements
The Hepatitis C virus genome contains numerous conserved RNA secondary structures. An example alignment of 57 HCV sequences with named structures is included at examples/hcv-alignment.stk. This file uses the structureID + regionID format.
Step 1: Run stockholm command
r2dt.py stockholm examples/hcv-alignment.stk output/hcv/
Output:
Processing Stockholm alignment: HCV Nucleotide Structures.stk
Found 57 sequences
SS_cons length: 9831
Found 32 named regions
Processing region: SLI
✓ Generated SVG: SLI_13-27.svg
Processing region: SLII
✓ Generated SVG: SLII_52-125.svg
...
Processing region: X-tail
Skipping: Unmatched ')' at position 23 (no matching '(')
Summary:
Processed: 28 regions
Skipped: 4 regions
Stitching SVG outputs...
✓ Stitched SVG written to: output/hcv/stitched.svg
Step 2: View the stitched output
The stitched SVG shows all valid structural elements in alignment order, with each panel labelled by its parent genomic region (from regionID) and structure name:
All named RNA structures from an HCV alignment, automatically stitched into a single diagram. Each panel shows the consensus secondary structure for one named region, with IUPAC ambiguity codes reflecting conservation across 57 sequences.
Step 3: Use the thumbnail view
The stockholm command also produces a thumbnail version of the stitched diagram. The thumbnail strips away text, numbering, base-pair lines, and pseudoknots, leaving only the backbone outline of each structure — a compact silhouette that is ideal for embedding in web pages, gallery views, or overview figures.
Thumbnail view of the same HCV structures. Each panel is reduced to its backbone outline, making it easy to compare the shapes and relative sizes of all structural elements at a glance.
The thumbnail is written to stitched-thumbnail.svg alongside the full diagram. It is generated automatically — no extra flags are needed.
What the thumbnail removes
Element |
Reason |
|---|---|
Nucleotide letters |
Clutter at small sizes |
Numbering lines and tick marks |
Not needed for shape overview |
Base-pair lines |
Simplify the silhouette |
Pseudoknot arcs |
Remove visual noise |
Circle markers |
Keep only the backbone path |
The connecting outline between panels and the break glyphs are preserved so the overall genome layout remains clear.
Command options
r2dt.py stockholm [OPTIONS] STOCKHOLM_INPUT OUTPUT_FOLDER
Option |
Description |
|---|---|
|
Enable/disable automatic stitching (default: enabled) |
|
Custom path for stitched SVG |
|
Monochrome output (default) or preserve colors |
|
Auto-colour panels: |
|
TSV file mapping structure/region names to SVG colours |
|
Suppress progress output |
Output files
The stockholm command produces:
output/
├── processing_summary.txt # Summary of processed/skipped regions
├── stitched.svg # Combined diagram (if --stitch)
├── stitched-outline.svg # Outline-only version (no fill colors)
├── stitched-thumbnail.svg # Backbone silhouette for galleries/embeds
├── results/
│ ├── svg/ # Individual SVG diagrams
│ │ ├── SLI_13-27.svg
│ │ ├── SLII_52-125.svg
│ │ └── ...
│ └── fasta/ # Consensus sequences with structures
│ ├── SLI.fasta
│ ├── SLII.fasta
│ └── ...
└── regions/ # Full working files per region
├── SLI/
│ ├── SLI.fasta
│ └── r2r/ # R2R intermediate files
└── ...
RF consensus computation
R2DT computes consensus sequences using IUPAC ambiguity codes, similar to Infernal’s RF line:
Code |
Nucleotides |
Description |
|---|---|---|
A, C, G, U |
Single |
Conserved nucleotide (>50% frequency) |
R |
A, G |
Purine |
Y |
C, U |
Pyrimidine |
S |
C, G |
Strong |
W |
A, U |
Weak |
K |
G, U |
Keto |
M |
A, C |
Amino |
N |
A, C, G, U |
Any nucleotide |
- |
Gap |
Majority gaps |
Lowercase letters indicate positions where the consensus nucleotide is present in 50-80% of sequences.
Validation and skipping
Regions are skipped if they have:
Unbalanced brackets — Mismatched
(and)pairsNo base pairs — Structure is all dots
Length mismatch — Sequence and structure have different lengths
Coloring panels
By default, stitched diagrams use monochrome styling. R2DT can colour each panel in the stitched output according to its structure name, genomic region, or a custom colour palette. The colour is applied to nucleotide letters, backbone lines, and the connecting outline between panels.
Three coloring modes are available:
Auto-colour by structure name
Assign a deterministic colour to each panel based on its structureID name. Every unique structure gets a different colour:
r2dt.py stockholm examples/hcv-alignment.stk output/hcv-colored/ --color-by structure
Auto-colour by genomic region
All structures that share a regionID get the same colour. This groups related structures visually (for example, all stem-loops within the 5′ UTR share one colour):
r2dt.py stockholm examples/hcv-alignment.stk output/hcv-colored/ --color-by region
Custom colour palette from a TSV file
For full control, provide a tab-separated file that maps structure or region names to specific SVG colours:
r2dt.py stockholm examples/hcv-alignment.stk output/hcv-colored/ \
--color-config examples/color-config.tsv
The colour config file has two tab-separated columns — name and colour — with no header row:
# Example colour configuration for HCV genome regions.
# Lines starting with # are comments.
5'UTR steelblue
core_protein #e07a5f
E1_protein #81b29a
E2_protein #f2cc8f
NS3_protease/helicase #457b9d
NS5B_RNA-dependent_RNA_polymerase #2a9d8f
3'UTR #e76f51
* gray
Colour config rules:
Feature |
Description |
|---|---|
Columns |
Two tab-separated columns: name and SVG colour |
Comments |
Lines starting with |
Name matching |
Each panel is matched first by its |
Default colour |
The special key |
Colour values |
Any SVG-valid colour: named ( |
An example configuration file is provided at examples/color-config.tsv.
What gets coloured
When panel colours are active, the following elements are styled:
Nucleotide letters — text colour matches the panel accent
Backbone lines — the “gray” backbone strokes use the panel colour
Connecting outline — the path tracing through nucleotide positions uses per-panel colours instead of a single flat gray
Thumbnail — the backbone silhouette in
stitched-thumbnail.svgalso reflects panel colours
Note
--color-by and --color-config both override --monochrome. You do not need to pass --color explicitly.