Processing Stockholm alignments

R2DT can process Stockholm-format multiple sequence alignments that contain named secondary structure regions. This is particularly useful for viral genome alignments where different RNA structures have been manually annotated with names.

Overview

Many viral RNA databases use Stockholm alignments with special #=GC annotation lines to mark named structural elements. R2DT uses the following annotation lines:

  • #=GC SS_cons — Consensus secondary structure in dot-bracket notation

  • #=GC structureID — Individual structure names (e.g. SLI, IRES), separated by |

  • #=GC regionID — Parent genomic region names (e.g. 5’UTR, NS5B), separated by |

R2DT extracts each named region, computes an RF-style consensus sequence using IUPAC ambiguity codes, and generates template-free visualizations for each structure. The outputs can be automatically stitched into a single combined diagram.

Quick start

r2dt.py stockholm examples/hcv-alignment.stk output/hcv/

This command:

  1. Parses the Stockholm alignment

  2. Extracts named structural elements from structureID

  3. Assigns parent regions from regionID

  4. Computes RF consensus sequences with IUPAC codes (R for purine, Y for pyrimidine, etc.)

  5. Generates template-free 2D diagrams for each valid region

  6. Stitches all diagrams into a combined view

Input format

The Stockholm file must contain #=GC SS_cons and at least one of:

  • #=GC structureID — for multi-region alignments (e.g. viral genomes)

  • #=GC knownSS_names — legacy format equivalent

  • Neither — the entire alignment is treated as a single structure (e.g. Rfam seed alignments)

structureID + regionID

The format uses two #=GC annotation lines:

  • structureID labels individual structural elements (stem-loops, junctions, etc.)

  • regionID labels the broader genomic region each structure belongs to

# STOCKHOLM 1.0

seq1/1-200      AUCGAUCG...AUCGAUCG...AUCGAUCG...
seq2/1-200      AUCGAUCG...AUCGAUCG...AUCGAUCG...

#=GC SS_cons         ...(((...)))......(((....)))......(((.....)))...
#=GC structureID     |...SLI.........|......SLII......|.....SLIII...|
#=GC regionID        |.........5'UTR...................|..core_protein.|
//

Both lines use the same pipe-delimited format:

  • | characters mark boundaries between named segments

  • Text between pipes (after stripping dots and whitespace) becomes the name

  • Dots (.) are filler characters — only the text content matters

  • Columns in structureID / regionID correspond 1:1 with alignment and SS_cons columns

In this example, SLI and SLII both fall within the 5’UTR region, while SLIII belongs to core_protein. R2DT uses the midpoint of each structure to determine its parent region.

Format rules

The | characters in the annotation lines mark the boundaries between regions:

Feature

Description

Delimiters

Pipe | characters separate named segments

Filler

Dots . pad segments to the correct column width

Names

Non-dot, non-whitespace text between pipes is the name

Empty segments

Segments with only dots/whitespace are unnamed (skipped)

Column alignment

Each character position corresponds to an alignment column

Simple alignment (e.g. Rfam seed)

When neither structureID nor knownSS_names is present, R2DT treats the entire alignment as a single structure. This works out of the box for Rfam seed alignments and any other Stockholm file with SS_cons.

# STOCKHOLM 1.0
#=GF ID SAM
#=GF AC RF00162

seq1/1-108      CUCUUAUCAAGAG...
seq2/1-108      ACCUUAUUUUGAG...

#=GC SS_cons    (.(((((((,,,,<.<<<.<.--......
#=GC RF         c.ucUuAUcaAGAG.gGG.c.gG......
//

R2DT uses these annotations:

Annotation

Purpose

#=GF ID

Region name (e.g. “SAM”)

#=GF AC

Fallback name if ID is absent (e.g. “RF00162”)

#=GC RF

Reference annotation — uppercase/lowercase positions define match columns; dots mark inserts to remove

#=GC SS_cons

Consensus structure, may use WUSS notation

If #=GC RF is present, match columns are determined by non-dot RF positions. Otherwise, R2DT computes an IUPAC consensus and strips all-gap columns.

Since only one region is produced, stitching is automatically skipped — no need to pass --no-stitch.

WUSS notation

Rfam and Infernal use WUSS (Washington University Secondary Structure) notation in SS_cons. R2DT automatically converts WUSS to standard dot-bracket:

WUSS character

Meaning

Converted to

( )

Base pair (depth 1)

( )

< >

Base pair (depth 2)

( )

[ ]

Base pair (depth 3)

( )

{ }

Base pair (depth 4)

( )

AZ / az

Pseudoknot (letter pair)

Kept as-is

.

Unpaired

.

, - _ : ~

WUSS unpaired variants

.

Example: HCV structural elements

The Hepatitis C virus genome contains numerous conserved RNA secondary structures. An example alignment of 57 HCV sequences with named structures is included at examples/hcv-alignment.stk. This file uses the structureID + regionID format.

Step 1: Run stockholm command

r2dt.py stockholm examples/hcv-alignment.stk output/hcv/

Output:

Processing Stockholm alignment: HCV Nucleotide Structures.stk
  Found 57 sequences
  SS_cons length: 9831
  Found 32 named regions
  Processing region: SLI
    ✓ Generated SVG: SLI_13-27.svg
  Processing region: SLII
    ✓ Generated SVG: SLII_52-125.svg
  ...
  Processing region: X-tail
    Skipping: Unmatched ')' at position 23 (no matching '(')

Summary:
  Processed: 28 regions
  Skipped: 4 regions

Stitching SVG outputs...
✓ Stitched SVG written to: output/hcv/stitched.svg

Step 2: View the stitched output

The stitched SVG shows all valid structural elements in alignment order, with each panel labelled by its parent genomic region (from regionID) and structure name:

HCV RNA secondary structures generated from a Stockholm alignment

All named RNA structures from an HCV alignment, automatically stitched into a single diagram. Each panel shows the consensus secondary structure for one named region, with IUPAC ambiguity codes reflecting conservation across 57 sequences.

Step 3: Use the thumbnail view

The stockholm command also produces a thumbnail version of the stitched diagram. The thumbnail strips away text, numbering, base-pair lines, and pseudoknots, leaving only the backbone outline of each structure — a compact silhouette that is ideal for embedding in web pages, gallery views, or overview figures.

HCV RNA secondary structures — thumbnail view

Thumbnail view of the same HCV structures. Each panel is reduced to its backbone outline, making it easy to compare the shapes and relative sizes of all structural elements at a glance.

The thumbnail is written to stitched-thumbnail.svg alongside the full diagram. It is generated automatically — no extra flags are needed.

What the thumbnail removes

Element

Reason

Nucleotide letters

Clutter at small sizes

Numbering lines and tick marks

Not needed for shape overview

Base-pair lines

Simplify the silhouette

Pseudoknot arcs

Remove visual noise

Circle markers

Keep only the backbone path

The connecting outline between panels and the break glyphs are preserved so the overall genome layout remains clear.

Command options

r2dt.py stockholm [OPTIONS] STOCKHOLM_INPUT OUTPUT_FOLDER

Option

Description

--stitch/--no-stitch

Enable/disable automatic stitching (default: enabled)

--stitch-output PATH

Custom path for stitched SVG

--monochrome/--color

Monochrome output (default) or preserve colors

--color-by MODE

Auto-colour panels: none (default), structure, or region

--color-config PATH

TSV file mapping structure/region names to SVG colours

--quiet

Suppress progress output

Output files

The stockholm command produces:

output/
├── processing_summary.txt    # Summary of processed/skipped regions
├── stitched.svg              # Combined diagram (if --stitch)
├── stitched-outline.svg      # Outline-only version (no fill colors)
├── stitched-thumbnail.svg    # Backbone silhouette for galleries/embeds
├── results/
│   ├── svg/                  # Individual SVG diagrams
│   │   ├── SLI_13-27.svg
│   │   ├── SLII_52-125.svg
│   │   └── ...
│   └── fasta/                # Consensus sequences with structures
│       ├── SLI.fasta
│       ├── SLII.fasta
│       └── ...
└── regions/                  # Full working files per region
    ├── SLI/
    │   ├── SLI.fasta
    │   └── r2r/              # R2R intermediate files
    └── ...

RF consensus computation

R2DT computes consensus sequences using IUPAC ambiguity codes, similar to Infernal’s RF line:

Code

Nucleotides

Description

A, C, G, U

Single

Conserved nucleotide (>50% frequency)

R

A, G

Purine

Y

C, U

Pyrimidine

S

C, G

Strong

W

A, U

Weak

K

G, U

Keto

M

A, C

Amino

N

A, C, G, U

Any nucleotide

-

Gap

Majority gaps

Lowercase letters indicate positions where the consensus nucleotide is present in 50-80% of sequences.

Validation and skipping

Regions are skipped if they have:

  • Unbalanced brackets — Mismatched ( and ) pairs

  • No base pairs — Structure is all dots

  • Length mismatch — Sequence and structure have different lengths

Coloring panels

By default, stitched diagrams use monochrome styling. R2DT can colour each panel in the stitched output according to its structure name, genomic region, or a custom colour palette. The colour is applied to nucleotide letters, backbone lines, and the connecting outline between panels.

Three coloring modes are available:

Auto-colour by structure name

Assign a deterministic colour to each panel based on its structureID name. Every unique structure gets a different colour:

r2dt.py stockholm examples/hcv-alignment.stk output/hcv-colored/ --color-by structure

Auto-colour by genomic region

All structures that share a regionID get the same colour. This groups related structures visually (for example, all stem-loops within the 5′ UTR share one colour):

r2dt.py stockholm examples/hcv-alignment.stk output/hcv-colored/ --color-by region

Custom colour palette from a TSV file

For full control, provide a tab-separated file that maps structure or region names to specific SVG colours:

r2dt.py stockholm examples/hcv-alignment.stk output/hcv-colored/ \
    --color-config examples/color-config.tsv

The colour config file has two tab-separated columns — name and colour — with no header row:

# Example colour configuration for HCV genome regions.
# Lines starting with # are comments.
5'UTR	steelblue
core_protein	#e07a5f
E1_protein	#81b29a
E2_protein	#f2cc8f
NS3_protease/helicase	#457b9d
NS5B_RNA-dependent_RNA_polymerase	#2a9d8f
3'UTR	#e76f51
*	gray

Colour config rules:

Feature

Description

Columns

Two tab-separated columns: name and SVG colour

Comments

Lines starting with # are ignored

Name matching

Each panel is matched first by its structureID, then by its regionID

Default colour

The special key * sets the fallback colour for unmatched panels

Colour values

Any SVG-valid colour: named (steelblue), hex (#e07a5f), rgb(), etc.

An example configuration file is provided at examples/color-config.tsv.

What gets coloured

When panel colours are active, the following elements are styled:

  • Nucleotide letters — text colour matches the panel accent

  • Backbone lines — the “gray” backbone strokes use the panel colour

  • Connecting outline — the path tracing through nucleotide positions uses per-panel colours instead of a single flat gray

  • Thumbnail — the backbone silhouette in stitched-thumbnail.svg also reflects panel colours

Note

--color-by and --color-config both override --monochrome. You do not need to pass --color explicitly.