Annotating viral genomes

R2DT can automatically scan viral genomes to identify and visualise non-coding RNA elements using the Rfam database. This tutorial demonstrates the complete workflow from a viral genome FASTA file to a diagram showing all RNA structures.

Overview

Many viruses contain conserved RNA secondary structures that play critical roles in their life cycles:

5′ and 3′ UTRs - Untranslated regions with regulatory structures
Frameshift elements - Programmed ribosomal frameshifting signals
Internal ribosome entry sites (IRES) - Cap-independent translation initiation
Packaging signals - RNA structures involved in genome packaging

R2DT uses Infernal’s cmscan tool to search viral genomes against the Rfam covariance model library, then generates secondary structure diagrams for each identified RNA family.

Quick start

r2dt.py viral-annotate genome.fasta output/

This command:

Scans the genome against Rfam using GA (gathering) thresholds
Filters and ranks overlapping hits
Generates 2D diagrams for each RNA family found
Optionally stitches all diagrams into a single combined view

Example: SARS-CoV-2 coronavirus

The SARS-CoV-2 genome (~30,000 nt) contains several well-characterised RNA structures that are important for viral replication.

Step 1: Prepare input

This example uses the SARS-CoV-2 genome with accession OX309346.1, which is included in the R2DT repository at examples/viral/coronavirus.fasta.

Step 2: Run viral-annotate

r2dt.py viral-annotate examples/viral/coronavirus.fasta output/coronavirus/

Output:

# R2DT :: visualise RNA secondary structure using templates
# Version 2.2 (2026)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Step 1: Calculating genome size
  Genome size: 29,903 nt
  Database size (Z): 0.059806 Mb

Step 2: Running cmscan with Rfam GA thresholds
  Running: cmscan... (this may take several minutes)

Step 3: Parsing cmscan results
  Found 3 RNA family hits:
    RF03120 (Sarbecovirus-5UTR): 26-299 + score=309.9
    RF00507 (Corona_FSE): 13,469-13,546 + score=77.6
    RF03125 (Sarbecovirus-3UTR): 29,536-29,870 + score=406.9

Step 4: Generating 2D diagrams for each hit
  ✓ RF03120 (Sarbecovirus-5UTR)
  ✓ RF00507 (Corona_FSE)
  ✓ RF03125 (Sarbecovirus-3UTR)

Summary
  Genome: OX309346.1 (29,903 nt)
  RNA families found: 3
  Diagrams generated: 3

Step 3: Examine the output

The output/coronavirus/ directory contains:

output/coronavirus/
├── cmscan.tblout          # Tabular cmscan results
├── cmscan.out             # Full cmscan output
└── rfam/                  # Individual RNA diagrams
    ├── RF03120_26-299.colored.svg
    ├── RF00507_13469-13546.colored.svg
    └── RF03125_29536-29870.colored.svg

Each SVG file shows the secondary structure of one RNA element:

RF03120 (Sarbecovirus-5UTR): The 5′ untranslated region containing stem-loops SL1-SL5

SARS-CoV-2 5′ UTR secondary structure — 5′ UTR of SARS-CoV-2 (nucleotides 26-299)

RF00507 (Corona_FSE): The programmed ribosomal frameshift element between ORF1a and ORF1b

Coronavirus FSE (frameshift element) — FSE - Frameshift element (nucleotides 13,469-13,546)

RF03125 (Sarbecovirus-3UTR): The 3′ untranslated region containing the s2m element

SARS-CoV-2 3′ UTR secondary structure — 3′ UTR of SARS-CoV-2 (nucleotides 29,536-29,870)

Step 4: Create a stitched diagram

To combine all diagrams into a single panoramic view:

r2dt.py stitch \
    output/coronavirus/rfam/*.colored.svg \
    -o coronavirus-stitched.svg \
    --sort \
    --captions "5′ UTR" --captions "FSE" --captions "3′ UTR"

The --sort flag arranges panels by genomic coordinates (extracted from filenames), and --captions adds labels above each panel.

The resulting stitched diagram shows all three RNA structures in genomic order:

🔍 Use mouse wheel to zoom, drag to pan.

Combined view of SARS-CoV-2 RNA structures: 5′ UTR, FSE (frameshift element), and 3′ UTR.

SARS-CoV-2 stitched RNA structures — This figure is hidden but ensures Sphinx copies the image to _images/

Two approaches to viral RNA annotation

R2DT supports two complementary approaches for annotating viral genomes:

Approach	Command	Best for
FASTA + Rfam	`viral-annotate`	Automatic discovery — finds known RNA families in any genome
Stockholm alignment	`stockholm`	Expert curation — uses manually annotated structures and regions

The FASTA approach (shown above with SARS-CoV-2) automatically scans the genome against Rfam and is fully automated. However, it can only find structures that already have Rfam models.

The Stockholm approach uses a curated multiple sequence alignment where structures have been manually annotated with #=GC structureID and #=GC regionID lines. This can capture structures that are not in Rfam, and groups them by genomic region (e.g. 5′UTR, NS5B).

Example: HCV using Stockholm alignment

The HCV genome contains over 40 annotated RNA structures across the 5′UTR, coding regions, and 3′UTR. An alignment of 57 HCV sequences with named structures is included at examples/hcv-alignment.stk.

r2dt.py stockholm examples/hcv-alignment.stk output/hcv-stockholm/

Each structure is labelled with its parent genomic region, producing captions like “SLI (5′UTR)” or “5BSL3.1 (NS5B)” in the stitched output:

🔍 Use mouse wheel to zoom, drag to pan.

HCV RNA structures from Stockholm alignment with structureID and regionID annotations — HCV RNA structures generated from a Stockholm alignment using `#=GC structureID` and `#=GC regionID` annotations. Structures are labelled with their parent genomic region.

Compare this with the Rfam-based HCV diagram in the gallery below — the Stockholm approach captures significantly more structures because it supports manually curated annotations, including families that have not yet been added to Rfam.

Command reference

viral-annotate

r2dt.py viral-annotate <genome.fasta> <output_folder> [OPTIONS]

Arguments:

genome.fasta - Input viral genome in FASTA format (one sequence)
output_folder - Directory for output files

Options:

Option	Default	Description
`--stitch-output PATH`	None	Path for stitched SVG output
`--cm-library PATH`	data/rfam/cms/all.cm	Path to Rfam CM library
`--clanin PATH`	None	Path to Rfam.clanin for clan competition
`--cpu N`	4	Number of CPUs for cmscan
`--evalue`, `-E`	None	E-value threshold (default: use Rfam GA thresholds)
`--monochrome/--color`	monochrome	Monochrome (default) or preserve original colors
`--quiet`	False	Suppress progress messages

stitch

See the stitch command reference for all available options.

Understanding the results

cmscan.tblout format

The tabular output from cmscan contains detailed hit information:

#idx  target name         accession  query name  ...  seq from  seq to  strand  ...  score  E-value
   Sarbecovirus-5UTR   RF03120    OX309346.1  ...  26        299     +       ...  309.9  3.2e-76
   Corona_FSE          RF00507    OX309346.1  ...  13469     13546   +       ...  77.6   1.1e-15
   Sarbecovirus-3UTR   RF03125    OX309346.1  ...  29536     29870   +       ...  406.9  1.5e-98

Genomic coordinates in filenames

SVG filenames encode the genomic coordinates:

RF00507_13469-13546.colored.svg
   │      │     │
   │      │     └── End position (1-based)
   │      └── Start position (1-based)
   └── Rfam accession

The stitch command uses these coordinates to:

Order panels by genomic position (--sort)
Calculate nucleotide distances between panels
Display gap labels showing the distance in nucleotides

Advanced usage

Using a custom CM library

To scan against a subset of Rfam families or custom models:

r2dt.py viral-annotate genome.fasta output/ --cm-library my-models.cm

The CM library will be automatically indexed with cmpress if needed.

Customising the stitched output

The stitch command provides many options for customising the combined diagram:

r2dt.py stitch \
    output/rfam/*.svg \
    -o stitched.svg \
    --sort \
    --gap 150 \
    --glyph break \
    --captions "5′ UTR" --captions "FSE" --captions "3′ UTR" \
    --color \
    --no-outline

By default, stitched diagrams are monochrome (black and white), which is suitable for publications. Use --color to preserve the original nucleotide coloring:

🔍 Use mouse wheel to zoom, drag to pan.

Colored version of the stitched diagram using --color flag.

SARS-CoV-2 stitched RNA structures (colored) — This figure is hidden but ensures Sphinx copies the image to _images/

See Stitching multiple diagrams for the full option reference.

Integrating with other R2DT workflows

The viral-annotate command is a convenience wrapper. For more control, you can run the steps manually:

Run cmscan separately:

cmscan -Z 0.06 --cut_ga --rfam --nohmmonly \
    --tblout hits.tblout --fmt 2 --cpu 4 \
    data/rfam/cms/all.cm genome.fasta > cmscan.out

Extract hit regions and visualise:

# For each hit, extract the sequence and run:
r2dt.py rfam draw RF00507 hit.fasta output/

Stitch the results:

r2dt.py stitch output/results/svg/*.colored.svg -o combined.svg --sort

Supported virus families

R2DT can identify RNA structures in many virus families. The Rfam database contains models for:

Coronaviridae - 5′/3′ UTRs, frameshift elements, s2m
Flaviviridae - 5′/3′ UTRs, pseudoknots
Picornaviridae - IRES elements, cre
Retroviridae - TAR, RRE, packaging signals
And many more…

Search Rfam to find RNA families associated with your virus of interest.

Troubleshooting

No hits found

If cmscan finds no hits:

The genome may not contain characterised RNA families in Rfam
Try lowering the threshold with --evalue 1 to see marginal hits

cmscan is slow

For large genomes or many sequences:

Increase --cpu to use more processors
Consider searching with a smaller CM library targeting expected families

Missing diagrams

If some hits don’t produce diagrams:

The Rfam family may not have a template in R2DT
Check output/rfam/ for error messages in the draw logs
Some very divergent sequences may fail alignment

Gallery

Examples of RNA structure annotations for different viral genomes, generated with --normalize-font-size for consistent visual appearance.

Note

All gallery images are regenerated from example inputs with just docs-images. See Updating documentation for details.

SARS-CoV-2 coronavirus

Genome: OX309346.1 (29,903 nt) | RNA structures: 3

🔍 Use mouse wheel to zoom, drag to pan.

Hepatitis C virus (HCV)

Genome: NC_038882.1 (9,646 nt) | RNA structures: 12

🔍 Use mouse wheel to zoom, drag to pan.

Dengue virus serotype 2

Genome: NC_001474.2 (10,723 nt) | RNA structures: 4

🔍 Use mouse wheel to zoom, drag to pan.

Annotating viral genomes

Overview

Quick start

Example: SARS-CoV-2 coronavirus

Step 1: Prepare input

Step 2: Run viral-annotate

Step 3: Examine the output

Step 4: Create a stitched diagram

Two approaches to viral RNA annotation

Example: HCV using Stockholm alignment

Command reference

viral-annotate

stitch

Understanding the results

cmscan.tblout format

Genomic coordinates in filenames

Advanced usage

Using a custom CM library

Customising the stitched output

Integrating with other R2DT workflows

Supported virus families

Troubleshooting

No hits found

cmscan is slow

Missing diagrams

Gallery

SARS-CoV-2 coronavirus

Hepatitis C virus (HCV)

Dengue virus serotype 2

See also