Annotating viral genomes
R2DT can automatically scan viral genomes to identify and visualise non-coding RNA elements using the Rfam database. This tutorial demonstrates the complete workflow from a viral genome FASTA file to a diagram showing all RNA structures.
Overview
Many viruses contain conserved RNA secondary structures that play critical roles in their life cycles:
5′ and 3′ UTRs - Untranslated regions with regulatory structures
Frameshift elements - Programmed ribosomal frameshifting signals
Internal ribosome entry sites (IRES) - Cap-independent translation initiation
Packaging signals - RNA structures involved in genome packaging
R2DT uses Infernal’s cmscan tool to search viral genomes against the Rfam covariance model library, then generates secondary structure diagrams for each identified RNA family.
Quick start
r2dt.py viral-annotate genome.fasta output/
This command:
Scans the genome against Rfam using GA (gathering) thresholds
Filters and ranks overlapping hits
Generates 2D diagrams for each RNA family found
Optionally stitches all diagrams into a single combined view
Example: SARS-CoV-2 coronavirus
The SARS-CoV-2 genome (~30,000 nt) contains several well-characterised RNA structures that are important for viral replication.
Step 1: Prepare input
This example uses the SARS-CoV-2 genome with accession OX309346.1, which is included in the R2DT repository at examples/viral/coronavirus.fasta.
Step 2: Run viral-annotate
r2dt.py viral-annotate examples/viral/coronavirus.fasta output/coronavirus/
Output:
# R2DT :: visualise RNA secondary structure using templates
# Version 2.2 (2026)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Step 1: Calculating genome size
Genome size: 29,903 nt
Database size (Z): 0.059806 Mb
Step 2: Running cmscan with Rfam GA thresholds
Running: cmscan... (this may take several minutes)
Step 3: Parsing cmscan results
Found 3 RNA family hits:
RF03120 (Sarbecovirus-5UTR): 26-299 + score=309.9
RF00507 (Corona_FSE): 13,469-13,546 + score=77.6
RF03125 (Sarbecovirus-3UTR): 29,536-29,870 + score=406.9
Step 4: Generating 2D diagrams for each hit
✓ RF03120 (Sarbecovirus-5UTR)
✓ RF00507 (Corona_FSE)
✓ RF03125 (Sarbecovirus-3UTR)
Summary
Genome: OX309346.1 (29,903 nt)
RNA families found: 3
Diagrams generated: 3
Step 3: Examine the output
The output/coronavirus/ directory contains:
output/coronavirus/
├── cmscan.tblout # Tabular cmscan results
├── cmscan.out # Full cmscan output
└── rfam/ # Individual RNA diagrams
├── RF03120_26-299.colored.svg
├── RF00507_13469-13546.colored.svg
└── RF03125_29536-29870.colored.svg
Each SVG file shows the secondary structure of one RNA element:
RF03120 (Sarbecovirus-5UTR): The 5′ untranslated region containing stem-loops SL1-SL5
5′ UTR of SARS-CoV-2 (nucleotides 26-299)
RF00507 (Corona_FSE): The programmed ribosomal frameshift element between ORF1a and ORF1b
FSE - Frameshift element (nucleotides 13,469-13,546)
RF03125 (Sarbecovirus-3UTR): The 3′ untranslated region containing the s2m element
3′ UTR of SARS-CoV-2 (nucleotides 29,536-29,870)
Step 4: Create a stitched diagram
To combine all diagrams into a single panoramic view:
r2dt.py stitch \
output/coronavirus/rfam/*.colored.svg \
-o coronavirus-stitched.svg \
--sort \
--captions "5′ UTR" --captions "FSE" --captions "3′ UTR"
The --sort flag arranges panels by genomic coordinates (extracted from filenames), and --captions adds labels above each panel.
The resulting stitched diagram shows all three RNA structures in genomic order:
Combined view of SARS-CoV-2 RNA structures: 5′ UTR, FSE (frameshift element), and 3′ UTR.
This figure is hidden but ensures Sphinx copies the image to _images/
Two approaches to viral RNA annotation
R2DT supports two complementary approaches for annotating viral genomes:
Approach |
Command |
Best for |
|---|---|---|
FASTA + Rfam |
|
Automatic discovery — finds known RNA families in any genome |
Stockholm alignment |
|
Expert curation — uses manually annotated structures and regions |
The FASTA approach (shown above with SARS-CoV-2) automatically scans the genome against Rfam and is fully automated. However, it can only find structures that already have Rfam models.
The Stockholm approach uses a curated multiple sequence alignment where structures have been manually annotated with #=GC structureID and #=GC regionID lines. This can capture structures that are not in Rfam, and groups them by genomic region (e.g. 5′UTR, NS5B).
Example: HCV using Stockholm alignment
The HCV genome contains over 40 annotated RNA structures across the 5′UTR, coding regions, and 3′UTR. An alignment of 57 HCV sequences with named structures is included at examples/hcv-alignment.stk.
r2dt.py stockholm examples/hcv-alignment.stk output/hcv-stockholm/
Each structure is labelled with its parent genomic region, producing captions like “SLI (5′UTR)” or “5BSL3.1 (NS5B)” in the stitched output:
HCV RNA structures generated from a Stockholm alignment using #=GC structureID and #=GC regionID annotations. Structures are labelled with their parent genomic region.
Compare this with the Rfam-based HCV diagram in the gallery below — the Stockholm approach captures significantly more structures because it supports manually curated annotations, including families that have not yet been added to Rfam.
Command reference
viral-annotate
r2dt.py viral-annotate <genome.fasta> <output_folder> [OPTIONS]
Arguments:
genome.fasta- Input viral genome in FASTA format (one sequence)output_folder- Directory for output files
Options:
Option |
Default |
Description |
|---|---|---|
|
None |
Path for stitched SVG output |
|
data/rfam/cms/all.cm |
Path to Rfam CM library |
|
None |
Path to Rfam.clanin for clan competition |
|
4 |
Number of CPUs for cmscan |
|
None |
E-value threshold (default: use Rfam GA thresholds) |
|
monochrome |
Monochrome (default) or preserve original colors |
|
False |
Suppress progress messages |
stitch
See the stitch command reference for all available options.
Understanding the results
cmscan.tblout format
The tabular output from cmscan contains detailed hit information:
#idx target name accession query name ... seq from seq to strand ... score E-value
1 Sarbecovirus-5UTR RF03120 OX309346.1 ... 26 299 + ... 309.9 3.2e-76
2 Corona_FSE RF00507 OX309346.1 ... 13469 13546 + ... 77.6 1.1e-15
3 Sarbecovirus-3UTR RF03125 OX309346.1 ... 29536 29870 + ... 406.9 1.5e-98
Genomic coordinates in filenames
SVG filenames encode the genomic coordinates:
RF00507_13469-13546.colored.svg
│ │ │
│ │ └── End position (1-based)
│ └── Start position (1-based)
└── Rfam accession
The stitch command uses these coordinates to:
Order panels by genomic position (
--sort)Calculate nucleotide distances between panels
Display gap labels showing the distance in nucleotides
Advanced usage
Using a custom CM library
To scan against a subset of Rfam families or custom models:
r2dt.py viral-annotate genome.fasta output/ --cm-library my-models.cm
The CM library will be automatically indexed with cmpress if needed.
Customising the stitched output
The stitch command provides many options for customising the combined diagram:
r2dt.py stitch \
output/rfam/*.svg \
-o stitched.svg \
--sort \
--gap 150 \
--glyph break \
--captions "5′ UTR" --captions "FSE" --captions "3′ UTR" \
--color \
--no-outline
By default, stitched diagrams are monochrome (black and white), which is suitable for publications. Use --color to preserve the original nucleotide coloring:
Colored version of the stitched diagram using --color flag.
This figure is hidden but ensures Sphinx copies the image to _images/
See Stitching multiple diagrams for the full option reference.
Integrating with other R2DT workflows
The viral-annotate command is a convenience wrapper. For more control, you can run the steps manually:
Run cmscan separately:
cmscan -Z 0.06 --cut_ga --rfam --nohmmonly \ --tblout hits.tblout --fmt 2 --cpu 4 \ data/rfam/cms/all.cm genome.fasta > cmscan.out
Extract hit regions and visualise:
# For each hit, extract the sequence and run: r2dt.py rfam draw RF00507 hit.fasta output/
Stitch the results:
r2dt.py stitch output/results/svg/*.colored.svg -o combined.svg --sort
Supported virus families
R2DT can identify RNA structures in many virus families. The Rfam database contains models for:
Coronaviridae - 5′/3′ UTRs, frameshift elements, s2m
Flaviviridae - 5′/3′ UTRs, pseudoknots
Picornaviridae - IRES elements, cre
Retroviridae - TAR, RRE, packaging signals
And many more…
Search Rfam to find RNA families associated with your virus of interest.
Troubleshooting
No hits found
If cmscan finds no hits:
The genome may not contain characterised RNA families in Rfam
Try lowering the threshold with
--evalue 1to see marginal hits
cmscan is slow
For large genomes or many sequences:
Increase
--cputo use more processorsConsider searching with a smaller CM library targeting expected families
Missing diagrams
If some hits don’t produce diagrams:
The Rfam family may not have a template in R2DT
Check
output/rfam/for error messages in the draw logsSome very divergent sequences may fail alignment
Gallery
Examples of RNA structure annotations for different viral genomes, generated with --normalize-font-size for consistent visual appearance.
Note
All gallery images are regenerated from example inputs with just docs-images. See Updating documentation for details.
SARS-CoV-2 coronavirus
Genome: OX309346.1 (29,903 nt) | RNA structures: 3
Hepatitis C virus (HCV)
Genome: NC_038882.1 (9,646 nt) | RNA structures: 12
Dengue virus serotype 2
Genome: NC_001474.2 (10,723 nt) | RNA structures: 4
See also
Stitching multiple diagrams - Full stitch command reference
Rfam cmscan documentation - Rfam genome annotation guide