Annotating viral genomes

R2DT can automatically scan viral genomes to identify and visualise non-coding RNA elements using the Rfam database. This tutorial demonstrates the complete workflow from a viral genome FASTA file to a diagram showing all RNA structures.

Overview

Many viruses contain conserved RNA secondary structures that play critical roles in their life cycles:

  • 5′ and 3′ UTRs - Untranslated regions with regulatory structures

  • Frameshift elements - Programmed ribosomal frameshifting signals

  • Internal ribosome entry sites (IRES) - Cap-independent translation initiation

  • Packaging signals - RNA structures involved in genome packaging

R2DT uses Infernal’s cmscan tool to search viral genomes against the Rfam covariance model library, then generates secondary structure diagrams for each identified RNA family.

Quick start

r2dt.py viral-annotate genome.fasta output/

This command:

  1. Scans the genome against Rfam using GA (gathering) thresholds

  2. Filters and ranks overlapping hits

  3. Generates 2D diagrams for each RNA family found

  4. Optionally stitches all diagrams into a single combined view

Example: SARS-CoV-2 coronavirus

The SARS-CoV-2 genome (~30,000 nt) contains several well-characterised RNA structures that are important for viral replication.

Step 1: Prepare input

This example uses the SARS-CoV-2 genome with accession OX309346.1, which is included in the R2DT repository at examples/viral/coronavirus.fasta.

Step 2: Run viral-annotate

r2dt.py viral-annotate examples/viral/coronavirus.fasta output/coronavirus/

Output:

# R2DT :: visualise RNA secondary structure using templates
# Version 2.2 (2026)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Step 1: Calculating genome size
  Genome size: 29,903 nt
  Database size (Z): 0.059806 Mb

Step 2: Running cmscan with Rfam GA thresholds
  Running: cmscan... (this may take several minutes)

Step 3: Parsing cmscan results
  Found 3 RNA family hits:
    RF03120 (Sarbecovirus-5UTR): 26-299 + score=309.9
    RF00507 (Corona_FSE): 13,469-13,546 + score=77.6
    RF03125 (Sarbecovirus-3UTR): 29,536-29,870 + score=406.9

Step 4: Generating 2D diagrams for each hit
  ✓ RF03120 (Sarbecovirus-5UTR)
  ✓ RF00507 (Corona_FSE)
  ✓ RF03125 (Sarbecovirus-3UTR)

Summary
  Genome: OX309346.1 (29,903 nt)
  RNA families found: 3
  Diagrams generated: 3

Step 3: Examine the output

The output/coronavirus/ directory contains:

output/coronavirus/
├── cmscan.tblout          # Tabular cmscan results
├── cmscan.out             # Full cmscan output
└── rfam/                  # Individual RNA diagrams
    ├── RF03120_26-299.colored.svg
    ├── RF00507_13469-13546.colored.svg
    └── RF03125_29536-29870.colored.svg

Each SVG file shows the secondary structure of one RNA element:

  • RF03120 (Sarbecovirus-5UTR): The 5′ untranslated region containing stem-loops SL1-SL5

SARS-CoV-2 5′ UTR secondary structure

5′ UTR of SARS-CoV-2 (nucleotides 26-299)

  • RF00507 (Corona_FSE): The programmed ribosomal frameshift element between ORF1a and ORF1b

Coronavirus FSE (frameshift element)

FSE - Frameshift element (nucleotides 13,469-13,546)

  • RF03125 (Sarbecovirus-3UTR): The 3′ untranslated region containing the s2m element

SARS-CoV-2 3′ UTR secondary structure

3′ UTR of SARS-CoV-2 (nucleotides 29,536-29,870)

Step 4: Create a stitched diagram

To combine all diagrams into a single panoramic view:

r2dt.py stitch \
    output/coronavirus/rfam/*.colored.svg \
    -o coronavirus-stitched.svg \
    --sort \
    --captions "5′ UTR" --captions "FSE" --captions "3′ UTR"

The --sort flag arranges panels by genomic coordinates (extracted from filenames), and --captions adds labels above each panel.

The resulting stitched diagram shows all three RNA structures in genomic order:

🔍 Use mouse wheel to zoom, drag to pan.

Combined view of SARS-CoV-2 RNA structures: 5′ UTR, FSE (frameshift element), and 3′ UTR.

This figure is hidden but ensures Sphinx copies the image to _images/

Two approaches to viral RNA annotation

R2DT supports two complementary approaches for annotating viral genomes:

Approach

Command

Best for

FASTA + Rfam

viral-annotate

Automatic discovery — finds known RNA families in any genome

Stockholm alignment

stockholm

Expert curation — uses manually annotated structures and regions

The FASTA approach (shown above with SARS-CoV-2) automatically scans the genome against Rfam and is fully automated. However, it can only find structures that already have Rfam models.

The Stockholm approach uses a curated multiple sequence alignment where structures have been manually annotated with #=GC structureID and #=GC regionID lines. This can capture structures that are not in Rfam, and groups them by genomic region (e.g. 5′UTR, NS5B).

Example: HCV using Stockholm alignment

The HCV genome contains over 40 annotated RNA structures across the 5′UTR, coding regions, and 3′UTR. An alignment of 57 HCV sequences with named structures is included at examples/hcv-alignment.stk.

r2dt.py stockholm examples/hcv-alignment.stk output/hcv-stockholm/

Each structure is labelled with its parent genomic region, producing captions like “SLI (5′UTR)” or “5BSL3.1 (NS5B)” in the stitched output:

🔍 Use mouse wheel to zoom, drag to pan.

HCV RNA structures generated from a Stockholm alignment using #=GC structureID and #=GC regionID annotations. Structures are labelled with their parent genomic region.

Compare this with the Rfam-based HCV diagram in the gallery below — the Stockholm approach captures significantly more structures because it supports manually curated annotations, including families that have not yet been added to Rfam.

Command reference

viral-annotate

r2dt.py viral-annotate <genome.fasta> <output_folder> [OPTIONS]

Arguments:

  • genome.fasta - Input viral genome in FASTA format (one sequence)

  • output_folder - Directory for output files

Options:

Option

Default

Description

--stitch-output PATH

None

Path for stitched SVG output

--cm-library PATH

data/rfam/cms/all.cm

Path to Rfam CM library

--clanin PATH

None

Path to Rfam.clanin for clan competition

--cpu N

4

Number of CPUs for cmscan

--evalue, -E

None

E-value threshold (default: use Rfam GA thresholds)

--monochrome/--color

monochrome

Monochrome (default) or preserve original colors

--quiet

False

Suppress progress messages

stitch

See the stitch command reference for all available options.

Understanding the results

cmscan.tblout format

The tabular output from cmscan contains detailed hit information:

#idx  target name         accession  query name  ...  seq from  seq to  strand  ...  score  E-value
1     Sarbecovirus-5UTR   RF03120    OX309346.1  ...  26        299     +       ...  309.9  3.2e-76
2     Corona_FSE          RF00507    OX309346.1  ...  13469     13546   +       ...  77.6   1.1e-15
3     Sarbecovirus-3UTR   RF03125    OX309346.1  ...  29536     29870   +       ...  406.9  1.5e-98

Genomic coordinates in filenames

SVG filenames encode the genomic coordinates:

RF00507_13469-13546.colored.svg
   │      │     │
   │      │     └── End position (1-based)
   │      └── Start position (1-based)
   └── Rfam accession

The stitch command uses these coordinates to:

  • Order panels by genomic position (--sort)

  • Calculate nucleotide distances between panels

  • Display gap labels showing the distance in nucleotides

Advanced usage

Using a custom CM library

To scan against a subset of Rfam families or custom models:

r2dt.py viral-annotate genome.fasta output/ --cm-library my-models.cm

The CM library will be automatically indexed with cmpress if needed.

Customising the stitched output

The stitch command provides many options for customising the combined diagram:

r2dt.py stitch \
    output/rfam/*.svg \
    -o stitched.svg \
    --sort \
    --gap 150 \
    --glyph break \
    --captions "5′ UTR" --captions "FSE" --captions "3′ UTR" \
    --color \
    --no-outline

By default, stitched diagrams are monochrome (black and white), which is suitable for publications. Use --color to preserve the original nucleotide coloring:

🔍 Use mouse wheel to zoom, drag to pan.

Colored version of the stitched diagram using --color flag.

This figure is hidden but ensures Sphinx copies the image to _images/

See Stitching multiple diagrams for the full option reference.

Integrating with other R2DT workflows

The viral-annotate command is a convenience wrapper. For more control, you can run the steps manually:

  1. Run cmscan separately:

    cmscan -Z 0.06 --cut_ga --rfam --nohmmonly \
        --tblout hits.tblout --fmt 2 --cpu 4 \
        data/rfam/cms/all.cm genome.fasta > cmscan.out
    
  2. Extract hit regions and visualise:

    # For each hit, extract the sequence and run:
    r2dt.py rfam draw RF00507 hit.fasta output/
    
  3. Stitch the results:

    r2dt.py stitch output/results/svg/*.colored.svg -o combined.svg --sort
    

Supported virus families

R2DT can identify RNA structures in many virus families. The Rfam database contains models for:

  • Coronaviridae - 5′/3′ UTRs, frameshift elements, s2m

  • Flaviviridae - 5′/3′ UTRs, pseudoknots

  • Picornaviridae - IRES elements, cre

  • Retroviridae - TAR, RRE, packaging signals

  • And many more…

Search Rfam to find RNA families associated with your virus of interest.

Troubleshooting

No hits found

If cmscan finds no hits:

  • The genome may not contain characterised RNA families in Rfam

  • Try lowering the threshold with --evalue 1 to see marginal hits

cmscan is slow

For large genomes or many sequences:

  • Increase --cpu to use more processors

  • Consider searching with a smaller CM library targeting expected families

Missing diagrams

If some hits don’t produce diagrams:

  • The Rfam family may not have a template in R2DT

  • Check output/rfam/ for error messages in the draw logs

  • Some very divergent sequences may fail alignment

See also