A SARS-CoV-2 Mutation Pattern Query Tool
The vdb program is designed to query the SARS-CoV-2 mutational landscape. It runs as a command shell in a terminal, and it allows customized searches for mutation patterns over the entire SARS-CoV-2 genome dataset or subsets thereof. These patttern searches can be for spike protein mutations or nucleotide mutations over the whole genome.
The vdb tool uses a natural syntax, permitting quick searches over various subsets of the data. The two main types of objects in vdb are groups of viruses (“clusters”) and groups of mutations (“patterns”). Clusters can be obtained by searching for patterns, and patterns can be obtained by examining clusters. The program does NOT automatically scan for some pre-defined pattern. Instead, the goal of the program is to make it very easy to look around the spike mutational landscape and see what’s there. The vdb program can be thought of as a “viewer” (a device for looking), even though it's entirely text-based.
The default cluster to search is the collection of all sequenced SARS-CoV-2 viruses (“world”). Alternatively, a country or a US state can be specified.
To search for all viruses from the United States, enter
from US or just
us as part of the search command. A cluster or pattern can be assigned to a variable using an equal sign,
Clusters can be filtered by date, number of mutations, country, and Pango lineage. For example, to find all viruses collected in the US containing both mutations E484K and D614G, and then to see what mutations patterns this set has, use the following two commands:
VDB> a = us w/ E484K D614G VDB> patterns a
2. Documentation, reference, and web server
vdb is described in the bioRxiv manuscript Detection and characterization of the SARS-CoV-2 lineage B.1.526 in New York.
A live version of vdb running with GISAID sequences is at vdb.live.
Questions about vdb can be sent to [email protected].
There are two programs:
vdbCreate - this converts multiple sequence alignments (MSA) of SARS-CoV-2 genomes into a file listing spike mutations
vdb - this is the query tool
These programs are written in Swift and are run in a terminal. Swift is available at https://swift.org/download/ or as part of Xcode. The programs can be compiled with Swift version 5.3 and higher. To simplify installation each program is distributed as a single, stand-alone source file. If vdb is run with nucleotide mutation data, then the file "nuclref.wiv04" should be in the working directory.
To compile the programs, first check that the Swift compiler (
swiftc) is part of your path. On an Ubuntu system, a command similar to the following (adjusting the path as necessary) is appropriate for a bash shell:
Next, download the vdb repository ("Download ZIP" under the "Code" button on the top level vdb page). Unzip the file. Then to compile the programs, run these commands (these take < 1 minute):
swiftc -O vdbCreate.swift swiftc -O vdb.swift
4. Data files
On the GISAID EpiCov “Downloads” window, select “MSA full0405 (64MB)” or the latest version in the "Alignment and proteins" section.
Also download the “metadata” file in the "Download packages" section or in the "Genomic epidemiology" section. Uncompress the files and place the FASTA file and the metadata file in the same directory that will be used to run vdb. One can also download selected sequences from GISAID, add the WIV04 reference sequence, and align these with MAFFT. It is possible to load both the large dataset from the main MSA and a local, manually aligned set. The FASTA sequence identifier lines must have the same format as used by GISAID:
Manually added sequences without GISAID-assigned accession numbers should use a provisional number slightly greater than the highest accession number in the current dataset.
Other files included in this repository are:
nuclref.wiv04 This is the SARS-CoV-2 genomic sequence reference, which is used when vdb is run in nucleotide mode
ref_wiv04 This is the same reference in fasta format, to be used for manual alignments of GISAID sequences
5. Running the programs
To run vdbCreate to create the mutations list (this takes about 10 minutes for a million sequences):
For the vdb program, you can either tell the program what file(s) to load on the command line, or if you do not give a file on the command line, the program will load the most recently modified file with the name vdb_mmddyy.txt:
./vdb vdb_040521.txt ./vdb
The vdb programs can also be used to examine nucleotide mutations. To produce the nucleotide mutation list file, use the -n or -N flag:
./vdbCreate -N msa_0405.fasta
The -n excludes ambiguous bases, while the -N flag includes these. The -N flag is necessary to have protein mutations match what is listed in GISAID. The file produced by -N is much larger. This can be useful if one wants to check if a certain region was not resolved in a particular strain, but it is also slower because of the much larger file. Probably the best option is to generate the mutation list file with the -N flag, and then trim this file using vdb, which keeps a very small subset of the Ns. This prevents mutation calls at codons such as NNC, which could happen if these Ns are dropped. The
trim command takes about 30 seconds on a million sequences, and this only needs to be done once since the results can be saved. The suggested workflow is
./vdbCreate -N msa_0405.fasta ./vdb vdb_040521_nucl.txt VDB> trim VDB> save world vdb_040521_trimmed_nucl.txt VDB> quit
To read the resulting file into vdb and thereby analyze mutations in nucleotide mode:
or if the trimmed file has not been generated:
6. Usage notes
One should be aware that the SARS-CoV-2 genome dataset has some artefacts in the sequences and some errors in the metadata. Obvious examples include viruses with incorrect or partial collection date information. Anomalies in the sequences are less obvious, but there is a way to guard against this problem. Unusual sequences are less likely to be an artefact if they have been deposited by multiple laboratories. A virus name often gives an indication of the organization which deposited the sequence.