I-Introduction

I.I-Purpose of the App

We built the Wormsynteny App to visualize the synteny or genomic conservation among eleven nematode Caenorhabditis species. Our previous study identified orthologous genes in the eleven species and assigned them into orthogroups based on protein sequence similarities (Ma, Lau, and Zheng, Cell Genomics, 2024). Those analyses, however, did not consider synteny. Thus, one of the purposes of this App is to integrate the orthology and synteny information by mapping the genomic position of the orthologs across species to assess the conservation of the genomic blocks they are located in. In the alignment plot, we provide the orthogroup information for protein sequence conservation and the synteny information by the links among aligned regions. Genomic alignment is done using Cactus, a reference-free whole-genome alignment program, and the output is visualized in an alignment plot and several downloadable tables.

The Wormsynteny App is built in a C. elegans-centric manner. It only takes C. elegans genes or genomic regions and aligns them with the other ten Caenorhabditis species to find syntenic regions. The App then annotates genes within the region and maps them to orthogroups. Although not possible in its current form, future updates may allow the inquiry using sequences from the other ten species. We also plan to include more nematode species in the App. Given that Wormbase does not have a function to visualize synteny across multiple species, Wormsynteny App may be useful for evolutionary studies in the worm community and in genome biology in general.

I-Introduction

I.II-Input data (Progressive Cactus output)

The following 11 genome assemblies and a guide tree (according to Ma et al., Cell Genomics, 2024; PMID: 38190105) are used as the input for Progressive Cactus (v2.2.0), which was developed by Armstrong et al., Nature 2020 (PMID: 33177663).

The software was ran using a step-by-step protocol provided at https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/progressive.md

Caenorhabditis becei (PRJEB28243)

Caenorhabditis bovis (PRJEB34497)

Caenorhabditis briggsae (PRJNA10731)

Caenorhabditis elegans (PRJNA13758)

Caenorhabditis inopinata (PRJDB5687)

Caenorhabditis latens (PRJNA248912)

Caenorhabditis nigoni (PRJNA384657)

Caenorhabditis panamensis (PRJEB28259)

Caenorhabditis remanei (PRJNA57750)

Caenorhabditis tribulationis (PRJEB12608)

Caenorhabditis tropicalis (PRJNA53597)

The output file of Progressive Cactus is a HAL file, which is a graph-based representation of multi-genome alignment. The HAL file can be queried using the HAL toolkit (v2.2) (Hickey et al., Bioinformatics, 2013; PMID: 23505295) given a set of genomic coordinates. The WormSynteny App takes C. elegans genomic coordinates, searches the HAL file, and exports pairwise alignment information into PSL files using the halLiftover function in the HAL toolkit. Ten pairwise alignment PSL files are generated for the alignment between C. elegans and the other ten Caenorhabditis species; these PSL files are then fed into the pipeline to be assembled into long homologous fragments. The links between sister species are created by taking the coordinates of the fragments aligned to C. elegans query sequences as query sequences to search the HAL file. Ten additional pairwise alignment PSL files are thus generated storing the alignment between sister species.

I-Introduction

I.III-Pipeline for visualizing the alignment

To visualize the alignment, the Wormsynteny App uses a versatile graphics package for comparative genomics : gggenomes. It extends the popular R visualization package ggplot2 which makes it quicker to learn and allows us to map genes to orthogroups using different colors in one comprehensive and elegant plot.

A gggenomes plot typically consists of three types of data tracks, each of which contains a single dataset :

seqs: sequences such as contigs or chromosomes
feats or genes tracks: annotations of locations on sequences
links: annotations that connect two locations between two different sequences

The Wormsynteny App uses a pipeline to transform the inputs data (that is a Cactus generated HAL file storing the alignments between species and GTF files storing genes data about our eleven species) into aligned genes, aligned sequences and links tables. Then, we passed those three tables to the gggenomes function to create our plots.

The aligned sequences table contains the regions representing the total alignment length of consecutive homologous fragments from the same chromosome (that is the length between the minimum start and maximum end). Then, the aligned genes table contains the intersected genes inside the aligned regions as well as orthogroups data. Here is the orthogroups table mapped to genes for the eleven Caenorhabditis species. Finally, the links table contains all the link coordinates between phylogenetically adjacent species.

The process of data transformation involves three primary steps. Initially, we retrieve the fragments of the ten other species that are homologous to the query sequence of C. elegans from the HAL file. Then, the links table is created from the raw homologous fragments by recovering the homologous fragments between phylogenetically adjacent species.

Furthermore, two main hurdles could interfere with the accurate visualization of genes. Firstly, the presence of diminutive homologous fragments may render them indiscernible on the plot, thereby cluttering it into superfluous information. Additionally, certain fragments that are considerably distanced from other fragments on the same chromosome may result in a sequence length that is disproportionate to the actual length of genes, thereby impeding the precise visualization of these genes.

To address this intricate visualization challenge, we assemble neighboring homologous fragments by considering a predetermined maximum length value, referred to as the gap value. Simultaneously, we eliminate fragments that fall below a specified minimum length requirement which is expressed as a percentage of the overall inquiry length (filtering percentage).

Once the homologous fragments filtered, the last step is to generate the tables:

by intersecting genes from GTF files in the homologous fragments (aligned genes table)
by taking the total length of homologous fragments for a same chromosome (aligned sequences table)

A flowchart below summarizes the details of the pipeline.

I-Introduction

I.IV-Data ouptuts

The Wormsynteny pipeline generates three tables : aligned genes, aligned sequences and links as seen in the previous section. Those tables are then introduce into gggenomes function from gggenomes package which create the synteny plot.

Table outputs

The aligned genes table contains the following columns:

bin_id : the species names
seqname: the sequences names (chromosomes names)
start: the starts of the aligned genes
start: the starts of the aligned genes
end: the ends of the aligned genes
length: the length of aligned genes
type: the DNA types (either genes, or CDS here)
gene_id: the Wormbase genes ids
genes_name: the Wormbase C. elegans gene names
gene_id: the Wormbase genes ids
gene_biotype: the biological functions (protein coding, etc.)
orthogroup: the Wormbase aligned genes orthologous number
coverages (see III.II-Finding the Ends of Genes section)

The aligned sequences table contains the following columns:

bin_id : the species names
seqname: the sequences names (chromosomes names)
start: the starts of the aligned genes
start: the starts of the aligned genes
end: the ends of the aligned genes
length: the length of aligned genes

The links table contains the following columns:

bin_id : the species names of the firsts paired species
seqname: the sequences names (chromosomes names) of the firsts paired species
start: the starts of the homologous fragments of the firsts paired species
end: the ends of the homologous fragments of the firsts paired species
length: the length of the homologous fragments of the firsts paired species
strand: the coding strands (+) and the noncoding strands (-) of the firsts paired species
bin_id2 : the species names of the seconds paired species
seqname2: the sequences names (chromosomes names) of the seconds paired species
start2: the starts of the homologous fragments of the seconds paired species
end2: the ends of the homologous fragments of the seconds paired species
length2: the length of the homologous fragments of the seconds paired species
strand2: the coding strands (+) and the noncoding strands (-) of the seconds paired species

Plot outputs

There are two types of plot outputs : macrosynteny plot and microsynteny plot.

Below is an example with the gene gap-1 (gap = 200, filter = 16) at macrosynteny level.

The scale is based on the C. elegans inquiry sequence position in its chromosome indicated in the title. The aligned sequences follow the phylogenetic tree order and are marked at the left by the species names (bin_ids). The aligned genes are represented by the colored loci in each sequence whose corresponding orthogroups are written in the Orthogroups legend. C. elegans genes legend is the C.elegans genes names of the same colored orthogroups. Finally, in between sequences gray areas highlight the links thus the homology between two fragments.

Below is an example with the gene gap-1 (gap = 200, filter = 16) at microsynteny level.

Similarly, the aligned sequences are sorted in the phylogenetic tree order tagged by their respective species names. However here, the orthogroup colored loci are CDS types which means coding exons. The spaces between exons are introns. All together, they form the genes.

Another example below with the gene Y39A3CL.7 (gap = 500, filter = 5) at macrosynteny level which have added fragments represented by the (filtered) tags next to the sequences. Those added fragments are also sorted against the phylogenetic tree.

II-Inquiry parameters

II.I-Key definitions

The region of inquiry

The Wormsynteny App is used to create plots that show the alignment of genes among eleven Caenorhabditis species with Caenorhabditis elegans as the query species. Thus, the user needs to specify a sequence of inquiry from the C. elegans genome to align with the other ten species and to retrieve homologous fragments from them. This sequence can be inputted by the genomic coordinates (chromosome, start, end) or by selecting a C. elegans gene in the dropdown list. If a gene is selected, its coding region will be used as the inquiry sequences.

By setting the region of inquiry and clicking the “Plot” button, the alignment pipeline will be launched to generate a plot. Two other important parameters that affect the pipeline are “Gap value” and “Filtering percentage” (explained below).

The history list

The app is built around a history list that stores plots and data generated by the pipeline. The data stored includes the genomic coordinates for the region of inquiry, the plot, aligned genes, aligned sequences, and links. The plot can be modified (e.g., select certain species to display) using the data stored in the history list.

Clicking the “Plot” button will always create a new set of alignment data based on the region of inquiry and load the new data into the history list.

II-Inquiry parameters

II.II-Generate a plot

There are two ways to select a region of inquiry for the alignment plot.

Typing in genomic coordinates

The inquiry section is organized such as :

In the above inquiry section:

Chromosome is the C. elegans chromosome where the region of inquiry is located
Start is the start of the inquiry
End is the end of the inquiry
Gene selection (see next paragraph)
Gap value is the maximum length (bp) allowed for two adjacent genomic fragments (that are aligned to the inquiry) to be assembled into one longer fragment; increasing gap values increases the chance of assembling all the aligned fragments into one long aligned region
Filtering percentage is the minimal length requirement (in the percentage of the inquiry length) for the assembled fragments to be included in the final output; increasing filtering percentage increases the chance of removing small assembled fragments.

When the website is first loaded, the coordinates are pre-filled with the mec-17 gene coordinates as an example, and one of the optimal gap values and filtering percentages are inputted (see the Wormsynteny article for details). You can change the settings to the desired parameters before clicking the “Plot” button to launch the pipeline. A plot will be generated in 10-20 seconds.

After the plot is generated, you can retrieve the coordinates of the inquiry and the other parameters in the “Explanations” section below the plot.

Selecting a gene

A second way to input a region of inquiry is to select one of the ~47,000 C. elegans genes in the dropdown list at “Gene selection”. Start typing the gene name until the desired gene appears. Select the gene and click the “Plot” button. The coding region of the gene will be used for the alignment.

Please note that when a gene (e.g., mec-2) is selected, the field for genomic coordinates is ignored. After the plot is generated, the gene field is emptied, and the coordinates field is automatically updated to show the coordinates of the coding region of the gene selected for the alignment.

After the plot is generated, you can retrieve the gene name and the other parameters in the “Explanations” section below the plot.

II-Inquiry parameters

II.III- Changing inquiry coordinates

Via Zoom buttons

You can use the Zoom function to modify the inquiry coordinates. Clicking the “Zoom In” and “Zoom Out” buttons will automatically shrink and enlarge the region of inquiry by 500 bp at both ends, respectively. So, there is no need to change the coordinates manually.

For example, the default mec-17 coordinates are IV: 7983802 … 7987183.

After clicking on “Zoom Out” button, the start position is reduced by 500, whereas the end position is increased by 500. The “Zoom In” function will do the opposite.

The “Explanations” panel records updated coordinates after the Zooming.

Via back and reset

Pressing the “Back” button brings back the last entry in the history list and displays all the data from the previous inquiry, including the plot, aligned genes, sequences, and links. It also automatically fills in the coordinates of the last inquiry.

For example, when the current inquiry corresponds to mec-2 coordinates.

Pressing the back button brings back the mec-17 coordinates from the last inquiry.

The “Reset” button will simply fill the coordinates with the mec-17 default coordinates.

III-Plot control

III.I- Synteny scale

Wormsynteny can display two levels of synteny, macrosynteny at the gene level and microsynteny at the exon/intron level. Synteny at the macro level does not necessarily indicate synteny at the micro level (illustrated by the examples below adapted from the Genomics and Comparative Genomics website).

Moreover, the microsynteny option allows the analysis of genomic changes at a small scale (e.g., intron elongation, exon number changes) among species. To switch between visualization of synteny at the gene and exon/intron levels, click the “Macrosynteny” or “Microsynteny” button. Please note that the default plot will be in the macrosynteny style for a new inquiry.

Below is the gap-1 gene plot as an example of using the microsynteny button going from macrosynteny level to microsynteny level.

Introns are represented by spaces between exons which are colored loci on each sequence.

III-Plot control

III.II- Finding the Ends of Genes

The output of the pipeline is a set of aligned regions that are assembled from smaller aligned fragments based on the gap value and are filtered for length based on the filtering percentage. Genes were then annotated on these aligned regions, which may or may not include the entirety of the genes. This cannot be easily discerned from the plot, so we added a column called Coverage in the “Aligned genes” tab of the output to indicate if the genes are fully covered in the aligned region or not.

If we select the gene mec-1 as the inquiry, we will have the “Aligned genes” table below in the output. If the last column Coverage shows “FULL”, it means the displayed gene block shows the entire gene. Otherwise, “PARTIAL” is shown with the percentages of the gene shown in the gene block.

To find the ends of the genes and display them in their entiery in the aligned region, we created a “Gene-centered visualizaiton” button to extend any gene that is present in the plot to their full size.

Once the button is clicked, the plot is updated and the “Aligned genes” table showed “FULL” coverage for every gene in the aligned region (see below). Please note that this button does not change the alignment results and does not adjust the gap value or the filtering percentages. It simply extends the aligned regions to include full genes.

III-Plot control

III.III- Alignements managment

In this section, you will discuss how to plot selected alignment when there are multiple aligned regions in some species even after the length-filtering step, which is controlled by the filtering percentage. Due to visualization constrains, we can only plot one aligned region per species and show synteny links among them. When there are multiple regions from one species aligned to the C. elegans inquiry, the default setting is to select the longest aligned region for the plot and place the others at the bottom and label them with “(filtered)”, because these fragments have passed the filtering criteria although they are not the longest.

These extra fragments could be adjacent to the plotted fragment and can be connected with the longest aligned region if gap value is increased. Alternatively, they could also be located in regions far away from the plotted fragment and thus representing true second aligned site for the sequence of inquiry.

We created two ways to manage the display of those extra fragments using the “Hide” and “Switch” buttons.

Hide button

When it is not necessary to show the shorter fragments. One can click the “Hide” button to hide them in the plot and only display the longest aligned region from each species. Below is an example for the gene bath-38 (gap value = 200, filtering value = 16). The pipleline generated extra aligned regions in three species, which contain genes from orthogroups that are different from the genes in the main aligned region.

By clicking on the “Hide” button, all the extra alignments are removed and only one aligned region and one gene from each species is shown.

To have them back, press the “Back” button or go through the history list by clicking the “History” button.

Switch button

In some cases, the desired fragment may not be the longest one and is thus placed at the bottom with the “(filtered)” label. To use these shorter fragments in the alignment plot, you can click the “Switch” button to replace the default, longest aligned fragment with the first extra fragment. If there are multiple extra fragments, you can keep clicking the “Switch” button until the desired fragment is used in the alignment plot to show the links.

Below is an example with the gene Y39A3CL.7 (gap = 500, filter = 5).

If the user wants to use the extra C. remanei fragment into the alignment, they can select “C. remanei” in the selection panel (1) and then press the “Switch” button (2) as below.

After clicking the “Switch” button, the original aligned fragment will be switched with the extra fragment and be placed at the bottom of the plot and be labeled as “(filtered)”. Click the “Switch” button again can switch the fragment back into the alignment.

Show button

The default pipeline shows the alignment of C. elegans query sequences with all ten other nematode species. If the user wants to show only selected species, they can select the desired species in the selection panel and click the “Show” button to display only these species in the plot. The order of the species in the plot still follows the phylogenetic tree.

Here is an example with the mec-17 gene (gap = 200, filter = 5).

One can select species in the selection panel (1) and press the Show button (2) as below.

The plot is then updated to only show the aligned regions from the selected species. Please, note that you do not need to select C. elegans since it is the query species and will always be shown.

IV-Download outputs

The app offers several options to download the output.

Download current data

The app allows the user to download the plot in pdf format as well as the tables (aligned genes, aligned sequences, links, etc.) in csv format. To do so, simply click on the download button below the plot and the tables.

Download historical data

When the users adjust the parameters to modify the plots, each version of the plot and associated data are stored in the Hisotry list, which allows the user to go back to previous version without remaking the plot. To download a specific version of the data, press the “History” tab to see the stored data in a table format. The table is built with the coordinates information as well as the last adjustment on the plot in the Modifications column, which helps to differentiate plots.

Using the mec-17 gene as an example, we ran a few variations of the plot, which are shown in the History table. Notice that the coordinates and pipeline parameters did not change because we only made adjustments to the plot display. The type of adjustments are shown in the Modifications column.

To download data, select a plot version by selecting a row in the table, then press one or more “download links” below the table to retrieve relevant data.

Plot features

Inquiry Features

Explanations

I-Introduction

I.I-Purpose of the App

I-Introduction

I.II-Input data (Progressive Cactus output)

I-Introduction

I.III-Pipeline for visualizing the alignment

I-Introduction

I.IV-Data ouptuts

Table outputs

Plot outputs

II-Inquiry parameters

II.I-Key definitions

The region of inquiry

The history list

II-Inquiry parameters

II.II-Generate a plot

Typing in genomic coordinates

Selecting a gene

II-Inquiry parameters

II.III- Changing inquiry coordinates

Via Zoom buttons

Via back and reset

III-Plot control

III.I- Synteny scale

III-Plot control

III.II- Finding the Ends of Genes

III-Plot control

III.III- Alignements managment

Hide button

Switch button

Show button

IV-Download outputs

Download current data

Download historical data