Table of Contents
• Mutation Data
• Expression Data
Running and Data Processing
• Status and Run Records
• Data Processing
◊ In-silico Translation and Alignment
◊ Variant Annotation
◊ Context-Specific Data Integration
• Alignment Summary
• Variant Info
• Input Variants
• Transcript Annotation
Clonotator is a free web-based platform that allows for eﬃcient screening of the quality and suitability of genetic constructs across many diﬀerent contexts (i.e. spatial, temporal, disease-speciﬁc) by integrating several bioinformatics tools, as well as mutation annotation and transcriptomic data. This tool allows users to quickly identify which transcript variant a given cDNA construct most accurately models and how completely it does so, as well as detect amino acid level mutations in the given sequence (as compared to the reference). Furthermore, using the gnomAD database it can annotate the population frequency and predicted likelihood of damage caused by these variants. Meanwhile, via user provided integration of 1) disease-specific mutation data and 2) context-specific transcription data (we have also internally incorporated the BrainSpan dataset), users can better assess if a given construct is a suitable model of the target gene in a given context.
The X11 License (X11)
Copyright © 2018 Willsey Lab
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Except as contained in this notice, the name(s) of the above copyright holders shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Software without prior written authorization.
There are three possible inputs, one of which is required: (1) Fasta File (required), (2) Mutation Data (optional), and (3) Expression Data (optional). The main input page is shown below in Figure 1, followed by a description of each input.
Figure 1 - Input Page. This figure depicts the display for the input page in which a cDNA construct sequence and context-specific mutations and/or transcriptomic data can be entered/selected.
1. Fasta (Required)
Here's an example for the basic nucleic acid sequence input in fasta format:
2. Mutation Data (Optional)
Optionally, the user can input a list of mutations of interest, to assess whether the construct entered (or which of the other reference constructs for the gene) spans the locus containing the mutation(s). This is done by highlighting their location(s) in a visual depiction of the transcripts. This can help the user assess which constructs are more or less relevant to a phenotype (often a disease/disorder) associated with the mutation. Previously saved mutational datasets entered on a registered account can also be selected on the alignment selection page. The records for saved mutations can be found under the “Mutations” tab on the left (only available for registered accounts), and include the gene in which the mutation occurs, mutation type, location, and the List ID – a unique identifier corresponding to the dataset the mutation is from, entered by the user on the input page. Since we use GRCh38/hg38 genome assembly internally, we highly recommend users input mutations in GRCh38/hg38 assembly format. If the mutations are submitted in GRCh37/hg19 genome assembly (we don't support hg18 or older versions), the coordinates will be converted to the new assembly by liftover. Please keep in mind that whichever version of the assembly is used, mutations must be entered in the following format(s) (tab- or space-separated):
Figure 2 - Format for Input Variant Entry. This table depicts the required formatting for input mutation entries for each of the different genome assemblies. Note that the mutation type is optional and can be set as either “LGD” or “Mis3” (Sanders et al. 2015).
|Chromosome||Start Coordinate||End Coordinate||Assembly||Mutation Type|
Previously saved mutational datasets entered on a registered account can also be selected on the alignment selection page (see below). The records for saved mutations can be found under the “Mutations” tab on the left (only available for registered accounts), and include the gene in which the mutation occurs, mutation type, location, and the List ID – a unique identifier corresponding to the dataset the mutation is from, entered by the user on the input page.
Figure 3 - Input Mutations Record Page. This page depicts the Input Mutations Record Page, which tracks input mutations along with the gene in which they occur, their type, their coordinates, and their List ID (a unique identifier entered by the user when the mutations are entered). Each registered account will have its own mutation records, which can be reaccessed and queried at any time (guest users do not have saved mutations).
3. Expression Data (Optional)
We have integrated the expression data from the Brainspan project internally. BrainSpan has collected transcriptome profilings for diﬀerent brain regions in the 2A-11 stages of development (See BrainSpan for details). Users can implement the data by selecting up to 3 spatio-temporal contexts for use in the ﬁnal visualization of the construct (and the other reference transcripts for the gene). By visually aligning regions of expression in a given context (e.g. mid-fetal prefrontal cortex) to the transcripts, users can assess which transcript(s) contain the regions of high (or low) expression in that context, and then make their selection of transcript accordingly.
Users can also submit expression data for a run using the given template. Note that this data will not be stored and must be re-entered for each run. Please make sure the header line is not deleted. Here's an example of the custom spreadsheet.
The start and end are GRCh38/hg38 coordinates corresponding to each region (such as an exon). To facilitate comparisons across multiple contexts, we support up to 3 diﬀerent “samples,” which means user can submit at most 3 expression datasets from three contexts. We do not oﬀer normalization in our pipeline, thus the user can oﬀer either raw data or normalized data here.
Once a user has entered all the required inputs, the user can click the run button at the bottom of the page. This will take them to the alignment selection page, which displays all the possible alignments of the translated input sequence to the possible reference protein isoforms from Ensembl for the gene of interest. The user can select the preferred alignment (generally, the longest complete alignment). The user can also select which combination of mutational datasets (see above) they wish to use for context-specific mutation annotation. This brings the user to another page informing the user if the run worked, and providing the user with a link to view the run records page.
Figure 4 - Alignment Selection Page. This page displays the possible complete alignments between the submitted cDNA construct sequence (“Query”) and the reference sequences (“Subject”) from Ensembl, along with their lengths and number of identities. The radio button allows for the selection of a particular alignment. Also, mutation lists of interest (including those submitted for the current run, as well as saved lists) are displayed. Multiple lists can be selected to be included in the final annotation.
Figure 5 - Run Result Page. This page displays the run ID, as well as a link to access the final output page. If an error occurs during the run, it will show here.
2. Status and Run Records
The run records page summarizes the user-selected alignment from the run, including the gene symbol (for the relevant gene of interest), the Run ID (a unique ID for that run), the Ensembl ID of the reference protein for the alignment, the number of identities in the alignment, the variant count in the alignment, and the run date.
The “Status” field contains a button, which also conveys the current status of the run. Generally, the status of a run can show “Check,” “Warn,” “Pass,” and “Fail” which are all buttons that take the user to the final output page. “Check” implies that the user has not yet fully evaluated that run (which can be done by viewing the final output; see below). “Warn” means there is a potential variant that has been detected in the alignment that the user must view in the final output (again by clicking the button). “Pass” means the user has evaluated the final output already, and approved of the construct, and “Fail” means they have evaluated it and not approved. The option for indicating whether a construct passes or fails is provided on the final output page.
The updated status and run record for every entered construct is stored under the results tab on the left, and can be queried, reaccessed, and exported at any time.
Figure 6 - Run Records Page. This page displays the records of a run, including its “status” (see above), the gene symbol for the construct, the input sequence ID, the Ensembl Protein ID (of the reference transcript), the number of amino acid identities, the number of variants in the alignment, and the run date.
3. Data Processing
Here's a diagram of how the tool works. The pipeline contains three main steps: 1) in-silico translation and alignments of a submitted sequence; 2) variant annotation and 3) integration with context specific data.
Figure 7 - Overview of Pipeline. This figure depicts an overview of the flow of information from input to output for the Clone Checker platform. Inputs are displayed in orange, intermediate steps blue, and outputs green (note that the same item may have multiple roles). Items that are used for processing (e.g. a database that is queried) are shown as cylinders and primary items as rectangles.
i. In-silico Translation and Alignment
The submitted sequence in FASTA format will be translated to a protein sequence using Biopython. In principle, every DNA sequence can be translated to six protein sequences based on alternate reading frames. By default, we use the longest one as the product encoded by the given sequence.
Simultaneously, the gene symbol in the submitted FASTA input is used for fetching reference protein sequence(s) from Ensembl. The alignments are carried out pairwise by local blastp between the translated product and the reference sequence(s). We only include the complete transcripts from Ensembl. The aligned results will be displayed on the next page where the user can select the best alignment for downstream analysis. Generally, this corresponds to the longest complete alignment.
ii. Variant Annotation
Once a hit is selected, the alignment will be scanned to ﬁnd all the mismatches, which can be introduced by amino acid substitution, insertion or deletion. If detected, these variants will be further annotated by gnomAD to determine the population allele frequency, as well as the predicted level of disruption (Missense Z score, pLI score, Polyphen2 score). There is no doubt that not all the possible variants are contained in the database. To avoid confusion between extremely rare variants and variants that happen to occur in regions of poor coverage (both of which would not appear in the database), we implement the depth of coverage among the tested population (percentage of individuals in gnomAD with >10X coverage) in the speciﬁc loci where the variants occur.
iii. Context-Specific Data Integration
Context-specific data can be uploaded in the form of both context-specific mutations (usually disease/disorder specific) as well context-specific expression data (i.e. tissue type, developmental timepoint, etc.).
The coordinates of uploaded/selected context-specific mutations are mapped onto the aligned exons and introns of the reference coding transcript variants in Ensembl for the gene of interest, and displayed on the output graphic. For now, we support sequence mutations (SNVs or INDELs) in hg19/GRCh37 or hg38/GRCh38 assembly format. All the coordinates of variants will be liftedover to hg38/GRCh38 if necessary. We highly suggest users upload the variants in hg38/GRCh38 as there may be some loci missed in the liftover process.
For context-specific expression data, we have integrated the expression data from the BrainSpan project, from which the user can select specific contexts, and have also provided the option for the uploading of custom expression datasets (see above). The uploaded expression data may be normalized but this is not necessary; as a default, we utilize normalized data for Brainspan analysis. The overlapping regions of context-specific expression are aligned to the transcripts from Ensembl for the gene of interest by hg38/GRCh38 coordinates, and integrated into the final output graphic in the form of a heatmap.
1. Alignment Summary
At the top of the final output page, the gene of interest is hyperlinked to its Genecard entry. Below is a summary of the alignment, including the Ensembl ID for the reference DNA and protein transcripts selected by the user (on the alignment selection page), the identity counts, the variant counts (detected by the alignment), as well as the FASTA input entered, and full protein level alignment (both hidden by default), which includes a graphical representation. Included at the top are the “Status” and “Pass?” fields. The button options next to “Pass?” (Yes/No) allow the user to record whether or not the construct is suitable for the user’s purposes, after assessing all of the output information, and will update the “Status” for that run record on the run records page (see above) to “Pass” or “Fail” accordingly.