According to their location and the kind of genetic alteration they cause, DNA variations are divided into five groups: non-synonymous, synonymous, splice site, intronic, and untranslated. Variations classified into the first three groups are those most likely to have significant functional impact. The first group may substantially affect protein function [10
]: a non-synonymous variation leads to an amino acid change or makes a premature stop codon. Variations of the second [11
] and the third categories [12
] can affect pre-mRNA splicing. For better prioritization, FANS nominates variations as one of five risk types, with nine risk subtypes and five risk levels (Table ) [13
]. Each analyzed variation is assigned one risk type and an accompanying risk level. The impact analysis offered by FANS includes examination of diminished ESE or ESS, altered protein function, protein domain abolition, and GT-AG splice site variation. A very important design feature of FANS is that the analysis is always based on the most up-to-date data retrieved from the source websites NCBI (National Center of Biotechnology Information) [15
], Ensembl [16
], UCSC BLAT [17
], Rescue-ESE [5
], Fas-ESS [6
], and SIFT [7
Risk types and risk levels used for prioritization of variations
Users of FANS can submit a single variation or a batch query. Currently, FANS offers users the option to search two species, human and mouse. After choosing the species, users need to input a sequence in FASTA format or provide information for each variation including chromosome number, physical position, orientation, and allele change, either through input fields on the web page or by uploading a batch file. For a batch file, each variation occupies a line that must contain the all required information for that variation. An example of a batch file is given online to illustrate the data format accepted by FANS. As all gene information and prediction results are acquired through the Internet in real-time, the maximum allowed number of variations for each batch query is currently 100 to avoid overloading the source websites.
Figure outlines the analysis flow that is started when a query is submitted. For a sequence query, the FASTA format sequence is sent to UCSC BLAT [17
] search. FANS automatically generates the required information about the novel variations from all the returned BLAT results, and users only need to choose a group of variations in one of the BLAT results for the next analysis stage. For each variation, FANS uses the submitted information to search NCBI Map Viewer to retrieve all transcripts of the gene covering the variation. If the variation falls in a non-coding region, it will be checked for its GT-AG splice site risk. Otherwise, the analysis will translate the coding region and then follow either a 'synonymous' or a 'non-synonymous' flow. A synonymous coding variation which affects amino acid sequence is first subjected to ESE and ESS analysis. The exonic DNA fragment of eleven nucleotides centered around the variation is extracted and transmitted to Rescue-ESE [5
] and Fas-ESS [6
] for ESE and ESS hexamers pattern matching. Any variation found to be covered by an ESE or ESS motif that is likely to diminish exon splicing is subsequently sent to protein domain abolition analysis. At this stage, FANS utilizes protein domain information from NCBI GenPept [18
] to check protein domain abolition that may be caused by splicing regulation when the diminished exon splicing changes amino acid sequence in a protein domain.
FANS prioritizing and analysis flow path.
A non-synonymous variation that produces a stop codon, resulting in very serious protein structure alteration, is categorized as risk type "Non-sense" with risk level "Very High". Other than the "Non-sense" type, FANS takes advantage of SIFT [7
] to see if a particular amino acid replacement may affect protein function. Those variations predicted to affect protein function are categorized as "Mis-sense (non-conservative change)". If no significant functional impact is found despite the substitution of an amino acid, then ESE, ESS, and protein domain abolition analysis will be carried out.
By utilizing retrieved information and analyzed results from six websites, FANS efficiently prioritizes novel variations according to their risk levels in a few seconds with just a few clicks. The integrated results are divided into four parts for easy visualization: Genome View, Gene View, Transcript View, and Variation View (Figure ).
The FANS results page. The output of an analysis of eighteen variations in eight chromosomes.
The first screen, Genome View, shows an overview of the chromosome locations and the risk levels of the queried variations. The higher the risk level, the warmer the color that is used, with red color representing a very high risk level and blue a very low risk. Clicking on a variation label will move from Genome View to Gene View to display gene information of that variation. Gene View also displays the scale, location and all transcripts accommodating the selected variation. A gene selection list is provided for a user to choose the gene he/she is interested in. The transcript accession numbers from NCBI are printed on the left side of the transcript picture and transcript IDs from Ensembl are marked on the other side. Users can link to the source web pages by clicking each ID.
Next, in Transcript View, FANS offers separate transcript tabs for transcript selection (provided that a gene has more than one transcript). Gene structure comprising introns and exons is shown here. Extended exonic regions together with colored vertical lines are drawn for better illustration of variation locations. When the mouse is moved over a line, a pop-up window will show its risk level and risk type. In addition, an upwards-pointing arrow further distinguishes the position of the selected variation.
Finally, Variation View depicts the analysis flow path as well as the collected results of an analyzed variation. Selecting an interesting variation out of the variation list box will bring up its associated results and analysis flow path. FANS colors the flow path that an analysis has gone through according to the final risk level outcome. The highlighted path likewise provides corresponding description pop-up windows when pointed at with the mouse. Moreover, users can download not only all analyzed results in CSV format (by clicking the icon right of Transcript View) but also a 200 bps flanking sequence of any variation listed in Variation View.
The following software was used to construct FANS:
i. Red Hat Enterprise Linux 4 AS Update 6
ii. Eclipse 3.2.2
iii. Subversion 1.2.1
iv. Tomcat 5.5.20
v. RoboSuite 5.5 SP2
vi. Java SE Development Kit 1.5.0_11
vii. Struts Framework 1.3
viii. BioJava 1.5
Java is the core language for integration and data calculation from different websites and for the generation of the final results. BioJava, an open-source project, is used for amino acid translation. RoboSuite handles the submission and extraction of data from websites. All development processes were done on Eclipse and Subversion was adopted as our revision control system.