The basic working principle of AF1 is shown in . The AF1 server has mainly following
components: (1) Restricted alignment based module (2) Probabilistic modeler (3) Classifier. The user gives an input sequence in FASTA
format either in paste sequence mode or load sequence file mode. The server has been designed to take a large single sequence. The input
query sequence is searched for exact match seed using library of overlapping words generated from an Alu prototype sequence unlike any
other database search tools that instead break query into words to scan databases. For every hit only flanking 300 bp regions on both
ends are taken for further analysis through alignment. These subsequences are subjected to first scan for longest possible region of
continuous match to nucleate the alignment. Unlike other famous methods of detecting multiple nucleus, here we need to locate just one
and around which alignment is extended. The matrices used are specific for Alu, derived from 5000 Alu sequences. If Alu is not detected
by this alignment, the entire alignment is scanned for a small subregion having reasonable identity. If its present, the aligning
sequence is subjected to probabilistic scanning where PWM derived from alignment of 5000 Alu sequences is used with overlapping window
of 32 on matrix as well as on sequence, assuming each position as start position in the matrix as well as sequence. The score is
compared to random one using a randomized matrix with same dimensions and composition and evaluated for threshold value for
identification as an Alu. The found Alu repeats are presented in both directions, whose links are made available. Clicking on those
links provides tabulated results giving start and end position with found Alu in that region. Probabilistic approaches work well when
sequences are not very close and in case of Alus when they are old and highly diverged.
The last stage is classification where the query is converted into encoded sequence via alignment with Alu Sx prototype. The same is
done for all known subfamilies of Alus. Finally the encoded query is aligned to encoded subfamily library where only diagnostic position
is allowed to guide the alignment and achieve the correct judgment for classification. Classification option runs automatically once the
first step of Alu identification is complete. The output of classification step is start of the region, end of the region, classified
subfamily and sequence.
The entire server has been implemented in Tomcat with JSP, while the core programs have been written in C++, Python and PERL.
Details, comparison and algorithm of program are available on the server page. The program achieved sensitivity and specificity above
0.9 when validated over experimental data from various sources. This data too has been made available on the server.