Full description
sim4 is a similarity-based tool for aligning an expressed DNA sequence
(EST, cDNA, mRNA) with a genomic sequence for the gene. It also detects
end matches when the two input sequences overlap at one end (i.e., the
start of one sequence overlaps the end of the other).sim4 employs a
blast-based technique to first determine the basic matching blocks
representing the "exon cores". In this first stage, it detects all
possible exact matches of W-mers (i.e., DNA words of size W) between
the two sequences and extends them to maximal scoring gap-free
segments. In the second stage, the exon cores are extended into the
adjacent as-yet-unmatched fragments using greedy alignment algorithms,
and heuristics are used to favor configurations that conform to the
splice-site recognition signals (GT-AG, CT-AC). If necessary, the
process is repeated with less stringent parameters on the unmatched
fragments.