Reverse complements can be the bane of any informatics projects. High-throughput sequence processing and alignment projects need to be confident that reverse strand sequences are corrected or else the results of the analyses will be junk. There are a number of ways that researchers attempt to overcome these problems. Here, I won’t discuss all of these ways but instead will discuss some of the solutions that have been coded into
phyx and contrast them with the nice options in
MAFFT has two very good solutions for reverse complement correction that are described here.
mafft --adjustdirection input > output, is relatively fast because it uses 6-mer counting (so it is looking at chunks of the sequence of length 6) instead of the whole sequence. This is generally useful but is not 100% accurate. Case in point, I have a set of sequences that includes 2,297 sequences. MAFFT with
adjustdirection takes 1m12.046s but does not get all the reverse comps.
The second method that MAFFT uses,
mafft --adjustdirectionaccurately input > output, employs the more standard dynamic programming algorithm and as a result is slower. As the name implies, however, it is more accurate than the previous method. I have found this to generally be the case but the speed does render it less useful for large analyses and pipelines. For the alignment mentioned above, if we use MAFFT with
adjustdirectionaccurately it takes more than 30 minutes (!) but does get all the reverse complements.
As part of the
phyx package and the
pxrevcomp program, there are three options for automatically fixing reverse complements. We have incorporated
Edlib into the procedures here (Sosic and Sikic, 2017).
The first option for
-g which will use the first sequence as a guide and then will align the others in the direction of the first. You can use
pxssort to sort the sequences by length to get the first sequence to be the longest sequence. You can do this by running
pxssort -s seq.fa -b 4 > outfile. You can pipe this to the
pxrevcomp program so the whole thing looks like
pxssort -s seq.fa -b 4 | pxrevcomp -g > outfile. If we take the set of sequences above and run
time pxssort -s seqs.fa -b 4 | pxrevcomp -g > tg && time mafft tg > tg.aln, it takes 0m0.822s for the phyx part and then 0m51.439s to align with mafft and all the rev comps are found. So, faster than the fastest MAFFT option.
The next option is
-p which will compare a sequence to a growing list of corrected sequences. This will take longer than
-g but, if you sort first as mentioned above, should be accurate. If we run this on the seqs mentioned above, it takes 1m59.701s instead of 0m0.822s for the phyx part. So definitely slower but not like the slowest MAFFT option and all the rev comps are found.
The final option is
-m which is a sampled guessing procedure. Instead of looking at the first sequence or progressively, this will take a sample of sequences and correct against the sample. This will probably be faster than
-p but slower than
-g. If we run this option on the seqs used above it takes 0m26.046s instead of 0m0.822s for
-g and 1m59.701s for
-p. So, in between the two. This also finds the rev comps. This one seems to be about the speed of the MAFFT fastest (if you include MAFFT alignment along with the