Reverse complements

Reverse complements

Reverse complements can be the bane of any informatics projects. High-throughput sequence processing and alignment projects need to be confident that reverse strand sequences are corrected or else the results of the analyses will be junk. There are a number of ways that researchers attempt to overcome these problems. Here, I won’t discuss all of these ways but instead will discuss some of the solutions that have been coded into phyx and contrast them with the nice options in mafft.

MAFFT solutions

MAFFT has two very good solutions for reverse complement correction that are described here.

adjustdirection

The first, mafft --adjustdirection input > output, is relatively fast because it uses 6-mer counting (so it is looking at chunks of the sequence of length 6) instead of the whole sequence. This is generally useful but is not 100% accurate. Case in point, I have a set of sequences that includes 2,297 sequences. MAFFT with adjustdirection takes 1m12.046s but does not get all the reverse comps.

adjustdirectionaccurately

The second method that MAFFT uses, mafft --adjustdirectionaccurately input > output, employs the more standard dynamic programming algorithm and as a result is slower. As the name implies, however, it is more accurate than the previous method. I have found this to generally be the case but the speed does render it less useful for large analyses and pipelines. For the alignment mentioned above, if we use MAFFT with adjustdirectionaccurately it takes more than 30 minutes (!) but does get all the reverse complements.

phyx solutions

As part of the phyx package and the pxrevcomp program, there are three options for automatically fixing reverse complements. We have incorporated Edlib into the procedures here (Sosic and Sikic, 2017).

guess

The first option for pxrevcomp is -g which will use the first sequence as a guide and then will align the others in the direction of the first. You can use pxssort to sort the sequences by length to get the first sequence to be the longest sequence. You can do this by running pxssort -s seq.fa -b 4 > outfile. You can pipe this to the pxrevcomp program so the whole thing looks like pxssort -s seq.fa -b 4 | pxrevcomp -g > outfile. If we take the set of sequences above and run time pxssort -s seqs.fa -b 4 | pxrevcomp -g > tg && time mafft tg > tg.aln, it takes 0m0.822s for the phyx part and then 0m51.439s to align with mafft and all the rev comps are found. So, faster than the fastest MAFFT option.

pguess

The next option is -p which will compare a sequence to a growing list of corrected sequences. This will take longer than -g but, if you sort first as mentioned above, should be accurate. If we run this on the seqs mentioned above, it takes 1m59.701s instead of 0m0.822s for the phyx part. So definitely slower but not like the slowest MAFFT option and all the rev comps are found.

sguess

The final option is -m which is a sampled guessing procedure. Instead of looking at the first sequence or progressively, this will take a sample of sequences and correct against the sample. This will probably be faster than -p but slower than -g. If we run this option on the seqs used above it takes 0m26.046s instead of 0m0.822s for -g and 1m59.701s for -p. So, in between the two. This also finds the rev comps. This one seems to be about the speed of the MAFFT fastest (if you include MAFFT alignment along with the phyx part).

blackrim Written by:

Be First to Comment

    Leave a Reply

    Your email address will not be published. Required fields are marked *