I got an email from a colleague requesting some help on an informatics issue and I thought it might be useful to post it here. I have been thinking about starting a section of this blog on “Technical Help Requests” or something like that so I guess this is a test.
Here is the request
I have a list of sequences of a set of > 1 million short repeat elements in a large eukaryotic genome, and I need to find a ~60 bp region which is most conserved among these elements. While they are “repeat elements”, they can be fairly diverse in specific sequence, but I only need a subset that contain the (near-perfect) conserved sequence. What method or software would you recommend to find this region? All the ones I usually use can’t handle that many lines of input.
Is the 60bp sequence known or is this a find the 60bp conserved region?
It is basically known (at least, representatives of the repeat are known)
To find a >60bp conserved region (or the most conserved region) among a known set of repeat elements. The conserved region is undefined. Thanks!
If you have representative sequences then something like kallisto (https://pachterlab.github.io/kallisto/about.html) or other psedualignment software would work well. Most pseudoalignment software scales to the millions pretty easily.
We are currently using kallisto to sift through millions of transcriptomes for conserved fragments.
Thanks! We’re taking a look into kallisto now for our seq list. We also just found an R package “DECIPHER” that seems pretty powerful.
You are welcome! vsearch (https://github.com/torognes/vsearch) might also be useful but I don’t know how well it scales.