Beware of mis-assembled genomes: still valid today!
This article is originally published at https://lcolladotor.github.io/
It’s a short note published in 2005, but damn, can anyone deny that it fits perfectly for today’s state of the art in the de novo genome assembly field? I bet no one will. For instance, it’s a solid statement to say:> The source of most mis-assemblies is, as it has always been, repeats.
He didn’t add a “will always be” or “will be for at least 7 years more” now that we are in 2012, but it feels like this will be the case until we can get accurate (and cheap) reads that span even the longest repeat. Well, maybe we don’t need such huge reads as people have been able to find large genome duplications.
And as I said in my previous post, I’m still surprised by how careless the human genome assembly was carried out as they didn’t track their own steps. I was hoping that wasn’t the case, but clearly it is:> Indeed, many of the original assemblies of parts of the human genome were done in the mid- and late-1990s, and are now lost.
I’m also impressed by how accurate Steven and James’ prediction was when they forsaw that people were going to be misled in judging assembly quality by contigs size without taking into account mis-assemblies.
They also called upon the bioinformatics community to take action in evaluating genome assemblies. Due to the amount of data nowadays, it feels like a inhuman (well, incluster as infeasible by high power clusters :P, well, incomputable is the correct term) task. But with some funding, I bet Salzberg and colleagues could find a way to do so. At least partially. Yet, as with anything, you need motivation, and I’m note sure they are motivated to clean up the mess.
Please visit source website for post related comments.