A complete genome sequence is the ultimate genomic resource. If such a sequence is available it is possible to conduct many large-scale, in-depth ecological genomics and evolutionary genomics studies. For example, Ellegren et al. used NGS sequenced genomes of two flycatcher species and identified approximately 50 high divergence genomic regions which might be related to the speciation process (Ellegren et al. 2012). However, even when applying NGS technologies, a de novo sequencing project on a large eukaryotic genome still represents a considerable investment for a small research group. The need of multiple sequencing libraries, high performance computational facilities and bioinformatic tools to handle the sheer volume of data generated, may limit the number of non-specialized labs currently able to embark on such a project.
Therefore, most whole genome sequencing projects, relying mainly or solely on NGS data, are still big, collaborative efforts with a large amount of people from several research groups featuring in the author lists. Published whole NGS genomes exemplifying this include giant panda (Li et al. 2010a), cod (Star et al. 2011), naked mole rat (Kim et al. 2011), macaque (Yan et al. 2011), Tasmanian devil (Miller et al. 2011), budgerigar (Koren et al. 2012), Puerto Rican parrot (Oleksyk et al. 2012), Heliconius butterfly (Dasmahapatra et al. 2012), Aye-aye (Perry et al. 2012), as well as the 29 mammalian genomes recently sequenced at the Broad Institute (Lindblad-Toh et al. 2011).
A small ecologically focused research group with limited resources may benefit from a wiser strategy to make use of the NGS technologies when sequencing whole genomes. As a large number of genome sequences, from both model and non-model organisms, are now publically available, it is prudent to utilize this information as much as possible when performing genomic investigations in related organisms. A good strategy is to utilise the genome sequence from a related model organism as reference in the assembly of short read data of the focal species, which is known as reference guided (or reference assisted) assembly (Gnerre et al. 2009; Schneeberger et al. 2011).
In theory there are two different ways of utilizing the reference sequence to guide the assembly process. Under an “align-then-assemble” strategy the reads are first mapped to the reference and clusters of reads mapping to the same location are then extracted and assembled de-novo. Alternatively, in the “assemble-then-align” strategy the reads are first de-novo assembled and the resulting contigs are then aligned to the reference genome (Bioinformatic analysis) to close gaps and create scaffolds (Martin and Wang 2011).