How do you merge SNP data with a reference genome?
# My DataI have a 23andMe file listing SNPs in the form:`rsid chromosome position genotypersXXXXX 1 PPPPPP CTrsXXXXX 1 PPPPPP GG`Fields are TAB-separated and each line corresponds to a single SNP. For each SNP, four fields of data are supplied. 1. An identifier (an rsid or an internal id) 2. Its location on the reference genome. - The chromosome it is located on. - The position within the chromosome is is located on. 3. The genotype call oriented with respect to the plus strand on the human reference sequence.The reference genome is the human assembly build 37 (also known as Annotation Release 104).# My QuestionHow do I merge the SNPs into the reference genome?For example, take the first line in my SNP file:`rsXXXXX 1 PPPPPP CT`### Part 1I can see that I need to replace the nucleotide at position PPPPPP on chomosome 1 of the reference genome with a nucleotide from the genotype field, but which nucleotide am I supposed to use? C or T? And why?### Part 2Where am I supposed to start counting from on the reference genome? Looking at chromosome 1 of the human assembly build 37, the first ~10,000 characters (excluding the first line description) are `N`. Is the first N number 1? eg. If PPPPPP was 100,000 would I replace the 100,000th character in the reference genome with the correct nucleotide from **Part 1** of this question? Or should I start counting from the first non-N character in the fasta file?
I am not sure if I follow your question. If you need to merge a sequence SNP (or otherwise) with a reference genome you can use many linux and text-editing commands. Is that your question? You can also use Perl and regex commands. Sorry I cannot be more of a help. N means that the sequence is not assigned to ATCG. In experiments that happens when the sequencer can NOT assign the nucleotide to be one of ATCG. Adding the SNP sequence between N's will not help you in any way. Most aligners escape long strings of N and that's the point for putting it in the beginning of the reference genome for technical reasons.