Asked • 05/27/19

How do you merge SNP data with a reference genome?

# My DataI have a 23andMe file listing SNPs in the form:`rsid chromosome position genotypersXXXXX 1 PPPPPP CTrsXXXXX 1 PPPPPP GG`Fields are TAB-separated and each line corresponds to a single SNP. For each SNP, four fields of data are supplied. 1. An identifier (an rsid or an internal id) 2. Its location on the reference genome. - The chromosome it is located on. - The position within the chromosome is is located on. 3. The genotype call oriented with respect to the plus strand on the human reference sequence.The reference genome is the human assembly build 37 (also known as Annotation Release 104).# My QuestionHow do I merge the SNPs into the reference genome?For example, take the first line in my SNP file:`rsXXXXX 1 PPPPPP CT`### Part 1I can see that I need to replace the nucleotide at position PPPPPP on chomosome 1 of the reference genome with a nucleotide from the genotype field, but which nucleotide am I supposed to use? C or T? And why?### Part 2Where am I supposed to start counting from on the reference genome? Looking at chromosome 1 of the human assembly build 37, the first ~10,000 characters (excluding the first line description) are `N`. Is the first N number 1? eg. If PPPPPP was 100,000 would I replace the 100,000th character in the reference genome with the correct nucleotide from **Part 1** of this question? Or should I start counting from the first non-N character in the fasta file?

1 Expert Answer

By:

Aref E. answered • 05/29/19

Tutor
4.9 (47)

Love Stat, Biostat, and R! Teaching Fellow at Harvard

Still looking for help? Get the right answer, fast.

Ask a question for free

Get a free answer to a quick problem.
Most questions answered within 4 hours.

OR

Find an Online Tutor Now

Choose an expert and meet online. No packages or subscriptions, pay only for the time you need.