|Puerto Rican Harvest stock virus||soconnor||2017-02-17|
We wanted to determine what sort of variants were present in the Puerto Rican stock virus that was prepared at UW-Madison. We used the short amplicon method developed by Quick et. al., and we were able to amplify and sequence two replicates of the harvest virus called ‘PR-ABC59-Harvestvirus’ .
For each replicate, we used the following number of vRNA templates as input for the cDNA synthesis reaction:
PR-ABC59 harvest virus: 1e6
animal 566628: 5,598
animal 634675: 6787
animal 311413: 23,123
The data was analyzed using the Sequencer pipeline (see the tools attached as a .zip file below), using the following processes:
1. Trims and merges the paired reads from the FASTQ data.
2. Extracts 1000 reads (if they are present in the sequence data) spanning each of the 35 amplicons that were generated by the amplification protocol.
3. Maps the 1000 reads x 35 amplicons to a full reference genome — KU50125
4. Calls the SNP positions using SNPeff, and generates a VCF file.
5. Generates a BAM file that can be viewed in a program like Geneious.
Once a BAM file is created, then the VCF file and BAM file are opened in Geneious. An annotation table is created with the following columns:
a. Track Name
e. FREQ (this is the variant frequency)
f. You can export other columns, such as Sample depth.
The exported CSV files are concatenated and a pivot table is created so that you can see the frequency of variation at each of the positions in the genome.
The pivot table is attached and labeled ‘PR_Stock&Animals_Wiki.xlsx.’
There are four samples on the table, each generated in 2 replicates:
634675 - animal 1 infected with PR-ABC-Harvest Virus
566628 - animal 2 infected with PR-ABC-Harvest Virus
311413 - animal 3 infected with PR-ABC-Harvest Virus
Here are some key points to observe:
1. First, the less interesting information. There are three sites where the variants are likely an artifact of the sequencing method. I put double asterisks next to their position in the excel table.
Site 118 — this is a string of A’s, and it looks like there is a low level insertion that could be a consequence of some slippage in PCR or sequencing
Site 405 — this is in a region that is high in G/A content. There is an occasional deletion that is in one amplicon, but not the overlapping amplicon. So, this seems to be an artifact of one of the amplicons generated during the process.
Site 9343 — this is also an artifact present in one amplicon, but not the overlapping amplicon. So this is also likely not real.
2. Second, we have the more interesting sites:
Site 1964 — This site is a real difference from the reference, but the frequencies may be inaccurate. It is present at the end of amplicon 6, but the middle of amplicon 7. The nucleotide gets deleted in some sequences found in amplicon 6. So, to get an accurate count of frequency, we need to look solely at amplicon 7.
Site 2780 — This looks to be a real variant whose frequency increases in the animals, relative to the stock.
Site 3147 — Looks real, but not too variable across animals.
Site 3281 — There is a variant at >5% in some of the animals, but it does not show up in the stock. This is a case where we made our cutoff to call a SNP at 5%. The stock does have variation in this site, but it is just less than 5%.
Site 5679 — Looks real, and frequency changes in 2 animals
Site 7915 — Looks real, but not too variable across animals.
Those are the interesting sites to consider. What does this mean?
This Puerto Rican Zika virus stock has a few sites of variation greater than 5%, and these are maintained in the animals. There are now new fixed changes in the animals, consistent with the idea that the stock virus is competent to grow in animals. Since we only looked at day 3 post infection, we don’t know whether there are other changes as the virus replicates in the animals. Future studies would need to address that.
For your viewing pleasure, I’ve attached the following:
1. The excel file ‘PR_Stock&Animals_Wiki’ that has the information about variation at the individual sites.
2. The Zequencer.zip file that was used to generate the mapped reads.
3. The BAM and VCF files needed to view this data set.