Identification of genetic variants by whole genome sequencing in Ankamali pigs of Kerala

. Identification of genetic variants by whole genome sequencing in Ankamali pigs of Kerala


Keywords: GATK Haplotype caller, single nucleotide variants, indels, haematological traits, missense, quantitative trait loci (QTL)
Domestication of pigs is thought to have begun in the Near East around 9000 years ago and may have occurred repeatedly from local populations of wild boars. The domestic pig (Sus scrofa) belongs to the Suidae family and order Artiodactyla. Pigs are one of the most common livestock species reared for meat in Kerala. Ankamali pig is a domesticated native variety of Kerala notable for its lean meat, disease resistance and adaptability to humid tropical environments (Behl et al., 2006). However, exotic breeds are preferred over desi pig breeds due to their excellent production performance. Due to this there is a drastic decline in the Ankamali pig population. Thus, conservation of Ankamali pigs is essential for future animal production and conservation of the genetic variety.
Recent breakthrough in genome sequencing technologies have created unparalleled prospects for characterizing individual genomic landscapes and identifying mutations. Whole exome sequencing (WES) is a cost-effective approach as it sequence only the coding regions of the genome (Reshma et al., 2020). However, Whole-genome sequencing (WGS) is becoming increasingly attractive as an alternative, due to its broader coverage and decreasing cost. Whole genome sequencing (WGS) approach allows the detection of nearly an organism's entire genome sequence (Park and Kim, 2016). One of the primary goals of sequencing-based studies is to find genetic variants that differ between individuals. Several studies have been conducted in many native breeds of different species including Vechur cattle, Malabari and Attapady black goats to gain a better understanding of the genetic variances responsible for the unique features such as short stature of Vechur cattle, high prolificacy and low-fat meat of Malabari goat and high disease resistance of Attapady black goats (Reshma et al., 2020;Marykutty et al., 2021). Although a wide range of major and smallscale genomic sequence changes can affect gene function, variants that change amino acid sequences via missense, nonsense, frameshift and splice site variants are among those most likely to affect function (Bickhart and Liu, 2014). Animal genomic sequences can be easily aligned to a high-quality, annotated reference genome assembly Sus scrofa (Sscrofa 11.1), to identify DNA polymorphisms encoding these protein variations. Several variant calling tools have been created over the years, including SAMtools/BCFtools, CLC Genomics Workbench (Qiagen), FreeBayes, GATK, LoFreq, SNVer, VarDict and VarScan. GATK HaplotypeCaller (HC) is a common haplotype based variant caller that finds variants between a DNA sequence and a reference sequence (Lefouili and Nam, 2022). GATK HaplotypeCaller (GATK HC) was created by the Broad Institute of MIT and Harvard. It has high accuracy in variant calling compared to other variant calling tools, but its feasibility is limited by the long execution time needed for the analysis. The haplotypebased variant detection method can be used with single or many samples (Ren et al., 2019). The aim of the present study was to identify the genetic variants in Ankamali pigs using whole genome sequencing and analyzing the rate of variants among different chromosomes.

DNA extraction and whole -genome sequencing
The blood samples were collected from 12 Ankamali pigs reared at Centre for pig production and research, Mannuthy. Approximately, 5 ml of blood was collected using a sterile 20-gauge needle and syringe from the ear vein of each animal into vacutainer tubes containing EDTA as an anticoagulant (1mg/ml of blood). The samples were preserved in ice and brought to the laboratory and were stored at -40 °C. The genomic DNA were extracted through Qiagen DNeasy blood and tissue kit (Qiagen, Germany). Quality of the isolated genomic DNA was checked by agarose gel electrophoresis and the concentration and purity was estimated by spectrophotometer (NanoDrop TM 2000C) and pooled in equimolar quantities (0.25 µl/ per sample). A single library was generated for the pooled sample using QIAseq FX DNA Library Kit for Illumina (Average fragment size a 342bp) and sequenced on NovaSeq 6000 instrument 525 (Paired -end 2× 150bp) (Clevergene, Bengaluru)
Using Burrow Wheeler Aligner (BWA v0.717) software high quality paired-end reads were mapped to the pig reference genome Sscrofa11.1 ). Following alignment, SAMtools were used to convert from SAM format to BAM format . The resultant BAM files were sorted using Picard 'SortSam', and duplicate reads were removed by Picard 'MarkDuplicates'.

Variant calling and annotation
The GATK Haplotype caller with default parameters was used to call variants for single nucleotide polymorphisms (SNPs) (McKenna et al., 2010). SnpEff was used to annotate the identified SNPs (Cingolani et al., 2012). The variant rate in Ankamali genome was estimated from the sequence length and total number of variants. The significance of different variant rate on 18 chromosomes were analysed using the chi-square statistics (Snedecor and Cochran, 1989). The value of the Chi-square (χ 2 ) statistic was calculated using the formula:

Transition and transversion ratio
Transition and transversion are two-point mutations that occur in DNA due to substitution errors. Transition mutation occurs due to an interchange of two-ring purines (A ↔ G) or one-ring pyrimidines (C ↔ T). Transversion mutation occurs due to interchanges of pyrimidine for purines or purines for pyrimidines. Transition mutations are more frequent than transversions and two out of three SNPs are caused by transitional mutations. However, transition mutations are less likely to cause amino acid sequence changes. Transversion occurs in two possible ways since two pyrimidines and two purines are present. This is more likely to cause amino acid sequence changes (Baes et al., 2014) Transition-transversion ratio is the ratio of the number of transitions to the number of transversions for a pair of sequences. The ratio becomes 0.5 when there is no bias towards either transitional or transversional substitution. Across the entire genome the ratio of transitions to transversions is typically around 2 and in protein coding regions, this ratio is typically higher, often a little above 3 (Bainbridge et al., 2011).
From the base change matrix of SNPs obtained by the WGS data, transitiontransversion ratio was calculated as: Ts/Tv = sum of 4 transitions/sum of 8 transversions

Results and discussion
The whole genome sequencing of Ankamali pig generated 205.66 Gbs raw paired-end reads. During sequencing, 95.47 per cent of R1 reads and 94.27 per cent of R2 reads were obtained through paired end sequencing having the base quality score (Q) of >30. In DNA sequencing, the Phred quality score was used to determine the measure of base quality. Phred score of 30 (Q30) indicate, the risk of an inaccurate base call is one in 1000 with base call accuracy of 99.9 per cent (Cock et al., 2009). The GC content of both R1 and R2 reads were 43 per cent. Total GC content, which ranged between 39 and 59 percent, might be utilized as a measure for assessing the quality of sequenced reads (Guo et al., 2013). Any variation from this range suggested that contamination from other sources was present. The GC content of raw sequence data obtained in this study was 43 per cent.
After variant calling, more than 26 million SNVs were identified in the Ankamali pig's genome. Of the total number of variants discovered, 21,303,641 (80.08 %) were single nucleotide polymorphisms (SNPs), 3,056,981 (11.4%) were insertions and 2,243,967 were deletions (8.43%) ( Table 1). In Wannan black pig more than 21 million SNV's were identified of which 16,249,548 (76.23%) were single nucleotide polymorphisms (SNPs), 2,898,582 (13.59%) of insertions and 2,168,624 (10.17%) of deletions (Zhang et al., 2020). The number of variants in Ankamali pigs and Wannan black revealed that the number of variants were almost similar in both breeds. Because the Ankamali and Wannan black pigs (Chinese local breed) were compared to a European breed-based reference (Duroc), the practically identical number of variations reported in these breeds could be explained.
The number of different transitions and transversions in Ankamali pig genome were presented in Table. 2 and Fig. 1. The total number of transitions and transversions obtained were more than 14 million (14,824,878) and 6 million (6,478,763), respectively. The number of A to G and T to C transitions were 3,906,144 (18.34%) and 3,511,198 (18.33%), respectively. The number of A to T and A to C transversions were 729,849 (3.43%) and 901,472 (4.23%), respectively. Similarly, the number of G to C and G to T transversions were 734,780 (3.45%) and 877,366 (4.12%), respectively. Ts/Tv ratio for SNPs found using whole-genome sequencing should be at least 2 (DePristo et al., 2011). In general, a greater Ts/Tv ratio means more accuracy. The average Ts/Tv ratio observed in Ankamali pigs was 2.29, which was consistent with the previous report in Anquing-six-end pigs (Zhang et al., 2020). Given that only 2% of the genome codes for proteins, the majority of the discovered SNVs were found in noncoding areas (46.6 per cent in intron and 39.2 per cent in intergenic regions). Intergenic and intron regions are anticipated to play essential roles in a range of cell activities, notably in gene expression, transcriptional control and gene splicing (Bartonicek et al., 2017). This is consistent with data from several whole genome sequencing studies, which showed more than 50 per cent of single nucleotide variants in cattle and pig were found in non-coding regions (Mei et al., 2018;Yu et al., 2020).
On whole genome sequencing of Ankamali pigs, the total genome length obtained was more than 2.5 billion with an average variant rate of one variant in every 94 bases. Chromosome wise distribution of variants is depicted in the Fig 2. The overall  Several quantitative trait loci (QTL) for haematological traits such as haematocrit (HCT), haemoglobin (HCB), mean corpuscular volume (MCV) and blood platelet counts (PLT) were discovered on SSC10, all of which are important components of the animal immune system. A QTL related to teat number was also identified on SSC10. A sow must provide a fair opportunity for all of its piglets to access nipple throughout the suckling phase, so teat number has been recognised as one of the most important factors in measuring swine mothering skill (Rohrer, 2000). On SSC13, some QTLs associated with back fat thickness, growth and meat quality was discovered (Yu et al., 1999). Multiple meat quality traits (SHEAR, MOIST, DRIP, MARB, and CFAT) were impacted by SSC12 QTL This co-localization of numerous QTL for meat quality-related characteristics shows a genetic correlation between these traits. The variability of many genes and QTLs located on chromosomes 10 and 12 may contribute the phenotypic and genetic uniqueness of Ankamali animals. 528 Identification of genetic variants by WGS in Ankamali pigs ________________________________________

Conclusion
Whole genome sequencing of Ankamali pigs yielded more than 1.37 billion paired end reads and 99.77% of QC passed reads were successfully aligned to the Sus scrofa (Sscrofa11.1) reference genome. The sequencing revealed more than 21 million SNPs and five million indels. Majority of the identified SNVs (85.8 per cent) were present in the non-coding regions The Ts/Tv ratio obtained was 2.29 and overall variable rate was one in every 94 bases. Significantly lower variable rate was observed on chromosome 13 whereas; significantly higher variable rate was observed on chromosome 10 and 12. The variability of many genes and QTLs located on these chromosomes might contribute to the unique phenotypic and genetic characteristics of Ankamali pigs. The study reveals the importance of conservation of this native variety.