Introduction

Exploration of the targeted STRs using capillary electrophoresis (CE) has been currently considered as the gold standard technology in the forensic DNA analysis. This technology employs the polymerase chain reaction (PCR) followed by CE for the detection of individual-specific length variations at the STR markers. Despite many advantages, CE technology is inexpedient in the analysis of multiple genetic polymorphisms (STRs/SNPs) in a single reaction, simultaneous generation of sequence information of STR alleles, loss of a generation of valuable genetic information from the degraded samples, and generation of low-resolution results in mtDNA and mixture analysis1. Since the CE technology does not provide the information on the base-pair variations at the STR alleles, it underestimates the genetic diversity and variations present at that genetic locus. Besides, homoplasmy i.e., similar-sized DNA fragments with varied sequence compositions can be misinterpreted as homozygous due to the generation of a single peak in CE results2.

In context to abovesaid drawbacks associated with CE, Next-generation sequencing (NGS) appears to be a suitable alternative technique. It provides information from numerous STRs and SNPs simultaneously. Sequencing of STR alleles provides in-depth genetic information in terms of internal sequence variation and mutations in the samples. NGS is also useful in mtDNA sequencing which is expedient in degraded samples due to the presence of thousands of copies of mtDNA per cell3. In addition to the use in routine forensic identification, the NGS technology promises many other applications of forensic relevance such as age estimation4, body fluid identification5, forensic genealogy6, DNA phenotyping7, detection of geographic origin and ancestry of an individual7. The use of NGS technology decreases the probability of false-positive matches in the DNA profiling due to high resolution in distinguishing between DNA mixtures8. Based on these merits, the technology has been considered as the future of forensic DNA analysis.

The major advantage of NGS over CE technology is that there exists no limitation in the number of STR markers to be multiplexed in a single reaction. Therefore, many new STR marker sets have been included in the commercially available sequencing kits besides the recommended 20 core CODIS STR loci. However, before their forensic application, these loci and their aptness at the population level should be understood utterly. The inclusion of more markers could increase the discrimination power of a multiplex system. However, a limited number of genetic markers can be accommodated in a single multiplex reaction due to the involvement of different dye sets and limited channels for detection. This could be overcome by NGS analysis where numerous genetic markers can be analyzed simultaneously.

Several attempts have been made to assess the sequence-based allele frequency data for the autosomal STR markers. Most of the studies such as for the US population9, Native Americans from West-Central Arizona10, Yavapai native Americans11, White British and British Chinese populations12 and Danish population13 have used ForenSeq DNA Signature Prep Kit on a MiSeq FGx instrument (Illumina, San Diego, CA). On the contrary, limited studies are available for sequence-based allele data using Precision ID Global Filer™ NGS STR panel (Thermo Scientific, US) for the Spanish population14 and Han population15. Indian population has not yet been explored for their sequence-based STR allelic data. Therefore, an attempt was made in the present study to analyze 31 autosomal STR markers simultaneously i.e., D12S391, D13S317, D8S1179, D21S11, D3S1358, D5S818, D1S1656, D2S1338, vWA, D2S441, D5S2800, D7S820, D16S539, D6S474, D12ATA63, D4S2408, D6S1043, D19S433, D14S1434, CSF1PO, D10S1248, D18S51, D1S1677, D22S1045, D2S1776, D3S4529, FGA, Penta D, Penta E, TH01 and TPOX in the central Indian population (Fig. 1). Madhya Pradesh is the second largest geographical state of India and the fifth largest in terms of population. Being located in the middle of India, Madhya Pradesh shares its boundary with five other states including Uttar Pradesh in North, Chhattisgarh in East, Maharashtra in south, and Gujarat and Rajasthan in West. For this region, the state experiences an admixed of populations to represent mini-India. Understanding the genetic diversity of central Indian population gives a representation of the genetic print pan-India. The study aimed to generate sequence-based allele frequency data, population-specific characteristics, sequence variations, and SNPs in the flanking regions for the forensic casework applications in the studied population.

Figure 1
figure 1

Map of India highlighting the central Indian population residing in the state Madhya Pradesh, India (https://en.wikipedia.org/wiki/Madhya_Pradesh#/media/File:IN-MP.svg).

Results and discussion

Sequencing performance of precision ID NGS STR panel v2

Quality control parameters such as Locus balance (LB), Heterozygous balance (HB) and Stutter ratio of the 31 autosomal STR markers have been mentioned in Fig. 2. Out of all the STR markers, D4S2408 showed the most perfect average LB value (0.992) whereas, D16S539 showed greatest deviation from the ideal LB value (1.0), with an average value of 1.925. Other markers which showed a greater deviation from the ideal LB value included D18S51 (0.394), D2S1338 (0.411), D3S1358 (1.513), FGA (0.371), Penta D (0.167), Penta E (0.371), TH01 (1.708), and TPOX (1.579). With an ideal value of 1.0, STR markers showed HB value in the range of 1.031 (D8S1179) and 1.722 (TH01). Out of 31 STR markers tested, relatively higher heterozygous imbalance was observed in the D12ATA63 (1.396), D19S433 (1.307), D1S1656 (1.376), D22S1045 (1.325), and TH01 (1.722). None of the markers showed a deviation for the threshold set for the stutter ratio i.e., 1.4. The occurrence of the stutter products was observed to be highest in the number for D1S1656 and null stutter product was observed for D3S4529. The average value of stutter ratio ranged from 0.104 (D16S539) to 0.127 (D6S474). As the use of NGS technology is still at its nascent stage in the forensic DNA applications, quality issues of some STR markers need to be addressed by the kit manufacturers prior to their efficient use in routine forensic casework.

Figure 2
figure 2

Sequence performance of Precision ID NGS STR panel v2. (a) Locus Balance for all STRs measured as coverage of each locus divided by average coverage of all locus per sample. (b) Heterozygous Balance for all STRs measured as coverage ratio of one allele to another, for heterozygous genotypes only. (c) Stutter ratio measured as ratio of coverage of stutter peak and allele peak.

Concordance study, allele frequency, forensic and paternity parameters

Out of 31 autosomal STR markers viz. CSF1PO, D10S1248, D12ATA63, D12S391, D13S317, D14S1434, D16S539, D18S51, D19S433, D1S1656, D1S1677, D21S11, D22S1045, D2S1338, D2S1776, D2S441, D3S1358, D3S4529, D4S2408, D5S2800, D5S818, D6S1043, D6S474, D7S820, D8S1179, FGA, Penta D, Penta E, TH01, TPOX and vWA analyzed in this study; 22 overlapped STRs were compared with the length-based allele data obtained by the CE analysis. For all the samples, the length-based allele data was found to be consistent irrespective of the CE analysis or NGS data. To the best of our knowledge, this is the first report wherein sequence-based analysis of the 31 STR markers has been carried out on studied markers in any Indian population. Besides, this is also the first allelic report on nine STR markers i.e., D12ATA63, D14S1434, D1S1677, D2S1776, D3S4529, D4S2408, D5S2800, D6S1043, and D6S474 in the Indian population. The calculated length-based allele frequency values are given in the Supplementary Table S1. Forensic and paternity parameters of the length-based and sequence-based alleles have been provided in Table 1. The average total allele number of all the genetic markers was calculated as 9.26 and the highest number of size-based alleles (18) was observed on marker Penta E, whereas, D1S1677, D4S2408, and D6S474 showed the lowest number of alleles i.e., 6 (Fig. 3). The newly analyzed markers i.e., D12ATA63, D14S1434, D1S1677, D2S1776, D3S4529, D4S2408, D5S2800, D6S1043, and D6S474 generated a total allele number of 8, 7, 6, 8, 7, 6, 8, 11, and 6 respectively. Besides, Penta E showed the highest power of discrimination (0.978), polymorphic information content (0.90), Expected Heterozygosity (0.905) value, and the lowest matching probability (0.022), whereas, FGA showed the highest value for Power of Exclusion (0.778), Typical Paternity index (4.60) and observed heterozygosity (0.891). These findings suggested the usefulness of Penta E and FGA marker in the central Indian population based on the length-based analysis of alleles. D2S441 showed its least usefulness in the terms of polymorphic information content (0.64), power of exclusion (0.329), typical paternity index (1.35), observed and expected heterozygosity (0.630 and 0.690). Similarly, the calculated power of discrimination (0.855) and matching probability (0.145) values did not advocate the usefulness of the D5S818 marker in the studied population. On the contrary, when sequence-based forensic and paternity parameters were calculated in 31 autosomal STR markers, D2S1338 emerged to be the most useful marker in the studied population with the highest values of power of discrimination (0.984), polymorphic information content (0.920), power of exclusion (0.822), and typical paternity index (5.75), and the lowest matching probability (0.016). This suggested that the individual markers should be assessed on the basis of sequence-based alleles to get a clear idea on their usefulness in a specific population.

Table 1 Calculated forensic and paternity parameters of the 31 autosomal STR based on length-based (LB) and sequence-based (SB) alleles in the central Indian population (n = 138).
Figure 3
figure 3

Allele gains by sequences at 31 autosomal STR markers due to SNPs in flanking regions and isometric heterozygous conditions.

The previous studies also suggested the utility of the Penta E marker with higher forensic and paternity parameters in the Indian population16,17,18. This marker has already been established with high forensic efficiency for its effective use in the personal identification in the Portuguese population19, Austrian Caucasian population20, Northern Italy population21 and Mexican population22. When the newly inducted STR markers i.e., D12ATA63, D14S1434, D1S1677, D2S1776, D3S4529, D4S2408, D5S2800, D6S1043, and D6S474 were analyzed, they showed a similar allelic range and other statistical parameters in the limited published literature from Inner Mongolia, China23, Tujia population24.

Out of 81 male samples, four samples were found to be of AMELY deletion cases; where, AMELY could not be amplified, but a positive amplification was present in three alternative sex-determining markers i.e., DYS391, SRY, and Y InDel. This result was found to be consistent with the corresponding CE data. Allele no. 10 was found to be present dominantly in 63 samples followed by allele 11 (16 samples) and allele 9 (2 samples). Similarly, Y InDel showed allele 2 in 74 samples and allele 1 in only 7 male samples. AMELY deletion is a global problem25 and simultaneous amplification of the alternative sex-determining markers26,27 is highly useful in assigning the sex of a sample appropriately as evidenced in four samples of the current study.

Increment in allele number by sequencing

A huge increase in the sequence-based allele number was detected in the studied STRs in comparison to the length-based allele numbers (Fig. 3). It has been previously studied that the presence of SNPs in STR flanking regions and allele sequence variation with similar length, majorly contribute to such increment in the allele numbers28. Substantial gain in allele numbers has been detected at D13S317, D16S539, D1S1656, D5S2800, D5S818, D7S820, and vWA with D5S2800 showing a significant increase in allele numbers due to the variation in flanking region and D3S1358 showed the highest allele gain due to the differing repeat sequence conditions. On the contrary, the genetic markers which showed no gain in allele numbers either by SNPs in flanking regions or sequence length variation included CSF1PO, D18S51, D19S433, D1S1677, D22S1045, D22S1045, D3S4529, D6S1043, FGA, Penta D, Penta E, and TPOX. Besides, the markers which showed an increment in allele number only due to SNPs in flanking regions were D10S1248, D13S317, D14S1434, and D7S820. The increased allele number in D12ATA63, D12S391, D21S11, D2S1338, D3S1358, D4S2408, D8S1179, and TH01, was due to the variation in the repeat sequences only.

Short nucleotide polymorphism (SNPs) associated with the flanking region of STRs has widely been reported throughout the globe13,29,30. The SNP-STR links SNPs with the STR polymorphism which allows the generation of an STR allele subtype, based on the observed SNP allele in the flanking region. Although many other marker combinations such as deletion-insertion polymorphisms amplified with STRs (DIP-STR) are used widely, a recent study advocated the use of SNP-STRs for forensic application, where an imbalanced DNA mixture is expected31. In this regard, the current study depicted the existence of many SNPs in the flanking region of STRs in the studied population (Table 2). rs25768 showed the highest occurrence in the central Indian population associated at upstream of D5S818 marker, whereas, rs73250432, rs369257353, and rs561924992 located at upstream of D13S317, downstream of D5S818, and downstream of D16S539 respectively showed their least occurrence.

Table 2 Observed SNPs and their characteristics in flanking regions of STRs in the current study.

Detection of alleles with identical size but different internal sequence variation has been acknowledged as one of the advantages of using NGS for studying STRs32,33. The marker-wise isoalleles observed in the central Indian population have been reported in the Table S2. Out of 31 autosomal STR markers analyzed in this study, the isometric heterozygous pattern was observed at only 16 loci i.e., D3S1358, D21S11, vWA, D5S2800, D6S474, D2S441, D12ATA63, D2S1338, D1S1656, D16S539, D8S1179, D12S391, D2S1776, TH01, D5S818, and D4S2408. Allele no. 15 of D3S1358, allele no. 19 of D2S1338, and allele no. 22 of D12S391 showed a maximum number of isoalleles with the same size and different intervening sequences (Fig. 4).

Figure 4
figure 4

A representative image showing allele sequence variations with the same length (a) 15 repeats at D3S1358, (b) 19 repeats at D2S1338, and (c) 22 repeats at D12S391.

A previous report has suggested a correlation between the allele number and various paternity and forensic parameters of an STR marker such as total possible genotypes, Power of discrimination, Matching probability, Polymorphic information content, power of exclusion, total paternity index, and gene diversity18. Keeping this in view, a substantial increase in sequence-based allele numbers in the STRs as observed in the present study increased their evidentiary value. With the increase in the allele number, the potential forensic and paternity applications of the STR markers are substantially increased. An increase in the allele number has further been correlated with the increase in heterozygosity of an STR marker which also increased its informativeness9.

Population genetics

When the observed size-based allelic data were compared at 15 consistent STR markers of the different populations and a neighbor-joining tree was constructed (Fig. 5a), the dendogram showed two distinct branches of the population clusters. One cluster included the population of Tibet, Nepal, China Han population from Yunnan Province, Southwest China, northeastern Thai people of Thailand, Hainan Li population from China, Kathmanduand Newar population, Nepal. The studied Central Indian population showed a close affinity with the population of Rajasthan, India, and the population of Odisha, India. Further, a consistent result was obtained in PCA plot based on the component 1 and component 2 (Fig. 5b), where, clustering of populations from Madhya Pradesh (Gond), Jharkhand, Uttar Pradesh, Tamilnadu, Rajasthan, Himachal Pradesh and Odisha states was observed. Therefore, the genetic sharing largely mimiced the geographical clustering. The heat map drawn using Nei’s Da distance matrix has been shown in Fig. 6. The overall result of the heat map was found in concordance with the outcomes of the NJ and PCA plot for the interpopulation comparison.

Figure 5
figure 5

(a) Neighbour Joining phylogenetic tree, and (b) PCA plot showing relatedness of the observed size-based alleles of consistent 15 autosomal STR markers with different populations.

Figure 6
figure 6

Interpopulation genetic structure at 15 consistent autosomal STR loci.

Conclusions

This first report to the best of our knowledge of sequence-based allelic data on the Central Indian population holds prominent usefulness in the forensic case works. Data obtained in this study further emphasized the implementation of NGS-based studies of STRs for forensic application. The size-based alleles showed concordance between the CE analysis as well as the NGS data. Some STR markers demonstrated a substantial variation in the repeat motifs as well as SNPs in the STR flanking regions in this study. A significant increase in the allele number further increased the statistical values of the studied forensic and paternity parameters of the STRs, thus, increasing their usefulness in the forensic applications. As per the recommendations of the ISFG, it is utmost importance to enrich the allelic data of the sequence-based STR genotypes. An increase in the allele number as evidenced in the present study also suggested the population-specific and sequence-based studies of the STR markers. In this context, the present study would be useful for providing the pioneer sequence-based data on the central Indian population.

Materials and methods

Sample collection and ethical statement

The current study and the experimental protocols were approved by the Ethics Committee of Banaras Hindu University, Varanasi, India (Ref. No. I.Sc./ECM-XII/2018–19/06). All the experimental procedures were carried out in accordance with the relevant guidelines and regulations laid by the ethical committee. Before the collection of the blood samples, written informed consent was obtained from each sample donor. Peripheral blood samples of 138 unrelated adult individuals consisting of 81 males and 57 females were collected in K2EDTA vacutainers and were stored at 4ºC till further use. Such samples were considered from the routine forensic cases at DNA Fingerprinting Unit, Forensic Science Laboratory, Bhopal, Madhya Pradesh, India and included in this study. The study was conducted following all the required quality control measures at Forensic Science Laboratory, Bhopal, M.P., India.

DNA extraction and quantification

Genomic DNA was extracted using PrepFiler Express™ Forensic DNA Extraction Kit (Thermo Scientific, US) following the manufacturer’s guidelines. The extracted DNA was quantified using Quantifiler® Trio DNA Quantification Kit (Thermo Scientific, US) and QuantStudio™ 3 Real-Time PCR System (Thermo Scientific, US) according to the manufacturer’s instructions. Further, the concentration of DNA samples wwas adjusted to 1.0 ng/µl using TE buffer and stored at − 20 °C until further use. The authors have passed the Academia Iberoamericana de Criminalística y EstudiosForenses (AICEF) DNA Proficiency test of the de BIOLOGIA y QUÍMICA FORENSE (GITAD), Spain (http://gitad.ugr.es/principal.htm).

Library preparation and quantitation

Genomic DNA isolated from the sample was converted to a sequencing library by targeted amplification of the regions of interest by using Precision ID DL8 kit and Precision ID GlobalFiler™ NGS STR panel v2 (Thermo Scientific, US) following manufacturer’s protocol on HID Ion Chef™ System (Thermo Scientific, US). Before that, each DNA sample was normalized to 1 ng in 15 µl volume followed by transfer of 15 µl normalized DNA into one of eight wells (A1-H1 position) of the IonCode™ Barcode Adapters plate. Subsequently, the plate with loaded DNA and other consumables was loaded at the designated places in the HID Ion Chef™ System to start the process of library preparation. The library preparation was carried out similarly for other 18 runs. Each library contained eight samples except the 18th library which had only 2 samples. Once the library preparation was completed, they were stored at -20° till further use. Further, the pooled libraries were quantified on the QuantStudio 5 Real-Time PCR system using Ion Library TaqMan® Quantitation kit (Thermo Scientific, US) following the manufacturer’s recommendations.

Template preparation, sequencing, and data analysis

Libraries that were prepared by automation were clonally amplified on the Ion Chef System by emulsion PCR of library molecules captured on the beads. The pooled libraries were diluted to 50 pM and mixed according to the group of barcode adaptors to accommodate 32 samples. 25 µl of each diluted library pool was loaded onto the Position A and Position B of the Ion S5™ Precision ID Chef Reagents along with other recommended plastic wares and reagents at the designated places onto the Ion Chef™ system. The Ion Chef System automated all template preparation steps, including creating the emulsion mixture, performing the PCR, carrying out the post-PCR purifications, and finally loading the purified templated beads onto the two Ion 530 chips accordingly using the manufacturer’s guidelines.

Sequencing

A sequencing run on the Ion S5 systems was initiated by loading a reagent cartridge, buffer, cleaning solution, and waste container as per the Ion S5™ Precision ID Sequencing Kit protocol of the manufacturer. The Ion S5 chip was then loaded and the run started using 200 bp chemistry with 650 flow according to the human identification GlobalFiler™ NGS STR sequencing format.

The raw data was extracted from the S5 Torrent Server v5.10.0 (Thermo Fisher Scientific) and were input into the Converge™ software v2.1 (Thermo Fisher Scientific) for sequence analysis with Homo sapiens hg19 genome. The HID Genotyper plugin v2.1 (Thermo Fisher Scientific) was applied to the analysis procedure at the default thresholds, in which the relative analytical and stochastic thresholds were both 0.05 and the stutter ratio was set as 0.14. Further sequencing performance of Precision ID NGS STR panel v2 was assessed by analyzing locus balance (LB), heterozygous balance (HB), and stutter ratio of the obtained sequences following Avila et al.34 and Brookes et al.35.

Concordance analysis with capillary electrophoresis (CE)

All the 138 samples were studied to assess the concordance between CE-STR data and NGS-STR data. All these samples were analyzed using the PowerPlex Fusion 6C System (Promega, USA) following the manufacturer’s guidelines. 0.5–1.0 ng of genomic DNA was used to amplify the samples on Veriti 96 well Thermal Cycler (Thermo Scientific, USA). Capillary electrophoresis of the amplified DNA fragments was performed using a 3500xL Genetic Analyzer (Thermo Scientific, USA). The generated STR fragments were analyzed using GeneMapper ID-X v.1.5 software maintaining a threshold of 200 RFU for all the dye sets. The CE based allelic values were compared with the sequencing-based allelic values at 23 consistent loci between Fusion 6C System and GlobalFiler NGS STR panel i.e., CSF1PO, D10S1248, D12S391, D13S317, D16S539, D18S51, D19S433, D1S1656, D21S11, D22S1045, D2S1338, D2S441, D3S1358, D5S818, D7S820, D8S1179, DYS391, FGA, Penta D, Penta E, TH01, TPOX, vWA, and sex determining marker Amelogenin.

Statistical analysis

Obtained sequence and allele data were evaluated for the presence of isometric heterozygous alleles and the presence/absence of SNPs in the flanking regions. Besides, various forensic and paternity parameters such as Allele frequency, Power of Discrimination (PD), Polymorphism information content (PIC), Power of exclusion (PE), Typical paternity index (PI), Observed (Ho), Matching Probability (Pm) were calculated using GenAlEx 6.5 software36, Arlequin v3.5 software37 and AMOVA for both length-based and sequence-based alleles. The observed size-based allele frequencies of the 15 consistent genetic markers were compared with the data obtained in the previously published literature by using the Fst pairwise distance.

The compared allele frequency data of the published populations included Balmiki population, Punjab, India38, Konkanastha Brahmin population, Maharashtra, India38, Naikpod Gond, Andhra Pradesh, India39, Gond, Madhya Pradesh, India40, Population of Jharkhand, India41, Populations of Uttar Pradesh, India42, population of Himachal Pradesh, India43, Tamil population, Tamil Nadu, India44, Tibetan population, Nepal45, population of Newar, Nepal46, population of Rajasthan, India47, population of Odisha, India48, Nepalese population49, Chinese Han population from Yunnan Province, Southwest China50, northeastern Thai people of Thailand51 and Hainan Li population from China52.