Methodology

ImmuneDiscover

The ImmuneDiscover method is a novel NGS-based technique designed to sequence genes in complex regions from large numbers of subjects in a multiplexed manner. The technique has been developed to require only nanogram quantities of genomic DNA as template and utilizes a unique indexing procedure that enables the combinatorial sequencing of samples from over 1000 individuals in a single analysis. The number of cases that can be simultaneously analyzed means that ImmuneDiscover is highly suitable for the genotypic analysis of cohorts containing hundreds or thousands of individuals, for example population or disease-based collections.

IGHV genotyping each of the 1KGP samples was performed through a process of sequencing the genomic DNA encompassing the functional V genes of each case using the ImmuneDiscover technique. This involves the targeted amplification of a multiplex of genomic segments that cover all functional V genes using a specially designed primer mix. The forward and reverse primers contain index-specific F and R tails, respectively, enabling the indexing of each amplicon library in a 96-well plate format. The indexing procedure additionally enables the use of both well- and plate-specific indices, thereby allowing for individual case specific libraries in each well of a 96-well plate to be combined with the libraries from multiple plates and sequenced simultaneously. The ImmuneDiscover software demultiplexes each well and plate specific library, identifying known and novel variants from all cases, thereby enabling individualized genotyping of all samples.

IgSNPer

Multiple research projects have accumulated data on the frequency of single nucleotide polymorphisms (SNPs) in collected cohorts of individuals. These datasets provide an important resource to analyze individual and population level variation if the individual SNPs can be accurately assigned to specific gene loci. We have taken advantage of this resource by utilizing two processes. First, we amalgamated the various SNP databases into a single searchable program based on the assignment of individual SNPs to the human GRCh37 and GRCh38 assemblies. Second, we utilized the ability to assign IGHV, IGHD and IGHJ sequences to defined genomic loci within the same assemblies, thereby confirming the individual SNP variant locations are present within verified IG gene variants.

The resultant program, termed IgSNPer, analyzes each nucleotide position within a complete variant allele, identifying whether specific nucleotide variations can be confirmed by the presence of a SNP variant within the amalgamated SNP databases above a particular frequency. IgSNPer uses dbSNP build 156 for humans, which contains 1,130,597,309 reference SNP (rs) calls in total. Of these, 33,714,196 are on chromosome 14, where the immunoglobulin heavy chain (IGH) genes are located. This accumulated SNP reference set includes data from various population databases, including:

dbGaP_PopFreq: Aggregated frequency data on over 1 million individuals.
ExAC: The Exome Aggregation Consortium (ExAC) dataset accumulated SNP variation data on 60,706 unrelated individuals that were sequenced during the analysis of several disease specific and population related projects.
TOPMED: The TOPMED dataset utilizes data from 158,000 individuals of mainly European, African, Hispanic/Latino or East Asian ancestry.
TOMMO: An allele frequency panel produced from the genomic sequence analysis of 8380 Japanese individuals.
KOREAN Reference Genome Database: containing SNP variation data on 1465 Korean individuals.
GoESP: an exon sequencing project dataset containing 6,503 individuals.

The program examines each full-length allelic sequence, identifying both common SNP variants according to the amalgamated SNP database, and uncommon variations that are either present at low frequency within the database or are not present in previously published SNPs within the database (labeled as uncommon). Each uncommon variation identified, a variant that is not present in the amalgamated SNP database above a minimal frequency, is given an IgSNPer score of one. The full IgSNPer score for each allelic sequence is the sum of all uncommon variant nucleotides for that sequence. The resultant IgSNPer output provides an indication of whether a single or multiple rare SNP variants are present in that allelic sequence, with a score of zero indicating that the allele specific variation is found above the cutoff frequency within the set of individuals of the amalgamated SNP database.

The IgSNPer output serves several purposes. First, allelic sequences with zero scores indicate that the variation is likely present in the human population at a certain frequency. Second, verified alleles with low IgSNPer counts (usually 1) reveal variations that are present at low frequency in the set of individuals in the reference set, for example alleles that may be specific to a population group that has not been utilized in the reference dataset. Finally, high IgSNPer counts are indicative of sequences that have an accumulation of nucleotide variations that are absent in the amalgamated SNP database cases. Extreme caution should be taken with such alleles as a high IgSNPer score indicates the possibility of technical issues such as those associated with truncated or incomplete alleles, or sequence errors present in such sequences. IgSNPer scores for the alleles in the IMGT, AIRR-C and KIARVA databases are available in Corcoran et al. Immunity 2026, Figure 2I and Table S2.