Frequently Asked Questions
How are alleles named within KIARVA
Nucleotide names
The IGHV alleles are provided in two formats within the KIARVA resource, nucleotide and collapsed amino acid sequences. Nucleotide differences within the IGHV alleles can result in synonymous or non-synonymous changes in the coding sequence and are useful for population and genotyping purposes, even if the variation results in functionally equivalent allelic sequences. The names of the alleles are based on the nucleotide variation from the closest alleles present within the initial reference database used for the ImmuneDiscover process used in Corcoran et al. Immunity 2026, namely the AIRR-C database (Collins et al. 2023, release version 9, downloaded on 10/12/2024). Alleles that are identical to the AIRR-C sequences have the same names in KIARVA. Variant sequences that differ from AIRR-C reference sequences were given names with a suffix of the form _SXXXX, where the ’X’s are integers provided by the ImmuneDiscover software discovery setting.
Amino acid-collapsed names
Amino acid (AA) collapsing of IGHV sequences enables allelic variants containing synonymous nucleotide variants to be collapsed into a single coding sequence. This allows IGHV alleles that have the same translated AA sequence and therefore are functionally equivalent in their naïve unmutated state, to be separated from alleles containing non-synonymous variations, which may have functional effects. The AA-collapsed alleles naming principle is based on the lowest numerical allele present within each collapsed set. For example, IGHV1-69*01 and IGHV1-69*12, IGHV1-69*13 and IGHV1-69*13_S7425 all encode the same AA sequence that is hence designated as IGHV1-69*01.
What does ’DEL’ in the allele dropdown menu mean?
Several common structural variants exist within the human IG locus, including some that result in the segmental loss of one or more genes. The frequency of these means that many individuals will be homozygous for such deletions, with the result that the genes located within these segments will not be present in the genotypic output for these cases. We have chosen to represent homozygous deletions as ’DEL’ in the frequency plots for IGHV7-4-1, IGH4-30-2, IGHV4-30-4, IGHV4-31, IGHV3-30-3, IGHV3-33, IGHV3-64D, IGHV5-10-1, IGHV1-69-2, IGHV3-9 and IGHV1-8.
What about duplicated genes?
The ImmuneDiscover technique produces accurate sequence analysis of full-length unrearranged genes. It does not enable physical linkage between genes in a phased haplotype manner. This means that it does not distinguish variants of different genes that are identical in sequence (i.e. IGHV3-23*01/IGHV3-23D*01 or IGHV1-69*01/IGHV1-69D*01). This issue is limited to very few alleles and gene ’groups’, namely IGHV1-69, IGHV3-23 and IGHV3-30 associated with genomic duplications. For KIARVA and the 1KGP dataset, we have chosen to use a single designation to represent allelic sequences that belong to such groups.
Can we identify all IG genes from lymphoblastoid cell line (LCL) samples?
Analysis of the 1KGP samples using ImmuneDiscover revealed several critical points that both enable and limit the use of LCL samples in IG genotyping. The samples are derived from EBV transformed polyclonal B cells that have undergone different IG rearrangements on one haplotype in each cell. In some cases, the cell line may be oligoclonal and contain cells that have undergone rearrangement on both haplotypes. Our analysis of the frequency of DJ rearrangements, which occurs prior to VDJ rearrangements, revealed that DNA from LCL samples of the vast majority of the 1KGP cases contained enough unrearranged template to allow complete IGHV genotyping and calculation of population frequencies for each allelic variant (Corcoran et al. Immunity 2026, Figure S2C-F). However, the frequency of JD rearrangements means that J and D genotyping is less reliable for 1KGP cases, and we therefore made the decision not to calculate IGHJ and IGHD allele frequencies from the 1KGP genotyping results.
Is potential SHM present in lymphoblastoid cell lines an issue?
LCL samples contain a mixture of DNA templates that have or have not acquired somatic hypermutation (SHM)-associated mutations within the rearranged V, D and J genes. Thus, care must be taken in the analysis process to handle this. ImmuneDiscover genotyping has two independent approaches to avoid including false-positive, mutated sequences in its genotyping output.
- ImmuneDiscover libraries are produced using primers that target unrearranged V, D and J genes, while rearranged sequences are not targeted and therefore not amplified. This approach excludes the vast majority of SHM associated variation.
- SHM of unrearranged genes is largely limited to when a V or J gene is positioned in the immediate physical vicinity of a rearranged gene. This is particularly the case for J genes if downstream unrearranged genes are present within approximately 3 kb of a rearranged neighboring J gene. In the case of V genes, the distance between the genes is sufficient in most cases to avoid SHM ’spreading’ to unrearranged neighbors. One exception is IGHV2-70/IGHV2-70D, which is located within 9 kb of the frequently rearranged IGHV1-69/IGH1-69D gene. In the KIARVA analysis of the 1KGP dataset, we therefore required that all IGHV allelic sequences were present in at least 2 cases, with a higher stringency applied for IGHV1-69 (4 cases required) and IGHV2-70/2-70D (10 cases required).
The issue of localized SHM spreading to unrearranged genes is specific to LCL samples and is not an issue when using ImmuneDiscover for IG genotype using other genomic DNA templates.
How do we know that the alleles are accurate?
Care has been taken to ensure the accuracy of the alleles present within KIARVA, with several steps of cross-validation.
- The accuracy of the ImmuneDiscover procedure was confirmed using a validation cohort of 90 cases, the KI cohort. These 90 samples were genotyped using two independent methods, IgDiscover (using two independent IgM libraries for each case) and ImmuneDiscover, resulting in highly concordant results (Corcoran et al. Immunity 2026, Figure 1E and G).
- We further used the IgSNPer program to screen all alleles identified to confirm the presence of allele specific SNP variants in an amalgamated cohort of over 1.2 million individuals.
- All IGHV allele variants present in the KIARVA database were confirmed to be present in at least two individuals, with the IGHV1-69 and IGHV2-70 variants (those most susceptible to SHM spreading from neighbouring rearranged genes) requiring presence in four and ten individuals, respectively.
How many IGHV alleles are there yet to identify?
Saturation analysis in the accompanying paper indicates that of the population groups analyzed, the set of alleles present in at least two individuals is approaching saturation (Corcoran et al. 2026, Figure 2H). The populations included in the 1KGP set represent the major continental human population groups. However, it is likely that additional populations that are not represented in the 1KGP set, for example individuals from South-East Asia, the Middle East and Oceania, contain variants that are local and even frequent within their own populations.