Germline variant calling

Analysis steps

Preprocessing and variant calling

Bcbio nextgen is used to produce a joint vcf of quality filtered calls across all samples included in the project.

The following steps are run:

Mapping with bwa mem
Duplicate marking with bamsormadup
Base quality recalibration with gatk
Variant calling using gatk HaplotypeCaller and FreeBayes. When multiple samples are present, batch or pooled calling will be used. For WES, variant calling is limited to the intervals specified in the provided bed file. The following regions are excluded in all datasets:

polyx: Regions of single nucleotide stretches > 50. Size of excluded regions is 1.6MB.
lcr: Low complexity regions as defined in Li 2014. Size of excluded regions is 228MB.
highdepth: ENCODE blacklist. This includes repetitive regions and regions of low mappability.
altcontigs: Non-standard chromosomes.

Filtering of vcfs. Bcbio automatically determines if the data are WGS or WES. For WGS, gatk VariantQualityScoreRecalibration will be applied to the calls from gatk HaplotypeCaller. For WES data and for all calls from FreeBayes, cut-off based soft filters will be used. Note that these filters consider the position as a whole, i.e. how confident are we that there is an ALT allele. They do not consider the quality of genotype calls in individual samples.
Create an ensemble vcf containing the union of the calls from gatk HaplotypeCaller and FreeBayes which passed the quality filtering in step 5. Multiallelic variants are split to multiple rows, MNPs are split to multiple rows and indels are normalised as outlined here.

Annotation

Annotation with Ensembl Variant Effect Predictor (VEP). A single consequence per variant is retained, selected following the criteria here.
Import of variants into a gemini database
Write a tab-delimited table of all variants with annotations. The file name ends in .fullVariantTable.txt.

Variant prioritization

Note that no filters are applied to the quality of individual genotype calls in any of the following tables. This can easily be done further downstream using the gt_quals columns.

From the full list of variants, retain only those that change the protein. More precisely, we retain all variants that have a consequence classified as having a medium (MED) or HIGH impact as detailed here. The file name ends in .proteinChanging.txt.
From the full list of variants, retain only those that change the protein and are rare in the general population. The file name ends in proteinChanging.Rare.txt. A variant is considered rare, if the variant meets one of these criteria:

Absent from the considered databases
Frequency of the ALT allele below the user-defined threshold in all considered databases (default 0.05).
Frequency of the ALT allele is above (1 - user-defined threshold) and at least 1 copy of the REF allele is observed (default 0.05). This criterion ensures that we include positions where the REF allele is rare in the general population.

The following databases are considered:

1000 Genomes Phase 3 European populations
Genome Aggregation Database (gnomAD) non-Finnish European population. Includes exome data only.

Columns in variant tables

Header	Description
chrom	The chromosome on which the variant resides
start	The 0-based start position
end	The 1-based end position
ref	Reference allele
alt	Alternative (non-reference) allele
existing_variation	Identifier of variant in public databases e.g. dbSNP, COSMIC (if available)
type	Type of the variant [snp, indel]
sub_type	Subtype of the variant. If type is snp: [ts=transition, tv=transversion]; if type is indel: [ins=insertion, del=deletion]
gene	Corresponding gene name of the transcript with the most severe predicted effect
ensembl_gene_id	Ensembl ID for gene
transcript	The variant transcript with most severe predicted effect
hgvsc	HGVS genomic nomenclature
hgvsp	HGVS protein nomenclature
is_exonic	Does the variant affect an exon of at least 1 transcript? [0=no, 1=yes]
is_coding	Does the variant fall in a coding region (excl. 3’ & 5’ UTRs) of at least 1 transcript? [0=no, 1=yes]
is_lof	Is the variant predicted to cause a loss of function (LOF) in at least 1 transcript? [0=no, 1=yes]
is_splicing	Does the variant affect a canonical or possible splice site? Set to 1 if variant is annotated as any of splice_acceptor_variant, splice_donor_variant, or splice_region_variant.
is_canonical	Set to 1 if the transcript is denoted as the canonical transcript for this gene
codon_change	What is the codon change caused by the variant?
aa_change	What is the amino acid change caused by the variant (for SNPs only)?
biotype	Biotype of the gene (e.g., protein-coding, pseudogene etc)
impact	The consequence of the most severely affected transcript. An overview of all impact categories is available here
impact_severity	Severity of the highest order observed for the variant. For details see here
polyphen_pred	PolyPhen-2 predictions (for SNPs)
polyphen_score	PolyPhen-2 scores (for SNPs)
sift_pred	SIFT predictions (for SNPs)
sift_score	SIFT scores (for SNPs)
af	Global allele frequency of the ALT in 1000 Genomes Phase 3 data
eur_af	Allele frequency of the ALT in 1000 Genomes Phase 3 European populations
gnomad_af	Global allele frequency of the ALT in Genome Aggregation Database (gnomAD). Note that only exome populations are included. [-1 = missing in database]
gnomad_nfe_af	Allele frequency of the ALT in Genome Aggregation Database (gnomAD) Non-Finnish European populations. Note that only exome populations are included.
max_af	Highest allele frequeny observed for the ALT in any population from 1000 Genomes, ESP or gnomAD.
max_af_pops	Population in which the highest ALT allele frequency is observed.
clin_sig	Clinical significance
pubmed	Report Pubmed IDs for publications that cite existing variant
gts.	Genotype observed in the first sample. [./. = missing genotype]. Other samples in consecutive columns.
…
gts.
gt_depths.	Number of reads covering the position in the first sample. [-1 = position not covered]. Other samples in consecutive columns.
…
gt_depths.
gt_quals.	Phred quality score for the genotype in the first sample [-1 = no genotype called]. Other samples in consecutive columns.
…
gt_quals.