Analysis steps


Preprocessing and variant calling

Bcbio nextgen is used to produce a joint vcf of quality filtered calls across all samples included in the project.

The following steps are run:

  1. Mapping with bwa mem

  2. Duplicate marking with bamsormadup

  3. Base quality recalibration with gatk

  4. Variant calling using gatk HaplotypeCaller and FreeBayes. When multiple samples are present, batch or pooled calling will be used. For WES, variant calling is limited to the intervals specified in the provided bed file. The following regions are excluded in all datasets:

  • polyx: Regions of single nucleotide stretches > 50. Size of excluded regions is 1.6MB.
  • lcr: Low complexity regions as defined in Li 2014. Size of excluded regions is 228MB.
  • highdepth: ENCODE blacklist. This includes repetitive regions and regions of low mappability.
  • altcontigs: Non-standard chromosomes.
  1. Filtering of vcfs. Bcbio automatically determines if the data are WGS or WES. For WGS, gatk VariantQualityScoreRecalibration will be applied to the calls from gatk HaplotypeCaller. For WES data and for all calls from FreeBayes, cut-off based soft filters will be used. Note that these filters consider the position as a whole, i.e. how confident are we that there is an ALT allele. They do not consider the quality of genotype calls in individual samples.
  2. Create an ensemble vcf containing the union of the calls from gatk HaplotypeCaller and FreeBayes which passed the quality filtering in step 5. Multiallelic variants are split to multiple rows, MNPs are split to multiple rows and indels are normalised as outlined here.

Annotation

  1. Annotation with Ensembl Variant Effect Predictor (VEP). A single consequence per variant is retained, selected following the criteria here.
  2. Import of variants into a gemini database
  3. Write a tab-delimited table of all variants with annotations. The file name ends in .fullVariantTable.txt.

Variant prioritization

Note that no filters are applied to the quality of individual genotype calls in any of the following tables. This can easily be done further downstream using the gt_quals columns.

  1. From the full list of variants, retain only those that change the protein. More precisely, we retain all variants that have a consequence classified as having a medium (MED) or HIGH impact as detailed here. The file name ends in .proteinChanging.txt.
  2. From the full list of variants, retain only those that change the protein and are rare in the general population. The file name ends in proteinChanging.Rare.txt. A variant is considered rare, if the variant meets one of these criteria:
  • Absent from the considered databases
  • Frequency of the ALT allele below the user-defined threshold in all considered databases (default 0.05).
  • Frequency of the ALT allele is above (1 - user-defined threshold) and at least 1 copy of the REF allele is observed (default 0.05). This criterion ensures that we include positions where the REF allele is rare in the general population.

The following databases are considered:

  • 1000 Genomes Phase 3 European populations
  • Genome Aggregation Database (gnomAD) non-Finnish European population. Includes exome data only.

Columns in variant tables
Header Description
chrom The chromosome on which the variant resides
start The 0-based start position
end The 1-based end position
ref Reference allele
alt Alternative (non-reference) allele
existing_variation Identifier of variant in public databases e.g. dbSNP, COSMIC (if available)
type Type of the variant [snp, indel]
sub_type Subtype of the variant. If type is snp: [ts=transition, tv=transversion]; if type is indel: [ins=insertion, del=deletion]
gene Corresponding gene name of the transcript with the most severe predicted effect
ensembl_gene_id Ensembl ID for gene
transcript The variant transcript with most severe predicted effect
hgvsc HGVS genomic nomenclature
hgvsp HGVS protein nomenclature
is_exonic Does the variant affect an exon of at least 1 transcript? [0=no, 1=yes]
is_coding Does the variant fall in a coding region (excl. 3’ & 5’ UTRs) of at least 1 transcript? [0=no, 1=yes]
is_lof Is the variant predicted to cause a loss of function (LOF) in at least 1 transcript? [0=no, 1=yes]
is_splicing Does the variant affect a canonical or possible splice site? Set to 1 if variant is annotated as any of splice_acceptor_variant, splice_donor_variant, or splice_region_variant.
is_canonical Set to 1 if the transcript is denoted as the canonical transcript for this gene
codon_change What is the codon change caused by the variant?
aa_change What is the amino acid change caused by the variant (for SNPs only)?
biotype Biotype of the gene (e.g., protein-coding, pseudogene etc)
impact The consequence of the most severely affected transcript. An overview of all impact categories is available here
impact_severity Severity of the highest order observed for the variant. For details see here
polyphen_pred PolyPhen-2 predictions (for SNPs)
polyphen_score PolyPhen-2 scores (for SNPs)
sift_pred SIFT predictions (for SNPs)
sift_score SIFT scores (for SNPs)
af Global allele frequency of the ALT in 1000 Genomes Phase 3 data
eur_af Allele frequency of the ALT in 1000 Genomes Phase 3 European populations
gnomad_af Global allele frequency of the ALT in Genome Aggregation Database (gnomAD). Note that only exome populations are included. [-1 = missing in database]
gnomad_nfe_af Allele frequency of the ALT in Genome Aggregation Database (gnomAD) Non-Finnish European populations. Note that only exome populations are included.
max_af Highest allele frequeny observed for the ALT in any population from 1000 Genomes, ESP or gnomAD.
max_af_pops Population in which the highest ALT allele frequency is observed.
clin_sig Clinical significance
pubmed Report Pubmed IDs for publications that cite existing variant
gts. Genotype observed in the first sample. [./. = missing genotype]. Other samples in consecutive columns.
gts.
gt_depths. Number of reads covering the position in the first sample. [-1 = position not covered]. Other samples in consecutive columns.
gt_depths.
gt_quals. Phred quality score for the genotype in the first sample [-1 = no genotype called]. Other samples in consecutive columns.
gt_quals.