Bcbio nextgen is used to produce a joint vcf of quality filtered calls across all samples included in the project.
The following steps are run:
Mapping with bwa mem
Duplicate marking with bamsormadup
Base quality recalibration with gatk
Variant calling using gatk HaplotypeCaller and FreeBayes. When multiple samples are present, batch or pooled calling will be used. For WES, variant calling is limited to the intervals specified in the provided bed file. The following regions are excluded in all datasets:
Note that no filters are applied to the quality of individual genotype calls in any of the following tables. This can easily be done further downstream using the gt_quals columns.
The following databases are considered:
Header | Description |
---|---|
chrom | The chromosome on which the variant resides |
start | The 0-based start position |
end | The 1-based end position |
ref | Reference allele |
alt | Alternative (non-reference) allele |
existing_variation | Identifier of variant in public databases e.g. dbSNP, COSMIC (if available) |
type | Type of the variant [snp, indel] |
sub_type | Subtype of the variant. If type is snp: [ts=transition, tv=transversion]; if type is indel: [ins=insertion, del=deletion] |
gene | Corresponding gene name of the transcript with the most severe predicted effect |
ensembl_gene_id | Ensembl ID for gene |
transcript | The variant transcript with most severe predicted effect |
hgvsc | HGVS genomic nomenclature |
hgvsp | HGVS protein nomenclature |
is_exonic | Does the variant affect an exon of at least 1 transcript? [0=no, 1=yes] |
is_coding | Does the variant fall in a coding region (excl. 3’ & 5’ UTRs) of at least 1 transcript? [0=no, 1=yes] |
is_lof | Is the variant predicted to cause a loss of function (LOF) in at least 1 transcript? [0=no, 1=yes] |
is_splicing | Does the variant affect a canonical or possible splice site? Set to 1 if variant is annotated as any of splice_acceptor_variant, splice_donor_variant, or splice_region_variant. |
is_canonical | Set to 1 if the transcript is denoted as the canonical transcript for this gene |
codon_change | What is the codon change caused by the variant? |
aa_change | What is the amino acid change caused by the variant (for SNPs only)? |
biotype | Biotype of the gene (e.g., protein-coding, pseudogene etc) |
impact | The consequence of the most severely affected transcript. An overview of all impact categories is available here |
impact_severity | Severity of the highest order observed for the variant. For details see here |
polyphen_pred | PolyPhen-2 predictions (for SNPs) |
polyphen_score | PolyPhen-2 scores (for SNPs) |
sift_pred | SIFT predictions (for SNPs) |
sift_score | SIFT scores (for SNPs) |
af | Global allele frequency of the ALT in 1000 Genomes Phase 3 data |
eur_af | Allele frequency of the ALT in 1000 Genomes Phase 3 European populations |
gnomad_af | Global allele frequency of the ALT in Genome Aggregation Database (gnomAD). Note that only exome populations are included. [-1 = missing in database] |
gnomad_nfe_af | Allele frequency of the ALT in Genome Aggregation Database (gnomAD) Non-Finnish European populations. Note that only exome populations are included. |
max_af | Highest allele frequeny observed for the ALT in any population from 1000 Genomes, ESP or gnomAD. |
max_af_pops | Population in which the highest ALT allele frequency is observed. |
clin_sig | Clinical significance |
pubmed | Report Pubmed IDs for publications that cite existing variant |
gts. |
Genotype observed in the first sample. [./. = missing genotype]. Other samples in consecutive columns. |
… | |
gts. |
|
gt_depths. |
Number of reads covering the position in the first sample. [-1 = position not covered]. Other samples in consecutive columns. |
… | |
gt_depths. |
|
gt_quals. |
Phred quality score for the genotype in the first sample [-1 = no genotype called]. Other samples in consecutive columns. |
… | |
gt_quals. |