Preparing GBRS Pseudo-References

Last updated on 2024-09-06 | Edit this page

Estimated time: 75 minutes

Overview

Questions

  • Why do we need to align to a non-reference genome?
  • What is a pseudo-reference genome?
  • How do we create a pseudo-reference genome?

Objectives

  • Construct a pseudo-reference genome for Diversity Outbred mice.

Introduction


Genome-to-Genome Tools, (g2gtools) is a suite of tools which incorporates SNPs and indels from mouse strains into the reference genome. Inserting these variants may lead to changes in the length of the genome and g2gtools is able to map positions between the reference genome and the modified genome.

In this tutorial, we will show you how to use the pre-built nextflow pipeline to build a pseudo-reference on the JAX computing cluster (sumner2). Then, we will show you how each step work. We expect that most users will use the nextflow pipeline and have included the step-by-step pipeline for advanced users who may want to use non-standard pipeline options.

Creating a Pseudo-Reference for Diversity Outbred Mice


The nextflow pipeline developed by Computational Sciences contains default values to make pseudo-reference generation for Diversity Outbred mice easy. You can run the commands with minimal arguments.

There is a Quick Start Guide for JAX users at https://github.com/TheJacksonLaboratory/cs-nf-pipelines/wiki/Pipeline-Documentation#quick-start-for-jax-users. This describes the essentials of running a nextflow pipeline on sumner2 at JAX.

The documentation for the GBRS nextflow pipeline is at https://github.com/TheJacksonLaboratory/cs-nf-pipelines/wiki/Generate-Pseudoreference-Pipeline-ReadMe. This page lists the default values, which point reference files on sumner2 and list the eight Diversity Outbred founder strains by default. Some of the default files are listed in the table below.

File Type Argument Name File Path Description
Reference FASTA –primary_reference_fasta /projects/compsci/omics_share/mouse/GRCm39/genome/sequence/ensembl/v105/Mus_musculus.GRCm39.dna.primary_assembly.fa primary reference fasta file for GRCm39
SNP VCF –snp_vcf /projects/compsci/omics_share/mouse/GRCm39/genome/annotation/snps_indels/rel_2112_v8/mgp_REL2021_snps.vcf.gz VCF containing SNPs in mouse strains from the Mouse Genomes Project version 8
Indel VCF –indel_vcf /projects/compsci/omics_share/mouse/GRCm39/genome/annotation/snps_indels/rel_2112_v8/mgp_REL2021_indels.vcf.gz VCF containing Indels in mouse strains from the Mouse Genomes Project version 8
Reference GTF –primary_reference_gtf /projects/compsci/omics_share/mouse/GRCm39/transcriptome/annotation/ensembl/v105/Mus_musculus.GRCm39.105.gtf Gene annotation for Ensembl 105 in GTF

The default nextflow pipeline also include the DO founder strains as a comma separated list:

Argument Value
–strain 129S1_SvImJ,A_J,CAST_EiJ,NOD_ShiLtJ,NZO_HlLtJ,PWK_PhJ,WSB_EiJ

To run the nextflow pipeline to generate a pseudo-reference for DO mice on sumner2, use the following commands.

First, we load the nextflow module.

module use --append /projects/omics_share/meta/modules
module load nextflow

Next, change into a directory where you will be working. We suggest using /flashscratch and creating a directory for this.

Create a variable for your user ID on sumner2:

USERID=<your User ID>
cd /flashscratch/${USERID}

Then use git to clone the cs-nf-pipelines repository.

git clone https://github.com/TheJacksonLaboratory/cs-nf-pipelines.git

Next, we need to create a variable for the output directory name and create the directory.

OUTPUT_DIR=/flashscratch/${USERID}/gbrs
mkdir -p ${OUTPUT_DIR}

The last step before running the pipeline is to create a temporary working directory. This should be on /flashscratch since many large files are created.

WORKING_DIR=/flashscratch/${USERID}/tmp
mkdir -p ${WORKING_DIR}

There are two types of arguments in the nextflow call below. Arguments to nextflow have a single ‘-’. These are:

Argument Value Description
-profile sumner2 This sets the resource values for the JAX sumner2 computing cluster
-w ${WORKING_DIR} The working directory for temporary files

Arguments which start with ‘–’ go to the nextflow pipeline. These are:

Argument Value Description
–workflow generate_pseudoreference This tells nextflow which workflow to run
–pubdir ${OUTPUT_DIR} The directory where output should be written

With this preparation, we can run the nextflow pipeline.

nextflow cs-nf-pipelines/main.nf \
         -profile sumner2 \
         -w ${WORKING_DIR} \
         --workflow generate_pseudoreference \
         --pubdir ${OUTPUT_DIR}

The entire process should take around 30 minutes. It should produce the following output on the terminal:

OUTPUT

N E X T F L O W  ~  version 22.04.3
Launching `cs-nf-pipelines/main.nf` [gloomy_panini] DSL2 - revision: 6b82c2c015


 _____ _______ _     _     _______ _______     __   _  ______ _______      _____   _____  _______
   |   |_____|  \___/  ___ |       |______ ___ | \  | |  ____ |______ ___ |     | |_____| |______
 __|   |     | _/   \_     |_____  ______|     |  \_| |_____| ______|     |_____| |       ______|


GENERATE PSEUDOREFERENCE PARAMETER LOG

--comment:

Results Published to: /flashscratch/dgatti/gbrs
______________________________________________________
--workflow                      generate_pseudoreference
-w                              /flashscratch/dgatti/tmp
-c                              config/generate_pseudoreference.config
--snp_vcf                       /projects/compsci/omics_share/mouse/GRCm39/genome/annotation/snps_indels/rel_2112_v8/mgp_REL2021_snps.vcf.gz
--indel_vcf                     /projects/compsci/omics_share/mouse/GRCm39/genome/annotation/snps_indels/rel_2112_v8/mgp_REL2021_indels.vcf.gz
--primary_reference_fasta       /projects/compsci/omics_share/mouse/GRCm39/genome/sequence/ensembl/v105/Mus_musculus.GRCm39.dna.primary_assembly.fa
--primary_reference_gtf         /projects/compsci/omics_share/mouse/GRCm39/transcriptome/annotation/ensembl/v105/Mus_musculus.GRCm39.105.gtf
--gtf_biotype_include           protein_coding,lncRNA,IG_C_gene,IG_D_gene,IG_J_gene,IG_LV_gene,IG_V_gene,TR_C_gene,TR_D_gene,TR_J_gene,TR_V_gene
--strain                        129S1_SvImJ,A_J,CAST_EiJ,NOD_ShiLtJ,NZO_HlLtJ,PWK_PhJ,WSB_EiJ
--genome_version                39
--append_chromosomes            true
--diploid                       false
--keep_fails                    false
--pass_only                     false
--quality_filter
--region
--bed

--keep_intermediate             false

Project Directory: /gpfs/ctgs0/flashscratch/dgatti/nextflow/cs-nf-pipelines

Command line call:
nextflow cs-nf-pipelines/main.nf -profile sumner2 -w /flashscratch/dgatti/tmp --workflow generate_pseudoreference --pubdir /flashscratch/dgatti/gbrs
______________________________________________________

executor >  slurm (72)
[9b/2185dd] process > GENERATE_PSEUDOREFERENCE:FILTER_GTF (Filtering GTF)    [100%] 1 of 1 ✔
[86/d616d1] process > GENERATE_PSEUDOREFERENCE:G2GTOOLS_VCF2VCI (WSB_EiJ)    [100%] 7 of 7 ✔
[4d/bce7df] process > GENERATE_PSEUDOREFERENCE:G2GTOOLS_PATCH (WSB_EiJ)      [100%] 7 of 7 ✔
[5c/6c9c51] process > GENERATE_PSEUDOREFERENCE:G2GTOOLS_TRANSFORM (WSB_EiJ)  [100%] 7 of 7 ✔
[97/91a924] process > GENERATE_PSEUDOREFERENCE:SAMTOOLS_FAIDX_G2GTOOLS (W... [100%] 7 of 7 ✔
[22/380cb8] process > GENERATE_PSEUDOREFERENCE:G2GTOOLS_CONVERT (CAST_EiJ)   [100%] 7 of 7 ✔
[6f/5a91f3] process > GENERATE_PSEUDOREFERENCE:APPEND_DROPPED_CHROMS (CAS... [100%] 7 of 7 ✔
[fa/69434d] process > GENERATE_PSEUDOREFERENCE:G2GTOOLS_GTF2DB (CAST_EiJ)    [100%] 7 of 7 ✔
[42/d32cda] process > GENERATE_PSEUDOREFERENCE:G2GTOOLS_EXTRACT_GENES (CA... [100%] 7 of 7 ✔
[c5/15243b] process > GENERATE_PSEUDOREFERENCE:G2GTOOLS_EXTRACT_TRANSCRIP... [100%] 7 of 7 ✔
[67/e947ad] process > GENERATE_PSEUDOREFERENCE:G2GTOOLS_EXTRACT_EXONS (CA... [100%] 7 of 7 ✔
[54/203383] process > GENERATE_PSEUDOREFERENCE:SAMTOOLS_FAIDX (primary_st... [100%] 1 of 1 ✔
Completed at: 14-Sep-2023 14:18:58
Duration    : 23m 40s
CPU hours   : 14.7
Succeeded   : 72

Let’s look at the files in the output directory.

ls ${OUTPUT_DIR}

You should see the following files and directories:

OUTPUT

g2gtools  generate_pseudoreference_report.html  Mus_musculus.GRCm39.105.filtered.gtf  trace

Look for the reference files in the ‘g2gtools’ directory.

ls ${OUTPUT_DIR}/g2gtools

OUTPUT

129S1_SvImJ.39_DroppedChromAppended.gtf          NOD_ShiLtJ.39.genes.fa
129S1_SvImJ.39.exons.fa                          NOD_ShiLtJ.39.gtf.db
129S1_SvImJ.39.fa                                NOD_ShiLtJ.39.transcripts.fa
129S1_SvImJ.39.fa.fai                            NOD_ShiLtJ.39.vci.gz
129S1_SvImJ.39.genes.fa                          NOD_ShiLtJ.39.vci.gz.tbi
129S1_SvImJ.39.gtf.db                            NZO_HlLtJ.39_DroppedChromAppended.gtf
129S1_SvImJ.39.transcripts.fa                    NZO_HlLtJ.39.exons.fa
129S1_SvImJ.39.vci.gz                            NZO_HlLtJ.39.fa
129S1_SvImJ.39.vci.gz.tbi                        NZO_HlLtJ.39.fa.fai
A_J.39_DroppedChromAppended.gtf                  NZO_HlLtJ.39.genes.fa
A_J.39.exons.fa                                  NZO_HlLtJ.39.gtf.db
A_J.39.fa                                        NZO_HlLtJ.39.transcripts.fa
A_J.39.fa.fai                                    NZO_HlLtJ.39.vci.gz
A_J.39.genes.fa                                  NZO_HlLtJ.39.vci.gz.tbi
A_J.39.gtf.db                                    PWK_PhJ.39_DroppedChromAppended.gtf
A_J.39.transcripts.fa                            PWK_PhJ.39.exons.fa
A_J.39.vci.gz                                    PWK_PhJ.39.fa
A_J.39.vci.gz.tbi                                PWK_PhJ.39.fa.fai
CAST_EiJ.39_DroppedChromAppended.gtf             PWK_PhJ.39.genes.fa
CAST_EiJ.39.exons.fa                             PWK_PhJ.39.gtf.db
CAST_EiJ.39.fa                                   PWK_PhJ.39.transcripts.fa
CAST_EiJ.39.fa.fai                               PWK_PhJ.39.vci.gz
CAST_EiJ.39.genes.fa                             PWK_PhJ.39.vci.gz.tbi
CAST_EiJ.39.gtf.db                               WSB_EiJ.39_DroppedChromAppended.gtf
CAST_EiJ.39.transcripts.fa                       WSB_EiJ.39.exons.fa
CAST_EiJ.39.vci.gz                               WSB_EiJ.39.fa
CAST_EiJ.39.vci.gz.tbi                           WSB_EiJ.39.fa.fai
Mus_musculus.GRCm39.dna.primary_assembly.fa.fai  WSB_EiJ.39.genes.fa
NOD_ShiLtJ.39_DroppedChromAppended.gtf           WSB_EiJ.39.gtf.db
NOD_ShiLtJ.39.exons.fa                           WSB_EiJ.39.transcripts.fa
NOD_ShiLtJ.39.fa                                 WSB_EiJ.39.vci.gz
NOD_ShiLtJ.39.fa.fai                             WSB_EiJ.39.vci.gz.tbi

Creating Pseudo-Reference Genomes and Transcriptomes for Other Strains


If you are working with a cross comprised of other strains, you can change the “–strain” argument.

Challenge 1: Viewing Strain Names in a VCF

There are 50 strains in the VCF SNP file. What are their names? The strain names are stored in the VCF header. The samtools container also includes a tool called tabix, which queries and indexes tab-delimited files. Run the samtools container and look at the tabix help page. Find a command that will print out the SNP VCF header and find the strain names.

singularity run ${SAMTOOLS} tabix --help

Call tabix with the ‘–only-header’ option and pass in the SNP VCF. In this course, we are using long option names. You could also use ‘-H’ to achieve the same result.

$ singularity run ${SAMTOOLS} tabix --only-header ${VCF_SNPS}

OUTPUT

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.13+htslib-1.13
##bcftoolsCommand=mpileup -f Mus_musculus.GRCm39.dna.toplevel.fa.gz -b samples -g 10 -a FORMAT/DP,FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/SP,INFO/AD -E -Q 0 -pm 3 -F 0.25 -d 500
##reference=Mus_musculus.GRCm39.dna.toplevel.fa.gz
##contig=<ID=1,length=195154279>
##contig=<ID=2,length=181755017>
##contig=<ID=3,length=159745316>
##contig=<ID=4,length=156860686>
##contig=<ID=5,length=151758149>
##contig=<ID=6,length=149588044>
##contig=<ID=7,length=144995196>
##contig=<ID=8,length=130127694>
##contig=<ID=9,length=124359700>
##contig=<ID=10,length=130530862>
##contig=<ID=11,length=121973369>
##contig=<ID=12,length=120092757>
##contig=<ID=13,length=120883175>
##contig=<ID=14,length=125139656>
##contig=<ID=15,length=104073951>
##contig=<ID=16,length=98008968>
##contig=<ID=17,length=95294699>
##contig=<ID=18,length=90720763>
##contig=<ID=19,length=61420004>
##contig=<ID=X,length=169476592>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of raw reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of raw reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (high-quality bases)">
##INFO=<ID=AD,Number=R,Type=Integer,Description="Total allelic depths (high-quality bases)">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=MinDP,Number=1,Type=Integer,Description="Minimum per-sample depth in this gVCF block">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Phred-scaled Genotype Quality">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
##bcftools_callCommand=call -mAv -f GQ,GP -p 0.99; Date=Wed Aug 11 21:20:03 2021
##bcftools_normCommand=norm --fasta-ref Mus_musculus.GRCm39.dna.toplevel.fa.gz -m +indels; Date=Fri Aug 13 11:11:49 2021
##FORMAT=<ID=FI,Number=1,Type=Integer,Description="High confidence (1) or low confidence (0) based on soft filtering values">
##FILTER=<ID=LowQual,Description="Low quality variants">
##VEP="v104" time="2021-08-30 23:27:00" cache="mus_musculus/104_GRCm39" ensembl-funcgen=104.59ae779 ensembl-variation=104.6154f8b ensembl=104.1af1dce ensembl-io=104.1d3bb6e assembly="GRCm39" dbSNP="150" gencode="GENCODE M27" regbuild="1.0" sift="sift"
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|SIFT|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|TRANSCRIPTION_FACTORS">
##bcftools_viewVersion=1.13+htslib-1.13
##bcftools_viewCommand=view -i 'FORMAT/FI[*] = 1' mgp_REL2021_snps.vcf.gz; Date=Sat Dec 18 19:08:09 2021
##bcftools_annotateVersion=1.13+htslib-1.13
##bcftools_annotateCommand=annotate -x INFO/VDB,INFO/SGB,INFO/RPBZ,INFO/MQBZ,INFO/MQBZ,INFO/MQSBZ,INFO/BQBZ,INFO/SCBZ,INFO/FS,INFO/MQOF,INFO/AC,INFO/AN,FORMAT/SP,FORMAT/ADF,FORMAT/ADR,FORMAT/GP; Date=Sat Dec 18 19:08:09 2021
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##bcftools_viewCommand=view -a -Oz -o final_mgp_REL2021_snps.vcf.gz; Date=Sat Dec 18 19:08:09 2021
##bcftools_annotateCommand=annotate -x INFO/MQ0F -Oz -o final_mgp_REL2021_snps.vcf.gz mgp_REL2021_snps.vcf.gz; Date=Mon Dec 20 07:12:23 2021
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  129P2_OlaHsd    129S1_SvImJ    129S5SvEvBrd     A_J     AKR_J   B10.RIII        BALB_cByJ       BALB_cJ BTBR_T+_Itpr3tf_J       BUB_BnJC3H_HeH  C3H_HeJ C57BL_10J       C57BL_10SnJ     C57BL_6NJ       C57BR_cdJ       C57L_J  C58_J   CAST_EiJCBA_J   CE_J    CZECHII_EiJ     DBA_1J  DBA_2J  FVB_NJ  I_LnJ   JF1_MsJ KK_HiJ  LEWES_EiJ       LG_J   LP_J     MAMy_J  MOLF_EiJ        NOD_ShiLtJ      NON_LtJ NZB_B1NJ        NZO_HlLtJ       NZW_LacJ       PL_J     PWK_PhJ QSi3    QSi5    RF_J    RIIIS_J SEA_GnJ SJL_J   SM_J    SPRET_EiJ       ST_bJ   SWR_J  WSB_EiJ  ZALENDE_EiJ

The VCF header contains information about the file format version, chromosomes, metadata, and tools used to create the VCF. The column names are on the last line of the VCF header and this line contains the strain names. If you had a cross comprised of other strains, you would query the VCF to find their names.

Step-by-Step Pseudo-Reference Generation


In general, most users will not use this strategy. We have included this section to document the steps for advanced users who may want to modify specific steps in the pipeline.

The first step in creating the reference files for GBRS is to create a set of genomes and transcriptomes with SNPs and Indels from other strains inserted into the reference genome.

Setup

We will use the g2gtools container stored in the public reference area on sumner2.

G2GTOOLS=/projects/omics_share/meta/containers/quay.io-jaxcompsci-g2gtools-74926ad.img

We will also use a tool called samtools to query and modify sequence and annotation files.

SAMTOOLS=/projects/omics_share/meta/containers/quay.io-biocontainers-samtools-1.14--hb421002_0.img

We need two input files and we will produce one output file per strain.

The first file is the reference genome in FASTA format.

REF_FASTA=/projects/omics_share/mouse/GRCm39/genome/sequence/ensembl/v105/Mus_musculus.GRCm39.dna.primary_assembly.fa

The second file is a VCF containing the SNPs of one or more strains. This file should contain the strain(s) that you have in your mouse cross.

VCF_INDELS=/projects/omics_share/mouse/GRCm39/genome/annotation/snps_indels/rel_2112_v8/mgp_REL2021_indels.vcf.gz
VCF_SNPS=/projects/omics_share/mouse/GRCm39/genome/annotation/snps_indels/rel_2112_v8/mgp_REL2021_snps.vcf.gz

Next, we need to pass in the name of the strain for which we are creating the VCI file. In this case, we will use DBA/2J. The strain name must exactly match the strain name in the VCF.

STRAIN=DBA_2J

This Challenge is used in above as well in “Creating Pseudo-Reference Genomes and Transcriptomes for Other Strains”.

Challenge 2: Viewing Strain Names in a VCF

There are 50 strains in the VCF SNP file. What are their names? The strain names are stored in the VCF header. The samtools container also includes a tool called tabix, which queries and indexes tab-delimited files. Run the samtools container and look at the tabix help page. Find a command that will print out the SNP VCF header and find the strain names.

singularity run ${SAMTOOLS} tabix --help

Call tabix with the ‘–only-header’ option and pass in the SNP VCF. In this course, we are using long option names. You could also use ‘-H’ to achieve the same result.

$ singularity run ${SAMTOOLS} tabix --only-header ${VCF_SNPS}

OUTPUT

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.13+htslib-1.13
##bcftoolsCommand=mpileup -f Mus_musculus.GRCm39.dna.toplevel.fa.gz -b samples -g 10 -a FORMAT/DP,FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/SP,INFO/AD -E -Q 0 -pm 3 -F 0.25 -d 500
##reference=Mus_musculus.GRCm39.dna.toplevel.fa.gz
##contig=<ID=1,length=195154279>
##contig=<ID=2,length=181755017>
##contig=<ID=3,length=159745316>
##contig=<ID=4,length=156860686>
##contig=<ID=5,length=151758149>
##contig=<ID=6,length=149588044>
##contig=<ID=7,length=144995196>
##contig=<ID=8,length=130127694>
##contig=<ID=9,length=124359700>
##contig=<ID=10,length=130530862>
##contig=<ID=11,length=121973369>
##contig=<ID=12,length=120092757>
##contig=<ID=13,length=120883175>
##contig=<ID=14,length=125139656>
##contig=<ID=15,length=104073951>
##contig=<ID=16,length=98008968>
##contig=<ID=17,length=95294699>
##contig=<ID=18,length=90720763>
##contig=<ID=19,length=61420004>
##contig=<ID=X,length=169476592>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of raw reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of raw reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (high-quality bases)">
##INFO=<ID=AD,Number=R,Type=Integer,Description="Total allelic depths (high-quality bases)">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=MinDP,Number=1,Type=Integer,Description="Minimum per-sample depth in this gVCF block">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Phred-scaled Genotype Quality">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
##bcftools_callCommand=call -mAv -f GQ,GP -p 0.99; Date=Wed Aug 11 21:20:03 2021
##bcftools_normCommand=norm --fasta-ref Mus_musculus.GRCm39.dna.toplevel.fa.gz -m +indels; Date=Fri Aug 13 11:11:49 2021
##FORMAT=<ID=FI,Number=1,Type=Integer,Description="High confidence (1) or low confidence (0) based on soft filtering values">
##FILTER=<ID=LowQual,Description="Low quality variants">
##VEP="v104" time="2021-08-30 23:27:00" cache="mus_musculus/104_GRCm39" ensembl-funcgen=104.59ae779 ensembl-variation=104.6154f8b ensembl=104.1af1dce ensembl-io=104.1d3bb6e assembly="GRCm39" dbSNP="150" gencode="GENCODE M27" regbuild="1.0" sift="sift"
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|SIFT|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|TRANSCRIPTION_FACTORS">
##bcftools_viewVersion=1.13+htslib-1.13
##bcftools_viewCommand=view -i 'FORMAT/FI[*] = 1' mgp_REL2021_snps.vcf.gz; Date=Sat Dec 18 19:08:09 2021
##bcftools_annotateVersion=1.13+htslib-1.13
##bcftools_annotateCommand=annotate -x INFO/VDB,INFO/SGB,INFO/RPBZ,INFO/MQBZ,INFO/MQBZ,INFO/MQSBZ,INFO/BQBZ,INFO/SCBZ,INFO/FS,INFO/MQOF,INFO/AC,INFO/AN,FORMAT/SP,FORMAT/ADF,FORMAT/ADR,FORMAT/GP; Date=Sat Dec 18 19:08:09 2021
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##bcftools_viewCommand=view -a -Oz -o final_mgp_REL2021_snps.vcf.gz; Date=Sat Dec 18 19:08:09 2021
##bcftools_annotateCommand=annotate -x INFO/MQ0F -Oz -o final_mgp_REL2021_snps.vcf.gz mgp_REL2021_snps.vcf.gz; Date=Mon Dec 20 07:12:23 2021
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  129P2_OlaHsd    129S1_SvImJ    129S5SvEvBrd     A_J     AKR_J   B10.RIII        BALB_cByJ       BALB_cJ BTBR_T+_Itpr3tf_J       BUB_BnJC3H_HeH  C3H_HeJ C57BL_10J       C57BL_10SnJ     C57BL_6NJ       C57BR_cdJ       C57L_J  C58_J   CAST_EiJCBA_J   CE_J    CZECHII_EiJ     DBA_1J  DBA_2J  FVB_NJ  I_LnJ   JF1_MsJ KK_HiJ  LEWES_EiJ       LG_J   LP_J     MAMy_J  MOLF_EiJ        NOD_ShiLtJ      NON_LtJ NZB_B1NJ        NZO_HlLtJ       NZW_LacJ       PL_J     PWK_PhJ QSi3    QSi5    RF_J    RIIIS_J SEA_GnJ SJL_J   SM_J    SPRET_EiJ       ST_bJ   SWR_J  WSB_EiJ  ZALENDE_EiJ

The VCF header contains information about the file format version, chromosomes, metadata, and tools used to create the VCF. The column names are on the last line of the VCF header and this line contains the strain names. If you had a cross comprised of other strains, you would query the VCF to find their names.

We will work on the sumner2 /flashscratch area. Create a directory with your user ID. Then create a directory called ‘gbrs’. For example:

mkdir -p /flashscratch/dgatti/gbrs/${STRAIN}

Then change into the directory that you just created.

cd /flashscratch/dgatti/gbrs

Load in the Singularity software.

module load singularity

Use g2gtools vcf2vci to Gather Strain-Specific SNPs and Indels

In the first step, we insert indels from a specific strain into the reference genome using the vcf2vci command. This is written out to a text file in a format called “VCI”.

The arguments are:

$ g2gtools vcf2vci --help
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────╮
│ *  --vcf       -i      FILE     VCF files can seperate files by "," or have multiple -i [default: None]  │
│                                 [required]                                                               │
│ *  --fasta     -f      FILE     Fasta file matching VCF information [default: None] [required]           │
│ *  --strain    -s      TEXT     Name of strain/sample (column in VCF file) [default: None] [required]    │
│    --output    -o      FILE     Name of output file [default: None]                                      │
│    --diploid   -d               Create diploid VCI file                                                  │
│    --keep                       Keep track of VCF lines that could not be converted to VCI file          │
│    --pass                       Use only VCF lines that have a PASS for the filter value                 │
│    --quality                    Filter on quality, FI=PASS                                               │
│    --no-bgzip                   DO NOT compress and index output                                         │
│    --verbose   -v      INTEGER  specify multiple times for more verbose output [default: 0]              │
│    --help                       Show this message and exit.                                              │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯

We will use a subset of the arguments, passing in the reference FASTA file, the SNPD and indel VCFs, the strain name to search for in the indel VCF, and the output file path.

Inputs | Argument | Description | |———-|————-| | –fasta | path to the GRCm39 reference FASTA file | | –vcf | path(s) to the Sanger VCFs for SNPs and indels. Both can be passed in. |

Output | Argument | Description | |———-|————-| | –output | A VCI file, which is a custom file format, akin to a BED format, which lists the positions and sequences of indels. Note that, while we list the output file extension as “vci”, g2gtools will gzip and index the file, so we will need to add the ‘.gz’ extension when we use the file in downstream commands. |

STRAIN_VCI=${STRAIN}/REF-to-${STRAIN}.vci
singularity run ${G2GTOOLS} g2gtools vcf2vci \
                            --fasta ${REF_FASTA} \
                            --vcf ${VCF_INDELS} \
                            --vcf ${VCF_SNPS} \
                            --strain ${STRAIN}  \
                            --output ${STRAIN_VCI}

Use g2gtools patch to Insert Strain-Specific SNPs into Reference

Next, we insert SNPs from a specific strain into the reference genome using the ‘patch’ command.

The arguments are:

$ g2gtools patch --help
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────╮
│ *  --input    -i      FILE     Fasta file to extract from [default: None] [required]             │
│ *  --vci      -c      FILE     VCI File to use [default: None] [required]                        │
│    --output   -o      FILE     Name of output file [default: None]                               │
│    --bed      -b      FILE     BED file name [default: None]                                     │
│    --region   -r      TEXT     Region to extract in chromosome:start-end format [default: None]  │
│    --reverse                   Reverse the direction of VCI file                                 │
│    --bgzip                     compress and index output                                         │
│    --verbose  -v      INTEGER  specify multiple times for more verbose output [default: 0]       │
│    --help                      Show this message and exit.                                       │
╰─────────────────────────────────────────────────────────────────────────────────────────────────╯

We will use the following arguments:

Inputs | Argument | Description | |———-|————-| | –input | path to the GRCm39 reference FASTA file | | –vci | path to the gzipped VCI file created by vcf2vci. |

Output | Argument | Description | |———-|————-| | –output | path to patched FASTA file with SNPs inserted into the reference genome sequence. |

PATCHED_FASTA=${STRAIN}/${STRAIN}.patched.fa
singularity run ${G2GTOOLS} g2gtools patch \
                            --input ${REF_FASTA} \
                            --vci ${STRAIN_VCI}.gz \
                            --output ${PATCHED_FASTA}

Use g2gtools transform to Insert Strain-Specific Indels into Strain FASTA

The ‘transform’ function takes the strain-specific SNP-patched FASTA file, inserts indels from the VCI file, and outputs a FASTA file. This FASTA file contains the inferred sequence of a specific strain.

Next, we insert SNPs from a specific strain into the reference genome using the ‘patch’ command.

The arguments are:

$ g2gtools transform --help
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────╮
│ *  --input    -i      FILE     Fasta file to extract from [default: None] [required]             │
│ *  --vci      -c      FILE     VCI File to use [default: None] [required]                        │
│    --output   -o      FILE     Name of output file [default: None]                               │
│    --bed      -b      FILE     BED file name [default: None]                                     │
│    --region   -r      TEXT     Region to extract in chromosome:start-end format [default: None]  │
│    --reverse                   Reverse the direction of VCI file                                 │
│    --bgzip                     compress and index output                                         │
│    --verbose  -v      INTEGER  specify multiple times for more verbose output [default: 0]       │
│    --help                      Show this message and exit.                                       │
╰─────────────────────────────────────────────────────────────────────────────────────────────────╯

We will use the following arguments:

Inputs | Argument | Description | |———-|————-| | –input | path to the strain-specific patched FASTA file created by patch. | | –vci | path to the gzipped VCI file created by vcf2vci. |

Output | Argument | Description | |———-|————-| | –output | path to transformed FASTA file with SNPs and indels inserted into the reference genome sequence. |

We need to create a genome version variable to create informative file names.

GENOME_VERSION=GRCm39

Next, we will run the transform command.

STRAIN_FASTA=${STRAIN}/${STRAIN}.${GENOME_VERSION}.fa
singularity run ${G2GTOOLS} g2gtools transform \
                            --input ${PATCHED_FASTA} \
                            --vci ${STRAIN_VCI}.gz \
                            --output ${STRAIN_FASTA}

We now have a pseudo-reference genome in the strain-specific FASTA file.

Use samtools faidx to Index the Strain-Specific FASTA file

This is a step which uses the samtools suite of tools to index the FASTA file.

Remember that we created a variable for the samtools Singularity container at the top of the lesson.

Then we will use samtools to index the FASTA file.

singularity run ${SAMTOOLS} samtools faidx ${STRAIN_FASTA}

Use g2gtools convert to Create a Strain-Specific GTF File

The convert function takes the strain-specific FASTA file, the VCI file, and the reference GTF and creates a strain-specific annotation file in GTF.

The arguments are:

$ g2gtools convert --help
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *  --input-file   -i      FILE                   Input file to convert to new coordinates [default: None] [required]  │
│ *  --vci-file     -c      FILE                   VCI file [default: None] [required]                                  │
│ *  --file-format  -f      [BAM|SAM|GFF|GTF|BED]  Input file format [default: None] [required]                         │
│    --output       -o      FILE                   Name of output file [default: None]                                  │
│    --reverse      -r                             Reverse the direction of the conversion                              │
│    --verbose      -v      INTEGER                specify multiple times for more verbose output [default: 0]          │
│    --help                                        Show this message and exit.                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

We will use the following arguments:

Inputs | Argument | Description | |———-|————-| | –input-file | path to the reference GTF. | | –vci-file | path to the gzipped VCI file created by vcf2vci. | | –file-format | string indicating a GTF file. Lower case since it will be used as a file name extension. |

Output | Argument | Description | |———-|————-| | –output | path to converted GTF file with SNPs and indels inserted into the reference transcript and coordinates shifted to the strain-specific coordinates. |

Next, we need the path to the reference GTF. We have selected the Ensembl 105 annotation for this lesson.

REF_GTF=/projects/omics_share/mouse/GRCm39/transcriptome/annotation/ensembl/v105/Mus_musculus.GRCm39.105.chr.gtf

We also need to specify the annotation file format in lower case because it will be used as a file name extension.

ANNOT_FORMAT=gtf
STRAIN_GTF=${STRAIN}/${STRAIN}.${GENOME_VERSION}.${ANNOT_FORMAT}
singularity run ${G2GTOOLS} g2gtools convert \
                            --input-file ${REF_GTF} \
                            --vci-file ${STRAIN_VCI}.gz \
                            --file-format ${ANNOT_FORMAT} \
                            --output ${STRAIN_GTF}

We now have two key files that are used by GBRS:

  • the strain-specific FASTA file, which contains the strain’s inferred sequence, and
  • the strain-specific GTF file, which contains the strain’s inferred annotation.

Challenge 3: Create a set of reference files for another strain in the VCF.

You will need to look back at Challenge 1 to see how to list the strains in the VCFs. You can query either the SNP or Indel VCF. Create an entire script on your compute instance to run the commands above

OUTPUT

[1] "This new lesson looks good"
g2gtools transform -i ${STRAIN}/${STRAIN}.patched.fa -c ${STRAIN}/REF-to-${STRAIN}.chain -o ${STRAIN}/${STRAIN}.fa
g2gtools convert   -c ${STRAIN}/REF-to-${STRAIN}.chain -i ${GTF} -f gtf -o ${STRAIN}/${STRAIN}.gtf

g2gtools gtf2db                -i ${STRAIN}/${STRAIN}.gtf -o ${STRAIN}/${STRAIN}.gtf.db
g2gtools extract --transcripts -i ${STRAIN}/${STRAIN}.fa -db ${STRAIN}/${STRAIN}.gtf.db > ${STRAIN}/${STRAIN}.transcripts.fa
g2gtools extract --genes       -i ${STRAIN}/${STRAIN}.fa -db ${STRAIN}/${STRAIN}.gtf.db > ${STRAIN}/${STRAIN}.genes.fa
g2gtools extract --exons       -i ${STRAIN}/${STRAIN}.fa -db ${STRAIN}/${STRAIN}.gtf.db > ${STRAIN}/${STRAIN}.exons.fa
 g2gtools --help

 Usage: g2gtools [OPTIONS] COMMAND [ARGS]...

 g2gtools

╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --version                                                                                                 │
│ --install-completion        [bash|zsh|fish|powershell|pwsh]  Install completion for the specified shell.  │
│                                                              [default: None]                              │
│ --show-completion           [bash|zsh|fish|powershell|pwsh]  Show completion for the specified shell, to  │
│                                                              copy it or customize the installation.       │
│                                                              [default: None]                              │
│ --help                                                       Show this message and exit.                  │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ────────────────────────────────────────────────────────────────────────────────────────────────╮
│ convert           Convert coordinates of BAM|SAM|GTF|GFF|BED files                                        │
│ extract           Extract subsequence from a fasta file given a region                                    │
│ fasta-format      Reformat a Fasta file                                                                   │
│ gtf2db            Convert a GTF file to a G2G DB file                                                     │
│ parse-region      Parse a region from the command line.  Useful for verifying regions.                    │
│ patch             Patch SNPs onto the reference sequence                                                  │
│ transform         Incorporate indels onto the input sequence                                              │
│ vcf2vci           Create VCI file from VCF file(s)                                                        │
│ vci-query         Query a VCI file.                                                                       │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────╯

For each strain the following is produced:

A_J.39.exons.fa
A_J.39.fa
A_J.39.fa.fai
A_J.39.genes.fa
A_J.39.gtf
A_J.39.gtf.db
A_J.39.gtf.unmapped
A_J.39.patched.fa
A_J.39.patched.fa.fai
A_J.39.transcripts.fa
A_J.39.vci.gz
A_J.39.vci.gz.tbi

Key Points

  • Use the default nextflow pipeline if working with DO mice on sumner2.
  • You can query the VCFs and get the strain names using tabix.
  • You can specify another set of strains and use the default reference genome.
  • Advanced users can run the pipeline line-by-line and specify different options.