Prepare EMASE Reference Files

Last updated on 2024-03-28 | Edit this page

Estimated time: 12 minutes

Overview

Questions

What does the Prepare EMASE process do?
What input files are required to prepare the EMASE reference?
What output files are created by the Prepare EMASE process?

Objectives

Explain what tasks Prepare EMASE performs.
Understand the input files needed to prepare an EMASE reference.
Understand the output files produced by the Prepare EMASE process.

Introduction

Expectation Maximization for Allele-Specific Expression (EMASE) is software which estimates allele-specific transcript expression in genetically diverse populations. It was designed to be used with genetically diverse mice whose genomes are comprised of two or more inbred strains. EMASE uses strain-specific genomes and GTF files to create strain-specific transcriptomes.

Before running EMASE, we must prepare the reference files which EMASE needs. In the “prepare-pseudo-reference” process, we used the reference genome and strain-specific VCFs to create strain-specific genomes and GTF files, which contain strain-specific sequences and transcript annotation. The Prepare EMASE process will take each founder strain genome and its corresponding GTF file and will create a bowtie index and a pooled transcript list.

Input Files

Prepare EMASE requires strain-specific genome sequences and transcript annotation files. Genome sequences are stored in FASTA files and transcript annotation is stored in Gene Transfer Format (GTF) files.

On sumner2, the GRCm39 strain-specific genome FASTA files are in: /projects/compsci/omics_share/mouse/GRCm39/genome/sequence/imputed/rel_2112_v8. These files contain the genome sequence of each strain with the strain-specific SNPs and Indels inserted. These were created by the “prepare-pseudo-reference” process.

The corresponding GTFs, which were also created by the “prepare-pseudo-reference” process, are in: /projects/compsci/omics_share/mouse/GRCm39/transcriptome/annotation/imputed/rel_2112_v8

We will create two variables for these paths:

GENOME_REF_DIR=/projects/compsci/omics_share/mouse/GRCm39/genome/sequence/imputed/rel_2112_v8

GTF_REF_DIR=/projects/compsci/omics_share/mouse/GRCm39/transcriptome/annotation/imputed/rel_2112_v8

The Prepare EMASE process requires the following files:

File Type	Argument Name	File Path	Description
Genome FASTA files	–genome_file_list	`${GENOME_REF_DIR}/A_J.39.fa,${GENOME_REF_DIR}/C57BL_6J.39.fa,${GENOME_REF_DIR}/129S1_SvImJ.39.fa,${GENOME_REF_DIR}/NOD_ShiLtJ.39.fa,${GENOME_REF_DIR}/NZO_HlLtJ.39.fa,${GENOME_REF_DIR}/CAST_EiJ.39.fa,${GENOME_REF_DIR}/PWK_PhJ.39.fa,${GENOME_REF_DIR}/WSB_EiJ.39.fa`	Comma-separated list of FASTA files for each input strain
Transcript GTF files	–gtf_file_list	`${GTF_REF_DIR}/A_J.39_DroppedChromAppended.gtf,${GTF_REF_DIR}/Mus_musculus.GRCm39.105.filtered.gtf,${GTF_REF_DIR}/129S1_SvImJ.39_DroppedChromAppended.gtf,${GTF_REF_DIR}/NOD_ShiLtJ.39_DroppedChromAppended.gtf,${GTF_REF_DIR}/NZO_HlLtJ.39_DroppedChromAppended.gtf,${GTF_REF_DIR}/CAST_EiJ.39_DroppedChromAppended.gtf,${GTF_REF_DIR}/PWK_PhJ.39_DroppedChromAppended.gtf,${GTF_REF_DIR}/WSB_EiJ.39_DroppedChromAppended.gtf`	Comma-separated list of GTF files for each input strain

We also need to provide a set of letters that EMASE will use as short abbreviations for each strain. These abbreviations are used to annotate transcripts which come from each founder strain. Note that the order of the strains in the arguments above must match the order of the strain letters below.

Argument Name	Value	Description
–haplotype_list	A,B,C,D,E,F,G,H	Comma-separated list of letters to be used as an abbreviation for each strain

Running Prepare EMASE using Nextflow

The JAX CS group has a sample script to run the Prepare EMASE process. The full script is listed below.

#!/bin/bash
#SBATCH --mail-user=first.last@jax.org
#SBATCH --job-name=gbrs_mouse
#SBATCH --mail-type=END,FAIL
#SBATCH -p compute
#SBATCH -q batch
#SBATCH -t 72:00:00
#SBATCH --mem=1G
#SBATCH --ntasks=1

cd $SLURM_SUBMIT_DIR

# LOAD NEXTFLOW
module use --append /projects/omics_share/meta/modules
module load nextflow

# RUN PIPELINE
nextflow TheJacksonLaboratory/cs-nf-pipelines \
-profile sumner \
--workflow prepare_emase \
--pubdir "/flashscratch/${USER}/outputDir" \
-w /flashscratch/${USER}/outputDir/work \
--genome_file_list "/path/to/genome/A.fa,/path/to/genome/B.fa,..." \
--gtf_file_list "/path/to/gtf/A.gtf,/path/to/gtf/B.gtf,..." \
--haplotype_list "A,B,..." \
--comment "This script will run prepare_emase to generate multiway references based on default parameters"

Running Prepare EMASE

Below is an example of the Prepare EMASE command using these arguments:

emase prepare --genome-file ${genome_file_list} \
              --gtf-file ${gtf_file_list} \
              --haplotype-char ${haplotype_list} \
              --save-g2tmap
              --out_dir ./ \
              --save-g2tmap \
              --no-bowtie-index

We used a subset of the EMASE arguments above. Below, we list the arguments:

╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────╮
│ --genome-file      -G      FILE     Genome files, can seperate files by "," or have multiple -G      |
|                                     [default: None] [required]                                       │
│ --haplotype-char   -s      TEXT     haplotype, either one per -h option, i.e. -h A -h B -h C, or a   |
|                                     shortcut -h A,B,C [default: None]                                │
│ --gtf-file         -g      FILE     Gene Annotation File files, can seperate files by "," or have    |
|                                     multiple -G [default: None]                                      │
│ --out_dir          -o      TEXT     Output folder to store results (default: the current working     |
|                                     directory) [default: None]                                       │
│ --save-g2tmap      -m               saves gene id to transcript id list in a tab-delimited text file │
│ --save-dbs         -d               save dbs                                                         │
│ --no-bowtie-index  -m               skips building bowtie index                                      │
│ --verbose          -v      INTEGER  specify multiple times for more verbose output [default: 0]      │
│ --help                              Show this message and exit.                                      │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────╯

workflow PREPARE_EMASE {
    // Prepare emase reference, given list of genomes and gtf files.
    EMASE_PREPARE_EMASE()
    BOWTIE_BUILD(EMASE_PREPARE_EMASE.out.pooled_transcript_fasta, 'bowtie.transcripts')
    // clean transcript lists to add transcripts absent from certain haplotypes.
    CLEAN_TRANSCRIPT_LISTS(EMASE_PREPARE_EMASE.out.pooled_transcript_info)
}