RNA Seq Part 1

#install.packages("devtools")
library(devtools)

Loading required package: usethis

#install_github("danmaclean/rbioinfcookbook")
library(rbioinfcookbook)

Installing Packages

installing packages needed to do analysis and installing a package from the R Bioinformatics Cookbook. Then installing other packages to use to analyze data. installing forcats and then use Bioconductor as a package installer like a pipefitter.

#install.packages("forcats")
#if (!require("BiocManager", quietly = TRUE))
#install.packages("BiocManager")
#BiocManager::install("edgeR")
#if (!require("BiocManager", quietly = TRUE))
  #install.packages("BiocManager")
#BiocManager::install("Biobase")
#library(forcats)
#library(edgeR)

#install.packages("forcats")
#if (!require("BiocManager", quietly = TRUE))
    #install.packages("BiocManager")
#BiocManager::install("edgeR")
#if (!require("BiocManager", quietly = TRUE))
  #install.packages("BiocManager")
#BiocManager::install("Biobase")
library(forcats)
library(edgeR)

Loading required package: limma

library(Biobase)

Loading required package: BiocGenerics


Attaching package: 'BiocGenerics'

The following object is masked from 'package:limma':

    plotMA

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which.max, which.min

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

edgeR Estimate Differential Expression (EDE)

Used edgeR means that we can see how two or more samples differ in what genes are being expressed. Going to use the sample data set from cookbook modencodefly. Other way to put it is use trimmed mean of M-values (TMM) and the generalized linear model (GLM) to estimate differential expression.

TMM is using the mean of log-transformed gene expression values to help scale data and account for differences in data collection between samples. Most likely will see this in the sample data as the size of the library or the amount of data collected for each sample?
GLM gen form of linear model that works well for count data, going to be used to compare differences between treatments or the type of samples in the sample data set modencodefly?

Using edgeR from a count table

Start by loading a transcript count table, meaning raw data that has the RNA transcripts (tcs) recorded in the samples and and a count for each tcs. First step is loading data set from cookbook and dataframe below has a dataset converted to a matrix readable to R.

Dataframe is the RNA tcs counts from the Drosophila (dp) experiment.

genes <- count_dataframe[['gene']]
count_dataframe[['gene']] <- NULL
count_matrix <- as.matrix(count_dataframe)
rownames(count_matrix) <- genes

Then we have a data table with 14,869 observations (these are the different gene transcripts) and the 147 variables are the different fly samples.

The first line is saying that we want to focus on the two experiments about the larvae and their larval stage. Ran into error code (Error in h(simpleError(msg, call)) : error in evaluating the argument ‘x’ in selecting a method for function ‘which’: object ‘experiments_of_interest’ not found)

Third line sets state as the grouping factor or categorical variable used to compare between samples.

experiments_of_interest <- c("L1Larvae", "L2Larvae")
columns_of_interest <- which(pheno_data[['stage']] %in% experiments_of_interest)
grouping <- pheno_data[["stage"]] [columns_of_interest] |> forcats::as_factor()

Next chunk is subseting data to pull out solumns interested in. Going to form subset of count_matrix data that only includes interest, counts_of_interest.

counts_of_interest <- count_matrix[,counts = columns_of_interest]

To assemble all our data into one object use the DGElist function in edgeR to assemble into one object. Makes it easier to perform analysis.

count_dge <- edgeR::DGEList(counts = counts_of_interest, group = grouping)

Doing the count_dge allows us to begin our differential expression analysis. Going to use five functions from edgeR package.

design <- model.matrix(~grouping)
eset_dge <- edgeR::estimateDisp(count_dge, design)
fit <- edgeR::glmQLFit(eset_dge, design)
result <- edgeR::glmQLFTest(fit, coef=2)
topTags(result)

Coefficient:  groupingL2Larvae 
               logFC    logCPM        F       PValue          FDR
FBgn0027527 6.318665 11.148756 42854.72 1.132951e-41 1.684584e-37
FBgn0037424 6.417770  9.715826 33791.15 2.152507e-40 1.518091e-36
FBgn0037430 6.557774  9.109132 32483.00 3.510727e-40 1.518091e-36
FBgn0037414 6.337846 10.704514 32088.92 4.083908e-40 1.518091e-36
FBgn0029807 6.334590  9.008720 27648.19 2.585312e-39 7.688200e-36
FBgn0037224 7.055635  9.195077 24593.62 1.102456e-38 2.732070e-35
FBgn0037429 6.623619  8.525136 24122.44 1.400960e-38 2.975840e-35
FBgn0030340 6.176390  8.500866 23111.55 2.380783e-38 4.424983e-35
FBgn0029716 5.167089  8.977840 22556.01 3.218122e-38 5.316695e-35
FBgn0243586 6.966873  7.769756 21465.47 5.945208e-38 8.839930e-35

Interpreting the data

Column 1 shows the gene tcs of interest
logFC = log Fold Change meaning a measure of the differences between the two samples, basic average of expression
logCPM = log Counts per Million, (similar (sim) to the average (av) expression value of gene across samples) accounts for large number of gene tcs.
Column 4, 5, 6 are the results of statistical test.
1. Column 6, FDR (false detection rate) is most relevant because it is a p-value that has been corrected for the number of tests we’re conducting, but makes more sense as adjusted to decrease the probability of false detection.
FDR can see that a number of gene tcs are differentially expressed genes between two treatments

Doing some Preliminary Research

Looking up gene names on the flybase database.

FBgn0000022, it makes sense that this gene would be differentially expressed between instar 1 and instar 2 as the nervous system grows and changes in the instar step.

FBgn0027527, this gene is differential expressed between instar 1 and 2 because it seems like if there is one copy or three copies the dp will die. (Flies with either one copy or three copies of Tpl die as late embryos or early first instar larvae. (Adapted from FBrf0173211 and FBrf0217655).)

FBgn0037424, protein coding gene, and is differentially expressed because of the morphological changes between instar 1 and 2, it is a prevalent gene in many significant structures.

Summary:

Learned how to use edgeR to analyze r-seq data and to compare RNA-seq data from two treatments of dp. I was able to use the final differential expression analysis and see the genes with the highest measure of differences between the two samples. Using flybase was interesting and gave further info on what each gene did and why or why not it was differentially regulated between each instar.