TRNReg: Building the Gene Regulatory Network Using Deep Learning
From September 2023 to May 2024, I engaged in research in Professor Manolis Kellis’ lab at MIT as an undergraduate research scholar in MIT’s Advanced Undergraduate Research Opportunities Program (SuperUROP). I developed a Transformer-based deep-learning approach to infer gene regulatory networks.
For the SuperUROP program, I presented my research at the Fall Showcase (link to poster here) and detailed my results in a research report (link to report here). Below is a summary of my research.
Background & Motivation
Gene regulatory networks (GRNs) are complex biological networks that govern the level to which a target gene is expressed as protein. Abstractly, a GRN can be visualized as a graph, in which nodes are either (1) target genes or (2) regulatory elements, DNA regions which upregulate or downregulate the expression of a target gene. In this graph, every edge between a regulatory element E and target gene G represents their interaction and can be characterized in one of two ways: either ‘E upregulates G’ or ‘E downregulates G’. Comprehensively identifying the interactions within GRNs is critical to understanding disease since most disease-related mutations occur in regulatory elements and other non-coding DNA regions, and cause target genes to be expressed in an incorrect amount.
Objective & Approach
In this work, I developed TRNReg, a transformer-based method that links regulatory elements with target genes. TRNReg improves upon an existing state-of-the-art method GraphReg by enhancing its prediction accuracy. It does this by replacing GraphReg’s convolutional neural network (CNN) with a transformer, an increasingly popular deep learning architecture. The primary advantage of a transformer is that unlike CNNs which learn relationships between a genomic region and its direct neighbors, transformers learn relationships between a region and many other regions upstream and downstream. This advantage is important because the model learns a more robust representation of a genomic region since the biological behavior of a particular region can be influenced by other regions that are not proximal to it.
Methods
TRNReg processes epigenomic data, quantitatively describing DNA structure and conformation for a particular genomic region as input, in order to predict that region’s gene expression as output. The input data is forwarded to a transformer encoder, consisting of multiple Encoder layers, each with self-attention, layer normalization, and feed-forward network. The encoder learns context-dependent representations of genomic regions and passes these representations to subsequent steps in the pipeline, ultimately yielding an output describing the region’s gene expression. This model was trained on genomic regions across eighteen chromosomes, validated on two chromosomes, and tested on two chromosomes in a human cell line. The performance of the model was evaluated by comparing Pearson correlation coefficients between the predicted output and ground-truth output. Because the original training data (approximately 1000 regions across 18 chromosomes) was not sufficient to improve training accuracy, several approaches were explored to artificially augment the size of the training set.
Results
I demonstrated that TRNReg surpasses GraphReg in terms of prediction accuracy. TRN-Reg achieved 21% higher Pearson correlation coefficient during training and 1.5% higher correlation during testing. The biological interactions that can be discovered by TRNReg serve as a stepping stone for advancing the current knowledge of gene regulation.