Our research lies at the intersection of machine learning and genomics. A genome is the totality of an organism’s DNA and effectively the “instructions” or software that control the development of an organism. One of the central goals of biology is to link changes in physical characteristics (phenotypes) of organisms to causative changes in the DNA. Genomic data, much like compiled software, are not easy for humans to interpret directly. Just as a decompiler can help us reverse engineer compiled software, machine learning can be used to identify, characterize, and annotate features of genomes and make analysis by humans easier. We aim to aid biologists in uncovering new knowledge and insights into the function of genomes by developing new methods that draw on machine learning, data science, and big data techniques.

We focus primarily on insect genomes. Insect genomes present unique and interesting challenges for computational biologists. More importantly, many insects vector diseases or are important to food security (either as pollinators or pests). Through their saliva, mosquitoes such as Anopheles gambiae and Aedes aegypti transmit the parasites and viruses that cause diseases such as malaria, dengue fever, and Zika are major threats to public health. By partnering with biologists who study these insects, we can ensure that my work solves relevant problems and useful.

Sequencing for Regulatory Genomics

In addition to genes, genomes contain regulatory elements that are involved in the mechanical process of gene expression. Regulatory elements can be divided into trans- and cis-acting categories. So-called DNA enrichment assays such as ChIP-Seq, ATAC-Seq, STARR-Seq, and FAIRE-Seq allow us to identify and characterize the activity of cis-regulatory elements such as promoters and enhancers.

My lab works on regulatory genomics from two angles. First, we are collaborate with biologists on several projects to analyze DNA enrichment assay data to annotate and characterize cis-regulatory elements in particular organisms. Secondly, we are applying new machine learning methods to create more sensitive and specific peak calling (filtering) techniques for DNA enrichment assay data.

Papers:

Posters:

  • CR Beal, JG Peters, and RJ Nowling. Sequence Model Evaluation Framework for STARR-Seq Peak Calling. Poster presentation at the 12th Annual ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, August 2021, Virtual. Abstract. Poster
  • RJ Nowling, RR Geromel, and BS Halligan. Filtering STARR-Seq Peaks for Enhancers with Sequence Models. Poster presentation at the 11th Annual ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, September 2020, Virtual. Abstract Poster
  • RJ Nowling, CR Beal, SJ Emrich, SK Behura, MS Halfon, and M Duman-Scheel. PeakMatcher: Matching Peaks Across Genome Assemblies. Poster presentation at the 11th Annual ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, September 2020, Virtual. GitHub Abstract Poster

Detecting Chromosomal Inversions

Chromosomal inversions an important role in ecological adaptation by enabling the accumulation of beneficial alleles (Love, et. al. 2016; Fuller, et al. 2017) and reproductive isolation (Noor, et a. 2001). The 2La inversion in the Anopheles gambiae complex has been associated with thermal tolerance of larvae (Rocca, et al. 2009), enhanced desiccation resistance in adult mosquitoes (Gray, et al. 2009), and susceptibility to malaria (Riehle, et al. 2017). Additionally, inversions must be identified and accounted for to avoid bias in population inference and association testing (Seich al Basatena, et al. 2013).

We are working on methods for detecting inversions from population genomics data. Methods commonly used in humans rely on aligning reads to a reference genome. Such methods, even when using long reads, are not as effective in insect genomes. Such genomes are often full of problematic repetitive elements and their assemblies may be fragmented. We have found that PCA-based methods can be quite effective for insects. Our eventual goal is to be able to detect inversions even when the reference genome is highly fragmented. My lab maintains and develops the open-source software package Asaph for SNP inversion detection methods.

Papers:

Posters:

Likelihood-Ratio Tests Adjusted for Missing Data

Single-SNP association tests are a popular and powerful statistical technique for identifying genomic variants that are associated with in population structure. Recently, I proposed an adjusted likelihood-ratio test that handles unknown variants by using an uninformative (uniform) prior over all possible genotypes. I demonstrated that this approach significantly reduces false positives when compared with more commonly-used techniques such as \(F_{ST}\) and can uncover variation missed by other methods.

Papers:

Transmembrane Receptors

Transmembrance receptors are proteins that sit in the membrane of the cell and are often activated or deactivated by the binding of a ligand to the extracellular side. Examples of these receptors include G Protein-coupled Receptors (GPCRs) which transfer signals by releasing G Proteins inside the cell and Ligand-gated ion channels that act as valves for the transfer of ions like potassium and chloride into the cell. GPCRs are a common target of drugs including selective serotonin reuptake inhibitors (SSRIs, antidepression medication) and beta blockers (hypertension).

As part of a project to develop novel insecticides, I worked on developing a classifier for GPCR protein sequences to aid in identifying and annotating GPCRs in insect genomes. I adapted several classifiers that had been trained primarily on GPCRs from humans and other model organisms using an ensemble-approach.

Papers:

Posters: