Dr. Huo of the University of Florida recently published a preprint on new methodology for meta-analysis of genomic data:

Bayesian latent hierarchical model for transcriptomic meta-analysis to detect biomarkers with clustered meta-patterns of differential expression signals

He explains the benefit of his method over existing methods, what data you need, and some code to get you started! Check out the article here and the code here.



What inspired you to want to solve this problem (i.e., how did you first become aware of the problem and come across the solution in your Bayesian approach)?


My major research focuses on genomic data integration, including genomic meta-analysis. However, some of the traditional frequentists’ methods to perform genomic meta-analysis suffer from some issues (non-complementary null and alternative space for decision making), which may cause inflated type I error rate or false discovery rate. Bayesian approach offers a natural solution to accommodate these issues and to obtain a correct false discovery rate control.


What is the benefit of Bayesian modeling compared with the traditional, frequentist approach to this type of problem?


The answer to this question may be a little technical. The benefits are two folds. Firstly, under several genomic meta-analysis hypothesis testing settings, traditional (frequentist) approaches suffer from issues of non-complementary null and alternative space, which may cause inflated type I error rate or false discovery rate. Bayesian approach offers a natural solution to accommodate these issues and to obtain a correct false discovery rate control. Secondly, the detected biomarkers for genomic meta-analysis show distinct differential expression pattern across different studies. Our approach is among the first to capture differential meta-patterns across different studies, which are informative to guide further biological investigation.


What situations would this method be useful for? Are there any situations that are not appropriate to use this method?


There are two situations when I would recommend biologists to try out this method. Firstly, if the user wants to perform genomic meta-analysis. Secondly, if the user wants to categorize the detected biomarkers according to their differential expression pattern among different studies. There are certain scenarios where it is not appropriate to use this method. Since the method is designed for genome-wide meta-analysis, single study analysis or single marker analysis are not appropriate for this method.


What type or amount of data would you need to perform this analysis on a system?


Since the method is designed for genome-wide meta-analysis, so genome-wide profiling of transcriptomic, epigenomic or metabolomic data from multiple studies are suitable for the method. The specific input type should be a p-value matrix (from individual study differential expression analysis), with each row representing a gene and each column representing a study. Modeling through p-values is flexible to accommodate different data techniques (RNA-seq and microarray) and the potential study batch effect.


Are you planning to write a vignette for your BayesMP R package? Are there any facets of the R package that may be tricky? In what form should the data be to use in your BayesMP R package?


Yes, these are very good suggestions. Currently BayesMP R package is hosted by GitHub, with an example how to run this package. But your suggestion to write a vignette will be very beneficial to its potential users. Currently the paper is under minor revision. I will prepare a nice and detailed vignette soon (maybe when the paper is accepted). I won’t expect other facets of the R package to be tricky. The input of the data should be p-value matrix. All these will be introduced in detail in the vignette. I think this is interview is definitely useful for me as well to improve the package toward a broader audience.


What was the most difficult part of completing this project?


Since Bayesian methods rely on multiple iterations of Markov chain Monte Carlo (MCMC), and R software is slow in “running loop”, it turned out the computing burden was a major concern of this project. We solved this issue by incorporating C++ in R, which improved the computational efficiency dramatically.


Do you have any advice for biologists looking to improve their computational skill set?


Since majority of statistical software are available in R (which is free), I would suggest biologists to pick up some R knowledge through some free online resources. Then, a problem-driven learning approach (relying on google) is recommended. For example, if the biologist wants to perform some tasks (e.g. perform a t.test, draw a heat map), just google it and some sample codes will pop-up. I don’t have a formal training in R. I picked up my R heavily relying on google.

Example of heatmaps created using R.

Do you have any questions for Dr. Huo about his work? Comment below!