Kelsey and Marko (co-first authors) and Roger (corresponding author) at Emory University recently published an article in Plant Cell (find it here) that used high throughput techniques (INTACT and ATAC-seq) to identify conserved transcription regulatory elements in the roots of four angiosperms. In addition to addressing a compelling biology question, Kelsey, Marko, and their colleagues had to cope with all of the traditional Big Data problems. We asked them for more insight into what motivated their work and how they went about acquiring, transforming, and sharing their data.
What spurred this question?
Kelsey (KM): Our lab has been interested in the similarities between root epidermal hair- and non-hair cells, two cell types that share a close developmental lineage but display radically different morphologies in their mature forms. By using our new approach of coupling ATAC-seq with the cellular precision that INTACT provides, we were able to investigate questions into the subtleties of cis-regulatory patterns that drive the differentiation of these cell types.
Marko (MB): Our comparisons across species stems from a collaboration with several labs interested in plant development and stress responses, specifically that of Julia Bailey-Serres (UC Riverside), Siobhan Brady (UC Davis), and Neelima Sinha (UC Davis). The collaboration was established so that we could compare different crop species under control and stress conditions. We wanted to quantify regulation of chromatin, transcription, and translation under normal conditions and submergence stress. ATAC-seq was a novel technique when we were optimizing ChIP-seq experiments in the different crop species, so we decided to try it. The benefit of ATAC-seq was quickly realized in that it was a fast and easy protocol to perform. We decided to examine chromatin accessibility in the different plant species because we were able to isolate nuclei from the same location in the different species and by comparing chromatin accessibility have a better understanding how chromatin regulation is conserved between the different species and how much it varies.
Roger (RD): This all started out as three separate projects, the first being a technical optimization of ATAC-seq using crude or INTACT-purified nuclei. Once we had ATAC-seq worked out, we decided to apply it to questions about transcriptional regulation across species and cell types, as Kelsey and Marko mentioned. In the end, we decided to roll all of this work into a single paper.
How did you decide upon the species you used in your analyses? Was it something to do with genome quality, resources available, etc.?
MB: Arabidopsis was chosen because of the plethora of resources available for this species, it has a small, well-annotated genome, and our lab does a lot of work using this model organism. Medicago, rice, and tomato were chosen because of their agricultural importance and the experience of our collaborators with these plants. Importantly, these four species vary greatly in genome size and intergenic space, and we wanted to understand how such genomic differences might manifest in differences in transcriptional regulation.
For compact genomes like Arabidopsis, how do you distinguish to which gene a THS corresponds?
KM: This is a great question. Unlike many highly annotated animal genomes, there is very little information on regulatory elements in plant species and their potential target genes. This makes it very challenging to assign potential regulatory regions – areas rich in THSs – to target genes. In lieu of this information, we chose to assign THSs to the gene with the closest TSS, as it has been shown in previous studies that regulatory elements preferentially influence their most proximal genes. While this is an imperfect system, it provides us with a reasonable starting point from which to launch our investigation.
MB: We used the PeakAnnotator program to assign each THS to the nearest TSS. PeakAnnotator identifies only one nearest TSS per THS, and we recognize that this is a limitation given that a single THS could regulate more than one gene. To improve the resolution, our pipeline has been updated to separate THSs into smaller sub-THSs by implementing the “regionRes” option of the HOMER peak finding program. This typically identifies several smaller regions within a larger THS, that can then be individually assigned to their nearest TSS.
RD: In the future, we would like to use high resolution chromosome conformation capture methods to more definitively assign THSs to target promoters by examining physical contacts between them.
What are your preferred ways of storing and sharing the data generated, particularly across so many groups?
MB: We’ve mainly done analysis and storage of raw data on our local computers and have used BOX to share smaller files between the labs. We’ve often used Cyverse to store and share larger files.
RD: Globus is another platform we’ve used for transferring large files, and this works quite well too.
Was there any particular computer language or tool that you found most useful to transform your data to work with different analysis programs?
KM: I wrote a python script to automate the early plug-and-chug stages of our pipeline, and that turned out to be a huge timesaver for me. Being able to go from raw .fasta files to normalized .bw files with only a few keystrokes not only spared me the tedium of sitting at the computer for hours on end, it also made troubleshooting a lot more straightforward when our code became standardized across all our files. While so many of the software tools available now don’t require specific knowledge of any programming languages – a development that I think is hugely important in making these tools more accessible to bench scientists – I do think it’s worth the time investment to become at least passingly familiar with one or two languages of your choice. This will not only help you understand the tools you’re using more fully, but will greatly reduce the amount of time you spend each day trying to ensure that all your different programs play nicely with each other.
You used a variety of tools to analyze your data. Did you have trouble using any of them, and if so, did you contact the creators of the program, collaborate with someone who knew the tool already, or figure it out by some trial and error type method?
KM: I think much of our process for this manuscript was trial and error. This was my first experience generating and analyzing data on this scale, and the largest challenge at the beginning of the project was simply gaining familiarity with the tools that are available. We spent a lot of time exploring the different software programs on our own, and then would meet back periodically to share what insights we had gained. It was an exciting time, being able to watch the libraries that I had made at my bench turn into real, meaningful data right before our eyes – there’s nothing quite like that thrill.
MB: When we first began calling THSs we tested HOMER, MACS, and HOTSPOT. HOTSPOT was the hardest to utilize and we were only able to get it work after reaching out to the creators of the program, but it still proved to be the most difficult to use. For all the other programs, there were always certain issues that we encountered along the way. The very first approach to fixing any issues was to dig through the options present in the help command and making sure everything was input correctly. Follow-up to this included going online and searching for pages or message boards where a similar issue may have been previously addressed. Finally, there were species-specific issues that would arise and would need to be fixed, either by checking the options or looking online. One such issue was calling THSs in Medicago, which has around two thousand unplaced genomic scaffolds and requires an additional option when creating a tag directory in HOMER, otherwise it fails to call peaks.
RD: Reaching out to the authors of published tools has been a huge source of help for those problems that we really got stuck on. It’s wonderful to see how responsive and helpful the bioinformatics community is.
What was the most difficult/challenging part of completing this project?
KM: One of the biggest challenges with this project was finding a way to compare chromatin accessibility data across species in a meaningful way. Because plant genomes have such large gene families, it is no simple matter to find an orthologous gene between two species, let alone across four! At first, we decided to examine a subset of ~370 syntenic orthologs, each one meticulously identified and vetted by our collaborators (particularly Maggie Woodhouse), but soon found that these revealed no obvious patterns in upstream chromatin accessibility across species. Since our gene-focused approach wasn’t working, we then decided to go a level deeper, and look into the accessible sites themselves: specifically, what recurring motifs did we find across these sites, and how did that compare across species? Focusing on motif analysis became a hugely fruitful line of investigation, and we feel that we have only barely begun to scratch the surface in using this approach to reveal regulatory patterns in both development and speciation.
RD: Certainly, trying to identify the “same” gene across four different species was a challenge, particularly in the sense that so few can be reliably identified in all the species. Another big challenge was in the fine details of analyzing the ATAC-seq data in each of these species. For example, there were issues of which genome build and annotation to use for each, and how to deal with unplaced contigs in the tomato and Medicago genomes. Differences in the depth of functional annotation for these genomes also caused some headaches in trying to compare particular genes sets across species.
Do you have any advice or insight for graduate students, post-docs, or budding computational biologists looking to know more about computation?
KM: My advice for trainees that are just starting out in the computational realm, as I was at the beginning of this project, is to not be afraid to get your hands dirty and play around with different software tools. Be sure to get some sort of structured introduction to the basics first – I tried a couple of online courses as well as a class offered at my university – but after that, try some things out and have a good time. Be sure to become familiar with the manual pages of your favorite software tools, and find a reliable forum (BioStars in particular was a big help to me) because you will certainly run into roadblocks along the way. But as long as you are eager, determined, and willing to experiment, you can teach yourself how to do anything, especially computational analysis.
MB: There are several pieces of advice that I would give: 1) Always keep a log of the commands you run or the pipeline changes you make, 2) Name your files so that they can be easily identified later, 3) Utilize the help option for the different programs and read about all the different options and commands available, 4) Ask for help from others, whether they are at your university or somewhere far away, additionally make sure to ask about the different options someone may have used and why they chose these options, 5) If you do not have access to servers or clusters you can combine mobility with performance by utilizing remote access to lab computers from your desktop or phone. Chrome Remote Desktop is a free and easy to use app that makes this possible, 6) Have someone else check your results along the way.
RD: Read as much as you can, talk to experts as much as possible, and just practice and experiment with different tools. A book that I found incredibly helpful in getting started was “Practical Computing for Biologists”. I would recommend this for all biologists who want to get into computation but don’t have much background in it.
And just like in bench experiments, much effort often leads to frustration and going back to square one. These bouts can actually be important learning experiences, and with perseverance you will eventually become an expert. At least that’s what I keep telling myself :).