Dr. Pankaj Jaiswal's group at Oregon State University recently published an article in Frontiers in Plant Science (find it here) that combines two types of RNA-sequencing techniques to shed light on how the poplar transcriptome is responding to abiotic stress. Sergei Filichkin and Michael Hamilton, the first authors of the study, demonstrated that a great deal of alternative splicing and intron retention occurs in response to stress, and that these events can dramatically alter isoform abundance. Interestingly, the authors observed alternative splicing occurring within the very genes thought to regulate responses to the different applied stresses. Is alternative splicing and differential intron retention part of poplar's programmed response to stress? 

Below I asked Pankaj for more insight into the transcriptional changes his group observed in stressed plants. In addition, we talked about his thoughts on the two RNA-seq platforms that he utilized: PAC-BIO's Iso-Seq and Illumina's short read sequencing. Finally, we cover the traditional topics associated with managing large amounts of data. Enjoy!


Could you expand on your preferred hypothesis for why you observed an uptick in differential intron retention (DIR) during stress?

Jaiswal: Our hypothesis and observations are that during stress there is an uptick of unspliced introns (or intron retention events). Not only that the ratio of spliced vs unspliced introns is differentially regulated, possibly to find a suitable condition to go full splicing or to adapt to a stress. Considering that many of the intron retention events would lead to unproductive ORFs (with different reading frames and possible premature termination), this is one way of regulating the amount and access of the gene product for carrying out its function. I wish we could complement this transcriptome study with the proteomics data from the same samples for further confirmation, but that’s for future.


PAC-BIO’s Iso-Seq is a relatively new technology. Where there any obstacles or extra considerations with Iso-Seq that you encountered of which the readers should be aware?

Jaiswal: The technology and kits have improved substantially including the newer platforms. It now allows bar-coding and pooling samples thus reducing the costs. In our estimates for full length cDNA sequencing this is a great technology, though our de-novo and reference based transcript assemblies with short read sequencing were fairly accurate and a lot cheaper. The one advantage with ISO_Seq was that it was the transcript sequence from an actual full length single c-DNA molecule. For exploration and genome annotation purposes ISO-seq is perfect. The technology may not be ready for for doing gene expression analysis. For gene expression Illumina's RNA_Seq is still the best and cheapest option.


You used a variety of public and proprietary (PacBio) tools to analyze your data. Did you have trouble using any of them, and if so, did you contact the creators of the program, collaborate with someone who knew the tool already, or figure it out by some trial and error type method?

Jaiswal: The Oregon State group was not very well trained in this work. Thus we reached out to ASN Reddy and Asa Ben-Hur’s lab at Colorado State University for their help. They were already working on Sorghum data and we saw it as a good match. Besides we were already working with them on the alternative splicing data from our RNA-Seq based transcriptome assemblies.


Of all the tools you used, what was your favorite and why?

Jaiswal: There was no real favorite. They are all good as all of them have a certain flavor and do things a bit differently from each other. Just make sure for any given analysis use at least 2-3 different software to see consistency and overlaps. Always use the most recent version of tools and software. Also if it does not work, engage with the source developer.


Was there any particular computer language or tool that you found most useful to transform your data to work with different analysis programs?

Jaiswal: Most of them are custom Java, Perl and Python scripts. We encourage students and postdocs to learn at least one of these languages to help them in their day-to-day bioinformatics analysis.


What are your preferred ways of storing and sharing the data generated, particularly with your collaborators?

Jaiswal: There are many ways depending on the data type. For some data types like raw sequence reads, there is a well-known archives like NCBI-GEO, EMBL-ENA, etc. However, there are not many resources that accept annotated datasets such as those generated by transcriptomes. This includes functional annotations, transcript assemblies, homology, alternative spliced forms, etc. ENA has been the best in terms of submitting the data. Especially helpful was EMBL’s Arrayexpress resource. Other alternative s are DataDryad, Harvard Dataverse or the best one from CyVerse. Cyverse has more implementation of data standards and QCs.


What was the most difficult/challenging part of completing this project?

Jaiswal: Getting the data ready for analysis was the biggest challenge in terms of computational effort. Also making sure all our biological samples were treated in a consistent manner and by following standardized conditions. We had to do a lot of dry runs before we attended to go full scale with 81 biological samples.


Do you have any advice or insight for graduate students, post-docs, or budding computational biologists looking to know more about computation?

Jaiswal: Start early. Have a biological question as you embark on learning the data analytics, and bioinformatics including programming. Pick a publicly available data set with an appropriate number of biological replicates (3 or more). Learn to use at least one programming language. Always seek help whenever you get stuck. Remember you don’t have to write the code from scratch. You can always begin by reusing and citing an existing open source code. Improvise it to address your needs. If you can return the favor by providing access to your code, no matter how rough it is, if it work, it will help others. Always maintain your code on a version control repository (such as GitHub). It may be private or open. Often use commenting to highlight what a certain part of code is expected to do. Take Data Carpentry training early on. Practice, Practice, Practice on a real data. Apply at least 2-3 different software to do the same analysis. This will give you some idea of consistency and overlaps. Always use Ontologies in your metadata descriptors and use standardized data formats.