CircParser: a novel streamlined pipeline for circular RNA structure and host gene prediction in non-model organisms

Artem Nedoluzhko; Fedor Sharko; Md. Golam Rbbani; Anton Teslyuk; Ioannis Konstantinidis; Jorge M.O. Fernandes

doi:10.7717/peerj.8757

CircParser: a novel streamlined pipeline for circular RNA structure and host gene prediction in non-model organisms

Artem Nedoluzhko ¹, Fedor Sharko^2,3, Md. Golam Rbbani¹, Anton Teslyuk², Ioannis Konstantinidis¹, Jorge M.O. Fernandes ¹

1Faculty of Biosciences and Aquaculture, Nord University, Bodø, Bodø, Norway

2Complex of NBICS Technologies, National Research Centre “Kurchatov Institute”, Moscow, Russia

3Institute of Bioengineering, Research Center of Biotechnology of the Russian Academy of Sciences, Moscow, Russia, Russia

DOI: 10.7717/peerj.8757

Published: 2020-03-16
Accepted: 2020-02-16
Received: 2019-12-04

Academic Editor: Yuriy Orlov

Subject Areas: Bioinformatics, Computational Biology, Genetics, Genomics
Keywords: Circular RNAs, Host gene, Prediction, Annotation, Structural components

Copyright: © 2020 Nedoluzhko et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.

Cite this article: Nedoluzhko A, Sharko F, Rbbani MG, Teslyuk A, Konstantinidis I, Fernandes JMO. 2020. CircParser: a novel streamlined pipeline for circular RNA structure and host gene prediction in non-model organisms. PeerJ 8:e8757 https://doi.org/10.7717/peerj.8757

The authors have chosen to make the review history of this article public.

Abstract

Circular RNAs (circRNAs) are long noncoding RNAs that play a significant role in various biological processes, including embryonic development and stress responses. These regulatory molecules can modulate microRNA activity and are involved in different molecular pathways as indirect regulators of gene expression. Thousands of circRNAs have been described in diverse taxa due to the recent advances in high throughput sequencing technologies, which led to a huge variety of total RNA sequencing being publicly available. A number of circRNA de novo and host gene prediction tools are available to date, but their ability to accurately predict circRNA host genes is limited in the case of low-quality genome assemblies or annotations. Here, we present CircParser, a simple and fast Unix/Linux pipeline that uses the outputs from the most common circular RNAs in silico prediction tools (CIRI, CIRI2, CircExplorer2, find_circ, and circFinder) to annotate circular RNAs, assigning presumptive host genes from local or public databases such as National Center for Biotechnology Information (NCBI). Also, this pipeline can discriminate circular RNAs based on their structural components (exonic, intronic, exon-intronic or intergenic) using a genome annotation file.

Introduction

De novo genome sequencing has become a routine procedure, due to a decrease in sequencing costs, diversification of high-throughput sequencing platforms and improvement of bioinformatic tools (Ekblom & Wolf, 2014). However, the quality of non-model species genome assemblies and, as a result, their annotations are often of unsatisfactory quality, because of (1) repetitive sequences, including transposons, and short sequence repeats (SSRs); (2) gene and genome duplications; (3) single-nucleotide polymorphisms (SNPs) and genome rearrangements (Lien et al., 2016; Negrisolo et al., 2010; Rodriguez & Arkhipova, 2018; Yahav & Privman, 2019).

CircRNAs are relatively poorly studied members of the non-coding RNA family. These unique single-stranded molecules are generated through back-splicing of pre-mRNAs in a wide range of eukaryotic and prokaryotic taxa (Danan et al., 2012; Holdt, Kohlmaier & Teupser, 2018), and even viruses (Huang et al., 2019). CircRNAs play a significant role in the regulation of the molecular pathways not only through modulating of microRNA and protein activity, but also by the affecting transcription or splicing (Holdt, Kohlmaier & Teupser, 2018).

These regulatory molecules have been known for decades, but the development of high-throughput DNA analysis methods lead to a rapid increase in the number of studies related to these type of non-coding RNAs. This, in turn, resulted in a requirement for additional circRNA prediction tools. The miARma-Seq (Andres-Leon & Rojas, 2019) with CIRI predictor (Gao, Wang & Zhao, 2015), circRNA_finder (Westholm et al., 2014), find_circ (Memczak et al., 2013), CIRCexplorer2 (Zhang et al., 2016), and other tools are very popular today for prediction of circRNAs sequences based on transcriptomic data (Hansen et al., 2016; Szabo & Salzman, 2016), despite significant output differences. Several circRNA predictors (CIRI, CIRI2, and CircExplorer2) can use genome annotation files for host gene prediction but they are definitely useful only for well-annotated genomes, and even, such as CircView (Feng et al., 2018) or circMeta (Chen et al., 2019), have been designed specifically for them.

Here we describe CircParser, a novel and easy to use Unix/Linux pipeline for prediction of host gene circular RNAs using the blastn program and the freely available bedtools software (Quinlan & Hall, 2010). CircParser can be also implemented as a part of pipelines for de novo prediction of circular RNA because of its versatile output files. CircParser is most useful for circRNA host gene prediction analysis in whole transcriptomic datasets for low-quality assembled, as well as poorly annotated genomes. It sorts and joins overlapped circular RNAs sequences and predicts host gene name for overrepresented circRNAs, while identifying their structural components. We demonstrate the prediction capacity of CircParser on a recently published transcriptomic data set from the wild and domesticated females of Nile tilapia (Oreochromis niloticus) fast muscle (Konstantinidis et al., under review) using the five most popular circRNAs in silico prediction tools—CIRI, CIRI2, CircExplorer2, find_circ, and circFinder.

Materials & Methods

The results of Illumina sequencing of twelve ribosomal RNA depleted RNA-seq libraries reads have been downloaded from Gene Expression Omnibus (accession number GSE135811). The DNA reads were filtered by quality (phred > 20) and library adapters were trimmed using Cutadapt software (version 1.12) (Marcel, 2011). The Nile tilapia reference genome (ASM185804v2) and its gene-annotation (ref_O_niloticus_UMD_NMBU_top_level.gff3) were used in the following analysis.

CircRNA prediction was performed for each ribosomal RNA depleted RNA-seq library using the circRNA in silico prediction tools (i) CIRI (Gao, Wang & Zhao, 2015) that is linked to miARma-Seq pipeline (Andres-Leon & Rojas, 2019), (ii) CIRI2 (Gao, Zhang & Zhao, 2018), (iii) CircExplorer2 (Zhang et al., 2016), (iv) find_circ (Memczak et al., 2013), and (v) circFinder (Westholm et al., 2014). Prediction output files from all libraries were converted separately to coordinate file format. After sorting, these coordinate files (from different prediction algorithms, but for each library) were merged using bedtools multiinter (Quinlan & Hall, 2010) to determine a joint prediction output from CIRI, CIRI2, CircExplorer2, find_circ, and circFinder (see Table S1).

We developed CircParser, as a streamlined pipeline, which makes use output files from the most popular circRNAs in silico predictors. CircParser works under Linux/Unix system and its parameters are presented in Table 1.

Table 1:

CircParser.pl usage. Required and optional parameters.

Parameter	Parameter description
-h, –help	Show this help message and exit
-b	CircRNA input file (required)
-g, –genome	Reference genome file (required)
-t, –tax	NCBI TaxID (optional)
-a	Genome annotation file, gff/gff3 file (optional)
–np	Prohibition for coordinate merging (optional)
-c, –ciri	Input circRNA from CIRI—CIRI2 in silico predictors, (default: input from CircExplorer2, find_circ, circFinder, and BED files)
–threads	Number of threads (CPUs) for BLAST search (optional)
-v, –version	Current CircParser version

DOI: 10.7717/peerj.8757/table-1

Usage: perl CircParser.pl [-h] -b INPUT_FILE—genome REF_GENOME

CircParser can merge overlapped circRNAs coordinates from circRNAs predictor outputs using bedtools merge (Quinlan & Hall, 2010) at the first stage of the pipeline; this ensures that they are related to the same host gene and creates separate coordinates files (bed file) with overlapped circRNAs coordinates. In addition, it is optionally possible to merge circRNA without overlapping coordinates but located in the contiguous genome locus using the special option.

The separate coordinate files (bed file) are converted to fasta files using bedtools getfasta (Quinlan & Hall, 2010). Finally, CircParser uses fasta files for host gene prediction using a NCBI database (the longest stage of pipeline) for circRNAs (Fig. 1A). CircParser works by default with the NCBI online database, but it can optionally use a custom database or a pre-compiled NCBI database installed locally. CircParser includes the following blast parameters, which are necessary for host gene prediction, and assigns sequences to the respective circRNA: -perc_identity 90; -max_target_seqs 1000; -max_hsps 1; the maximum number of aligned sequences to keep is 1000; the minimum percent identity of matches to report is 90%. CircParser also filters out non-informative blast results, such as “uncharacterized”, “clone”, “linkage group” and others from the output table.

Figure 1: An overview of the CircParser pipeline.
(A) The pipeline includes merging of the circRNAs with overlapping genome coordinates and presents the number of different circRNAs originating from one host gene. (B) CircParser includes the prediction of circRNA structural components using a genome annotation gff/gff3 file.

Download full-size image

DOI: 10.7717/peerj.8757/fig-1

CircParser can also discriminate circular RNAs by their structural components: exonic, intronic, exon-intronic or intergenic using genome annotation gff/gff3 file (-a parameter). In this case, the user should avoid circRNAs coordinate merging (using –np parameter) during the pipeline implementation for correct results (Fig. 1B).

Usage: perl CircParser.pl -np -b INPUT_FILE –genome REF_GENOME -a GENOME.gff

However, poor quality of annotation file can lead to errors in the circRNAs structure analysis.

The Perl implementation of CircParser is available at https://github.com/SharkoTools/CircParser.

Results

We applied CircParser to twelve merged coordinate files that contained information about joint coordinates for circRNAs predicted using CircExplorer2, miARma-Seq (with CIRI predictor), CIRI2, find_circ, and circFinder. The five different algorithms predicted on average ∼131 (CircExplorer2); ∼501 (CIRI); ∼706 (CIRI2); ∼257 (find_circ), and ∼398 (circFinder) circRNAs per sample, with an insignificant overlap ∼37 circRNAs (Fig. 2; Table S1), similarly to previously published comparisons (Hansen, 2018; Hansen et al., 2016).

Figure 2: Number of circular RNAs that have been predicted by CIRI, CIRI2, CircExplorer2, find_circ, circFinder, and that are common between all prediction algorithms.

Download full-size image

DOI: 10.7717/peerj.8757/fig-2

To access the host gene of circular RNAs and to reduce false-positive rates, only overlapping circRNAs (Fig. 2) were used in CircParser. This pipeline allows the elimination of non-informative outputs (e.g., contains only chromosome/contig name, number of uncharacterized loci, or name of BAC clone, and etc.), while keeping more the relevant blast results and retrieving the likely host gene name for the circular RNAs; in the case of impossibility to find identical sequences in the database, this tool mark these sequences as NOT ASSIGNED).

Discussion

The CircParser results also allow us to determine the number of circRNA types from one host gene and their minimum and maximum size in base pairs (bp). We showed that our algorithm detected presumable host gene names for the vast majority of predicted circRNAs. Moreover, most of them were related to muscle functions (e.g., calcium/calmodulin-dependent protein kinase, troponin T3, myocyte-specific enhancer factor 2C, and others), and immune-related genes (MHC class IA antigen), which were consistently found among different individuals (Table S2), despite the relatively low coverage for circRNAs analysis of the sequencing data used (Mahmoudi & Cairns, 2019). An example of circRNA structure analysis for CIRI, CIRI2, CircExplorer2, find_circ, and circFinder outputs is presented in Supplementary Table S3.

To estimate the capacity of our pipeline we compared a number of host genes that were predicted by CircExplorer2 and CircParser (CircExplorer2 outputs were used as input files) for the same O. niloticus fast muscle datasets used earlier. As a result, CircParser shows greater efficiency for Nile tilapia, improving the number of predicted host genes up to two-fold (Fig. 3).

Figure 3: CircParser capacity: number of host genes that were predicted by CircExplorer2 and CircParser.

Download full-size image

DOI: 10.7717/peerj.8757/fig-3

Another equally important aspect of CircParser concerned the accuracy of this pipeline. The most well-annotated reference genome of zebrafish (assembly GRCz11) and zebrafish muscle transcriptomic dataset (ERR145655) were used for accuracy estimation, i.e., the agreement between the annotation file and CircParser output. We showed that in this case, CircParser host gene prediction was confirmed in 82.4% cases.

Conclusions

Thus, we conclude that CircParser represents a reproducible workflow that enables researchers to effectively predict the host genes for circular RNAs, even in non-model organisms with poorly annotated genome assemblies.

Supplemental Information

Sequencing and quality trimming statistics

Number of circular RNAs have been predicted by CircExplorer2, CIRI/ miARma-Seq, CIRI2, find_circ, and circFinder software. The number of common circular RNAs predicted by all methods is also indicated.

DOI: 10.7717/peerj.8757/supp-1

Download

The CircParser pipeline output table for overlapping circRNAas (without structural component analysis)

DOI: 10.7717/peerj.8757/supp-2

Download

The CircParser pipeline output table for overlapping circRNAs (with structural component analysis)

DOI: 10.7717/peerj.8757/supp-3

Download

[1] Andres-Leon E, Rojas AM. 2019. miARma-Seq, a comprehensive pipeline for the simultaneous study and integration of miRNA and mRNA expression data. Methods 152:31-40

[2] Chen L, Wang F, Bruggeman EC, Li C, Yao B. 2019. circMeta: a unified computational framework for genomic feature annotation and differential expression analysis of circular RNAs. Bioinformatics 36(2):539-545

[3] Danan M, Schwartz S, Edelheit S, Sorek R. 2012. Transcriptome-wide discovery of circular RNAs in Archaea. Nucleic Acids Research 40:3131-3142

[4] Ekblom R, Wolf JB. 2014. A field guide to whole-genome sequencing, assembly and annotation. Evolutionary Applications 7:1026-1042

[5] Feng J, Xiang Y, Xia S, Liu H, Wang J, Ozguc FM, Lei L, Kong R, Diao L, He C, Han L. 2018. CircView: a visualization and exploration tool for circular RNAs. Briefings in Bioinformatics 19:1310-1316

[6] Gao Y, Wang J, Zhao F. 2015. CIRI: an efficient and unbiased algorithm for de novo circular RNA identification. Genome Biology 16:4

[7] Gao Y, Zhang J, Zhao F. 2018. Circular RNA identification based on multiple seed matching. Briefings in Bioinformatics 19:803-810

[8] Hansen TB. 2018. Improved circRNA identification by combining prediction algorithms. Frontiers in Cell and Developmental Biology 6:20

[9] Hansen TB, Veno MT, Damgaard CK, Kjems J. 2016. Comparison of circular RNA prediction tools. Nucleic Acids Research 44:e58

[10] Holdt LM, Kohlmaier A, Teupser D. 2018. Molecular roles and function of circular RNAs in eukaryotic cells. Cellular and Molecular Life Science 75:1071-1098

[11] Huang JT, Chen JN, Gong LP, Bi YH, Liang J, Zhou L, He D, Shao CK. 2019. Identification of virus-encoded circular RNA. Virology 529:144-151

[12] Lien S, Koop BF, Sandve SR, Miller JR, Kent MP, Nome T, Hvidsten TR, Leong JS, Minkley DR, Zimin A, Grammes F, Grove H, Gjuvsland A, Walenz B, Hermansen RA, Von Schalburg K, Rondeau EB, Di Genova A, Samy JK, Olav Vik J, Vigeland MD, Caler L, Grimholt U, Jentoft S, Vage DI, De Jong P, Moen T, Baranski M, Palti Y, Smith DR, Yorke JA, Nederbragt AJ, Tooming-Klunderud A, Jakobsen KS, Jiang X, Fan D, Hu Y, Liberles DA, Vidal R, Iturra P, Jones SJ, Jonassen I, Maass A, Omholt SW, Davidson WS. 2016. The Atlantic salmon genome provides insights into rediploidization. Nature 533:200-205

[13] Mahmoudi E, Cairns MJ. 2019. Circular RNAs are temporospatially regulated throughout development and ageing in the rat. Scientific Reports 9:2564

[14] Marcel M. 2011. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnetjournal 17:10-12

[15] Memczak S, Jens M, Elefsinioti A, Torti F, Krueger J, Rybak A, Maier L, Mackowiak SD, Gregersen LH, Munschauer M, Loewer A, Ziebold U, Landthaler M, Kocks C, Le Noble F, Rajewsky N. 2013. Circular RNAs are a large class of animal RNAs with regulatory potency. Nature 495:333-338

[16] Negrisolo E, Kuhl H, Forcato C, Vitulo N, Reinhardt R, Patarnello T, Bargelloni L. 2010. Different phylogenomic approaches to resolve the evolutionary relationships among model fish species. Molecular Biology and Evolution 27:2757-2774

[17] Quinlan AR, Hall IM. 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26:841-842

[18] Rodriguez F, Arkhipova IR. 2018. Transposable elements and polyploid evolution in animals. Current Opinion in Genetics & Development 49:115-123