A Simple Method for Analyzing Actives in Random RNAi Screens: Introducing the “H Score” for Hit Nomination & Gene Prioritization

Please download to get full document.

View again

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
  Send Orders of Reprints at reprints@benthamscience.org   686  Combinatorial Chemistry & High Throughput Screening,  2012, 15, 686-704 A Simple Method for Analyzing Actives in Random RNAi Screens: Introducing the “H Score” for Hit Nomination & Gene Prioritization Bhavneet Bhinder and Hakim Djaballah *    HTS Core Facility, Molecular Pharmacology and Chemistry Program, Memorial Sloan-Kettering Cancer Center, 1275 York Avenue, New York, USA Abstract: Due to the numerous challenges in hit identification from random RNAi screening, we have examined current  practices with a discovery of a variety of methodologies employed and published in many reports; majority of them, unfortunately, do not address the minimum associated criteria for hit nomination, as this could potentially have been the cause or may well be the explanation as to the lack of confirmation and follow up studies, currently facing the RNAi field. Overall, we find that these criteria or parameters are not well defined, in most cases arbitrary in nature, and hence rendering it extremely difficult to judge the quality of and confidence in nominated hits across published studies. For this  purpose, we have developed a simple method to score actives independent of assay readout; and provide, for the first time, a homogenous platform enabling cross-comparison of active gene lists resulting from different RNAi screening technologies. Here, we report on our recently developed method dedicated to RNAi data output analysis referred to as the BDA method applicable to both arrayed and pooled RNAi technologies; wherein the concerns pertaining to inconsistent hit nomination and off-target silencing in conjugation with minimal activity criteria to identify a high value target are addressed. In this report, a combined hit rate per gene, called “H score”, is introduced and defined. The H score provides a very useful tool for stringent active gene nomination, gene list comparison across multiple studies, prioritization of hits, and evaluation of the quality of the nominated gene hits. Keywords:  3’UTR, BDA method, esiRNA, H score, HCS, heptamer, HTS, miRNA, off-target effect, randomness, RNAi, screening, seed sequence, shRNA, siRNA. INTRODUCTION RNAi screening technology is viewed by many as a  promising exploratory tool and researchers worldwide have embarked on it to reap the benefits from its ability to allow for single gene knockdown. RNAi, often referred to as the scientist’s Holy Grail, has since been adapted to conduct up to genome-wide screens to study the entire genome repertoire and to advance our current understanding of gene function and its role in various disease states [1]. Despite the assuring progress in the discovery of a wide array of gene candidates, none of them have come to fruition especially as novel molecular targets for therapeutic intervention; and in most cases, failed further validation. More recently, the RNAi field has been faced with the challenge of poor reproducibility and lack of confirmatory studies [2-4]. In 2008, three HIV host-virus interaction siRNA screens were published by Konig [5] and co-workers, Brass and co-workers [6], and Zhou and co-workers [7]; and in the following year an additional shRNA screen was also reported by Yeung and co-workers [8]. Intuitively, one would have expected to observe a significant overlap among the genes causing the strongest phenotypes across the four screens irrespective of the type of RNAi technology used. Surprisingly, none of the genes overlapped across all four screens; while only three genes overlapped across the three *Address correspondence to this author at the HTS Core Facility, Molecular Pharmacology and Chemistry Program, Memorial Sloan-Kettering Cancer Center, 1275 York Avenue, New York, USA; Tel: (646) 888-2198; E-mail: djaballh@mskcc.org siRNA screens namely, RELA, MED6 and MED7 [2, 3]. This obviously has caused a big dent in the field and questioned the sophistication of the sequence predicting algorithms used by the various vendors in the first place followed by a strong call for standardization of the RNAi field. Additionally, a follow up study summarizing the screening meta data from three reports (Konig, Brass, and Zhou) concluded that overall the three screens, though not having identified the identical genes; they nevertheless identified their molecular pathways as an explanation for the lack of overlap [2]. Unfortunately, a few more examples have posed similar concerns of poor cross study overlap, and are now begging the question as to the true merits of hits identified through RNAi screening [4, 9, 10]. From a screening analysis perspective, the observed screening output discordances could well be explained by the following three statements: 1) Inconsistent methods of  phenotypic scoring, 2) Minimal criteria of overall gene activity, and 3) Disregard for prevalent off-target effects (OTEs) in nominated hits. Each statement has its own merit in identifying artifactual hits from random RNAi screening. During the conception days of RNAi screening, investigators applied the common hit selection practices applied for many years to chemical screening for hit identification. However, by doing so, a major aspect of the 1:1 relationship of compound to observed activity versus 1: many RNAi targeting sequences per gene relationship was totally ignored. Various groups still use multiple methods for hit selection in RNAi screening such as percentage inhibition of activity, z-score, B-score, statistical test, and various other ranking methods [11]. In 2007, strictly standardized mean 1  -   /12 $58.00+.00 © 2012 Bentham Science Publishers   A Simple Method for RNAi Hit Nomination and Prioritization Combinatorial Chemistry & High Throughput Screening, 2012, Vol. 15, No. 9 687 difference (SSMD) method was developed specifically for RNAi screening data analysis as an siRNA duplex ranking method based on the duplex’s effect size relative to the negative control and was proposed to yield hits with reduced false positive and false negative rates when compared to the traditional methods like z-score and percentage inhibition [12]. Although this method enabled active duplex identification, yet it did not address the minimal active gene identification criteria for RNAi experiments. Later in the same year, the redundant siRNA activity (RSA)   method was introduced, and it partly addressed the issue of the combinatorial nature of RNAi screens for the first time as it assigned a p-value to a gene based on the performance of all its corresponding duplexes [13]. However, the RSA method ranks a gene based on the collective activity of all its corresponding duplexes and thus it is likely that the  performance of ill-behaved duplexes might consequentially skew the analysis results. Therefore, although the minimal criterion for determining gene activity is crucial due to the inherent combinatorial nature of RNAi screening, yet it has always remained a widely ignored parameter in the  published data analysis workflows. Off-target silencing has become a major handicap in RNAi screening and efforts have been directed towards designing oligonucleotides with maximal target specificity [14]. Recent studies have widened our understanding regarding the off-target silencing, attributing this behavior to a microRNA (miRNA) like mimic activity of the exogenous oligonucleotides [15, 16]. Meanwhile, the role of seed sequences in determining the target specificity has also been described. Based on these observations, various computational approaches have been developed to identify off-target in the screening results. These approaches rely on one or the other factors that putatively lead to off-target effects (OTEs), such as seed over-representation in hits, miRNA enrichments, or 3’UTR enrichments [17-19]. Despite the development of these computational tools for OTE identification in RNAi screening data, none of the reported strategies have been currently incorporated into a standardized hit nomination workflow. One of the major requirements of the field is to incorporate all these three factors, and to include them as an integral step in a comprehensive hit nomination workflow, dedicated solely to identify the high confidence targets from random RNAi screening campaigns. In this report, we introduce a simple method, referred to as the BDA method (Fig. 1 ), as a standardized workflow  pipeline in general encompassing most of the issues described above; we also introduce the H score and the OTE filtering specifically enhancing confidence in hit nomination from random RNAi screening. We perform a control-based analysis to best address the systems heterogeneity while also incorporating OTE filtering to address the prevalence of OTEs, and have streamlined a workflow to standardize the hit selection methodologies. We also introduce the concept of hit rate per genes referred to as an H score to address the combinatorial nature of the RNAi screening as an attempt to avoid the pitfalls of calling an outlier a high value gene target. We have applied our newly developed methodology to data obtained from two published shRNA screens as case studies, with one report using an arrayed shRNA approach against 19 cell lines [20] and the second using a pooled shRNA approach against 102 cell lines [21]. We report on our findings as to nominated hits using the BDA method versus those published gene lists. MATERIALS AND METHODS Sequence Databases  Human genome-wide 3’UTR sequences were obtained from the University of California at Santa Cruz (UCSC) genome browser assembly GRCh37/hg19 (genome.ucsc.edu [22]). The nucleotide (nt) sequences less than 10 nt in length were excluded from the analysis. The human microRNA (miRNA) sequences were obtained from miRbase release 18 (mirbase.org [23]) and the information relating to their experimentally validated targets was obtained from Tarbase 6.0 [24]. The 330,687 oligonucleotide   sequences for the TRC library were downloaded from the Broad Institute  portal (www.broadinstitute.org/IGP/home). Seed Sequence Heptamer Selection  The 7-mer seed sequence, referred to as the seed heptamer, was selected from the antisense (guide) strand (Fig. 2A ). Guide strand is selected over passenger strand due to its predominant role in OTEs [14, 17, 25]. The choice of a heptamer seed over 6-mer or 8-mer seed was based on  previous findings regarding higher specificity of a heptamer [25, 26]. For siRNA duplexes, the seed heptamer was defined as the 7 nt long sequence with its start position determined at the second nt from the 5’ end of the guide strand. For shRNA hairpins, the seed heptamer was defined as the 7 nt long sequence with its start position determined  by two methods: 1) theoretically, based on the ideal seed start position on the oligonucleotide, and 2) empirically,  based on the duplex performance in the screen using a method developed by Dr. Eugen Buehler (NIH, MD), referred hereafter as empirical seed-selection (ESS) (Fig. 2B ). This enables us to generate two lists of the seed heptamers to perform OTE filtering analysis on them separately and merge the results into one list of High Confidence OTEs (HC_OTEs). The BDA Method  The BDA method is comprised of five steps (Fig. 1 ) defined below, and takes into account activities of “duplexes” referring to either siRNA duplexes, shRNA hairpins, or esiRNA duplexes:  Active Duplex Identification The active duplexes were scored based on a threshold determined at the mean (  ) ± 2 standard deviations (2  ) of the controls. The outliers in the control data maybe identified and removed using the interquartile range before determining thresholds. The selection of controls and determination of threshold is screen dependent. The RNAi screening raw data does not necessarily follow a classic Gaussian distribution; and is more often bimodal in nature [27]. Therefore, we incorporated the control-based analysis allowing for hit selection in the two distributions observed independent of the duplex performance distribution.  688 Combinatorial Chemistry & High Throughput Screening, 2012, Vol. 15, No. 9 Bhinder and Djaballah By setting a control-based threshold for active duplex identification, we would miss those with activities just below the cuts. Thus, after actives identification, we find it important to examine breakpoints in output values of all the duplexes to assess where strong activity breakpoints are located. This analysis is done in order to determine clear  breaks in the readout values and to score such duplexes as active if no clear performance differential is observed. We  present two examples of 10 genes each, one for a gain of function assay measuring EGFP fluorescence enhancement against an siRNA library with a control based threshold set at > 259; and the second example is a lethal shRNA hairpin assay measuring residual nuclei count with a control based threshold set at 2,440. The analysis identifies and re-scores as active those duplexes which have values around the set threshold (Suppl Fig. 1 ).  Active Gene Identification Active genes were identified from the active duplexes obtained from step 1 and based on two criteria described as follows:  a) H score to identify active genes The active genes were nominated from the active duplex list using a hit rate  per gene score (H score) with a threshold set at   60. An H score of 60 translates into 2 active siRNA/esiRNA duplexes or 3 active shRNA hairpins in a typical RNAi library comprising of at least 3 siRNA/esiRNA duplexes or 5 shRNA hairpins targeting each gene, respectively; yielding a hit rate of > 60% under each scenario (Fig. 3 ). The H score is defined as follows:  H score =  Numberof activeduplexesTotalnumberof duplexes  100  Considering the inherent gene coverage heterogeneity of most RNAi libraries (Table 1 ), where we do find a  percentage coverage of > 3 duplexes for si/esi- and > 5 duplexes for shRNA hairpins and ranging from 0.11 to up to 21%, we have made provisions to the H score analysis whereby a t-test is performed to determine if the  performance of the active duplexes was significantly different from the performance of the inactive ones as Fig. (1). Schematic workflow of the developed BDA method. The five steps of the BDA method are depicted. HC_OTE: High confidence Off-Target Effects, LC_OTE: Low Confidence Off-Target Effects; No OTE: no Off-Target Effects.   A Simple Method for RNAi Hit Nomination and Prioritization Combinatorial Chemistry & High Throughput Screening, 2012, Vol. 15, No. 9 689 Fig. (2).  Seed heptamer sequence attributes and location on resulting RNAi duplexes. A ) Seed heptamer determination in siRNA duplexes. B ) Seed heptamer determination in shRNA hairpins based on differential dicer cleavage scenarios considered in the BDA method. Seed heptamers are depicted in red. C ) Correlation assessment of seed heptamer starting nucleotide performed using ESS method in four representative screened cell lines from the Barbie screen. Red line indicates nucleotide position with highest correlation value in guide strand. A)B) 0.435 786-O 0.436 HMEC-TERT C)   o  r  r  e   l  a   t   i  o  n  v  a   l  u  e  s 0.2   o  r  r  e   l  a   t   i  o  n  v  a   l  u  e  s 0.2Oligonucleotide position0 10 20 30 40 50 600.0Oligonucleotide position0 10 20 30 40 50 600.0   v  a   l  u  e  s 0.435 THP-1   v  a   l  u  e  s 0.436 MDA-MB-231      C  o  r  r  e   l  a   t   i  o  n 0 00.2    C  o  r  r  e   l  a   t   i  o  n 0.2   Oligonucleotide position0 10 20 30 40 50 60.Oligonucleotide position0 10 20 30 40 50 600.0  690 Combinatorial Chemistry & High Throughput Screening, 2012, Vol. 15, No. 9 Bhinder and Djaballah described below. Genes with coverage of < 2 duplexes in any given RNAi library were completely excluded from the analysis to maintain high level of stringency. b) Statistical test to assess duplex performance On average, an RNAi library either contains 3 siRNA or 5 shRNA hairpins per gene; these numbers do vary. It is important, however, to account for differential performance amongst duplexes for such genes especially in scenarios where high H scores of   80 would otherwise be expected; 4 active shRNA hairpins or 3 active duplexes based on the average library statistics. This also helps to assess the possibility of inactive duplexes for genes targeted by high numbers of duplexes in the library as being inactive. The duplexes targeting such genes were divided into two categories, those active in the screen and those inactive in the screen; and a statistical t-test was applied to assess the difference in the performance  between the two categories. The null hypothesis (H 0 ) was defined as no difference in the mean of performance between the two categories and the H 0  was rejected at a p-value threshold set at < 0.05 [28]. The t-test was performed using the Statistics::TTest module in PERL. OTE Filtering The overall active duplexes corresponding to the active genes nominated in step 2 were assessed for OTEs. The OTE activity was defined for the seed heptamer corresponding to individual duplexes. The seed heptamer was subjected to three analyses for OTE filtering, defined as follows:  a) Seed heptamer enrichment in hits  The seed heptamer enrichment was determined based on the hypergeometric distribution [29] to find the probability of obtaining the number of matches for seed heptamer in the active duplexes at least as extreme as actually observed. The H 0  was defined as to observe the number of seed heptamer matches in the active duplexes by chance and the threshold for rejecting H 0  was determined at < 0.05 [28], therefore indicating an over-representation of a seed heptamer in the list of active duplexes versus the inactive duplexes. If,  l   is a seed heptamer; N is the total number of seed heptamers in the library; n is the total number of seed heptamers in the active duplexes; k is the number of  l   in the library; x is the number of  l   in the active duplexes, then, the p-value for  l   is calculated as follows:  p  value =  P (  X     x ) =  p(  x ) i = 0min( n , k  )     p(  x ) i = 0  x  1   b) Seed heptamer enrichment in   3’UTR   sequences  The seed heptamer enrichment was found in the 3’UTR sequences and the percent (%) 3’UTR enrichment is calculated as follows: %   3 UTRenrichment   = numberof activeseedmatchesin   3 UTRsequencestotalnumberof    3 UTRsequences ( > 10 nt  )  The multiple seed heptamer matches within a single 3’UTR sequence were considered. We calculated the %   3’UTR enrichment of the unique seed heptamers obtained from the four RNAi libraries. The distribution plot for the Table 1. Duplex Coverage, Frequency and Library Attributes of Four Analyzed RNAi Libraries Provider Library Technology Coverage Duplex Coverage Duplex Frequency (%) Validated Duplexes (%) 1 duplex   0.02   4   2 duplexes   0.03   3 duplexes   99.83   Ambion Silencer Select siRNA duplex Genome-wide (21,565 Genes) > 3 duplexes   0.12   1 duplex   6.05   Unknown   2 duplexes   8.99   3 duplexes   64.33   MSKCC MSK siRNA duplex Druggable Genome (6,016 Genes) > 3 duplexes   20.63   1 duplex   0.39   Unknown   2 duplexes   0.95   3 duplexes   97.57   Sigma-Aldrich Druggable Genome siRNA duplex Druggable Genome (6,623 Genes) > 3 duplexes   1.09   1 hairpin   0.16   28   2 hairpins   0.41   3 hairpins   1.63   4 hairpins   9.95   5 hairpins   84.19   Sigma-Aldrich TRC 1.0 shRNA hairpin Genome-wide (16,039 Genes) > 5 hairpins   3.66  
Related Search
Similar documents
View more
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!