Your activity: 26 p.v.
your limit has been reached. plz Donate us to allow your ip full access, Email: sshnevis@outlook.com

Tools for genetics and genomics: Gene expression profiling

Tools for genetics and genomics: Gene expression profiling
Authors:
Katrina Steiling, MD, MSc
Stephanie Christenson, MD, MAS
Section Editor:
Benjamin A Raby, MD, MPH
Deputy Editor:
Jennifer S Tirnauer, MD
Literature review current through: Dec 2022. | This topic last updated: Apr 02, 2021.

INTRODUCTION — The genetic basis for disease is determined by the inheritance of genes containing specific sequences of DNA. The phenotypic expression of these genes, through the synthesis of specific proteins, involves interaction with environmental signals that trigger activation of particular genes.

According to the central dogma of biology, ribonucleic acid (RNA) is transcribed from a DNA template; messenger RNA (mRNA) is then translated into protein (figure 1). Transcription and translation underlie gene expression. (See "Basic genetics concepts: DNA regulation and gene expression", section on 'Gene expression'.)

Approximately 3 to 5 percent of genes are active in a particular cell, even though all cells have the same information contained in their DNA. Most of the genome is selectively repressed, a property that is governed by the regulation of gene expression, mostly at the level of transcription (ie, the production of messenger RNA from the DNA). In response to a cellular perturbation, changes in gene expression take place that result in the expression of hundreds of gene products and the suppression of others. This molecular heterogeneity can affect when and how a disease presents clinically in an individual with genetic predisposition to a condition and how individuals with a given disease will respond to specific treatments.

Analyses of gene expression can be clinically useful for disease classification, diagnosis, prognosis, and tailoring treatment to underlying genetic determinants of pharmacologic response.

This topic will focus on the role of mRNA in the cell, platforms for profiling mRNA expression, the challenges in interpreting the data from these analyses, and the emerging clinical applications of gene expression measurements.

Other molecular tools for evaluating genetic disorders are presented in separate topic reviews:

Cytogenetics – (See "Tools for genetics and genomics: Cytogenetics and molecular genetics".)

PCR – (See "Tools for genetics and genomics: Polymerase chain reaction".)

Next generation sequencing – (See "Next-generation DNA sequencing (NGS): Principles and clinical applications".)

RNA IN CELL FUNCTION — There are several classes of ribonucleic acid (RNA). Messenger RNA (mRNA) is the RNA that is translated into protein. Non-coding RNAs are not translated into protein and serve other functions in the cell. Classes of non-coding RNAs include the following [1-6]:

Transfer RNAs (tRNAs)

Ribosomal RNAs (rRNAs)

Small nuclear RNAs (snRNAs)

Small nucleolar RNAs (snoRNAs)

MicroRNAs

Piwi-interacting RNAs (piRNAs)

Long noncoding RNAs (lncRNAs), which include the subclass large intergenic non-coding RNAs (lincRNAs)

mRNA accounts for approximately 1 percent of the total RNA in a cell [7] and is transcribed from approximately 20,000 to 25,000 protein-coding genes in the human genome [8]. After mRNA is transcribed from DNA, it typically undergoes further modifications including the addition of a methyl-guanosine cap (5' cap), the addition of a series of adenines to the 3' end of RNA (poly-A tail), and the splicing out of introns (figure 2) [1]. mRNA is then transported from the nucleus to the cytoplasm where it is translated into protein. mRNA serves as a transient intermediate between DNA and protein and is degraded in minutes to hours [1].

MicroRNAs (miRNAs) are small, endogenous, non-coding RNAs (length 18 to 24 nucleotides) that regulate gene expression by binding to the untranslated regions of mRNAs and inducing mRNA degradation or inhibiting protein translation, in turn reducing gene expression. Profiling of miRNAs can be done using similar methods to those used for mRNAs. A searchable database of characterized miRNAs is available (www.mirbase.org/).

lncRNAs are non-coding RNAs with diverse functions defined by their length of >200 nucleotides. lincRNAs are a subset of lncRNAs defined by the lack of overlap with protein-coding genes [5]. However, protein-coding genes can also produce non-coding transcript variants, further increasing lncRNA diversity. The functional relevance of the vast majority of the >19,000 lncRNAs is unknown, but the ability of different lncRNAs to conform to various structures or interactions seems to influence many cellular processes [6,9]. As with miRNAs, lncRNAs can be profiled using techniques similar to those used for mRNAs.

MEASURING GENE EXPRESSION — Since mRNA represents the functional bridge between DNA and protein, alterations in mRNA may serve as markers for the activation or inhibition of a particular gene.

The challenge in measuring RNA relates to the susceptibility of RNAs to degradation by ribonucleases (RNases). Methods of RNA detection take advantage of the single stranded structure of RNA and its complementarity to the DNA from which it was transcribed.

Expression profiling of single genes or small gene panels — Prior to the development of microarray and whole genome sequencing technology, available methods of measuring gene expression included:

Northern blot (see 'Northern blot' below)

Ribonuclease protection assay (see 'Ribonuclease protection assay' below)

In-situ hybridization (see 'In-situ hybridization' below)

Reverse-transcription quantitative polymerase chain reaction (RT-PCR) (see 'Real-time reverse transcription polymerase chain reaction' below)

Spotted cDNA arrays (see 'Spotted cDNA arrays' below)

A comparison of these methods is summarized in a table (table 1).

Northern blot — Northern blots allow the determination of both the presence of an RNA molecule and its size [10]. RNA molecules from a patient sample are first separated based on size using gel electrophoresis. The size-separated RNA molecules are then transferred and cross-linked to a nylon membrane. The RNA of interest is detected by incubating the membrane with a labeled single-stranded DNA probe that is complimentary to this RNA. Probes bound to the RNA of interest can then be detected using chemiluminescence or autoradiography.

Ribonuclease protection assay — Whereas the Northern blot uses complimentary DNA probes, the ribonuclease protection assay (RPA) uses antisense RNA probes, referred to as riboprobes [7]. These probes are single-stranded radiolabeled RNA molecules complimentary to the RNA of interest. The riboprobe is incubated with the RNA from the sample and binds to complementary RNA to form double-stranded RNA complexes. Incubating the mixture with ribonucleases degrades unbound single-stranded RNA from both the sample and excess probe. The remaining double stranded RNA complexes are size-separated by electrophoresis, and detected by autoradiography.

In-situ hybridization — In-situ hybridization (ISH) uses a nucleic acid probe to detect any other nucleic acids in a tissue section. ISH can localize the RNA of interest at the anatomic or cellular level. The tissue section is fixed to preserve tissue morphology and nucleic acid integrity [11,12]. The sample is then treated with proteases to eliminate proteins bound to the RNA of interest [11,12]. A labeled probe is hybridized to the sample and detected using autoradiography or chemiluminescence [12]. In-situ hybridization using a fluorescently labeled probe is also called fluorescence in-situ hybridization (FISH). The use of FISH to detect gene mutations is discussed separately. (See "Tools for genetics and genomics: Cytogenetics and molecular genetics", section on 'Fluorescence in situ hybridization'.)

Real-time reverse transcription polymerase chain reaction — Real-time reverse transcription polymerase chain reaction (RT-PCR) is a relatively simple approach that can be used to assay small or large numbers of genes from a single sample [13]. After isolating RNA from a sample, complementary DNAs (cDNAs) are synthesized by reverse transcription with an RNA-dependent DNA polymerase. This cDNA mixture is then combined with a DNA-dependent DNA polymerase and fluorescently-labeled oligonucleotide primers [14]. These primers are short sequences of nucleotides complementary to a portion of the cDNA and allow amplification. Fluorescence increases as the cDNA of interest is amplified with PCR. The fluorescence intensity is monitored and the total number of PCR cycles is counted [7].

The point at which the PCR cycler can distinguish fluorescence related to gene amplification from background is the cycle threshold, and this number can be used to estimate the relative starting quantity of the RNA of interest [13]. Careful primer selection is required to prevent amplification of related genes [7]. (See "Tools for genetics and genomics: Polymerase chain reaction".)

Spotted cDNA arrays — Unlike Northern blotting, RPA, or ISH, spotted cDNA arrays are capable of testing the relative expression levels, between two conditions, of several hundred genes. With increased knowledge of which sequences are expressed from the genome, it became possible to create cDNA probes targeting the expressed DNA sequences from which RNA is transcribed.

cDNA probes are amplified using PCR and spotted onto a glass slide [15]. RNA is then isolated from two samples representing different conditions. mRNA from each sample is isolated, and labeled with one of two fluorescent dyes (green or red) [16]. The samples are then mixed together and co-hybridized to the cDNA probes on the glass slide [15]. This approach directly compares gene expression in the first condition to the second condition and allows the detection of as many genes as there are probes on the array. However, reproducibility is limited across arrays because of the need to manually spot probes on slides.

Genome-wide gene expression profiling — Platforms for profiling gene expression take advantage of increased knowledge of the sequence of the human genome and require smaller quantities of starting RNA. Current platforms for profiling gene expression include:

Oligonucleotide arrays (microarrays)

Transcriptome sequencing

A comparison of these methods is summarized in a table (table 1).

While these technologies were originally developed using samples prepared from "bulk" tissues, or RNA was isolated from samples composed of multiple cell types, expression profiling at the single-cell level is also available. This approach to expression profiling provides unique insights into how individual cells and cell types contribute to human health and disease beyond that which is possible with bulk sequencing. For example, sequencing of clinical samples with an admixture of cell types will not be able to determine whether gene expression differences between healthy tissue and diseased tissue are due to changes in the abundance of cell types or to changes in the gene expression levels in a specific cell type. The most common profiling technique for individual cells, single cell RNA sequencing (scRNA-seq), is described below. (See 'Single cell sequencing' below.)

Oligonucleotide arrays (microarrays) — Oligonucleotide arrays operate on a similar principle to spotted cDNA arrays, but differ in how they are produced. Rather than spotting probes onto a glass slide, the short probes are synthesized directly on the slide [17,18]. Depending on the commercial manufacturer, probes vary from approximately 20 to 60 base pairs in length. Several types of arrays are commercially available.

Sample preparation begins with the isolation of RNA from the tissue of interest, resulting in an extraction that contains all of the genes transcribed in the tissue at the time the RNA is isolated. The RNA is then reverse-transcribed into cDNA and amplified using polymerase chain reaction (PCR) technique. Finally, a biotin label is incorporated through an in vitro transcription process, which converts cDNA into labeled cRNA.

A single sample of the labeled cRNA is applied to each array. Hybridization occurs between the labeled cRNA from the sample and complementary probes on the array. This is followed by binding to an avidin-conjugated fluorophore and a washing step that removes any unbound material. The fluorophore is excited by a laser scanner coupled to a computer that captures the image fluorophores linked to hybridized target molecules on the array, thus enabling the detection of the expression of thousands of genes simultaneously.

In general, the greater the amount of mRNA from a particular gene (ie, the higher the expression of that gene), the more fluorescently-labeled material corresponding to that gene will bind to complementary probes on the array. Background fluorescence or nonspecific binding may limit detection of lowly expressed transcripts. Probe-based detection for gene expression limits analysis to genes that are known.

Transcriptome sequencing — An alternative for measuring gene expression is the direct sequencing and quantification of RNA molecules. This method of measuring gene expression has also been referred to as "RNA-seq," "massively parallel sequencing," "next-generation sequencing (NGS)," or "deep sequencing," and several commercial platforms are available. The details of each system vary. In general, the sample is prepared so that many sequencing reactions occur simultaneously and yields millions of RNA sequence reads obtained by laser scanning [19]. NGS is discussed in more detail separately. (See "Next-generation DNA sequencing (NGS): Principles and clinical applications".)

Transcriptome sequencing allows improved detection of low abundance transcripts, as well as detection of novel transcripts and polymorphisms within a transcript's sequence. Advances in sample processing techniques also allow for preservation of the identity of the sense and antisense strands [20].

Single cell sequencing — Multiple protocols are available for single cell sequencing. In general, these systems start with isolation of single cells separated manually (eg, by serial or microwell dilution), by fluorescence-activated cell sorting (FACS), or by automated microfluidics-based technologies [21-23]. This is often followed by a confirmatory procedure such as microscopy to ensure that single cells were indeed isolated. This helps to prevent spurious conclusions based on the evaluation of chambers that are either empty or contain multiple cells. After separation, cells are lysed, the RNA fraction is converted to cDNA by reverse transcription, and the cDNA is amplified and sequenced [21,23]. Microfluidics and other microwell-plate-based technologies, along with transcript barcoding, which tags the cell of origin, allow for parallel sequencing of large numbers of individual cells [24-26].

Genome-wide gene expression analysis and interpretation in bulk tissue — Transcriptome sequencing has begun to surpass microarrays as the platform most often used for gene expression profiling of clinical specimens. As the cost of sequencing declines, the use of this platform is expected to increase. Both sequencing and microarrays can assay large numbers of genes with relatively high throughput. Typically, investigators involved with transcriptome profiling experiments are interested in comparing gene expression across different conditions [27]. While there are many approaches to the data analysis in order to accomplish this goal, there are generally several analytical steps that must first be taken (figure 3).

There are four general considerations in approaching transcriptome profiling data analysis and interpretation:

Preprocessing of raw data (see 'Preprocessing of raw data' below)

Data storage and analysis (see 'Data storage and analysis' below)

Multiple comparison problem (see 'The multiple comparison problem' below)

Biologic interpretation (see 'Biologic interpretation' below)

Preprocessing of raw data — Preprocessing prepares the raw data for statistical analysis. Preprocessing steps include quantification of expression levels and quality assessment of the raw data. The expression levels of genes are quantified differently for microarrays and transcriptome sequencing. For microarrays, a process called normalization adjusts measured fluorescence intensities so that they are comparable across different experiments. Quality assessment eliminates low-quality microarrays, poorly aligning sequencing reads, or outlier measurements. Preprocessing prepares the data for statistical analysis.

Quantification of expression — For microarray data analysis, each microarray can be considered a separate experiment that contains slightly different amounts of starting RNA and different labeling efficiencies [27]. Data normalization adjusts the fluorescence intensities representing the amount of RNA bound to each probe so that these intensities are comparable across different arrays.

There are several methods for normalizing microarray data, including:

Scaling – Adjusts intensities by a constant factor so that the average expression level across microarrays is similar.

Quantile normalization – Adjusts the distribution of intensities across microarrays. As illustrated in the figure (figure 4), this is accomplished by ranking the probe intensities from highest to lowest for each array. A numerical value is assigned to represent this intensity on an individual array based on the behavior across all arrays and the rank of that probe on the individual array.

LOWESS – Locally weighted scatterplot smoothing (LOWESS) adjusts the brightness or darkness of different fluorescent labels for two-color array experiments.

For transcriptome sequencing, each sample generates millions of sequencing reads, which are used to estimate expression levels of each gene or isoform. First, high-quality sequencing reads are aligned to the reference genome using one of many available sequence aligners [28]. Next, expression levels of each gene or isoform are calculated, usually by counting the reads aligned to a particular gene or isoform. For bulk sequencing, several methods are available for normalization, including:

Gene/transcript length normalization – For the most common of this family of methods, reads in a sample are first normalized for sequencing depth. The depth-normalized reads are then divided by the length of the corresponding gene or isoform in kilobases. This yields reads per kilobase per million reads (RPKM) [28]. This method has largely fallen out of favor for bulk-sequencing analyses due to the biases it generates in between-sample differential expression analyses [29].

Trimmed mean of M-values (TMM) – This method uses the weighted average of log expression ratios for each gene calculated for all samples against one reference sample (M-values). Genes with outlier values are thrown out, and a weighted average for all M-values is set for each sample [29,30].

DESeq – A per-sample scaling factor is calculated as the median of the ratios of each gene's read count over its geometric mean across all samples [31].

Variance modeling at the observational level (voom) – In the voom method, log counts are first normalized for sequencing depth [32]. Then a precision weight incorporating the mean-variance trend for each normalized observation is generated and both the normalized counts and precision weights are entered into the analysis pipeline. This method is particularly useful for small sample sizes or datasets where the between-sample sequencing depth is highly variable.

Single cell sequencing uses similar processes for expression quantification as bulk sequencing, with the caveat that the normalization procedure must account for the high proportion of zero read counts. This so-called "zero-inflation" is a result of two factors:

Not all cells express the same genes.

Relatively low-abundance transcripts frequently are not captured/sequenced in any given cell.

While many scRNA-seq normalization methods use scaling factors as described for bulk sequencing and microarrays, additional methods have been developed to manage zero-inflation and other the biases inherent in single cell sequencing data [33]. This is an active area of method development and study. (See 'Single cell sequencing' above.)

Quality assessment — Quality assessment occurs both before and after data normalization.

Pre-normalization quality assessment evaluates the quality of the raw data before preprocessing. For microarrays, the array itself is inspected to ensure there are no bubbles, scratches, or other artifacts on the array. Some commercial arrays also contain controls inserted during sample processing ("spike-in" controls) to ensure that all steps leading to the hybridization were successful. For transcriptome sequencing, each base pair call and individual sequencing read is considered a separate experiment that must be quality controlled [28]. This is performed with tools such as FastQC or NGSQC. Sequencing reads may be then "trimmed" to remove the leading or tailing adapter sequences added during sample processing or to remove lower-quality base pairs at the ends of each sequencing read [28,34].

Post-normalization quality assessment evaluate processed data from a microarray or transcriptome sequencing sample relative to others in the experiment. This helps to identify outlier samples or differences in batches of microarrays or sequencing. Samples identified as significantly different from others can be adjusted statistically or excluded from the analysis.

Data transformation — Many common statistical procedures assume a normal and continuous distribution of data. Gene expression levels from microarrays or transcriptome sequencing can be mathematically transformed, often using a logarithmic scale, so that they become normally distributed. Transcriptome sequencing data, which is comprised of read counts rather than continuous numerical values, can be filtered to include only higher read counts which may approximate continuous data. Alternatively, sequencing data can be modeled using a distribution more suitable for count data, such as the negative binomial distribution. Preprocessing can also include filtering out low-quality probe sets or genes with low variability across all samples in the experiment.

Data storage and analysis — Microarray and transcriptome sequencing experiments require computational tools to store raw data, analyze gene expression, and ensure uniformity across different laboratories.

Data storage — The fluorescence intensities generated by scanning an oligonucleotide array or sequencing flow cell with a laser scanner results in an image file. Most scientific journals specify that raw data be made publicly available as a requirement for publication [35]. A typical microarray raw data file, called a CEL file, is 0.1 to 1 gigabytes (GB) per array [19]. A typical sequencing texted-based raw data file, called a FASTQ file, is approximately 1 to 5 GB per sample. Thus, these experiments generate a large amount of data that must be stored. Beyond the storage of raw data, data files also include the clinical variables associated with each sample and the normalized, quality assessed preprocessed expression levels for each array. This is often accomplished with the use of a database capable of storing and integrating both gene expression data and clinical variables.

Data analysis — There are several possible levels of data analysis, ranging from simple statistical tests that can be performed with commercial software packages, to advanced analyses and the development of novel algorithms. Advanced analyses and novel algorithms are implemented with a variety of programming languages, such as Perl and Python, and computational software, such as R [36] and Matlab [37]. The flexibility to write, modify, and share algorithms using these tools makes them particularly well suited for gene expression data analysis.

Differential expression – One of the most common analyses performed on gene expression data is to determine which genes are altered in one condition as compared with another. This can be accomplished by performing a t-test, ANOVA, or linear model for continuous data, or binomial models for count data.

Class prediction – In this type of analysis, samples from two conditions are split into a training set and a test set. A list of genes that distinguishes the two conditions is derived from the training set of samples, and the accuracy of this gene expression signature is assessed on the test set of samples.

Class discovery – Genome wide gene expression data can be used to explore novel molecular phenotypes. By evaluating genes across all samples regardless of their clinical phenotype, it can be determined which samples are most closely correlated with each other based on gene expression alone. Samples that share similar patterns of gene expression may represent previously unrecognized subtypes of the disease.

Network analysis – The number of genes assayed by microarrays and transcriptome sequencing allows the entire dataset to be harnessed to make new predictions about how genes might interact. These approaches often operate on the premise that highly correlated genes in a network of gene-gene interactions are involved in the same or overlapping biologic pathways. One approach, Weighted Gene Co-Expression Network Analysis (WGCNA), works by clustering highly correlated genes, defined as "modules," within a gene network [38]. Genome-wide gene expression data can also be integrated with other data types, such as DNA methylation, proteomics, and metabolomics [34].

In addition to these general types of analyses of gene expression data, transcriptome sequencing also allows for more advanced analyses, such as discovery of novel transcripts or isoforms, detection of alternative splicing, and de novo reconstruction of the transcriptome [34].

Single cell analysis — Single cell RNAseq (scRNAseq) enables the identification of specific cells or cell types and their functions by interrogating cell-specific molecular signatures [21]. Cell type identification is often done using clustering methods that exploit latent-class modeling to identify cells with similar gene expression patterns. Modified versions of tools for bulk sequencing enable differential expression and network analyses to characterize cell-specific differences in gene expression and function. Methods for scRNAseq analysis are still in their infancy and an active field of study.(See 'Single cell sequencing' above.)

The multiple comparison problem — Statistical analyses of several thousand genes pose unique problems in the interpretation of the statistical results because of the large number of tests performed. This is because every statistical test has a small possibility of leading to the conclusion that an association is present when no such association actually exists, and when thousands of genes are tested with a microarray, an unacceptably high number of false-positive associations may be produced. An overview of statistical principles relevant to the multiple comparison problem is presented separately. (See "Proof, p-values, and hypothesis testing".)

Biologic interpretation — The final step in gene expression data analysis is to interpret the results in a biologically meaningful context. Making biologic sense of a whole transcriptome profiling-derived gene list is one of the more challenging aspects of the analysis. While there are many strategies for accomplishing this goal, two broad approaches are discussed below. Additional studies are often required to validate biologic predictions that are made from the microarray or sequencing data.

Comparison with other genome-wide gene expression datasets — Several tools exist for comparing gene expression datasets, including large databases containing gene expression data to look for a shared gene expression signature [39,40], alternative gene probes [41], and analytic tools that incorporate phenotypic associations [42-44].

Enrichment ranking — Gene set enrichment analysis (GSEA) is a method by which gene expression data is ranked by association with phenotypes, and is used as a means to identify biologically-relevant pathways [42,43]. Other techniques provide other mechanisms to enrich pathways or functional categories [45] or to visualize previously published interactions between genes of interest [46]. Data visualization using heat maps, which organize samples by columns and genes by rows according to similarity in gene expression, are also useful for determining which groups of genes or samples share similar patterns of expression. Gene set variation analysis (GSVA) uses a similar approach to identify pathway enrichment in a gene-expression dataset [47].

OVERVIEW OF CLINICAL APPLICATIONS — Gene expression profiling within clinical specimens has the potential to be used for disease screening, diagnosis, prognostication, and optimizing treatment regimens. As the platforms for measuring gene expression continue to evolve, personalized approaches to the diagnosis and treatment of complex human disease will increasingly find their place in routine clinical practice.

The promise of using gene expression profiling to identify individuals at risk for disease has yet to be fully reached. However, in certain circumstances, this tool has been incorporated into routine clinical evaluation or prognostication.

Diagnosis — Examples of the use of gene expression profiling to target selected patients for more conservative monitoring and less invasive diagnostic testing include the following:

The Oncotype Dx assay is used to guide the evaluation and management of subsets of patients with breast cancer. (See "Prognostic and predictive factors in metastatic breast cancer", section on 'Predictive factors' and "Prognostic and predictive factors in early, non-metastatic breast cancer", section on 'Receptor status' and "Deciding when to use adjuvant chemotherapy for hormone receptor-positive, HER2-negative breast cancer".)

Routine endomyocardial biopsy surveillance is used for cardiac transplant patients to detect the presence of acute cellular rejection. A study of gene expression profiling of ribonucleic acid (RNA) from peripheral blood mononuclear cells in heart transplant recipients revealed that an 11 gene expression profile could predict the presence of rejection [48]. Use of this gene expression profile has the potential to decrease the number of transplant patients who would need to undergo invasive myocardial biopsy to confirm the diagnosis of rejection.

Smokers suspected of having lung cancer on the basis of an abnormal chest computed tomography (CT) scan often need to undergo invasive diagnostic procedures, beyond bronchoscopy, to achieve a final diagnosis. A gene expression signature resulting from gene expression profiling of histologically normal airway epithelium obtained during bronchoscopy was capable of distinguishing between smokers with and without lung cancer [49]. Combining results of this gene expression biomarker with clinical variables, in an integrated clinico-genomic model, improved the discriminatory potential to predict lung cancer [50]. Refinement of this signature using an independent dataset and application in a prospective multicenter validation trial resulted in a diagnostic biomarker with a 91 percent negative predictive value in patients with an intermediate pretest probability of lung cancer [51,52]. Gene expression profiling may help to stratify patients with an abnormal chest CT scan who should undergo invasive diagnostic testing for a potential lung cancer and those for whom imaging surveillance would be appropriate. (See "Diagnostic evaluation of the incidental pulmonary nodule" and "Overview of the initial evaluation, diagnosis, and staging of patients with suspected lung cancer".)

Thyroid nodules are frequently evaluated using fine-needle aspirate biopsies, but this approach sometimes yields indeterminate results and requirement for thyroid surgery to achieve a definitive diagnosis. A diagnostic biomarker resulting from gene expression profiling of indeterminate thyroid nodule aspirates demonstrated a 92 percent sensitivity and high negative predictive value, suggesting that patients with indeterminate results from a fine needle aspirate of a thyroid nodule can be monitored less invasively, potentially avoiding unnecessary surgery [53].

Prognosis — The most advanced application of gene expression profiling is in predicting outcome from disease. The risk of certain therapies might be outweighed by the potential benefit for patients at high risk for relapse or with a poor prognosis, whereas the risks might outweigh the benefits for patients with a relatively good prognosis. Gene expression profiling has been helpful in targeting appropriate therapy for patients with acute leukemia, prostate cancer, colon cancer, breast cancer, lung carcinoma, and lymphoma.

MiRNAs have also been found to be dysregulated in a number of solid tumors and hematologic malignancies [54-66]. (See 'RNA in cell function' above.)

A 2012 systematic review of available studies examining associations between miRNA profiling and cancer prognosis found associations between certain miRNAs and poor outcomes, including decreased overall survival but noted several potential sources of bias [67]. Further work will be needed to validate the use of these findings in clinical applications.

SUMMARY

There are a variety of methods for measuring ribonucleic acid (RNA) to evaluate gene expression (table 1). These methods differ in their requirements for the amount of starting RNA, their sensitivity to detect the RNA of interest, and the computational requirements for data analysis. (See 'Measuring gene expression' above.)

While oligonucleotide array profiling and RNA sequencing are commonly used biomarker discovery platforms, there are numerous challenges and pitfalls in the analysis and interpretation of the large volume of data generated. Considerations with data processing, storage, and analysis are discussed above. (See 'Genome-wide gene expression analysis and interpretation in bulk tissue' above.)

Gene expression profiling is emerging as a potential approach for the diagnosis and prognosis of complex human disease. However, a number of important barriers remain, including validation of these biomarkers in prospective multi-center studies to demonstrate their reproducibility and accuracy across multiple sites and operators. (See 'Overview of clinical applications' above.)

ACKNOWLEDGMENT — The UpToDate editorial staff acknowledges Avrum Spira, MD, MSc, who contributed to an earlier version of this topic review.

  1. Brown TA. Genomes 3, 3rd ed, Garland Science, 2007.
  2. Lee RC, Feinbaum RL, Ambros V. The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell 1993; 75:843.
  3. Girard A, Sachidanandam R, Hannon GJ, Carmell MA. A germline-specific class of small RNAs binds mammalian Piwi proteins. Nature 2006; 442:199.
  4. Aravin A, Gaidatzis D, Pfeffer S, et al. A novel class of small RNAs bind to MILI protein in mouse testes. Nature 2006; 442:203.
  5. Khalil AM, Guttman M, Huarte M, et al. Many human large intergenic noncoding RNAs associate with chromatin-modifying complexes and affect gene expression. Proc Natl Acad Sci U S A 2009; 106:11667.
  6. Hon CC, Ramilowski JA, Harshbarger J, et al. An atlas of human long non-coding RNAs with accurate 5' ends. Nature 2017; 543:199.
  7. Dvorák Z, Pascussi JM, Modrianský M. Approaches to messenger RNA detection - comparison of methods. Biomed Pap Med Fac Univ Palacky Olomouc Czech Repub 2003; 147:131.
  8. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 2004; 431:931.
  9. Marchese FP, Raimondi I, Huarte M. The multidimensional mechanisms of long noncoding RNA function. Genome Biol 2017; 18:206.
  10. Alwine JC, Kemp DJ, Stark GR. Method for detection of specific RNAs in agarose gels by transfer to diazobenzyloxymethyl-paper and hybridization with DNA probes. Proc Natl Acad Sci U S A 1977; 74:5350.
  11. Gall JG, Pardue ML. Formation and detection of RNA-DNA hybrid molecules in cytological preparations. Proc Natl Acad Sci U S A 1969; 63:378.
  12. Jin L, Lloyd RV. In situ hybridization: methods and applications. J Clin Lab Anal 1997; 11:2.
  13. Nolan T, Hands RE, Bustin SA. Quantification of mRNA using real-time RT-PCR. Nat Protoc 2006; 1:1559.
  14. VanGuilder HD, Vrana KE, Freeman WM. Twenty-five years of quantitative PCR for gene expression analysis. Biotechniques 2008; 44:619.
  15. Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995; 270:467.
  16. Churchill GA. Fundamentals of experimental design for cDNA microarrays. Nat Genet 2002; 32 Suppl:490.
  17. Pease AC, Solas D, Sullivan EJ, et al. Light-generated oligonucleotide arrays for rapid DNA sequence analysis. Proc Natl Acad Sci U S A 1994; 91:5022.
  18. Nuwaysir EF, Huang W, Albert TJ, et al. Gene expression analysis using oligonucleotide arrays produced by maskless photolithography. Genome Res 2002; 12:1749.
  19. Wilhelm BT, Landry JR. RNA-Seq-quantitative measurement of expression through massively parallel RNA-sequencing. Methods 2009; 48:249.
  20. Ozsolak F, Milos PM. RNA sequencing: advances, challenges and opportunities. Nat Rev Genet 2011; 12:87.
  21. Stegle O, Teichmann SA, Marioni JC. Computational and analytical challenges in single-cell transcriptomics. Nat Rev Genet 2015; 16:133.
  22. Hwang B, Lee JH, Bang D. Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp Mol Med 2018; 50:1.
  23. Gawad C, Koh W, Quake SR. Single-cell genome sequencing: current state of the science. Nat Rev Genet 2016; 17:175.
  24. Islam S, Kjällquist U, Moliner A, et al. Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome Res 2011; 21:1160.
  25. Hashimshony T, Wagner F, Sher N, Yanai I. CEL-Seq: single-cell RNA-Seq by multiplexed linear amplification. Cell Rep 2012; 2:666.
  26. Macosko EZ, Basu A, Satija R, et al. Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets. Cell 2015; 161:1202.
  27. Quackenbush J. Microarray data normalization and transformation. Nat Genet 2002; 32 Suppl:496.
  28. Yang IS, Kim S. Analysis of Whole Transcriptome Sequencing Data: Workflow and Software. Genomics Inform 2015; 13:119.
  29. Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 2010; 11:R25.
  30. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010; 26:139.
  31. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 2014; 15:550.
  32. Law CW, Chen Y, Shi W, Smyth GK. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 2014; 15:R29.
  33. Vallejos CA, Risso D, Scialdone A, et al. Normalizing single-cell RNA sequencing data: challenges and opportunities. Nat Methods 2017; 14:565.
  34. Conesa A, Madrigal P, Tarazona S, et al. A survey of best practices for RNA-seq data analysis. Genome Biol 2016; 17:13.
  35. Brazma A, Hingamp P, Quackenbush J, et al. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet 2001; 29:365.
  36. R Development Core Team. R: A Language and Environment for Statistical Computing 2009. R Foundation for Statistical Computing. Available at: www.R-project.org (Accessed on December 14, 2009).
  37. The Mathworks I. The MathWorks - MATLAB and Simulink for Technical Computing 2009. Available at: www.mathworks.com (Accessed on December 14, 2009).
  38. Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 2008; 9:559.
  39. Barrett T, Troup DB, Wilhite SE, et al. NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Res 2009; 37:D885.
  40. Gower AC, Spira A, Lenburg ME. Discovering biological connections between experimental conditions based on common patterns of differential gene expression. BMC Bioinformatics 2011; 12:381.
  41. Dai M, Wang P, Boyd AD, et al. Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res 2005; 33:e175.
  42. Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 2005; 102:15545.
  43. Mootha VK, Lindgren CM, Eriksson KF, et al. PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet 2003; 34:267.
  44. Lamb J, Crawford ED, Peck D, et al. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science 2006; 313:1929.
  45. Dennis G Jr, Sherman BT, Hosack DA, et al. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 2003; 4:P3.
  46. Ingenuity Systems. Ingenuity Pathway Analysis Software 2009. Available at: www.ingenuity.com (Accessed on December 14, 2009).
  47. Hänzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics 2013; 14:7.
  48. Deng MC, Eisen HJ, Mehra MR, et al. Noninvasive discrimination of rejection in cardiac allograft recipients using gene expression profiling. Am J Transplant 2006; 6:150.
  49. Spira A, Beane JE, Shah V, et al. Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer. Nat Med 2007; 13:361.
  50. Beane J, Sebastiani P, Whitfield TH, et al. A prediction model for lung cancer diagnosis that integrates genomic and clinical features. Cancer Prev Res (Phila) 2008; 1:56.
  51. Whitney DH, Elashoff MR, Porta-Smith K, et al. Derivation of a bronchial genomic classifier for lung cancer in a prospective study of patients undergoing diagnostic bronchoscopy. BMC Med Genomics 2015; 8:18.
  52. Silvestri GA, Vachani A, Whitney D, et al. A Bronchial Genomic Classifier for the Diagnostic Evaluation of Lung Cancer. N Engl J Med 2015; 373:243.
  53. Alexander EK, Kennedy GC, Baloch ZW, et al. Preoperative diagnosis of benign thyroid nodules with indeterminate cytology. N Engl J Med 2012; 367:705.
  54. Calin GA, Sevignani C, Dumitru CD, et al. Human microRNA genes are frequently located at fragile sites and genomic regions involved in cancers. Proc Natl Acad Sci U S A 2004; 101:2999.
  55. Lovat F, Valeri N, Croce CM. MicroRNAs in the pathogenesis of cancer. Semin Oncol 2011; 38:724.
  56. Esquela-Kerscher A, Slack FJ. Oncomirs - microRNAs with a role in cancer. Nat Rev Cancer 2006; 6:259.
  57. Calin GA, Dumitru CD, Shimizu M, et al. Frequent deletions and down-regulation of micro- RNA genes miR15 and miR16 at 13q14 in chronic lymphocytic leukemia. Proc Natl Acad Sci U S A 2002; 99:15524.
  58. Calin GA, Ferracin M, Cimmino A, et al. A MicroRNA signature associated with prognosis and progression in chronic lymphocytic leukemia. N Engl J Med 2005; 353:1793.
  59. Garzon R, Volinia S, Liu CG, et al. MicroRNA signatures associated with cytogenetics and prognosis in acute myeloid leukemia. Blood 2008; 111:3183.
  60. Yanaihara N, Caplen N, Bowman E, et al. Unique microRNA molecular profiles in lung cancer diagnosis and prognosis. Cancer Cell 2006; 9:189.
  61. Yu SL, Chen HY, Chang GC, et al. MicroRNA signature predicts survival and relapse in lung cancer. Cancer Cell 2008; 13:48.
  62. Raponi M, Dossey L, Jatkoe T, et al. MicroRNA classifiers for predicting prognosis of squamous cell lung cancer. Cancer Res 2009; 69:5776.
  63. Fanini F, Vannini I, Amadori D, Fabbri M. Clinical implications of microRNAs in lung cancer. Semin Oncol 2011; 38:776.
  64. Boeri M, Pastorino U, Sozzi G. Role of microRNAs in lung cancer: microRNA signatures in cancer prognosis. Cancer J 2012; 18:268.
  65. Castañeda CA, Agullo-Ortuño MT, Fresno Vara JA, et al. Implication of miRNA in the diagnosis and treatment of breast cancer. Expert Rev Anticancer Ther 2011; 11:1265.
  66. Sandhu S, Garzon R. Potential applications of microRNAs in cancer diagnosis, prognosis, and treatment. Semin Oncol 2011; 38:781.
  67. Nair VS, Maeda LS, Ioannidis JP. Clinical outcome prediction by microRNAs in human cancer: a systematic review. J Natl Cancer Inst 2012; 104:528.
Topic 14602 Version 30.0

References