The cell line transcriptome

The word transcriptome refers to the full set of transcribed RNA molecules within a cell at a given time point. In contrast to the genome, which is characterized by its stability over different cells within an organism, the transcriptome varies greatly. This plastic nature of the transcriptome has made it appealing to study, owing to its potential to serve as a proxy for cellular identity and diversity. In the Cell Atlas all 19628 protein-coding human genes are classified according to their expression across a large number of in vitro cultured cell lines (Figure 1) (Uhlén M et al, 2015). The cell lines have been harvested during log phase of growth and extracted high quality mRNA was used as input material for library construction and subsequent sequencing. The expression level of gene-specific transcripts is given as Transcript Per Million (TPM) values. Genes with a TPM value ≥1 are considered as detected. Altogether the transcriptome of 56 cell lines have been analyzed to form a basis of different expression categories.

Approximately one third of all protein-coding genes (n=6833) were expressed in all cell lines, consistent with a "housekeeping" function for the corresponding proteins. 11% (n=2090) of all genes were not detected in any of the analyzed cell lines, suggesting that corresponding proteins are only expressed in highly specialized cell types, during specific developmental stages or under specific conditions such as cell stress. 42% (n=8267) of the protein-coding genes show a more restricted pattern of expression across the analyzed cell lines, some expressed in only a few or even just a single cell line. In Table 1 the specific expression profile for each analyzed cell line is shown with clickable numbers for total detected genes, cell line enriched genes, group enriched genes and cell line enhanced genes.

  • 233 genes found only in cell lines and not tissues
  • 1225 genes found only in tissues and not cell lines

The cell line transcriptome was compared with the transcriptome of 37 different normal tissues and organs. 233 genes were only expressed in cell lines and not in any of the analyzed normal tissue types. These genes serve an interesting starting point to study the function and role of corresponding proteins in human biology. Furthermore, 1225 genes were only found to be expressed in normal human tissues but not in any of the analyzed cell lines. Several of the proteins corresponding to these genes have functions associated with differentiated cells in specialized tissues or subcompartments of tissues, exemplified by ACR (acrosin) the major proteinase present in the acrosome of mature spermatozoa in normal testis and ABCB11, the major canalicular bile salt export pump in normal liver.

Figure 1. Pie chart showing the number of genes in the different RNA-based categories of gene expression in the panel of cell lines.

Table 1. Table showing the number of detected genes per cell line based on RNA sequencing (TPM ≥1), and the number of genes in the enriched and enhanced categories.

Cell line

Detectable genes

Enriched genes

Group enriched genes

Enhanced genes

A-431 11477 8 42 150
A549 11893 8 37 197
AF22 12047 23 100 382
AN3-CA 11310 13 30 217
ASC TERT1 11350 21 40 251
BEWO 11822 57 80 402
BJ 11684 2 28 192
BJ hTERT+ 11634 0 0 0
BJ hTERT+ SV40 Large T+ 11621 0 0 0
BJ hTERT+ SV40 Large T+ RasG12V 11716 0 0 0
CACO-2 11532 20 80 225
CAPAN-2 11934 10 64 328
Daudi 10132 6 63 181
EFO-21 12407 14 74 274
HaCaT 11769 21 75 295
HBF TERT88 11383 0 4 67
HDLM-2 11216 73 79 370
HEK 293 12097 9 31 205
HEL 11267 41 131 295
HeLa 11544 9 26 130
Hep G2 11174 89 110 255
HL-60 10305 1 27 101
HMC-1 11559 55 97 402
hTCEpi 11379 19 61 178
HUVEC TERT2 11227 14 74 221
K-562 10965 17 80 191
Karpas-707 10736 19 66 269
LHCN-M2 11438 9 34 171
MCF7 11315 7 16 205
MOLT-4 10542 32 53 153
NB-4 10912 13 53 181
NTERA-2 12596 50 119 354
PC-3 11871 5 38 188
REH 11043 15 50 198
RH-30 11373 35 51 245
RPMI-8226 10994 20 67 217
RPTEC TERT1 11798 32 77 260
RT4 11614 32 73 313
SCLC-21H 12810 94 193 696
SH-SY5Y 12385 52 149 479
SiHa 11705 6 29 195
SK-BR-3 11180 27 46 230
SK-MEL-30 11402 26 40 177
T-47d 11890 21 45 328
THP-1 11371 23 58 215
TIME 11334 2 69 284
U-138 MG 11575 4 29 204
U-2 OS 12795 24 76 276
U-2197 11460 20 40 237
U-251 MG 11247 1 9 71
U-266/70 11451 21 90 361
U-266/84 10959 18 83 197
U-698 10154 16 59 170
U-87 MG 11916 16 51 273
U-937 10874 15 61 206
WM-115 11937 16 45 274

A diversity of cell lines

The 56 different cell lines used in the Human Protein Atlas have been selected to represent various cell populations in different tissue types and organs of the human body. A vast majority of the selected cell lines have been derived from human cancer and thus are best described as human cancer cell lines with limited resemblance to normal cell types. Cell lines are in general adapted to cultivation in vitro and can only approximate the lives of normal cells that perform their function in a complex tissue content. As cancer is a composite tissue with heterogeneous cancer cell populations in addition to the stromal component, it is not surprising that several features of a normal cell corresponding to the putative progenitor cell are lacking in the corresponding cancer-derived cell line. Despite the evident differences between primary cells in tissue and in vitro cultured cell lines, a global analysis based on an unbiased hierarchical clustering analysis (Figure 2) shows that cell lines in fact do cluster as expected from similarities in origin and phenotype of the cancer cells from which the respective cell line was derived from. This can be exemplified by the derivatives of the isogenic BJ fibroblast model that mimics the four stages of malignant transformation (normal, immortalized, transformed and metastasizing) by cumulative addition of defined genetic elements (Hahn WC et al, 1999). At the highest level of separation, cell lines that grow in solution and also represent hematopoietic and lymphoid cell systems cluster together and separate into two major clusters dependent on myeloid or lymphoid origin/phenotype. Moreover, several related cell lines cluster together such as the versions of immortalized and transformed fibroblastic cell lines (BJ derivatives), glioma (U-138 MG and U-251 MG), melanoma (WM-115 and SK-MEL-30), breast cancer (SK-BR-3, MCF7 and T47d) and endothelial cell lines (TIME and HUVEC).

The selection of human cancer cell lines for the Cell Atlas was aimed to correspond to the origin and phenotype of solid cancer types represented in the Pathology Atlas of the Human Protein Atlas. A special emphasis has been made to represent cells in the hematopoietic and immune system as these corresponding tumor types are more scarcely represented in the Cancer Atlas. Data from altogether 7 and 8 cell lines representing different stages of myeloid and lymphoid differentiation, respectively, has been generated and analyzed. In addition to cancer-derived cell lines there are also a number of cell lines that have been generated through in vitro protocols for immortalization of growing cells as well as stem cells. Details regarding the different cell lines can be found here.

Figure 2. Hierarchical clustering based on RNA sequencing data for the 56 cell lines. The color of the cell line name represents its origin: red - myeloid, yellow - lymphoid, brown - lung, dark blue - brain, turquoise - renal, urinary and male reproductive system, green - breast and female reproductive system, pink - sarcoma, purple - fibroblast, dark blue - abdominal, black - miscellaneous. Cells immortalized by the introduction of telomerase are indicated by an asterisk (*).

Cell line enriched genes

A majority of the cell line enriched genes also belong to the tissue elevated gene expression categories (tissue enriched, group enriched and tissue enhanced). The expression pattern in normal tissues and function of these proteins relate to the specific traits and functions of the corresponding normal tissue type and organ. Examples are presented in Figure 3 and include: The secreted proteins AHSG and ALB that are only expressed in normal liver and the liver derived cell line Hep-G2, where immunofluorescent analysis shows localization to the Golgi apparatus and vesicles respectively. The transcription factor HOXB13 that is only expressed in the nuclei of prostate, colon and rectum tissue as well as in the prostate-derived cell line PC-3. The adhesion glycoprotein CDH15 that is enriched in skeletal muscle tissue and in the sarcoma cell line RH-30. The enzyme TYR that is exclusively expressed in skin and in the melanoma derived cell line SK-MEL-30. The epidermal growth factor receptor EGFR enriched in female tissues and skin, and in the skin-derived cell line A-431.

The RNA-seq data for all 56 cell lines expressing 89% (n=17538) of all protein-coding human genes are presented in the Cell Atlas and can be used as a tool for selection of suitable cell lines for an experiment involving a particular gene or pathway or for further studies on the transcriptome of established human cell lines.

AHSG - Hep G2
ALB - Hep G2
HOXB13 - PC-3

CDH15 - RH-30
EGFR - A-431

Figure 3. Examples of proteins with enriched expression in a cell line and the corresponding tissue of origin. The proteins are AHSG, ALB, HOXB13, CDH15, TYR, and EGFR. The immunohistochemical (IHC) staining shows the protein expression pattern in tissue in brown. The immunofluorescent (IF) staining shows the protein subcellular expression pattern in cell lines in green. The nucleus and microtubules are shown in blue and red respectively in the IF images.

Relelvant links and publications

Hahn WC et al, 1999. Creation of human tumour cells with defined genetic elements. Nature.
PubMed: 10440377 DOI: 10.1038/22780

UhlÚn M et al, 2015. Tissue-based map of the human proteome. Science
PubMed: 25613900 DOI: 10.1126/science.1260419