Assays and annotation - The Human Protein Atlas

We use cookies to enhance the usability of our website. If you continue, we'll assume that you are happy to receive all cookies. More information. Don't show this again.



Immunocytochemistry/IF - cells
Annotation
Knowledge-based annotation
Reliability score
Immunohistochemistry - tissues
Annotation
Knowledge-based annotation
Reliability score
Immunohistochemistry/IF - mouse brain
Annotation
Reliability score
Immunohistochemistry - cells
Annotation
Protein array
Western blot
HPA RNA-seq data
GTEx RNA-seq data
FANTOM5 CAGE data
TCGA RNA-seq data
Survival
Evidence

Immunocytochemistry/IF - cells

Besides the immunohistochemically stained tissues, the protein atlas displays high resolution, multicolor images of immunofluorescently labeled proteins at the single cell level. This provides spatial information on protein expression patterns on a fine cellular and subcellular level.

Originally three cell lines, U-2 OS, A-431 and U-251 MG, originating from different human tissues were chosen to be included in the immunofluorescent analysis. In year 2012, the cell line panel was expanded to include additional cell lines: A-549, BJ, CACO-2, HaCaT, HEK 293, HeLa, Hep-G2, MCF-7, PC-3, RH-30, RT-4, SH-SY5Y, SiHa, SK-MEL-30 and TIME. In the HPA16 release, 12 additional cell lines have been added. To enhance the probability for a large number of proteins to be expressed, the cell lines were selected from different lineages, e.g. tumor cell lines from mesenchymal, epithelial and glial tumors. The selection was furthermore based on morphological characteristics, widespread use and multitude of publications using these cell lines. Information regarding sex and age of the donor, cellular origin and source is listed here. Based on mRNA expression data, two suitable cell lines from the cell line panel are selected for the immunofluorescent analysis of each protein. In order to localize the whole human proteome on a subcellular level in one specific cell line a third cell line, U-2 OS, is always chosen.

In addition to the human cell lines, the mouse cell line NIH 3T3 is also stained. This is only done for the antibodies corresponding to genes where the mouse and human genes are orthologous.

In order to facilitate the annotation of the subcellular localization of the protein targeted by the HPA antibody, the cells are also stained with reference markers. The following probes/organelles are used as references; (i) DAPI for the nucleus, (ii) anti-tubulin antibody as internal control and marker of microtubules, and (iii) anti-calreticulin or anti-KDEL for the endoplasmic reticulum (ER).

The resulting confocal images are single slice images representing one optical section of the cells. The microscope settings are optimized for each sample. The different organelle probes are displayed as different channels in the multicolor images; the HPA antibody staining is shown in green, nuclear stain in blue, microtubules in red and ER in yellow.

Annotation

In order to provide an interpretation of the staining patterns, all images of immunofluorescently stained cell lines are manually annotated. For each cell line and antibody the intensity and subcellular location of the staining is described. The staining intensity is classified as negative, weak, moderate or strong based on the laser power and detector gain settings used for image acquisition in combination with the visual appearance of the image. The subcellular location is further combined with parameters describing the staining characteristics (e.g. smooth, granular, speckled or fibrous). The table below lists the subcellular locations used, links to the cell structure dictionary and corresponding GO terms.

Subcellular location	GO term
Actin filaments	GO:0015629
Aggresome	GO:0016235
Cell Junctions	GO:0030054
Centrosome	GO:0005813
Cytokinetic bridge	GO:0045171
Cytoplasmic bodies	GO:0036464
Cytosol	GO:0005829
Endoplasmic reticulum	GO:0005783
Endosomes	GO:0005768
Focal adhesion sites	GO:0005925
Golgi apparatus	GO:0005794
Intermediate filaments	GO:0045111
Lipid droplets	GO:0005811
Lysosomes	GO:0005764
Microtubule organizing center	GO:0005815
Microtubules	GO:0015630
Midbody	GO:0030496
Midbody ring	GO:0090543
Mitochondria	GO:0005739
Mitotic spindle	GO:0072686
Nuclear bodies	GO:0016604
Nuclear membrane	GO:0031965
Nuclear speckles	GO:0016607
Nucleoli	GO:0005730
Nucleoli fibrillar center	GO:0001650
Nucleoplasm	GO:0005654
Nucleus	GO:0005634
Peroxisomes	GO:0005777
Plasma membrane	GO:0005886
Vesicles	GO:0043231

Knowledge-based annotation

Knowledge-based annotation of subcellular location aims to provide an interpretation of the subcellular localization of a specific protein in at least three human cell lines. The conflation of immunofluorescence data from two or more antibody sources directed towards the same protein and a review of available protein/gene characterization data, allows for a knowledge-based interpretation of the subcellular location.

Reliability score

A reliability score is set for all genes and indicates the level of reliability of the analyzed protein expression pattern based on available protein/RNA/gene characterization data.

The reliability of the annotated protein expression data is also scored depending on similarity in immunostaining patterns and consistency with available experimental gene/protein characterization data in the UniProtKB/Swiss-Prot database.

The overall score encompass several factors: Reproducibility of the antibody staining in different cell lines (and if signal strength correlates with RNA expression levels); reproducibility of the staining using antibodies binding to different epitopes on the same target protein; validation data for the specificity of the antibody (including knockdown/knock out methods, matching signal with fluorescent-tagged protein); experimental evidence for location described in literature. Considerations are also made on whether the antibody performs in non IF-related methods like Western blot or immunohistochemistry.

The final score leads to the assignment into one of the following four classes:

Validated - If there is more than one source supporting the localization; for example, two independent antibodies show the same localization and this is also observed in experiments outside the Human Protein Atlas project.
Supported - If there is one source supporting the localization.
Approved - If the localization of the protein has not been previously described and was detected by only one antibody without additional antibody validation.
Uncertain - If the antibody-staining pattern contradicts experimental data or no expression is detected on the RNA level.

Immunohistochemistry - tissues

The Human Protein Atlas contains images of histological sections from normal and cancer tissues obtained by immunohistochemistry. Antibodies are labeled with DAB (3,3'-diaminobenzidine) and the resulting brown staining indicates where an antibody has bound to its corresponding antigen. The section is furthermore counterstained with hematoxylin to enable visualization of microscopical features.

Tissue microarrays are used to show antibody staining in samples from 144 individuals corresponding to 44 different normal tissue types, and samples from 216 cancer patients corresponding to 20 different types of cancer (movie about tissue microarray production and immunohistochemical staining). Each sample is represented by 1 mm tissue cores, resulting in a total number of 576 images for each antibody. Normal tissues are represented by samples from three individuals each, one core per individual, except for endometrium, skin, soft tissue and stomach, which are represented by samples from six individuals each and parathyroid gland, which is represented by one sample. Protein expression is annotated in 76 different normal cell types present in these tissue samples. For cancer tissues, two cores are sampled from each individual and protein expression is annotated in tumor cells. A small fraction of the 576 images are missing for most antibodies due to technical issues. Specimens containing normal and cancer tissue have been collected and sampled from anonymized paraffin embedded material of surgical specimens, in accordance with approval from the local ethics committee.

For selected proteins extended tissue profiling is performed in addition to standard tissue microarrays. Examined tissues include mouse brain, human lactating breast, eye, hair and extended samples of adrenal gland, skin and brain.

Since specimens are derived from surgical material, normal is here defined as non-neoplastic and morphologically normal. It is not always possible to obtain fully normal tissues and thus several of the tissues denoted as normal will include alterations due to inflammation, degeneration and tissue remodeling. In rare tissues, hyperplasia or benign proliferations are included as exceptions. It should also be noted that within normal morphology there may exist interindividual differences and variations due to primary diseases, age, sex etc. Such differences may also affect protein expression and thereby immunohistochemical staining patterns.

Samples from cancer are also derived from surgical material. Due to subgroups and heterogeneity of tumors within each cancer type, included cases represent a typical mix of specimens from surgical pathology. The inclusion of tumors is based on availability and representativity, however, an effort has been made to include high and low grade malignancies where such is applicable. In certain tumor groups, subtypes have been included, e.g. breast cancer includes both ductal and lobular cancer, lung cancer includes both squamous cell carcinoma and adenocarcinoma and liver cancer includes both hepatocellular and cholangiocellular carcinoma etc. Tumor heterogeneity and interindividual differences may be reflected in diverse expression of proteins resulting in variable immunohistochemical staining patterns.

Annotation

In order to provide an overview of protein expression patterns, all images of immunohistochemically stained tissues are manually annotated by a specialist followed by verification by a second specialist. Annotation of each different normal and cancer tissue is performed using fixed guidelines for classification of immunohistochemical results. Each tissue is examined for representability, and subsequently immunoreactivity in the different cell types present in normal or cancer tissues was annotated. Basic annotation parameters include an evaluation of i) staining intensity (negative, weak, moderate or strong), ii) fraction of stained cells (rare, <25%, 25-75% or >75%) and iii) subcellular localization (nuclear and/or cytoplasmic/membranous). The manual annotation also provides two summarizing texts describing the staining pattern for each antibody in normal tissues and in cancer tissues.

The terminology and ontology used is compliant with standards used in pathology and medical science. SNOMED classification is used for assignment of topography and morphology. SNOMED classification also underlies the given original diagnosis from which normal as well as cancer samples were collected.

A histological dictionary used in the annotation is available as a PDF-document, containing images which are immunohistochemically stained with antibodies included in the Human Protein Atlas. The dictionary displays subtypes of cells distinguishable from each other and also shows specific expression patterns in different intracellular structures. Annotation dictionary: screen usage (15 MB), printing (95 MB).

Knowledge-based annotation

Knowledge-based annotation aims to create a comprehensive overview of protein expression patterns in normal human tissues. This is achieved by stringent evaluation of immunohistochemical staining pattern, RNA-seq data from internal and external sources and available protein/gene characterization data, with special emphasis on RNA-seq. Annotated protein expression profiles are performed using single antibodies as well as independent antibodies (two or more independent antibodies directed against different, non-overlapping epitopes on the same protein). For independent antibodies, the immunohistochemical data from all the different antibodies are taken into consideration. The immunohistochemical staining pattern in normal tissues is subjectively annotated according to strict guidelines. It is based on the experienced evaluation of positive immunohistochemical signals in the 76 normal cell types analyzed. The review also takes suboptimal experimental procedures and interindividual variations into consideration.

The final annotated protein expression is considered a best estimate and as such reflects the most probable histological distribution and relative expression level for each protein. To enable a protein expression profile, one or several of the following additional data sources is necessary; i) an independent antibody targeting another epitope of the same protein ii) RNA-seq data, and iii) available protein/gene characterization data. The result of the knowledge-based annotation is considered inconclusive when the information available at the time of analysis is evaluated as not sufficient for verification of the staining pattern and an estimation of the expected protein expression.

The knowledge-based protein expression profiles are performed using fixed guidelines on evaluation and presentation of the resulting expression profiles. Standardized explanatory sentences are used when necessary to provide additional information required for full understanding of the expression profile. A reliability score, set as Supported (visualized with a yellow star symbol), Approved, or Uncertain is set for each annotated protein expression profile based on evaluation of all available data.

Reliability score

A reliability score is manually set for all genes and indicates the level of reliability of the analyzed protein expression pattern based on available RNA-seq data and/or protein/gene characterization data.

The reliability score is divided into Supported, Approved, or Uncertain. Supported is indicated by a star on the image that links to the tissue atlas data for a particular gene. If there is available data from more than one antibody, the staining patterns of all antibodies are taken in consideration during evaluation of reliability score.

Supported - Consistency with RNA-seq and/or protein/gene characterization data, in combination with similar staining pattern if independent antibodies are available.
Approved - Consistency with RNA-seq data in combination with inconsistency with, or lack of, protein/gene characterization data. Alternatively, consistency with protein/gene characterization data in combination with inconsistency with RNA-seq data. If independent antibodies are available, the staining pattern is partly similar or dissimilar.
Uncertain - Inconsistency with, or lack of, RNA-seq and/or protein/gene characterization data, in combination with dissimilar staining pattern if independent antibodies are available.

Immunohistochemistry/IF - mouse brain

As a complement to the immunohistochemically stained tissues, the protein atlas also includes the mouse brain atlas as a sub compartment of the normal tissue atlas. In which comprehensive profiles are available in mouse brain. A selected set of targets have been analyzed by using the antibodies in serial sections of mouse brain which covers 129 areas and subfields of the brain, several of these regions difficult to cover in the human brain. In addition pituitary, retina and trigeminal ganglions are included in recent and future image series but not annotated yet.

The tissue micro array method used within the human protein atlas enabled the global mapping of proteins in the human body, including the brain. Currently, the human tissue atlas covers four areas of the human brain: cerebral cortex, hippocampus, caudate and cerebellum. Due to the heterogeneous structure of the brain, with many nuclei and cell-types organized in complex networks, it is difficult to achieve a comprehensive overview in a 1 mm tissue sample. Analysis of more human brain samples, including smaller brain nuclei, is thus desirable in order to generate a more detailed map of protein distribution in the brain. Therefore, we here complemented the human brain atlas effort with a more comprehensive analysis of the mouse brain. A series of mouse brain sections is explored for protein expression and distribution in a large number of brain regions.

Antibodies are selected against protein involved in normal brain physiology, brain development and neuropathological processes. A limit of 60% homology (human vs mouse) is used as cut off when comparing the PrEST sequence for the antibody targets.

Selected antibodies are applied to test-sections containing brain regions or cell types with known expression based on in situ hybridization (Allen Brain Atlas) and single cell RNAseq data (Linnarsson Lab and Barres Lab). Staining patterns are evaluated based on consistency between staining patterns of multiple antibodies against the same target and match to transcriptomics data. Antibody immunoreactivity is visualized using tyramid signal amplification shown in green. A nuclear reference staining (DAPI) is visualized in blue. The immunofluorescence protocol is standardized though antibody concentration and incubation time are variable depending on protein abundance and antibody affinity determined during the test staining. The complete mouse brain profile is represented by serial coronal sections of adult mouse brain, 16 µm thick. Stained slides are then scanned and digitalized before further processing.

Table 1. Brain regions. Abbreviations are based on The Mouse Brain in Stereotaxic Coordinates, Third Edition: The coronal plates and diagrams (ISBN: 9780123742445)

Region			Abbreviation	Allen Brain Atlas
forebrain	olfactory bulb	anterior olfactory nucleus	aon	AON
forebrain	olfactory bulb	granule cell layer	gro	MOBgr
forebrain	olfactory bulb	internal plexiform layer	ipl	MOBipl
forebrain	olfactory bulb	mitral cell layer	mi	MOBmi
forebrain	olfactory bulb	glomerular layer	gl	MOBgl
forebrain	olfactory bulb	rostral migratory stream	rms	SEZ
forebrain	olfactory bulb	external plexiform layer	epl	MOBopl
forebrain	olfactory bulb	external plexiform layer of the accessory OB	epla
forebrain	olfactory bulb	granule cell layer of the accessory OB	gra	AOBgr
forebrain	olfactory bulb	glomerular layer of the accessory OB	gla	AOBgl
forebrain	basal forebrain	dorsal tenia tecta	dtt	TTd
forebrain	basal forebrain	caudate putamen	cpu	CP
forebrain	basal forebrain	accumbens nucleus, core	acbc	ACB
forebrain	basal forebrain	accumbens nucleus, shell	acbsh	ACB
forebrain	basal forebrain	island of Calleja	icj	isl
forebrain	basal forebrain	ventral pallidum	vp	PALv
forebrain	basal forebrain	medial septum	ms	MS
forebrain	basal forebrain	nucleus of the vertical limb of the diagonal band	vdb	NDB
forebrain	basal forebrain	lateral septum	ls	LS
forebrain	basal forebrain	nucleus of the horizontal limb of the diagonal band	hdb	NDB
forebrain	basal forebrain	globus pallidus	gp	PALd
forebrain	cerebral cortex	frontal association cortex	fra	FRP
forebrain	cerebral cortex	motor cortex	m	MO
forebrain	cerebral cortex	cingulate cortex	cg	ACA
forebrain	cerebral cortex	piriform cortex, L1	pirl1	PIR1
forebrain	cerebral cortex	piriform cortex, L2	pirl2	PIR2
forebrain	cerebral cortex	piriform cortex, L3	pirl3	PIR3
forebrain	cerebral cortex	insular cortex	i	AI
forebrain	cerebral cortex	somatosensory cortex	s	SS
forebrain	cerebral cortex	retrosplenial granular cortex	rsg	RSP
forebrain	cerebral cortex	parietal association cortex	p	PTLp
forebrain	cerebral cortex	entorhinal cortex	ent	ENT
forebrain	cerebral cortex	visual cortex	v	VIS
forebrain	hippocampus	polymorph layer of the dentate gyrus	podg	DG-po
forebrain	hippocampus	molecular layer of the dentate gyrus	modg	DG-mo
forebrain	hippocampus	granular dentate gyrus	grdg	DG-sg
forebrain	hippocampus	CA1 - oriens layer	ca1or	CA1so
forebrain	hippocampus	CA1 - pyramidal layer	ca1py	CA1sp
forebrain	hippocampus	CA1 - radiatum layer	ca1ra	CA1sr
forebrain	hippocampus	CA2 - oriens layer	ca2or	CA2so
forebrain	hippocampus	CA2 - pyramidal layer	ca2py	CA2sp
forebrain	hippocampus	CA2 - radiatum layer	ca2ra	CA2sr
forebrain	hippocampus	CA3 - oriens layer	ca3or	CA3so
forebrain	hippocampus	CA3 - pyramidal layer	ca3py	CA3sp
forebrain	hippocampus	CA3 - radiatum layer	ca3ra	CA3sr
forebrain	hippocampus	stratum lucidum	slu	CA3slu
forebrain	hippocampus	lacunosum moleculare	lmol	CA1slm
forebrain	hippocampus	subiculum	sub	SUB
forebrain	circumventricular organs	subfornical organ	sfo	SFO
forebrain	amygdala	nucleus of the lateral olfactory tract	lot	NLOT
forebrain	amygdala	basal medial amygdaloid nucleus	bma	BMA
forebrain	amygdala	basal lateral amygdaloid nucleus	bla	BLA
forebrain	amygdala	cortical amygdala	aco	COA
forebrain	amygdala	central amygdala	ce	CEA
forebrain	amygdala	medial amygdaloid nucleus	mea	MEA
interbrain	hypothalamus	dorsal tuberomammillary nucleus	dtm	TMd
interbrain	hypothalamus	mammillary nucleus	mn	MBO
interbrain	hypothalamus	periventricular hypothalamic nucleus	pe	PVi
interbrain	hypothalamus	supraoptic nucleus	so	SO
interbrain	hypothalamus	tuberal nucleus	tu	TU
interbrain	hypothalamus	ventral tuberomammillary nucleus	vtm	TMv
interbrain	hypothalamus	lateral preoptic area	lpo	LPO
interbrain	hypothalamus	medial preoptic area	mpo	MEPO
interbrain	hypothalamus	suprachiasmatic nucleus	sch	SCH
interbrain	hypothalamus	paraventricular hypothalamic nucleus	pa	PVH
interbrain	hypothalamus	anterior hypothalamic area, central	ahc	AHN
interbrain	hypothalamus	ventral medial hypothalamic nucleus	vmh	VMH
interbrain	hypothalamus	arcuate nucleus	arc	ARH
interbrain	hypothalamus	peduncular part of lateral hypothalmus	plh	PH
interbrain	hypothalamus	dorsal medial hypothalamic nucleus	dm	DMH
interbrain	circumventricular organs	subcommissural organ	sco
interbrain	circumventricular organs	median eminence	me	ME
interbrain	thalamus	medial geniculate nucleus	mg	MG
interbrain	thalamus	parafascicular thalamic nucleus	pf	PF
interbrain	thalamus	pregeniculate nucleus	pg	GENd
interbrain	thalamus	stria terminalis	st	st
interbrain	thalamus	zona incerta	zi	ZI
interbrain	thalamus	anterodorsal thalamic nucleus	ad	AD
interbrain	thalamus	reticular thalamic nucleus	rt	RT
interbrain	thalamus	vental anterior thalamic nucleus	va	VAL
interbrain	thalamus	medial habenular nucleus	mhb	MH
interbrain	thalamus	laterodorsal thalamic area	ld	LD
interbrain	thalamus	paraventricular thalamic nucleus	pv	PVT
interbrain	thalamus	central medial thalamic area	cm	CM
interbrain	thalamus	ventral lateral thalamic area	vl	VP
interbrain	thalamus	ventral medial thalamic area	vm	VM
interbrain	thalamus	lateral habenulal nucleus	lhb	LH
interbrain	thalamus	ventral posterior thalamus	vpt	VP
interbrain	thalamus	anterior pretactal nucleus	apt	PRT
interbrain	thalamus	retromammillary nucleus	rm	SUM
midbrain	midbrain motor	substantia nigra, reticular	snr	SNr
midbrain	midbrain motor	periaquaductal grey	pag	PAG
midbrain	midbrain motor	interpeduncular nucleus	ip	IPN
midbrain	midbrain motor	mesencephalic retic form	mrt	MRN
midbrain	midbrain motor	red nucleus	r	RN
midbrain	midbrain motor	oculomotor nucleus	3n	III
midbrain	midbrain motor	mesencephalic trigeminal nucleus	me5	MEV
midbrain	midbrain motor	ventral tegmental area	vta	VTA
midbrain	midbrain behavioral	substantia nigra, compact	snc	SNc
midbrain	midbrain behavioral	dorsal raphe nucleus	dr	DR
midbrain	midbrain behavioral	median raphe nucleus	mnr	CLI
midbrain	midbrain sensory	superior colliculi	sc
midbrain	midbrain sensory	external cortical inferior colliculli	ecic	ICe
hindbrain	cerebellum	moleuclar layer of the cerebellum	cemol	CBXmo
hindbrain	cerebellum	Purkinje layer of the cerebellum	cepur	CBXpu
hindbrain	cerebellum	granular layer of the cerebellum	cegr	CBXgr
hindbrain	circumventricular organs	medulla	ap	AP
hindbrain	pons	koelliker-fuse nucleus	kf	KF
hindbrain	pons	motor tregiminal nucleus	5n	V
hindbrain	pons	parabrachial nucleus	pbp	PB
hindbrain	pons	principle sensory trigeminal nucleus	pr5	PSV
hindbrain	pons	locus coeruleus	lc	LC
hindbrain	pons	pontine nucleus	pn	PG
hindbrain	pons	vestibular nucleus	ve	VNC
hindbrain	pons	pontine reticular nucleus, oral	pno	PRNr
hindbrain	pons	lateral lemniscus	ll	NLL
hindbrain	pons	superior paraolivary nucleus	spo	POR
hindbrain	medulla	nucleus of the solitary tract	sol	NTS
hindbrain	medulla	raphe magnus nucleus	rmg	RM
hindbrain	medulla	cochlear nucleus	cn	CN
hindbrain	medulla	lateral paragigantocellular nucleus	lpg	PGRNl
hindbrain	medulla	raphe pallidus nucleus	rpa	RPA
hindbrain	medulla	facial nucleus	7n	VII
hindbrain	medulla	hypoglossal nucleus	12n	XII
hindbrain	medulla	ambiguus nucleus	amb	AMB
hindbrain	medulla	external cuneate nucleus	ecu	CU
hindbrain	medulla	inferior olivary nucleus	io	IO
hindbrain	medulla	raphe obscures nucleus	rob	RO
hindbrain	medulla	dorsal motor nucleus of vagus	10n	DMX

Annotation

The digitalized images are processed (axel-adjusted and tissue edges defined) and regions of interest (ROIs) are then marked according to the table above. Theses ROIs are then used for image analysis and the relative fluorescence intensity is listed for each region. The relative fluorescence is defined intensity of the annotated region relative to the intensity of the region with highest intensity.

The overview and preserved orientation in the mouse brain has enabled us to annotate additional cell classes (ependymal), glial subpopulations (microglia, oligodendrocytes, and astrocytes), and additional brain specific subcellular locations (axon, dendrite, synapse, and glia endfeet) for each investigated protein.

All images of immunofluorescence stained sections were manually annotated by specially educated personnel followed by review and verification by a second qualified member of the staff. The cellular and sub cellular location of the immunoreactivity is defined and a summarizing text is provided describing the general staining pattern.

Specificity is validated by comparing the data with in situ hybridization data (Allen brain atlas) and/or available literature; support from other data leads to a supportive reliability score, while more unknown targets are viewed as uncertain and awaits further validation.

Reliability score

A reliability score is set for all genes and indicates the level of reliability of the analyzed protein expression pattern based on available protein/RNA/gene characterization data.

The reliability score of the antibodies in mouse brain atlas is scored as Supported or Uncertain depending on support from in situ hybridization data (Allen brain atlas) and/or previous published data, UniProtKB/Swiss-Prot database.

Immunohistochemistry - cells

For antibody validation purposes a selection of 44 different widely used and well characterized human cell lines are used for immunohistochemical staining on cell microarrays. Included cell lines are derived from DSMZ, ATCC or academic research groups (kindly provided by cell line founders). Information regarding sex and age of the donor, tissue origin and source is listed here. All cells are fixed in 4% paraformaldehyde and dispersed in agarose prior to paraffin embedding and immunohistochemical staining.

The cell microarray (CMA) enables representation of leukemia and lymphoma cell lines, covering major hematopoietic neoplasms and even different stages of differentiation. Cell lines from solid tumors are also included in the CMA. A subset originate from solid tumors not represented in the TMAs, e.g. sarcoma, choriocarcinoma, small cell lung carcinoma, and the remaining cell lines are derived from tumor types also represented in the tissue microarray.

Antibodies are labeled with DAB (3,3'-diaminobenzidine) and the resulting brown staining indicates where an antibody has bound to its corresponding antigen. The section is furthermore counterstained with hematoxylin to enable visualization of microscopical features.

Annotation

All images of immunohistochemically stained cell lines are annotated using an automated recognition software for image analysis. The image analysis software, TMAx (Beecher Instruments, Sun Prairie, WI, USA), built on an object-oriented image analysis engine from Definiens, utilizes rule-based operations and multiple iterative segmentation processes together with fuzzy logic to identify cells and immunohistochemical stain deposits.

Output parameters from the software always displayed in conjunction with the annotated images are:

number of objects defined as cells in the image
staining intensity (negative, weak, moderate and strong)
fraction (%) of positive cells

In addition, two overlay images with additional numerical information are presented to facilitate interpretation. The information displayed includes:

Cell: object based view representing fraction (%) of immunostained cells. The color code for each cell represents a range of immunoreactivity, blue (negative/very weak), yellow (weak/moderate), orange (moderate/strong) and red (strong) cells. This classification is based on areas of different intensities within each object (cell). This differs slightly from the subjective classification provided by manual annotation of cells in normal and cancer tissue.
Area: area-based view representing immunostained areas (%) within cells. The color code represents a range of immunoreactivity, yellow (weak/moderate), green (moderate/strong) and red (strong). Negative/very weak areas are transparent. The intensity score is generated from the total of this area based analysis.

The annotation data is used for formal antibody validation based on Pearson correlation between independent antibodies and an orthogonal method (RNA-seq). More information can be found here.

Protein array

All purified antibodies are analyzed on antigen microarrays. The specificity profile for each antibody is determined based on the interaction with 384 different antigens including its own target. The antigens present on the arrays are consecutively exchanged in order to correspond to the next set of 384 purified antibodies. Each microarray is divided into 21 replicated subarrays, enabling the analysis of 21 antibodies simultaneously. The antibodies are detected through a fluorescently labeled secondary antibody and a dual color system is used in order to verify the presence of the spotted proteins. A specificity profile plot is generated for each antibody, where the signal from the binding to its own antigen is compared to the eventual off target interactions to all the other antigens. The vast majority (86%) of antibodies are given a pass and the remaining are failed either due to low signal or low specificity.

Western blot

Western blot analysis of antibody specificity has been done using a routine sample setup composed of IgG/HSA-depleted human plasma and protein lysates from a limited number of human tissues and cell lines. Antibodies with an uncertain routine WB have been revalidated using an over-expression lysate (VERIFY Tagged Antigen(TM), OriGene Technologies, Rockville, MD) as a positive control. Antibody binding was visualized by chemiluminescence detection in a CCD-camera system using a peroxidase (HRP) labeled secondary antibody.

Antibodies included in the Human Protein Atlas have been analyzed without further efforts to optimize the procedure and therefore it cannot be excluded that certain observed binding properties are due to technical rather than biological reasons and that further optimization could result in a different outcome.

HPA RNA-seq data

In total, 56 cell lines and 37 tissues have been analyzed by RNA-seq to estimate the transcript abundance of each protein-coding gene.

For cell lines, early-split samples were used as duplicates and total RNA was extracted using the RNeasy mini kit. Information regarding cellular origin and source of each cell line is listed here.

For normal tissue, specimens were collected with consent from patients and all samples were anonymized in accordance with approval from the local ethics committee (ref #2011/473) and Swedish rules and legislation. All tissues were collected from the Uppsala Biobank and RNA samples were extracted from frozen tissue sections.

For a total number of 115 cell line samples and 172 tissue samples, mRNA sequencing was performed on Illumina HiSeq2000 and 2500 machines (Illumina, San Diego, CA, USA) using the standard Illumina RNA-seq protocol with a read length of 2x100 bases. Transcript abundance estimation was performed using Kallisto v0.42.4. For each gene, we report the abundance in 'Transcript Per Million' (TPM) as the sum of the TPM values of all its protein-coding transcripts. For each cell line and tissue type, the average TPM value for replicate samples were used as abundance score. The threshold level to detect presence of a transcript for a particular gene was set to ≥ 1 TPM.

The RNA-seq data was used to classify all genes according to their tissue-specific expression into one of six different categories, defined based on the total set of all TPM values in 37 tissues:

tissue enriched (expression in one tissue at least five-fold higher than all other tissues)
group enriched (five-fold higher average TPM in a group of two to seven tissues compared to all other tissues)
tissue enhanced (five-fold higher average TPM in one or more tissues compared to the mean TPM of all tissues)
expressed in all (≥ 1 TPM in all tissues)
not detected (< 1 TPM in all tissues)
mixed (detected in at least one tissue and in none of the above categories)

An additional category "elevated", containing all genes in the first three categories (tissue enriched, group enriched and tissue enhanced), has been used for some parts of the analysis.

GTEx RNA-seq data

The Genotype-Tissue Expression (GTEx) project collects and analyzes multiple human post mortem tissues. RNA-seq data from 31 of their tissues having a corresponding tissue in Human Protein Atlas have been included to allow for comparisons between the Human Protein Atlas data and GTEx data.

The GTEx RNA-seq data has been mapped using the ensembl gene id available from GTEx, and the RPKMs (number Reads Per Kilobase gene model and Million mapped reads) for each gene were subsequently used to categorize the genes using the same classification as described above but using 0.5 RPKM as threshold for detection.

Tissue	GTEx tissue	Number of samples
Adipose tissue	Adipose - Subcutaneous	350
	Adipose - Visceral (Omentum)	227
Adrenal gland	Adrenal Gland	145
Breast	Breast - Mammary Tissue	214
Caudate	Brain - Caudate (basal ganglia)	117
Cerebellum	Brain - Cerebellar Hemisphere	105
	Brain - Cerebellum	125
Cerebral cortex	Brain - Cortex	114
	Brain - Frontal Cortex (BA9)	108
Cervix, uterine	Cervix - Ectocervix	6
	Cervix - Endocervix	5
Colon	Colon - Sigmoid	149
	Colon - Transverse	196
Endometrium	Uterus - Endometrium	14
Esophagus	Esophagus - Mucosa	286
Fallopian tube	Fallopian Tube	6
Heart muscle	Heart - Atrial Appendage	194
	Heart - Left Ventricle	218
Hippocampus	Brain - Hippocampus	94
Hypothalamus	Brain - Hypothalamus	96
Kidney	Kidney - Cortex	32
Liver	Liver	119
Lung	Lung	320
Ovary	Ovary	97
Pancreas	Pancreas	171
Pituitary gland	Pituitary	103
Prostate	Prostate	106
Salivary gland	Minor Salivary Gland	57
Skeletal muscle	Muscle - Skeletal	430
Skin	Skin - Not Sun Exposed (Suprapubic)	250
	Skin - Sun Exposed (Lower leg)	357
Small intestine	Small Intestine - Terminal Ileum	88
Spleen	Spleen	104
Stomach	Stomach	193
Testis	Testis	172
Thyroid gland	Thyroid	323
Urinary bladder	Bladder	11
Vagina	Vagina	96

FANTOM5 CAGE data

The Functional Annotation of Mammalian Genomes 5 (FANTOM5) project provides comprehensive expression profiles and functional annotation of mammalian cell-type specific transcriptomes using Cap Analysis of Gene Expression (CAGE) (Takahashi et al 2012), which is based on a series of full-length cDNA technologies developed in RIKEN. CAGE data for 35 of their tissues was obtained from the FANTOM5 repository and mapped to ENSEMBL. The normalized Tags Per Million for each gene were calculated and subsequently used to categorize the genes using the same classification as described above and using Tags Per Million ≥ 1 as threshold for detection to allow for comparisons with the Human Protein Atlas data.

Tissue	FANTOM5 tissue	Sample description	FANTOM5 sample id
Adipose tissue	Adipose tissue	65,65,76 years, mixed	FF:10010-101C1
Appendix	Appendix	29 years, male	FF:10189-103D9
Brain	Brain	77,79,81 years, mixed	FF:10012-101C3
Breast	Breast	77 years, female	FF:10080-102A8
Caudate	Caudate nucleus	76 years, female	FF:10164-103B2
Cerebellum	Cerebellum	22-68 years, mixed	FF:10083-102B2
	Cerebellum	76 years, female	FF:10166-103B4
Cervix, uterine	Cervix	40,46,57,65 years, female	FF:10013-101C4
Colon	Colon	62,83,84 years, mixed	FF:10014-101C5
Endometrium	Uterus	23-63 years, female	FF:10100-102D1
Epididymis	Epididymis	24 years, male	FF:10197-103E8
Esophagus	Esophagus	68,74,75 years, mixed	FF:10015-101C6
Gallbladder	Gall bladder	57 years, male	FF:10198-103E9
Heart muscle	Heart	70,73,74 years, mixed	FF:10016-101C7
	Left ventricle	73 years, female	FF:10078-102A6
Hippocampus	Hippocampus	76 years, female	FF:10153-102I9
	Hippocampus	60 years, female	FF:10169-103B7
Kidney	Kidney	60,62,63 years, female	FF:10017-101C8
Liver	Liver	64,69,70 years, mixed	FF:10018-101C9
Lung	Lung	46,65,94 years, mixed	FF:10019-101D1
	Lung - right lower lobe	29 years, male	FF:10075-102A3
Lymph node	Lymph node	30 years, male	FF:10077-102A5
Ovary	Ovary	47,75,84 years, female	FF:10020-101D2
Pancreas	Pancreas	52 years, male	FF:10049-101G4
Pituitary gland	Pituitary gland	76 years, female	FF:10162-103A9
Placenta	Placenta	female	FF:10021-101D3
Prostate	Prostate	73,79,93 years, male	FF:10022-101D4
Retina	Retina	24-65 years, mixed	FF:10030-101E3
Salivary gland	Salivary gland	16-60 years, mixed	FF:10093-102C3
Seminal vesicle	Seminal vesicle	24 years, male	FF:10201-103F3
Skeletal muscle	Skeletal muscle	55,79,79 years, mixed	FF:10023-101D5
	Skeletal muscle - soleus muscle	male	FF:10282-104F3
Small intestine	Small intestine	15,40,85 years, mixed	FF:10024-101D6
Smooth muscle	Smooth muscle	20-68 years, male	FF:10048-101G3
Spleen	Spleen	39,50,70 years, male	FF:10025-101D7
Testis	Testis	34,53,86 years, male	FF:10026-101D8
	Testis	14-64 years, male	FF:10096-102C6
Thyroid gland	Thyroid	67,68,78 years, mixed	FF:10028-101E1
Tonsil	Tonsil	22-61 years, mixed	FF:10047-101G2
Urinary bladder	Bladder	55,58,79 years, mixed	FF:10011-101C2
Vagina	Vagina	68 years, female	FF:10204-103F6

TCGA RNA-seq data

The Cancer Genome Atlas (TCGA) project of Genomic Data Commons (GDC) collects and analyzes multiple human cancer samples. RNA-seq data from 17 cancer types representing 21 cancer subtypes with a corresponding major cancer type in the Human Pathology Atlas were included to allow for comparisons between the protein staining data from the Human Protein Atlas and RNA-seq from TCGA data.

The TCGA RNA-seq data was mapped using the Ensembl gene id available from TCGA, and the FPKMs (number Fragments Per Kilobase of exon per Million reads) for each gene were subsequently used for quantification of expression with a detection threshold of 1 FPKM. Genes were categorized using the same classification as described above.

HPA cancer type	TCGA cancer	No. of samples in TCGA
Breast cancer	Breast Invasive Carcinoma (BRCA)	1075
Cervical cancer	Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma (CESC)	291
Colorectal cancer	Colon Adenocarcinoma (COAD)	438
	Rectum Adenocarcinoma (READ)	159
Endometrial cancer	Uterine Corpus Endometrial Carcinoma (UCEC)	541
Glioma	Glioblastoma Multiforme (GBM)	153
Head and neck cancer	Head and Neck Squamous Cell Carcinoma (HNSC)	499
Liver cancer	Liver Hepatocellular Carcinoma (LIHC)	365
Lung cancer	Lung Adenocarcinoma (LUAD)	500
	Lung Squamous Cell Carcinoma (LUSC)	494
Melanoma	Skin Cuteneous Melanoma (SKCM)	102
Ovarian cancer	Ovary Serous Cystadenocarcinoma (OV)	373
Pancreatic cancer	Pancreatic Adenocarcinoma (PAAD)	176
Prostate cancer	Prostate Adenocarcinoma (PRAD)	494
Renal cancer	Kidney Chromophobe (KICH)	64
	Kidney Renal Clear Cell Carcinoma (KIRC)	528
	Kidney Renal Papillary Cell Carcinoma (KIRP)	285
Stomach cancer	Stomach Adenocarcinoma (STAD)	354
Testis cancer	Testicular Germ Cell Tumor (TGCT)	134
Thyroid cancer	Thyroid Carcinoma (THCA)	501
Urothelial cancer	Bladder Urothelial Carcinoma (BLCA)	406

Survival

Based on the FPKM value of each gene, patients were classified into two expression groups and the correlation between expression level and patient survival was examined. Genes with a median expression less than FPKM 1 were excluded. The prognosis of each group of patients was examined by Kaplan-Meier survival estimators, and the survival outcomes of the two groups were compared by log-rank tests. Both median and maximally separated Kaplan-Meier plots are presented in the Human Protein Atlas, and genes with log rank P values less than 0.001 in maximally separated Kaplan-Meier analysis were defined as prognostic genes. If the group of patients with high expression of a selected prognostic gene has a higher observed event than expected event, it is an unfavourable prognostic gene; otherwise, it is a favourable prognostic gene.

Evidence

Protein evidence is calculated for each gene based on three different sources: UniProt protein existence (UniProt evidence); a Human Protein Atlas antibody- or RNA based score (HPA evidence); and evidence based on two proteogenomics studies (MS evidence). In addition, for each gene, a protein evidence summary score is based on the maximum level of evidence in all three independent evidence scores (Evidence summary).

All scores are classified into the following categories:

Evidence at protein level
Evidence at transcript level
No evidence
Not available

UniProt evidence is based on UniProt protein existence data, which uses five types of evidence for the existence of a protein. All genes in the classes "Experimental evidence at protein level" or "Experimental evidence at transcript level" are classified into the first two evidence categories, whereas genes from the "Inferred from homology", "Predicted", or "Uncertain" classes are classified as "No evidence". Genes where the gene identifier could not be mapped to UniProt from Ensembl version 83.38 are classified as "Not available".

The HPA evidence is calculated based on the manual curation of Western blot, tissue profiling and subcellular location as well as transcript profiling using RNA-seq. All genes with Data reliability "Supported" in one or both of the two methods immunohistochemistry and immunofluorescence, or standard validation "Supported" for the Western blot application (assays using over-expression lysates not included) are classified as "Evidence at protein level". For the remaining genes, all genes detected at TPM > 1 in at least one of the tissues or cell lines used in the RNA-seq analysis are classified as "Evidence at transcript level". The remaining genes are classified as "No evidence".

MS evidence is based on two proteogenomics studies Kim et al 2014 and Ezkurdia et al 2014. Each gene detected by at least one of the MS-based studies is classified as "Evidence at protein level" and all remaining genes as "Not available".

Contact

contact@proteinatlas.org

The Project

The Human Protein Atlas

The Human Protein Atlas project is funded
by the Knut & Alice Wallenberg foundation.