Immunocytochemistry/IF - cells

The Cell Atlas displays high-resolution, multicolor images of proteins labeled by indirect immunocytochemistry/immunofluorescence (ICC-IF). This provides spatial information on protein expression patterns to define the subcellular localization to cellular organelles and structures at single cell level.

Originally three cell lines, U-2 OS, A-431 and U-251 MG, originating from different human tissues were chosen to be included in the analysis of protein subcellular location by ICC-IF. The cell line panel has been expanding and is now including additional cell lines to enhance the probability for a large number of expressed proteins. The cell lines were selected from different lineages, e.g. tumor cell lines from mesenchymal, epithelial and glial tumors as well as cells immortalized by introduction of telomerase. The selection was furthermore based on morphological characteristics and widespread use of these cell lines. Information regarding sex and age of the donor, cellular origin and source is listed here. In order to localize the whole human proteome on a subcellular level in one specific cell line, all proteins are stained in U-2 OS. Two additional cell lines are selected based on mRNA expression data. In addition to the human cell lines, many proteins have been stained in the mouse cell line NIH 3T3, given that the human and mouse genes are orthologous.

The standard immunostaining protocol for ICC can be found on the open access repository for science methods at protocols.io. For the great majority of antibodies, fixation is achieved with paraformaldehyde (PFA), but for a few antibodies, this is replaced by methanol (3 times 5 minutes) in order to better preserve the morphology of certain cellular structures. For each gene, the use of PFA or methanol, as well as dilution factors for the antibodies, are stated in the Antibodies and Validation section. In order to facilitate the annotation of the subcellular localization of the protein targeted by the HPA antibody, the cells are also stained with reference the following markers: (i) DAPI for the nucleus, (ii) anti-tubulin antibody as internal control and marker of microtubules, and (iii) anti-calreticulin or anti-KDEL for the endoplasmic reticulum (ER).

The resulting confocal images are single slice images representing one optical section of the cells. The microscope settings are optimized for each sample. The different organelle probes are displayed as different channels in the multicolor images; the HPA antibody staining is shown in green, nucleus in blue, microtubules in red and ER in yellow.

Annotation

In order to provide an interpretation of the staining patterns, all images generated by ICC-IF are manually annotated. For each cell line and antibody, the staining is described in terms of intensity, subcellular location and single-cell variability (SCV). The staining intensity is classified as negative, weak, moderate or strong based on the laser power and detector gain settings used for image acquisition in combination with the visual appearance of the image. The table below lists the subcellular locations used for annotation, links to the cell structure dictionary entry and corresponding GO terms. SCVs within an immunofluorescence image are annotated as intensity (variation in their expression level) or as spatial (variation in the spatial distribution).

Subcellular location GO term
Actin filaments GO:0015629
Aggresome GO:0016235
Cell Junctions GO:0030054
Centriolar satellite GO:0034451
Centrosome GO:0005813
Cleavage furrow GO:0032154
Cytokinetic bridge GO:0045171
Cytoplasmic bodies GO:0036464
Cytosol GO:0005829
Endoplasmic reticulum GO:0005783
Endosomes GO:0005768
Focal adhesion sites GO:0005925
Golgi apparatus GO:0005794
Intermediate filaments GO:0045111
Kinetochore GO:0000776
Lipid droplets GO:0005811
Lysosomes GO:0005764
Microtubule ends GO:1990752
Microtubules GO:0015630
Midbody GO:0030496
Midbody ring GO:0090543
Mitochondria GO:0005739
Mitotic chromosome GO:0005694
Mitotic spindle GO:0072686
Nuclear bodies GO:0016604
Nuclear membrane GO:0031965
Nuclear speckles GO:0016607
Nucleoli GO:0005730
Nucleoli fibrillar center GO:0001650
Nucleoli rim GO:0005730
Nucleoplasm GO:0005654
Peroxisomes GO:0005777
Plasma membrane GO:0005886
Rods & Rings
Vesicles GO:0043231

Knowledge-based annotation

The knowledge-based annotation aims to provide an interpretation of the detected subcellular location of a protein. In the first step, stainings in different cell lines with the same antibody are reviewed and the results are compared with external experimental protein/gene characterization data for subcellular location, available in the UniProtKB/Swiss-Prot database. In the second step, all antibodies targeting the same protein are taken in consideration for final annotation of the subcellular distribution of the protein. Each location gets separately one of the four reliability scores (Enhanced, Supported, Approved, and Uncertain), which results together with additional factors (e.g. correlation of the signal strength to RNA-seq data, similarity between sibling antibodies) in an overall gene reliability score.

Reliability score

A reliability score is set manually for all genes and indicates the level of reliability of the detected subcellular distribution pattern of the protein, based on available protein/RNA/gene characterization data from both HPA and the UniProtKB/Swiss-Prot database. The overall score encompass several factors, including reproducibility of the antibody staining in different cell lines (and if signal strength correlates with RNA expression levels), assays for enhanced antibody validation by using antibodies binding to different epitopes on the same target protein (independent antibody validation), by knockdown/knockout of the target protein (genetic validation) and by matching of the signal with a GFP-tagged protein (recombinant expression validation), and experimental evidence for subcellular location described in literature.

The final score leads to assignment into one of the following four classes:

  • Enhanced - One or more antibodies are enhanced validated and there is no contradicting data, such as literature describing experimental evidence for a different location.
  • Supported - There is no enhanced validation of the used antibody, but the annotated localization is reported in literature.
  • Approved - The localization of the protein has not been previously described and was detected by only one antibody without additional antibody validation.
  • Uncertain - The antibody-staining pattern contradicts experimental data or expression is not detected at RNA level.

Immunohistochemistry - tissues

The Human Protein Atlas contains images of histological sections from normal and cancer tissues obtained by immunohistochemistry. Antibodies are labeled with DAB (3,3'-diaminobenzidine) and the resulting brown staining indicates where an antibody has bound to its corresponding antigen. The section is furthermore counterstained with hematoxylin to enable visualization of microscopical features. Tissue microarrays are used to show antibody staining in samples from 144 individuals corresponding to 44 different normal tissue types, and samples from 216 cancer patients corresponding to 20 different types of cancer (movie about tissue microarray production and immunohistochemical staining). Each sample is represented by 1 mm tissue cores, resulting in a total number of 576 images for each antibody. Normal tissues are represented by samples from three individuals each, one core per individual, except for endometrium, skin, soft tissue and stomach, which are represented by samples from six individuals each and parathyroid gland, which is represented by one sample. Protein expression is annotated in 76 different normal cell types present in these tissue samples. For cancer tissues, two cores are sampled from each individual and protein expression is annotated in tumor cells. A small fraction of the 576 images are missing for most antibodies due to technical issues. Specimens containing normal and cancer tissue have been collected and sampled from anonymized paraffin embedded material of surgical specimens, in accordance with approval from the local ethics committee. For selected proteins extended tissue profiling is performed in addition to standard tissue microarrays. Examined tissues include mouse brain, human lactating breast, eye, thymus and extended samples of adrenal gland, skin and brain.
Since specimens are derived from surgical material, normal is here defined as non-neoplastic and morphologically normal. It is not always possible to obtain fully normal tissues and thus several of the tissues denoted as normal will include alterations due to inflammation, degeneration and tissue remodeling. In rare tissues, hyperplasia or benign proliferations are included as exceptions. It should also be noted that within normal morphology there may exist interindividual differences and variations due to primary diseases, age, sex etc. Such differences may also affect protein expression and thereby immunohistochemical staining patterns. Samples from cancer are also derived from surgical material. Due to subgroups and heterogeneity of tumors within each cancer type, included cases represent a typical mix of specimens from surgical pathology. The inclusion of tumors is based on availability and representativity, however, an effort has been made to include high and low grade malignancies where such is applicable. In certain tumor groups, subtypes have been included, e.g. breast cancer includes both ductal and lobular cancer, lung cancer includes both squamous cell carcinoma and adenocarcinoma and liver cancer includes both hepatocellular and cholangiocellular carcinoma etc. Tumor heterogeneity and interindividual differences may be reflected in diverse expression of proteins resulting in variable immunohistochemical staining patterns.

Annotation

In order to provide an overview of protein expression patterns, all images of tissues stained by immunohistochemistry are manually annotated by a specialist followed by verification by a second specialist. Annotation of each different normal and cancer tissue is performed using fixed guidelines for classification of immunohistochemical results. Each tissue is examined for representability, and subsequently immunoreactivity in the different cell types present in normal or cancer tissues was annotated. Basic annotation parameters include an evaluation of i) staining intensity (negative, weak, moderate or strong), ii) fraction of stained cells (<25%, 25-75% or >75%) and iii) subcellular localization (nuclear and/or cytoplasmic/membranous). The manual annotation also provides two summarizing texts describing the staining pattern for each antibody in normal tissues and in cancer tissues.
The terminology and ontology used is compliant with standards used in pathology and medical science. SNOMED classification is used for assignment of topography and morphology. SNOMED classification also underlies the given original diagnosis from which normal as well as cancer samples were collected.
A histological dictionary used in the annotation is available as a PDF-document, containing images stained by immunohistochemistry using antibodies included in the Human Protein Atlas. The dictionary displays subtypes of cells distinguishable from each other and also shows specific expression patterns in different intracellular structures. Annotation dictionary: screen usage (15 MB), printing (95 MB).

Knowledge-based annotation

Knowledge-based annotation aims to create a comprehensive overview of protein expression patterns in normal human tissues. This is achieved by stringent evaluation of immunohistochemical staining pattern, RNA-seq data from internal and external sources and available protein/gene characterization data, with special emphasis on RNA-seq. Annotated protein expression profiles are performed using single antibodies as well as independent antibodies (two or more independent antibodies directed against different, non-overlapping epitopes on the same protein). For independent antibodies, the immunohistochemical data from all the different antibodies are taken into consideration. The immunohistochemical staining pattern in normal tissues is subjectively annotated according to strict guidelines. It is based on the experienced evaluation of positive immunohistochemical signals in the 76 normal cell types analyzed. The review also takes suboptimal experimental procedures and interindividual variations into consideration.
The final annotated protein expression is considered a best estimate and as such reflects the most probable histological distribution and relative expression level for each protein. To enable a protein expression profile, one or several of the following additional data sources is necessary; i) an independent antibody targeting another epitope of the same protein ii) RNA-seq data, and iii) available protein/gene characterization data. The result of the knowledge-based annotation is considered inconclusive when the information available at the time of analysis is evaluated as not sufficient for verification of the staining pattern and an estimation of the expected protein expression. The knowledge-based protein expression profiles are performed using fixed guidelines on evaluation and presentation of the resulting expression profiles. Standardized explanatory sentences are used when necessary to provide additional information required for full understanding of the expression profile. A reliability score, set as Enhanced, Supported, Approved, or Uncertain is set for each annotated protein expression profile based on evaluation of all available data.

Reliability score

A reliability score is manually set for all genes and indicates the level of reliability of the analyzed protein expression pattern based on knowledge-based evaluation of available RNA-seq data, protein/gene characterization data and immunohistochemical data from one or several antibodies designed towards non-overlapping sequences of the same gene. The reliability score is based on the 44 normal tissues analyzed, and is displayed on both the Tissue Atlas and the Pathology Atlas.

The reliability score is divided into Enhanced, Supported, Approved, or Uncertain. If there is available data from more than one antibody, the staining patterns of all antibodies are taken into consideration during the evaluation of the reliability score.

Enhanced
One or several antibodies targeting non-overlapping sequences of the same gene have obtained enhanced validation based on either orthogonal or independent antibody validation methods.

Supported
If one of the following criteria is fulfilled:

  • At least one antibody shows high or medium consistency between RNA levels and staining pattern, but the antibody does not qualify for Orthogonal validation and staining pattern is consistent with valid literature, or there is no valid literature available
  • At least one antibody has RNA consistency defined as “Cannot be evaluated” and staining pattern is consistent with valid literature
  • Paired antibodies (several antibodies targeting non-overlapping sequences) show similar staining pattern, but the antibodies do not qualify for Independent antibody validation and staining pattern is consistent with valid literature, or there is no valid literature availa

Approved
If one of the following criteria is fulfilled:

  • At least one antibody shows high or medium consistency between RNA levels and staining pattern and staining pattern is inconsistent with valid literature
  • At least one antibody shows low consistency between RNA levels and staining pattern and staining pattern is consistent with valid literature
  • At least one antibody has RNA consistency defined as “Cannot be evaluated” and staining pattern is partly consistent with valid literature, or consistent with limited literature
  • Paired antibodies show partly similar expression patterns

Uncertain
If one of the following criteria is fulfilled:

  • Only multi-targeting antibodies are available. Multi-targeting antibodies are used for genes where it was not possible to generate single-targeting antibodies due to high sequence identity among proteins belonging to different genes. These genes are in many cases closely related and belong to known gene families, and in these cases a multi-targeting antibody was produced that has >80% sequence identity to transcripts of the genes belonging to the family and low sequence identity to the transcripts of all other human genes.
  • At least one antibody shows low or very low consistency between RNA and staining pattern, or RNA consistency is defined as “Cannot be evaluated” and staining pattern is inconsistent with valid literature, or there is no valid literature available
  • Paired antibodies show dissimilar expression patterns

Immunohistochemistry/IF - mouse brain

As a complement to the immunohistochemically stained tissues, the protein atlas also includes the mouse brain atlas as a sub compartment of the normal tissue atlas. In which comprehensive profiles are available in mouse brain. A selected set of targets have been analyzed by using the antibodies in serial sections of mouse brain which covers 129 areas and subfields of the brain, several of these regions difficult to cover in the human brain. In addition pituitary, retina and trigeminal ganglions are included in recent and future image series but not annotated yet.

The tissue microarray method used within the human protein atlas enabled the global mapping of proteins in the human body, including the brain. Currently, the human tissue atlas covers four areas of the human brain: cerebral cortex, hippocampus, caudate and cerebellum. Due to the heterogeneous structure of the brain, with many nuclei and cell-types organized in complex networks, it is difficult to achieve a comprehensive overview in a 1 mm tissue sample. Analysis of more human brain samples, including smaller brain nuclei, is thus desirable in order to generate a more detailed map of protein distribution in the brain. Therefore, we here complemented the human brain atlas effort with a more comprehensive analysis of the mouse brain. A series of mouse brain sections is explored for protein expression and distribution in a large number of brain regions.

Antibodies are selected against protein involved in normal brain physiology, brain development and neuropathological processes. A limit of 60% homology (human vs mouse) is used as cut off when comparing the PrEST sequence for the antibody targets.

Selected antibodies are applied to test-sections containing brain regions or cell types with known expression based on in situ hybridization (Allen Brain Atlas) and single cell RNAseq data (Linnarsson Lab and Barres Lab). Staining patterns are evaluated based on consistency between staining patterns of multiple antibodies against the same target and match to transcriptomics data. Antibody immunoreactivity is visualized using tyramid signal amplification shown in green. A nuclear reference staining (DAPI) is visualized in blue. The immunofluorescence protocol is standardized through antibody concentration and incubation time are variable depending on protein abundance and antibody affinity determined during the test staining. The complete mouse brain profile is represented by serial coronal sections of adult mouse brain, 16 µm thick. Stained slides are then scanned and digitalized before further processing.

Table 1. Brain regions. Abbreviations are based on The Mouse Brain in Stereotaxic Coordinates, Third Edition: The coronal plates and diagrams (ISBN: 9780123742445)

Region Abbreviation Allen Brain Atlas
cerebral cortex cerebral cortex frontal association cortex fra FRP
cerebral cortex cerebral cortex motor cortex m MO
cerebral cortex cerebral cortex cingulate cortex cg ACA
cerebral cortex cerebral cortex piriform cortex, L1 pirl1 PIR1
cerebral cortex cerebral cortex piriform cortex, L2 pirl2 PIR2
cerebral cortex cerebral cortex piriform cortex, L3 pirl3 PIR3
cerebral cortex cerebral cortex insular cortex i AI
cerebral cortex cerebral cortex somatosensory cortex s SS
cerebral cortex cerebral cortex retrosplenial granular cortex rsg RSP
cerebral cortex cerebral cortex parietal association cortex p PTLp
cerebral cortex cerebral cortex entorhinal cortex ent ENT
cerebral cortex cerebral cortex visual cortex v VIS
olfactory region olfactory bulb anterior olfactory nucleus aon AON
olfactory region olfactory bulb granule cell layer gro MOBgr
olfactory region olfactory bulb internal plexiform layer ipl MOBipl
olfactory region olfactory bulb mitral cell layer mi MOBmi
olfactory region olfactory bulb glomerular layer gl MOBgl
olfactory region olfactory bulb rostral migratory stream rms SEZ
olfactory region olfactory bulb external plexiform layer epl MOBopl
olfactory region olfactory bulb external plexiform layer of the accessory OB epla
olfactory region olfactory bulb granule cell layer of the accessory OB gra AOBgr
olfactory region olfactory bulb glomerular layer of the accessory OB gla AOBgl
Show allShow less

Annotation

The digitalized images are processed (axel-adjusted and tissue edges defined) and regions of interest (ROIs) are then marked according to the table above. These ROIs are then used for image analysis and the relative fluorescence intensity is listed for each region. The relative fluorescence is defined intensity of the annotated region relative to the intensity of the region with highest intensity.

The overview and preserved orientation in the mouse brain has enabled us to annotate additional cell classes (ependymal), glial subpopulations (microglia, oligodendrocytes, and astrocytes), and additional brain specific subcellular locations (axon, dendrite, synapse, and glia endfeet) for each investigated protein.

All images of immunofluorescence stained sections were manually annotated by specially educated personnel followed by review and verification by a second qualified member of the staff. The cellular and subcellular location of the immunoreactivity is defined and a summarizing text is provided describing the general staining pattern.

Specificity is validated by comparing the data with in situ hybridization data (Allen brain atlas) and/or available literature; support from other data leads to a supportive reliability score, while more unknown targets are viewed as uncertain and awaits further validation.

Reliability score

A reliability score is set for all genes and indicates the level of reliability of the analyzed protein expression pattern based on available protein/RNA/gene characterization data.

The reliability score of the antibodies in mouse brain atlas is scored as Supported or Uncertain depending on support from in situ hybridization data (Allen brain atlas) and/or previous published data, UniProtKB/Swiss-Prot database.

Protein array

All purified antibodies are analyzed on antigen microarrays. The specificity profile for each antibody is determined based on the interaction with 384 different antigens including its own target. The antigens present on the arrays are consecutively exchanged in order to correspond to the next set of 384 purified antibodies. Each microarray is divided into 21 replicated subarrays, enabling the analysis of 21 antibodies simultaneously. The antibodies are detected through a fluorescently labeled secondary antibody and a dual color system is used in order to verify the presence of the spotted proteins. A specificity profile plot is generated for each antibody, where the signal from the binding to its own antigen is compared to the eventual off target interactions to all the other antigens. The vast majority (86%) of antibodies are given a pass and the remaining are failed either due to low signal or low specificity.

Western blot

Western blot analysis of antibody specificity has been done using a routine sample setup composed of IgG/HSA-depleted human plasma and protein lysates from a limited number of human tissues and cell lines. Antibodies with an uncertain routine WB have been revalidated using an over-expression lysate (VERIFY Tagged Antigen(TM), OriGene Technologies, Rockville, MD) as a positive control. Antibody binding was visualized by chemiluminescence detection in a CCD-camera system using a peroxidase (HRP) labeled secondary antibody.

Antibodies included in the Human Protein Atlas have been analyzed without further efforts to optimize the procedure and therefore it cannot be excluded that certain observed binding properties are due to technical rather than biological reasons and that further optimization could result in a different outcome.

Transcriptomics

HPA RNA-seq data

In total, 69 cell lines, 37 human tissues and 18 blood cell types as well as total peripheral blood mononuclear cells (PBMC) have been analyzed by RNA-seq to estimate the transcript abundance of each protein-coding gene. Additionally, 19 mouse tissue samples and 32 pig tissue samples collected from the brain and retina of the animals were sampled and analyzed by RNA-seq.

For cell lines, early-split samples were used as duplicates and total RNA was extracted using Qiagen RNeasy mini kit. Information regarding cellular origin and source of each cell line is listed here.

For normal tissue and blood samples, specimens were collected with consent from patients and all samples were anonymized in accordance with approval from the local ethics committee (ref #2011/473 and ref #2015/1552-32) and Swedish rules and legislation. All tissues were collected from the Uppsala Biobank and RNA samples were extracted from frozen tissue sections. Blood samples were enriched for PBMC and granulocytes, labeled with antibodies and separated into subpopulation by flow sorting.

For mouse tissue, samples were collected and handled in accordance with Swedish laws and regulation, and all experiments were approved by the local ethical committee (Stockholms Norra Djurförsöksetiska Nämd N183/14). The animal experiments conformed to the European Communities Council Directive (86/609/EEC), and all efforts were made to minimize the suffering and the number of animals used. WT male (n = 2) and female (n = 2) C57BL/6J mice (2 month old) were obtained from Charles River Laboratories and maintained under standard conditions on a 12-hour day/night cycle, with water and food ad libitum.

For a total number of 141 cell line samples, 483 tissue samples, and 109 blood cell type samples, mRNA sequencing was performed on Illumina HiSeq2000 and 2500 machines (Illumina, San Diego, CA, USA) using the standard Illumina RNA-seq protocol with a read length of 2x100 bases. Blood cells mRNA sequencing was performed on an Illumina NovaSeq 6000 System in four S4 lanes with a read length of 2x150 bases. Transcript abundance estimation was performed using Kallisto v0.43.1. The 18 blood cell types are classified into six different lineages including B-cells, T-cells, NK-cells, monocytes, granulocytes and dendritic cells.

The pig tissue samples were collected and analyzed in collaboration with BGI. Pig brain used for mRNA analysis were collected and handled in accordance with national guidance for large experimental animals and under permission of the local ethical committee (ethical permission numbers No.44410500000078 and BGI-IRB18135) as well as conducted in line with European directives and regulations. The experimental minipigs (Chinese Bama Minipig) were provided by the Peral Lab Animal Sci & Tech Co.,Ltd (Permit number SYXK2017-0123). Male (n = 2) and female (n = 2) Chinese Bama minipigs (1 year old), were housed in a specific pathogen-free stable facility under standard conditions.

The human prefrontal cortex dataset includes 165 samples from 3 male and 3 female donors providing a detailed overview of protein expression in 17 subregions of the prefrontal cortex and 3 reference cortical regions. The analysis is a collaboration with Human Brain Tissue Bank (HBTB; Semmelweis University, Budapest) in accordance with approval from the Committee of Science and Research Ethic of the Ministry of Health Hungary (ETT TUKEB: 189/KO/02.6008/2002/ETT) and the Semmelweis University Regional Committee of Science and Research Ethic (No. 32/1992/TUKEB) to remove human brain tissue samples, collect, store and use them for research. Samples were collected by Prof. Palkovits and RNA was extracted from frozen brain punches.

GTEx RNA-seq data

The Genotype-Tissue Expression (GTEx) project collects and analyzes multiple human post mortem tissues. RNA-seq data from 36 of their tissue types was mapped based on RSEMv1.2.22 (v7) and the resulting TPM values have been included in the Human Protein Atlas for all corresponding genes.

Tissue GTEx tissue Number of samples
Adipose tissue Adipose - Subcutaneous 442
Adipose - Visceral (Omentum) 355
Adrenal gland Adrenal Gland 190
Amygdala Brain - Amygdala 100
Breast Breast - Mammary Tissue 290
Caudate Brain - Caudate (basal ganglia) 160
Cerebellum Brain - Cerebellar Hemisphere 136
Brain - Cerebellum 173
Cerebral cortex Brain - Anterior cingulate cortex (BA24) 121
Brain - Cortex 158
Brain - Frontal Cortex (BA9) 129
Cervix, uterine Cervix - Ectocervix 6
Cervix - Endocervix 5
Colon Colon - Sigmoid 233
Colon - Transverse 274
Endometrium Uterus - Endometrium 16
Esophagus Esophagus - Mucosa 407
Fallopian tube Fallopian Tube 7
Heart muscle Heart - Atrial Appendage 297
Heart - Left Ventricle 303
Hippocampus Brain - Hippocampus 123
Hypothalamus Brain - Hypothalamus 121
Kidney Kidney - Cortex 45
Liver Liver 175
Lung Lung 427
Nucleus accumbens Brain - Nucleus accumbens (basal ganglia) 147
Ovary Ovary 133
Pancreas Pancreas 248
Pituitary gland Pituitary 183
Prostate Prostate 152
Putamen Brain - Putamen (basal ganglia) 124
Salivary gland Minor Salivary Gland 97
Skeletal muscle Muscle - Skeletal 564
Skin Skin - Not Sun Exposed (Suprapubic) 387
Skin - Sun Exposed (Lower leg) 473
Small intestine Small Intestine - Terminal Ileum 137
Spinal cord Brain - Spinal cord (cervical c-1) 91
Spleen Spleen 162
Stomach Stomach 262
Substantia nigra Brain - Substantia nigra 88
Testis Testis 259
Thyroid gland Thyroid 446
Urinary bladder Bladder 11
Vagina Vagina 115

FANTOM5 CAGE data

The Functional Annotation of Mammalian Genomes 5 (FANTOM5) project provides comprehensive expression profiles and functional annotation of mammalian cell-type specific transcriptomes using Cap Analysis of Gene Expression (CAGE) (Takahashi H et al. (2012)), which is based on a series of full-length cDNA technologies developed in RIKEN. CAGE data for 60 of their tissues was obtained from the FANTOM5 repository and mapped to ENSEMBL. The normalized Tags Per Million for each gene were calculated and included in the Human Protein Atlas.

Tissue FANTOM5 tissue Sample description FANTOM5 sample id
Adipose tissue Adipose tissue 65,65,76 years, mixed FF:10010-101C1
Amygdala Amygdala 76 years, female FF:10151-102I7
Appendix Appendix 29 years, male FF:10189-103D9
Breast Breast 77 years, female FF:10080-102A8
Caudate Caudate nucleus 76 years, female FF:10164-103B2
Cerebellum Cerebellum 22-68 years, mixed FF:10083-102B2
Cerebellum 76 years, female FF:10166-103B4
Cervix, uterine Cervix 40,46,57,65 years, female FF:10013-101C4
Colon Colon 62,83,84 years, mixed FF:10014-101C5
Corpus callosum Corpus callosum 24-68 years, mixed FF:10042-101F6
Ductus deferens Ductus deferens 24 years, male FF:10196-103E7
Endometrium Uterus 23-63 years, female FF:10100-102D1
Epididymis Epididymis 24 years, male FF:10197-103E8
Esophagus Esophagus 68,74,75 years, mixed FF:10015-101C6
Frontal lobe Frontal lobe 32-61 years, mixed FF:10040-101F4
Gallbladder Gall bladder 57 years, male FF:10198-103E9
Globus pallidus Globus pallidus 76 years, female FF:10161-103A8
Globus pallidus 60 years, female FF:10175-103C4
Heart muscle Heart 70,73,74 years, mixed FF:10016-101C7
Left ventricle 73 years, female FF:10078-102A6
Left atrium 40 years, male FF:10079-102A7
Hippocampus Hippocampus 76 years, female FF:10153-102I9
Hippocampus 60 years, female FF:10169-103B7
Insular cortex Insula 20-68 years, mixed FF:10039-101F3
Kidney Kidney 60,62,63 years, female FF:10017-101C8
Liver Liver 64,69,70 years, mixed FF:10018-101C9
Locus coeruleus Locus coeruleus 76 years, female FF:10165-103B3
Locus coeruleus 60 years, female FF:10182-103D2
Lung Lung 46,65,94 years, mixed FF:10019-101D1
Lung - right lower lobe 29 years, male FF:10075-102A3
Lymph node Lymph node 30 years, male FF:10077-102A5
Medial frontal gyrus Medial frontal gyrus 76 years, female FF:10150-102I6
Medial temporal gyrus Medial temporal gyrus 76 years, female FF:10156-103A3
Medial temporal gyrus 60 years, female FF:10183-103D3
Medulla oblongata Medulla oblongata 18-64 years, mixed FF:10038-101F2
Medulla oblongata 76 years, female FF:10155-103A2
Medulla oblongata 60 years, female FF:10174-103C3
Nucleus accumbens Nucleus accumbens 23-56 years, mixed FF:10037-101F1
Occipital cortex Occipital cortex 76 years, female FF:10163-103B1
Occipital lobe Occipital lobe 27 years, male FF:10076-102A4
Occipital pole Occipital pole 22-68 years, mixed FF:10036-101E9
Olfactory region Olfactory region 87 years, female FF:10195-103E6
Ovary Ovary 47,75,84 years, female FF:10020-101D2
Pancreas Pancreas 52 years, male FF:10049-101G4
Paracentral gyrus Paracentral gyrus 22-69 years, mixed FF:10035-101E8
Parietal lobe Parietal lobe 35-89 years, mixed FF:10034-101E7
Parietal lobe 76 years, female FF:10157-103A4
Parietal lobe 60 years, female FF:10171-103B9
Pituitary gland Pituitary gland 76 years, female FF:10162-103A9
Placenta Placenta female FF:10021-101D3
Pons Pons 18-54 years, mixed FF:10033-101E6
Postcentral gyrus Postcentral gyrus 44-52 years, mixed FF:10032-101E5
Prostate Prostate 73,79,93 years, male FF:10022-101D4
Putamen Putamen 60 years, female FF:10176-103C5
Retina Retina 24-65 years, mixed FF:10030-101E3
Salivary gland Salivary gland 16-60 years, mixed FF:10093-102C3
Parotid gland 23 years, male FF:10199-103F1
Submaxillary gland 24 years, male FF:10202-103F4
Seminal vesicle Seminal vesicle 24 years, male FF:10201-103F3
Skeletal muscle Skeletal muscle 55,79,79 years, mixed FF:10023-101D5
Skeletal muscle - soleus muscle male FF:10282-104F3
Small intestine Small intestine 15,40,85 years, mixed FF:10024-101D6
Smooth muscle Smooth muscle 20-68 years, male FF:10048-101G3
Spinal cord Spinal cord 76 years, female FF:10159-103A6
Spinal cord 60 years, female FF:10181-103D1
Spleen Spleen 39,50,70 years, male FF:10025-101D7
Substantia nigra Substantia nigra 76 years, female FF:10158-103A5
Temporal cortex Temporal lobe 32-61 years, mixed FF:10031-101E4
Testis Testis 34,53,86 years, male FF:10026-101D8
Testis 14-64 years, male FF:10096-102C6
Thalamus Thalamus 76 years, female FF:10154-103A1
Thymus Thymus 0.5,0.5,0.83 years old infant years, male FF:10027-101D9
Thyroid gland Thyroid 67,68,78 years, mixed FF:10028-101E1
Tongue Tongue 28 years, male FF:10203-103F5
Tonsil Tonsil 22-61 years, mixed FF:10047-101G2
Urinary bladder Bladder 55,58,79 years, mixed FF:10011-101C2
Vagina Vagina 68 years, female FF:10204-103F6

scRNA-seq data

Inclusion criteria

The single cell RNA sequencing dataset is based on meta-analysis of literature on single cell RNA sequencing and single cell databases that include healthy human tissue. To avoid technical bias and to ensure that the single cell dataset can best represent the corresponding tissue, the following data selection criteria were applied: (1) Single cell transcriptomic datasets were limited to those based on the Chromium single cell gene expression platform from 10X Genomics (version 2 or 3); (2) Single cell RNA sequencing was performed on single cell suspension from tissues without pre-enrichment of cell types; (3) Only studies with >4,000 cells and 20 million read counts were included, (4) Only dataset whose pseudo-bulk transcriptomic expression profile is highly correlated with the transcriptomic expression profile of the corresponding HPA tissue bulk sample were included. It should be noted that exceptions were made for lung (~7.3 million reads), pancreas (3,719 cells) and rectum (3,898 cells) to include various cell types in the analysis.

Single cell transcriptomic datasets

In total, single cell transcriptomics data for 13 tissues and peripheral blood mononucleated cells (PBMCs) were analyzed. These datasets were respectively retrieved from the Single Cell Expression Atlas, the Human Cell Atlas, the Gene Expression Omnibus, and the European Genome-phenome Archive. The complete list of references is shown in the table below.

Tissue Data source No. of M reads No. of cells Correlation with
HPA bulk RNA
Reference
Colon GSE116222 13.1 11167 0.811 Parikh K et al. (2019)
Eye GSE137537 23.1 20091 Menon M et al. (2019)
Heart muscle GSE109816 396.8 9182 0.797 Wang L et al. (2020)
Small intestine GSE125970 60.1 6167 Wang Y et al. (2020)
Kidney GSE131685 56.7 25279 0.867 Liao J et al. (2020)
Liver GSE115469 28.7 8439 0.837 MacParland SA et al. (2018)
Lung GSE130148 6.9 4599 0.863 Vieira Braga FA et al. (2019)
Placenta E-MTAB-6701 346.9 18547 0.879 Vento-Tormo R et al. (2018)
Pancreas GSE131886 93.2 3719 0.829 Qadir MMF et al. (2020)
Skin GSE130973 57.1 15798 0.756 Solé-Boldo L et al. (2020)
Prostate GSE117403 179.3 35862 0.756 Henry GH et al. (2018)
Rectum GSE125970 60.9 3898 0.756 Wang Y et al. (2020)
PBMC GSE112845 19.5 4972 0.756 Chen J et al. (2018)
Testis GSE120508 71.8 6490 0.756 Guo J et al. (2018)

Clustering of single cell transcriptomics data

For each of the single cell transcriptomics datasets, the quantified raw sequencing data were downloaded from the corresponding depository database based on the accession number provided by the corresponding study in the available format (total cells, read, and feature counts, or count tables). Unfiltered data were used as input for downstream analysis with an in-house pipeline using Scanpy (version 1.4.4.post1) in Python 3.7.3. In the pipeline, the data were filtered using two criteria: a cell is considered as valid if it has at least 200 genes and a gene is considered as valid if it is expressed in at least 10% of the cells. Subsequently, the cell counts were normalized to have a total count per cell of 10000. The valid cells were then clustered using Louvain clustering function within Single-Cell Analysis in Python (Scanpy). The total read counts for all genes in each cluster was calculated by adding up the read counts of each gene in all cells belonging to the corresponding cluster. Finally, the read counts were normalized to transcripts per million protein coding genes (pTPM) for each of the single cell clusters. When calculating the expression profile for pseudo-bulk samples based on single cell transcriptomics, we added the read counts for all genes from all cells of the sample, and normalized it to pTPM in the same way as for the cluster ones.

Defining cell types

Each of the 192 different cell type clusters were manually annotated based on an extensive survey of >500 well-known tissue and cell type-specific markers, including both markers from the original publications, and additional markers used in pathology diagnostics. For each cluster, one main cell type was chosen by taking into consideration the expression of different markers. For a few clusters, no main cell type could be selected, and these clusters were not used for classification. The most relevant markers are presented in a heatmap on the Cell Type Atlas, in order to clarify cluster annotation to visitors.

Cell type dendrogram

The cell type dendrogram presented on the Cell Type Atlas shows the relationship between the single cell types based on genome-wide expression. The dendrogram is based on agglomerative clustering of 1 - Spearman's rho between cell types using Ward's criterion. The dendrogram was then transformed into a hierarchical graph, thus where link distances were normalized to emphasize graph connections rather than link distances. Link width is proportional to the distance from the root, and are colored according to cell type group if only one cell type group is present among connected leaves.

Normalization of transcriptomics data

For each of the three transcriptomics datasets (HPA, GTEx and FANTOM5), the average TPM value of all individual samples for each human tissue or human cell type was used to estimate the gene expression level. To be able to combine the datasets into consensus transcript expression levels, a pipeline was set up to normalize the data for all samples. In brief, all TPM values per sample were scaled to a sum of 1 million TPM (denoted pTPM) to compensate for the non-coding transcripts that had been previously removed. Next, all TPM values of all the samples within each data source (HPA human tissues, HPA blood cells, GTEx, and FANTOM5 respectively) were TMM normalized, followed by Pareto scaling of each gene within each data source. Tissue data from the three transcriptomics datasets were subsequently integrated using batch correction through the removeBatchEffect function of R package Limma, using the data source as a batch parameter. The blood RNA-seq dataset was not limma-adjusted. The resulting transcript expression values, denoted Normalized eXpression (NX), were calculated for each gene in every sample.

In the Human Protein Atlas, the NX value for every gene and tissue were calculated and visualized on the gene summary page together with the pTPM values for the individual samples. Consensus transcript expression levels for each gene were summarized in 74 human tissues based on transcriptomics data from three sources: HPA, GTEx and FANTOM5. The consensus normalized expression (NX) value for each gene and organ/tissue represents the maximum NX value in the three data sources. For tissues with multiple sub-tissues (brain regions, blood cells, lymphoid tissues and intestine) the maximum of all sub-tissues is used for the tissue type. The total number of tissue types in the human tissue consensus set is 37 and the total number of human blood cell types is 18.

The prefrontal cortex (PFC) dataset is based on multiple samples (PFC regions) from in total six donors. Trimmed mean of M values (TMM) and robust-linear-model normalization were carried out to eliminate batch effects caused by sampling, post-mortem intervals, and the differences in transcriptome size between different brain regions in the PFC project. Trimmed average normalized expressions of the protein-coding genes in the analyzed subregions were used as the consensus result of the 17 PFC subregions and 3 reference cortical regions.

Mouse and pig transcriptomic data generated by the HPA in collaboration with BGI, were normalized separately, according to the same procedure used for human tissues and cell types, no Limma adjustment was performed on the mouse and pig data. Consensus transcript expression levels is summarized into 10 brain regions for mouse and pig brain, where sub-regional samples were combined and the maximum of sub-regions used for the brain region. Corpus callosum, spinal cord, retina and pituitary gland was also included in the analysis, but is not defined as one of the 10 brain regions.

Single cell type clusters were normalized separately using TMM. To generate expression values per cell type, clusters were aggregated per cell type by using the median expression of each gene.

Classification of transcriptomics data

The consensus transcriptomics data was used to classify all genes according to their tissue-specific, single cell type-specific, brain region-specific, blood cell-specific or cell line-specific expression into two different schemas: specificity category and distribution category. These are defined based on the total set of all NX values in 37 tissues, 43 single cell types, 10 main regions of each mammalian brain,18 blood cell types or 69 cell lines and using a cutoff value of 1 NX as a limit for detection across all tissues or cell types.

Explanation of the specificity category

Category Description
Enriched NX level in a particular tissue/region/cell type at least four times any other tissue/region/cell type
Group enriched NX levels of a group (of 2-5 tissues or 2-10 cell types or 2-5 brain regions) at least four times any other tissue/region/cell type
Enhanced NX levels of a group (of 1-5 tissues or 1-10 cell types or 1-5 brain regions) at least four times the mean of other tissue/region/cell types
Low specificity NX ≥ 1 in at least one tissue/region/cell type but not elevated in any tissue/region/cell type
Not detected NX < 1 in all tissue/cell/region types

An additional category "elevated", containing all genes in the first three categories (tissue/cell line/cell type enriched, group enriched and tissue/cell line/cell type enhanced), has been used for some parts of the analysis. TS/CS-score (Tissue Specificity/Cell Specificity score) is calculated for “elevated” tissues/cell lines. TS/CS-score is calculated as the fold change from the tissue/cell line with highest RNA to the tissue/cell line with second highest RNA.

Explanation of the distribution category

Category Description
Detected in single Detected in a single tissue/region/cell type
Detected in some Detected in more than one but less than one third of tissue/region/cell types
Detected in many Detected in at least a third but not all tissue/region/cell types
Detected in all Detected in all tissue/region/cell types
Not detected Not detected NX < 1 in all tissue/region/cell type

External blood RNA-seq data

In addition to the blood cell type data generated within the Human Protein Atlas project, data from 15 blood cell types by Schmiedel et al. and 29 blood cell types as well as total PBMC by Monaco et al. have been incorporated into the Blood Atlas.

The Schmiedel dataset is available at the DICE (Database of Immune Cell Expression, Expression quantitative trait loci (eQTLs) and Epigenomics) database, which was established to address how genetic variants associated with risk for human diseases affect gene expression in various cell types. The TPM values per gene for 15 immune cell types were mapped to the corresponding genes in the Ensembl version used in the Human Protein Atlas.

The Monaco dataset contains data for 29 immune cell types within the peripheral blood mononuclear cell (PBMC) fraction of healthy donors using RNA-seq and flow cytometry. TPM values per transcript for 29 immune cells as well as total PBMC were mapped to the corresponding transcripts in the Ensembl version used in the Human Protein Atlas and summarized to pTPM values based only on protein coding transcripts.

TCGA RNA-seq data

The Cancer Genome Atlas (TCGA) project of Genomic Data Commons (GDC) collects and analyzes multiple human cancer samples. RNA-seq data from 17 cancer types representing 21 cancer subtypes with a corresponding major cancer type in the Human Pathology Atlas were included to allow for comparisons between the protein staining data from the Human Protein Atlas and RNA-seq from TCGA data.

The TCGA RNA-seq data was mapped using the Ensembl gene id available from TCGA, and the FPKMs (number Fragments Per Kilobase of exon per Million reads) for each gene were subsequently used for quantification of expression with a detection threshold of 1 FPKM. Genes were categorized using the same classification as described above.

HPA cancer type TCGA cancer No. of samples in TCGA
Breast cancer Breast Invasive Carcinoma (BRCA) 1075
Cervical cancer Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma (CESC) 291
Colorectal cancer Colon Adenocarcinoma (COAD) 438
Rectum Adenocarcinoma (READ) 159
Endometrial cancer Uterine Corpus Endometrial Carcinoma (UCEC) 541
Glioma Glioblastoma Multiforme (GBM) 153
Head and neck cancer Head and Neck Squamous Cell Carcinoma (HNSC) 499
Liver cancer Liver Hepatocellular Carcinoma (LIHC) 365
Lung cancer Lung Adenocarcinoma (LUAD) 500
Lung Squamous Cell Carcinoma (LUSC) 494
Melanoma Skin Cuteneous Melanoma (SKCM) 102
Ovarian cancer Ovary Serous Cystadenocarcinoma (OV) 373
Pancreatic cancer Pancreatic Adenocarcinoma (PAAD) 176
Prostate cancer Prostate Adenocarcinoma (PRAD) 494
Renal cancer Kidney Chromophobe (KICH) 64
Kidney Renal Clear Cell Carcinoma (KIRC) 528
Kidney Renal Papillary Cell Carcinoma (KIRP) 285
Stomach cancer Stomach Adenocarcinoma (STAD) 354
Testis cancer Testicular Germ Cell Tumor (TGCT) 134
Thyroid cancer Thyroid Carcinoma (THCA) 501
Urothelial cancer Bladder Urothelial Carcinoma (BLCA) 406

TCGA survival

Based on the FPKM value of each gene, patients were classified into two expression groups and the correlation between expression level and patient survival was examined. The prognosis of each group of patients was examined by Kaplan-Meier survival estimators, and the survival outcomes of the two groups were compared by log-rank tests. Both median and maximally separated Kaplan-Meier plots are presented in the Human Protein Atlas, and genes with log rank P values less than 0.001 in maximally separated Kaplan-Meier analysis were defined as prognostic genes. If the group of patients with high expression of a selected prognostic gene has a higher observed event than expected event, it is an unfavorable prognostic gene; otherwise, it is a favorable prognostic gene. Genes with a median expression less than FPKM 1 were lowly expressed, and classified as unprognostic in the database even if they exhibited significant prognostic effect in survival analysis

Allen Mouse brain ISH dataset

The Allen Brain Atlas (ABA) is an open access database focusing on the brain, and includes both human and mouse expression data. The ABA is a part of the Allen Institute for Brain Science, which is one of the three branches of the Allen Institute. The Mouse brain In situ hybridization (ISH) data provides information on where in the adult mouse brain each gene is expressed (Lein ES et al. (2007)). We have imported the expression values available through the ABA API (© 2004 Allen Institute for Brain Science, Allen Mouse Brain Atlas) and show the regional expression grouped in the same manner as the other datasets visualized on the HPA Brain Atlas.

The Allen mouse brain ISH data was mapped to the mouse gene annotation of Ensembl version 92.38 using the probe nucleotide sequences provided through the Allen mouse brain API together with the blast program package. The mouse genes where then mapped to human genes using Ensembl orthologue data with a one-to-one restriction.

Evidence

Protein evidence is calculated for each gene based on three different sources: UniProt protein existence (UniProt evidence); a Human Protein Atlas antibody- or RNA based score (HPA evidence); and evidence based on PeptideAtlas (MS evidence). In addition, for each gene, a protein evidence summary score is based on the maximum level of evidence in all three independent evidence scores (Evidence summary).

All scores are classified into the following categories:

  • Evidence at protein level
  • Evidence at transcript level
  • No evidence
  • Not available

UniProt evidence is based on UniProt protein existence data, which uses five types of evidence for the existence of a protein. All genes in the classes "Experimental evidence at protein level" or "Experimental evidence at transcript level" are classified into the first two evidence categories, whereas genes from the "Inferred from homology", "Predicted", or "Uncertain" classes are classified as "No evidence". Genes where the gene identifier could not be mapped to UniProt from Ensembl version 92.38 are classified as "Not available".

The HPA evidence is calculated based on the manual curation of Western blot, tissue profiling and subcellular location as well as transcript profiling. All genes with Data reliability "Supported" in one or both of the two methods immunohistochemistry and immunofluorescence, or standard validation "Supported" for the Western blot application (assays using over-expression lysates not included) are classified as "Evidence at protein level". For the remaining genes, all genes detected at NX > 1 in at least one of the tissue types, cell type or cell lines used in the RNA-seq analysis based on HPA, GTEx or FANTOM5 are classified as "Evidence at transcript level". The remaining genes are classified as "No evidence".

The MS evidence displays protein evidence based on PeptideAtlas for all proteins with protein evidence in NextProt and with a minimum of two distinct peptides. Each gene detected on the protein level according to PeptideAtlas is classified as "Evidence at protein level" and all remaining genes as "Not available".