Skip to content

Classifications, Ontologies and Standards

The Progenetix resource utilizes standardized diagnostic coding systems, with a move towards hierarchical ontologies. As part of the coding process we have developed and provide several code mapping resources through repositories, the Progenetix website and APIs.

Additionally to diagnostic and other clinical concepts, Progenetix increasingly uses hierarchical terms and concepts for the annotation and querying of technical parameters such as platform technologies. Overall, the Progenetix resource uses a query syntax based around the Beacon v2 "filters" concept with a CURIE based syntax.

List of filters recognized by different query endpoints

Public Ontologies with CURIE-based syntax

CURIE prefix Code/Ontology Examples
NCIT NCIt Neoplasm1 NCIT:C27676
HP HPO2 HP:0012209
PMID NCBI Pubmed ID PMID:18810378
geo NCBI Gene Expression Omnibus3 geo:GPL6801, geo:GSE19399, geo:GSM491153
arrayexpress EBI ArrayExpress4 arrayexpress:E-MEXP-1008
cellosaurus Cellosaurus - a knowledge resource on cell lines 5 cellosaurus:CVCL_1650
UBERON Uberon Anatomical Ontology6 UBERON:0000992
cbioportal cBioPortal9 cbioportal:msk_impact_2017

Private filters

Since some classifications cannot directly be referenced, and in accordance with the upcoming Beacon v2 concept of "private filters", Progenetix uses additionally a set of structured non-CURIE identifiers.

For terms with a pgx prefix, the resolver will

Filter prefix / local part Code/Ontology Example
pgx:icdom-... ICD-O 37 Morphologies (Progenetix) pgx:icdom-81703
pgx:icdot... ICD-O 37 Topographies(Progenetix) pgx:icdot-C04.9
TCGA The Cancer Genome Atlas (Progenetix)8 TCGA-000002fc-53a0-420e-b2aa-a40a358bba37
pgx:pgxcohort-... Progenetix cohorts 10 pgx:pgxcohort-arraymap

Diagnoses, Phenotypes and Histologies

NCIt coding of tumor samples

  • based on NCIt neoplasm core but now extended based on the whole "neoplasia" subtree of the NCI Thesaurus (NCIT:C3262 and child terms)
  • first implementation of NCIt concepts mapping in January 2017, then for a subset of arrayMap samples
  • now providing ICD-O 3 <=> NCIt mappings through the ICDOntoologies mapping project with a front-end an API on the website

Current NCIt sample codes

ICD coding of tumor samples

The Progenetix resource primarily used the coding schemas of the _International Classification of Diseases in Oncology__ (3rd edition; "ICD-O 3"), to classify all biosamples for which experimental data is available. Users can get a list of ICD-O 3 codes in the Progenetix format through Progenetix collations.

The mappings used here for the ICD morphology codings (mapped to ICDMORPHOLOGY and ICDMORPHOLOGYCODE) are derived from the original source file last accessed on 2016-08-18 from the WHO. The primary codes have been updated from the 2011 update document ICDO3Updates2011.pdf.

Current ICD-O sample codes

UBERON codes

The organ sites of the original coding have been mapped to UBERON. The mappings are detailed in the related icdot2uberon project.

Current UBERON sample codes

Genomic Variations (CNV Ontology)

The Progenetix repository contains predominantly copy number variants. While we had limited CNV type annotations to the "minimum information content" - i.e. using DUP and DEL categories for indicating relative genomic copy number gains or losses, respectively, from 2022 Progenetix will move to a richer CNV classification in line with "common use practices". As part of the ELIXIR h-CNV community and contributors to the GA4GH Beacon project and Variant Representation Specification (VRS) we have co-developed a "CNV assessment ontology" which in January 2022 has been accepted into the Experimental Factor Ontology (EFO) and is under discussion at Sequence Ontology (SO) and for use in VRS.

In January 2022 we switched the internal representation of CNV states to EFO codes and implemented the respective search functionality in the bycon package. Future data updates will gradually add the more granular classes such as EFO:0030073 where they apply.

id: EFO:0030063
label: copy number assessment
  |-id: EFO:0030064
  | label: regional base ploidy
  |   |
  |   |-id: EFO:0030065
  |     label: copy-neutral loss of heterozygosity
  |-id: EFO:0030066
    label: relative copy number variation
      |-id: EFO:0030067
      | label: copy number loss
      |   |
      |   |-id: EFO:0030068
      |   | label: low-level copy number loss
      |   |
      |   |-id: EFO:0030069
      |     label: complete genomic deletion
      |-id: EFO:0030070
        label: copy number gain
          |-id: EFO:0030071
          | label: low-level copy number gain
          |-id: EFO:0030072
             label: high-level copy number gain
             note: commonly but not consistently used for >=5 copies on a bi-allelic genome region
              |-id: EFO:0030073
                 label: focal genome amplification
                 note: >-
                   commonly used for localized multi-copy genome amplification events where the
                   region does not extend >3Mb (varying 1-5Mb) and may exist in a large number of

Geolocation Data

Provenance and use of geolocation data

Geographic point coordinates are assigned to each sample after review of existing information from associated publications or repository information for their ”best available” geographic origin using a precedence of:

  1. sample specific data (e.g. from article text)
  2. experiment location
  3. first author proxy

For publications w/o accessible sample data in general the "author proxy" is being used, unless specific annotations have been found in the article.

A more detailed discussion of the problems and benefits of geographic provenance tagging can be found in Carrio-Cordo et al., DATABASE 2020.

Geolocations Service

The Progenetix API provides a service for retrieving geographic coordinates as point coordinates, for the majority of cities.

GeoLocation schema

The current version of the JSON Schema data schema for the geolocation object can be accessed through the Progenetix services API.

"geometry": {
  "coordinates": [
  "type": "Point"
"properties": {
  "ISO3166alpha2": "DE",
  "ISO3166alpha3": "DEU",
  "city": "Heidelberg",
  "continent": "Europe",
  "country": "Germany"
"type": "Feature"

  1. National Cancer Institute Thesaurus Neoplasm NCIt Neoplasm 

  2. Human phenotype ontology HPO 

  3. Supported identifiers include platforms(GPL), series(GSE) and samples(GSM).GEO Overview 

  4. Supports ArrayExpress Accession ID. ArrayExpress browse 

  5. Cellosaurus accession ID. 

  6. Uberon ID 

  7. International Classification of Diseases for Oncology, 3rd Edition ICD-O-3 

  8. Supports TCGA Sample UUID. 

  9. Supports cBioPortal Study ID. 

  10. Cohorts defined in Progenetix involving a collection of related samples. Currently includes (add pgx:cohort-): arraymap, 2021progenetix, DIPG, TCGA, TCGAcancers, gao2021signatures