Skip to content

Classifications, Ontologies and Standards

The Progenetix resource utilizes standardized diagnostic coding systems, with a dedicated move towards hierarchical ontologies. As part of the coding process we have developed and provide several code mapping resources through repositories, the Progenetix website and APIs.

Additionally to diagnostic and other clinical concepts, Progenetix increasingly uses hierarchical terms and concepts for the annotation and querying of technical parameters such as platform technologies. Overall, the Progenetix resource uses a query style based around the Beacon v2 "filters" concept with a CURIE based syntax.


List of filters recognized by different query endpoints

Public Ontologies with CURIE-based syntax

CURIE prefix Code/Ontology Examples
NCIT NCIt Neoplasm1 NCIT:C27676
HP HPO2 HP:0012209
PMID NCBI Pubmed ID PMID:18810378
geo NCBI Gene Expression Omnibus3 geo:GPL6801, geo:GSE19399, geo:GSM491153
arrayexpress EBI ArrayExpress4 arrayexpress:E-MEXP-1008
cellosaurus Cellosaurus - a knowledge resource on cell lines 5 cellosaurus:CVCL_1650
UBERON Uberon Anatomical Ontology6 UBERON:0000992
cbioportal cBioPortal9 cbioportal:msk_impact_2017
SO Sequence Ontology11 SO:0000704

Private filters

Since some classifications cannot directly be referenced, and in accordance with the upcoming Beacon v2 concept of "private filters", Progenetix uses additionally a set of structured non-CURIE identifiers.

For terms with a pgx prefix, the identifiers.org resolver will

Filter prefix / local part Code/Ontology Example
pgx:icdom-... ICD-O 37 Morphologies (Progenetix) pgx:icdom-81703
pgx:icdot... ICD-O 37 Topographies(Progenetix) pgx:icdot-C04.9
TCGA The Cancer Genome Atlas (Progenetix)8 TCGA-000002fc-53a0-420e-b2aa-a40a358bba37
pgx:pgxcohort-... Progenetix cohorts 10 pgx:pgxcohort-arraymap

Diagnoses, Phenotypes and Histologies

NCIt coding of tumor samples

  • based on NCIt neoplasm core but now extended based on the whole "neoplasia" subtree of the NCI Thesaurus (NCIT:C3262 and child terms)
  • first implementation of NCIt concepts mapping in January 2017, then for a subset of arrayMap samples
  • now providing ICD-O 3 <=> NCIt mappings through the ICDOntoologies mapping project with a front-end an API on the website

Current NCIt sample codes

ICD coding of tumor samples

The Progenetix resource primarily used the coding schemas of the _International Classification of Diseases in Oncology__ (3rd edition; "ICD-O 3"), to classify all biosamples for which experimental data is available. Users can get a list of ICD-O 3 codes in the Progenetix format through Progenetix collations.

The mappings used here for the ICD morphology codings (mapped to ICDMORPHOLOGY and ICDMORPHOLOGYCODE) are derived from the original source file last accessed on 2016-08-18 from the WHO. The primary codes have been updated from the 2011 update document ICDO3Updates2011.pdf.

Current ICD-O sample codes

UBERON codes

The organ sites of the original coding have been mapped to UBERON. The mappings are detailed in the related icdot2uberon project.

Current UBERON sample codes


Genomic Variations (CNV Ontology)

The Progenetix repository contains a large amount of genomic copy number variants. While we had limited CNV type annotations to the "minimum information content" - i.e. using DUP and DEL categories for indicating relative genomic copy number gains or losses, respectively - from 2022 Progenetix has moved to a richer CNV classification in line with "common use practices".

As part of the ELIXIR h-CNV community and contributors to the GA4GH Beacon project and Variant Representation Specification (VRS) we have co-developed a "CNV assessment ontology" which in January 2022 has been accepted into the Experimental Factor Ontology (EFO), has been adopted by the VRS 1.3 standard (w/ slight changes) and is under discussion at Sequence Ontology (SO).

In January 2022 we switched the internal representation of CNV states to EFO codes and implemented the respective search functionality in the bycon package. Future data updates will gradually add the more granular classes such as EFO:0030073 where they apply.

id: EFO:0030063
label: copy number assessment
  |
  |-id: EFO:0030064
  | label: regional base ploidy
  |   |
  |   |-id: EFO:0030065
  |     label: copy-neutral loss of heterozygosity
  |
  |-id: EFO:0030066
    label: relative copy number variation
      |
      |-id: EFO:0030067
      | label: copy number loss
      |   |
      |   |-id: EFO:0030068
      |   | label: low-level copy number loss
      |   |
      |   |-id: EFO:0030069
      |     label: complete genomic deletion
      |
      |-id: EFO:0030070
        label: copy number gain
          |
          |-id: EFO:0030071
          | label: low-level copy number gain
          |
          |-id: EFO:0030072
             label: high-level copy number gain
             note: commonly but not consistently used for >=5 copies on a bi-allelic genome region
              |
              |-id: EFO:0030073
                 label: focal genome amplification
                 note: >-
                   commonly used for localized multi-copy genome amplification events where the
                   region does not extend >3Mb (varying 1-5Mb) and may exist in a large number of
                   copies

CNV terminology

Please see the variants annotation table at cnvar.org or in the Beacon v2 documentation.

Sequence Variation (SNV Ontology)

id: SO:0001059
label: sequence_alteration
  |
  |-id: SO:0000159
  | label: deletion
  |
  |-id: SO:0000667
  | label: insertion
  |
  |-id: SO:1000002
  | label: substitution
      |
      |-id: SO:0002007
      | label: MNV (multiple nucleotide variant)
      |
      |-id: SO:0001483
        label: SNV (single nucleotide variant)
id: SO:0001059
label: sequence_alteration
  |
  |-id: SO:0001744
  | label: UPD (uniparental disomy)
  |
  |-id: SO:0000159
  | label: deletion
  |
  |-id: SO:1000032
  | label: delins
  |
  |-id: SO:0000667
  | label: insertion
  |
  |-id: SO:1000036
  | label: inversion
  |
  |-id: SO:0000248
  | label: sequence_length_alteration
  |   |
  |   |-id: SO:0001019
  |   | label: copy_number_variation
  |   |   |
  |   |   |-id: SO:0001742
  |   |   | label: copy_number_gain
  |   |   |
  |   |   |-id: SO:0001743
  |   |   | label: copy_number_loss
  |   |   |
  |   |   |-id: SO:0002210
  |   |     label: presence_absence_variation
  |   |
  |   |-id: SO:0002096
  |   | label: short_tandem_repeat_variation
  |   |
  |   |-id: SO:0000207
  |     label: simple_sequence_length_variation
  |
  |-id: SO:0001785
  | label: structural_alteration
  |   |
  |   |-id: SO:0001784
  |   | label: complex_structural_alteration
  |   |   |
  |   |   |-id: SO:0002062
  |   |     label: complex_chromosomal_rearrangement
  |   |
  |   |-id: SO:0001872
  |   | label: rearrangement_region
  |   |
  |   |-id: SO:0000199
  |     label: translocation
  |       |
  |       |-id: SO:1000044
  |         label: chromosomal_translocation
  |
  |-id: SO:1000002
  | label: substitution
      |
      |-id: SO:0002007
      | label: MNV (multiple nucleotide variant)
      |
      |-id: SO:0001483
        label: SNV (single nucleotide variant)
Sequence Ontology Definition
SO:0001059 sequence_alteration A sequence alteration is a sequence feature whose extent is the deviation from another sequence.
SO:0001483 SNV SNVs are single nucleotide positions in genomic DNA at which different sequence alternatives exist.
SO:0002007 MNV An MNV is a multiple nucleotide variant (substitution) in which the inserted sequence is the same length as the replaced sequence.
SO:0000159 deletion The point at which one or more contiguous nucleotides were excised.
SO:0000667 insertion The sequence of one or more nucleotides added between two adjacent nucleotides in the sequence.

Variant Schemas

GA4GH "Variant Representation" schema

The "Genomic Knowledge Standards" (GKS) of the Global Alliance for Genomics and Health GA4GH develops a modern schema for the unambiguous representation, transmission and recovery of sequence variants (genomic and beyond). The first release of the GA4GH Variation Representation Specification (vr-spec v1.0) does not yet include the option to represent structural variants. However, the internal roadmap of the project points towards an extension for CNV representation in 2020.

Ad-Hoc & "Community" Formats

Progenetix Variant schema

The Progenetix cancer genomics resource store their millions of CNVs in as data objects in MongoDB document databases. The format of the single variants is based on the Beacon v2 default model with some modifications (e.g. incorporating the VRS 1.3 RelativeCopyNumber concept but w/ slightly rewrapped components).

The Progenetix data serves as the repository behind the Beacon+ forward looking implementation of the ELIXIR Beacon project. Accordingly, upon export through the API variants are re-mapped to a Beacon v2 representation.

Progenetix CNV example

{
  "id": "pgxvar-5bab576a727983b2e00b8d32",
  "variant_internal_id": "11:52900000-134452384:DEL",
  "callset_id": "pgxcs-kftvldsu",
  "biosample_id": "pgxbs-kftva59y",
  "individual_id": "pgxind-kftx25eh",
  "variant_state": { "id": "EFO:0030067", "label": "copy number loss" },
  "relative_copy_class": "partial loss",
  "location": {
    "sequence_id": "refseq:NC_000011.10",
    "chromosome": "11",
    "start": 52900000,
    "end": 134452384
    },
  "updated": "2022-03-29T14:36:47.454674"
}

Progenetix SNV example

  {
    "updated": "2023-05-25T17:03:45.096849",
    "callset_id": "pgxcs-kl8hg1r8",
    "biosample_id": "pgxbs-kl8hg1r4",
    "id": "pgxvar-5be1840772798347f0ed9d9d",
    "variant_internal_id": "5:67589139:G>A",
    "location": {
      "sequence_id": "refseq:NC_000005.10",
      "chromosome": "5",
      "start": 67589138,
      "end": 67589139
    },
    "individual_id": "pgxind-kl8hg1r5",
    "reference_sequence": "G",
    "sequence": "A",
    "variant_state": { "id": "SO:0001059", "label": "sequence_alteration" }
  }

Geolocation Data

Provenance and use of geolocation data

Geographic point coordinates are assigned to each sample after review of existing information from associated publications or repository information for their ”best available” geographic origin using a precedence of:

  1. sample specific data (e.g. from article text)
  2. experiment location
  3. first author proxy

For publications w/o accessible sample data in general the "author proxy" is being used, unless specific annotations have been found in the article.

A more detailed discussion of the problems and benefits of geographic provenance tagging can be found in Carrio-Cordo et al., DATABASE 2020.

Geolocations Service

The Progenetix API provides a service for retrieving geographic coordinates as point coordinates, for the majority of cities.

GeoLocation schema

The current version of the JSON Schema data schema for the geolocation object can be accessed through the Progenetix services API.

"geometry": {
  "coordinates": [
    8.69,
    49.41
  ],
  "type": "Point"
},
"properties": {
  "ISO3166alpha2": "DE",
  "ISO3166alpha3": "DEU",
  "city": "Heidelberg",
  "continent": "Europe",
  "country": "Germany"
},
"type": "Feature"


  1. National Cancer Institute Thesaurus Neoplasm NCIt Neoplasm 

  2. Human phenotype ontology HPO 

  3. Supported identifiers include platforms(GPL), series(GSE) and samples(GSM).GEO Overview 

  4. Supports ArrayExpress Accession ID. ArrayExpress browse 

  5. Cellosaurus accession ID. 

  6. Uberon ID 

  7. International Classification of Diseases for Oncology, 3rd Edition ICD-O-3 

  8. Supports TCGA Sample UUID. 

  9. Supports cBioPortal Study ID. 

  10. Cohorts defined in Progenetix involving a collection of related samples. Currently includes (add pgx:cohort-): arraymap, 2021progenetix, DIPG, TCGA, TCGAcancers, gao2021signatures

  11. Sequence Ontology ID