Classifications, Ontologies and Standards¶

The Progenetix resource utilizes standardized diagnostic coding systems, with a dedicated move towards hierarchical ontologies. As part of the coding process we have developed and provide several code mapping resources through repositories, the Progenetix website and APIs.

Additionally to diagnostic and other clinical concepts, Progenetix increasingly uses hierarchical terms and concepts for the annotation and querying of technical parameters such as platform technologies. Overall, the Progenetix resource uses a query style based around the Beacon v2 "filters" concept with a CURIE based syntax.

List of filters recognized by different query endpoints¶

Public Ontologies with CURIE-based syntax¶

CURIE prefix	Code/Ontology	Examples
NCIT	NCIt Neoplasm¹	NCIT:C27676
HP	HPO²	HP:0012209
pubmed	NCBI Pubmed ID	pubmed:18810378
geo	NCBI Gene Expression Omnibus³	geo:GPL6801, geo:GSE19399, geo:GSM491153
arrayexpress	EBI ArrayExpress⁴	arrayexpress:E-MEXP-1008
cellosaurus	Cellosaurus - a knowledge resource on cell lines ⁵	cellosaurus:CVCL_1650
UBERON	Uberon Anatomical Ontology⁶	UBERON:0000992
cbioportal	cBioPortal⁹	cbioportal:msk_impact_2017
SO	Sequence Ontology¹¹	SO:0000704

Private filters¶

Since some classifications cannot directly be referenced, and in accordance with the upcoming Beacon v2 concept of "private filters", Progenetix uses additionally a set of structured non-CURIE identifiers.

For terms with a pgx prefix, the identifiers.org resolver will

Filter prefix / local part	Code/Ontology	Example
pgx:icdom-...	ICD-O 3⁷ Morphologies (Progenetix)	pgx:icdom-81703
pgx:icdot...	ICD-O 3⁷ Topographies(Progenetix)	pgx:icdot-C04.9
TCGA	The Cancer Genome Atlas (Progenetix)⁸	TCGA-000002fc-53a0-420e-b2aa-a40a358bba37
pgx:cohort-...	Progenetix cohorts ¹⁰	pgx:cohort-arraymap

Diagnoses, Phenotypes and Histologies¶

NCIt coding of tumor samples¶

based on NCIt neoplasm core but now extended based on the whole "neoplasia" subtree of the NCI Thesaurus (NCIT:C3262 and child terms)
first implementation of NCIt concepts mapping in January 2017, then for a subset of arrayMap samples
now providing ICD-O 3 <=> NCIt mappings through the ICDOntoologies mapping project with a front-end an API on the website

Current NCIt sample codes¶

progenetix.org/services/collations?collationTypes=NCIT&deliveryKeys=id,label,count

ICD coding of tumor samples¶

The Progenetix resource primarily used the coding schemas of the _International Classification of Diseases in Oncology__ (3rd edition; "ICD-O 3"), to classify all biosamples for which experimental data is available. Users can get a list of ICD-O 3 codes in the Progenetix format through Progenetix collations.

The mappings used here for the ICD morphology codings (mapped to ICDMORPHOLOGY and ICDMORPHOLOGYCODE) are derived from the original source file last accessed on 2016-08-18 from the WHO. The primary codes have been updated from the 2011 update document ICDO3Updates2011.pdf.

Current ICD-O sample codes¶

ICD-O Morphologies
- progenetix.org/services/collations?collationTypes=icdom&deliveryKeys=id,label,count
ICD-O Topographies
- progenetix.org/services/collations/?collationTypes=icdot&deliveryKeys=id,label,count

UBERON codes¶

The organ sites of the original coding have been mapped to UBERON. The mappings are detailed in the related icdot2uberon project.

Current UBERON sample codes¶

progenetix.org/services/collations?collationTypes=UBERON&deliveryKeys=id,label,count

Genomic Variations (CNV Ontology)¶

The Progenetix repository contains a large amount of genomic copy number variants. While we had limited CNV type annotations to the "minimum information content" - i.e. using DUP and DEL categories for indicating relative genomic copy number gains or losses, respectively - from 2022 Progenetix has moved to a richer CNV classification in line with "common use practices".

As part of the ELIXIR h-CNV community and contributors to the GA4GH Beacon project and Variant Representation Specification (VRS) we have co-developed a "CNV assessment ontology" which in January 2022 has been accepted into the Experimental Factor Ontology (EFO), has been adopted by the VRS 1.3 standard (w/ slight changes) and is under discussion at Sequence Ontology (SO).

In January 2022 we switched the internal representation of CNV states to EFO codes and implemented the respective search functionality in the bycon package. Future data updates will gradually add the more granular classes such as EFO:0030073 where they apply.

id: EFO:0030063
label: copy number assessment
  |
  |-id: EFO:0030064
  | label: regional base ploidy
  |   |
  |   |-id: EFO:0030065
  |     label: copy-neutral loss of heterozygosity
  |
  |-id: EFO:0030066
    label: relative copy number variation
      |
      |-id: EFO:0030067
      | label: copy number loss
      |   |
      |   |-id: EFO:0030068
      |   | label: low-level copy number loss
      |   |
      |   |-id: EFO:0030069
      |     label: complete genomic deletion
      |
      |-id: EFO:0030070
        label: copy number gain
          |
          |-id: EFO:0030071
          | label: low-level copy number gain
          |
          |-id: EFO:0030072
             label: high-level copy number gain
             note: commonly but not consistently used for >=5 copies on a bi-allelic genome region
              |
              |-id: EFO:0030073
                 label: focal genome amplification
                 note: >-
                   commonly used for localized multi-copy genome amplification events where the
                   region does not extend >3Mb (varying 1-5Mb) and may exist in a large number of
                   copies

CNV terminology

Please see the variants annotation table at cnvar.org or in the Beacon v2 documentation.

Sequence Variation (SNV Ontology)¶

Used termsExtended tree

id: SO:0001059
label: sequence_alteration
  |
  |-id: SO:0000159
  | label: deletion
  |
  |-id: SO:0000667
  | label: insertion
  |
  |-id: SO:1000002
  | label: substitution
      |
      |-id: SO:0002007
      | label: MNV (multiple nucleotide variant)
      |
      |-id: SO:0001483
        label: SNV (single nucleotide variant)

id: SO:0001059
label: sequence_alteration
  |
  |-id: SO:0001744
  | label: UPD (uniparental disomy)
  |
  |-id: SO:0000159
  | label: deletion
  |
  |-id: SO:1000032
  | label: delins
  |
  |-id: SO:0000667
  | label: insertion
  |
  |-id: SO:1000036
  | label: inversion
  |
  |-id: SO:0000248
  | label: sequence_length_alteration
  |   |
  |   |-id: SO:0001019
  |   | label: copy_number_variation
  |   |   |
  |   |   |-id: SO:0001742
  |   |   | label: copy_number_gain
  |   |   |
  |   |   |-id: SO:0001743
  |   |   | label: copy_number_loss
  |   |   |
  |   |   |-id: SO:0002210
  |   |     label: presence_absence_variation
  |   |
  |   |-id: SO:0002096
  |   | label: short_tandem_repeat_variation
  |   |
  |   |-id: SO:0000207
  |     label: simple_sequence_length_variation
  |
  |-id: SO:0001785
  | label: structural_alteration
  |   |
  |   |-id: SO:0001784
  |   | label: complex_structural_alteration
  |   |   |
  |   |   |-id: SO:0002062
  |   |     label: complex_chromosomal_rearrangement
  |   |
  |   |-id: SO:0001872
  |   | label: rearrangement_region
  |   |
  |   |-id: SO:0000199
  |     label: translocation
  |       |
  |       |-id: SO:1000044
  |         label: chromosomal_translocation
  |
  |-id: SO:1000002
  | label: substitution
      |
      |-id: SO:0002007
      | label: MNV (multiple nucleotide variant)
      |
      |-id: SO:0001483
        label: SNV (single nucleotide variant)

Sequence Ontology	Definition
`SO:0001059` sequence_alteration	A sequence alteration is a sequence feature whose extent is the deviation from another sequence.
`SO:0001483` SNV	SNVs are single nucleotide positions in genomic DNA at which different sequence alternatives exist.
`SO:0002007` MNV	An MNV is a multiple nucleotide variant (substitution) in which the inserted sequence is the same length as the replaced sequence.
`SO:0000159` deletion	The point at which one or more contiguous nucleotides were excised.
`SO:0000667` insertion	The sequence of one or more nucleotides added between two adjacent nucleotides in the sequence.

Variant Schemas¶

GA4GH "Variant Representation" schema¶

The "Genomic Knowledge Standards" (GKS) of the Global Alliance for Genomics and Health GA4GH develops a modern schema for the unambiguous representation, transmission and recovery of sequence variants (genomic and beyond). The first release of the GA4GH Variation Representation Specification (vr-spec v1.0) does not yet include the option to represent structural variants. However, the internal roadmap of the project points towards an extension for CNV representation in 2020.

Links¶

vr-spec repository
documentation

Ad-Hoc & "Community" Formats¶

Progenetix `Variant` schema¶

The Progenetix cancer genomics resource store their millions of CNVs in as data objects in MongoDB document databases. The format of the single variants is based on the Beacon v2 default model with some modifications (e.g. incorporating the VRS 1.3 RelativeCopyNumber concept but w/ slightly rewrapped components).

The Progenetix data serves as the repository behind the Beacon+ forward looking implementation of the ELIXIR Beacon project. Accordingly, upon export through the API variants are re-mapped to a Beacon v2 representation.

Progenetix CNV example¶

{
  "id": "pgxvar-5bab576a727983b2e00b8d32",
  "variant_internal_id": "11:52900000-134452384:DEL",
  "analysis_id": "pgxcs-kftvldsu",
  "biosample_id": "pgxbs-kftva59y",
  "individual_id": "pgxind-kftx25eh",
  "variant_state": { "id": "EFO:0030067", "label": "copy number loss" },
  "location": {
    "sequence_id": "refseq:NC_000011.10",
    "chromosome": "11",
    "start": 52900000,
    "end": 134452384
    },
  "updated": "2022-03-29T14:36:47.454674"
}

Progenetix SNV example¶

  {
    "updated": "2023-05-25T17:03:45.096849",
    "analysis_id": "pgxcs-kl8hg1r8",
    "biosample_id": "pgxbs-kl8hg1r4",
    "id": "pgxvar-5be1840772798347f0ed9d9d",
    "variant_internal_id": "5:67589139:G>A",
    "location": {
      "sequence_id": "refseq:NC_000005.10",
      "chromosome": "5",
      "start": 67589138,
      "end": 67589139
    },
    "individual_id": "pgxind-kl8hg1r5",
    "reference_sequence": "G",
    "sequence": "A",
    "variant_state": { "id": "SO:0001059", "label": "sequence_alteration" }
  }

Links¶

schema in progenetix/bycon code repository

Geolocation Data¶

Provenance and use of geolocation data¶

Geographic point coordinates are assigned to each sample after review of existing information from associated publications or repository information for their ”best available” geographic origin using a precedence of:

sample specific data (e.g. from article text)
experiment location
first author proxy

For publications w/o accessible sample data in general the "author proxy" is being used, unless specific annotations have been found in the article.

A more detailed discussion of the problems and benefits of geographic provenance tagging can be found in Carrio-Cordo et al., DATABASE 2020.

Geolocations Service¶

The Progenetix API provides a service for retrieving geographic coordinates as point coordinates, for the majority of cities.

`GeoLocation` schema¶

The current version of the JSON Schema data schema for the geolocation object can be accessed through the Progenetix services API.

"geometry": {
  "coordinates": [
    8.69,
    49.41
  ],
  "type": "Point"
},
"properties": {
  "ISO3166alpha2": "DE",
  "ISO3166alpha3": "DEU",
  "city": "Heidelberg",
  "continent": "Europe",
  "country": "Germany"
},
"type": "Feature"

National Cancer Institute Thesaurus Neoplasm NCIt Neoplasm ↩
Human phenotype ontology HPO ↩
Supported identifiers include platforms(GPL), series(GSE) and samples(GSM).GEO Overview ↩
Supports ArrayExpress Accession ID. ArrayExpress browse ↩
Cellosaurus accession ID. ↩
Uberon ID ↩
International Classification of Diseases for Oncology, 3rd Edition ICD-O-3 ↩↩
Supports TCGA Sample UUID. ↩
Supports cBioPortal Study ID. ↩
Cohorts defined in Progenetix involving a collection of related samples. Currently includes (add pgx:cohort-): arraymap, 2021progenetix, DIPG, TCGA, TCGAcancers, gao2021signatures. ↩
Sequence Ontology ID ↩

Classifications, Ontologies and Standards¶

List of filters recognized by different query endpoints¶

Public Ontologies with CURIE-based syntax¶

Private filters¶

Diagnoses, Phenotypes and Histologies¶

NCIt coding of tumor samples¶

Current NCIt sample codes¶

ICD coding of tumor samples¶

Current ICD-O sample codes¶

UBERON codes¶

Current UBERON sample codes¶

Genomic Variations (CNV Ontology)¶

Sequence Variation (SNV Ontology)¶

Variant Schemas¶

GA4GH "Variant Representation" schema¶

Links¶

Ad-Hoc & "Community" Formats¶

Progenetix Variant schema¶

Progenetix CNV example¶

Progenetix SNV example¶

Links¶

Geolocation Data¶

Provenance and use of geolocation data¶

Geolocations Service¶

GeoLocation schema¶

Progenetix `Variant` schema¶

`GeoLocation` schema¶