The National Cancer Institute (NCI) Thesaurus (NCIt) is a biomedical reference terminology which has been developed for over a decade. Nearly every month from 2003 through 2012, the NCI has published an updated version of the NCIt to the Web in the Web Ontology Language (OWL), as well as in other formats.
In order to investigate and characterise the evolution of the NCIt, we collected all OWL versions of the NCIt available (2003-2012) and conducted a cross-sectional study on this corpus. In particular, we gathered and analysed various axiom and entity statistics, and carried out a reasoner performance test over the corpus [1,2].
Additionally, using the diff tool Ecco (available as a web-based application here), we extracted the change sets of pairwise, consecutive NCIt versions. The diff notion driving Ecco involves the extraction of, firstly, a purely syntactic difference analysis (based on OWL’s notion of “structural equivalence”), and secondly, it verifies whether the additions or removals changed the set of entailments between versions. Additionally, the tool employs a categorisation of changes based on a heuristic inference of the impact of the change [3]. The results of diff study are available here.
By looking at the entire NCIt history, it became relatively straightforward to identify tool artefacts and significant events and thus to disentangle accidental and essential features of the ontology. In the process we discovered a high level of “merely syntactic” removals and additions. Particularly we noticed a constant pruning of redundant information throughout the corpus. Overall, not only do we get a rich, purely analytic characterisation of the change history of the NCIt, but also we generate a realistic test corpus for incremental classification.
[1] See paper by Rafael. S. Gonçalves, Bijan Parsia, and Uli Sattler: Analysing multiple versions of an ontology: A study of the NCI Thesaurus. In Proc. of DL, 2011.
[2] See paper by Rafael S. Gonçalves, Bijan Parsia, and Uli Sattler: Analysing the evolution of the NCI Thesaurus. In Proc. of CBMS, 2011.
[3] See paper by Rafael S. Gonçalves, Bijan Parsia, and Uli Sattler: Categorising logical differences between OWL ontologies. In Proc. of CIKM, 2011.