language en

Anatomy of Knowledge Cleaning Ontology (AKCO)

Representing and organizing relevant aspects of knowledge cleaning assertion errors in knowledge graphs

Ontology Specification Draft

Abstract

The Anatomy of Knowledge Cleaning Ontology represents relevant aspects of cleaning assertion errors in knowledge graphs. Knowledge cleaning, i.e., error detection and correction, is an important part of the lifecycle of a knowledge graph. However, selecting suitable cleaning approaches requires different types of expertise, including consideration of possibilities, limitations, and underlying interdependencies. To address this, the ontology was developed based on an extensive literature review, designed following the Ontology 101 methodology, and aligned with the FAIR principles.

It enables users to streamline various tasks, such as gaining an overview of aspects relevant to a specific knowledge graph or understanding which background knowledge is utilized by a particular cleaning approach. The Ontology comprises a set of newly introduced classes and properties, as well as the adaptation of existing vocabularies, such as the Simple Knowledge Organisation System and the Data Quality Vocabulary. It currently contains over 100 instances that factualize a broad spectrum of literature, representing dimensions, background knowledge types, and various cleaning approaches. It includes rich metadata, versioning, and multiple access formats to support reuse and discoverability.

Introduction back to ToC

Motivation

Knowledge graphs are a well-established approach for representing information of any domain in a machine-readable format. One important quality measure for a successful application is the correctness of the assertions. Knowledge cleaning, i.e., error detection and correction, improves the correctness of a knowledge graph. These assertions represent the factual knowledge of the domain and make up the majority of a knowledge graph. Different types of assertions can have different errors, and various approaches have been developed to address these errors. Each approach targets a limited number of different error types and typically performs either error detection or correction.

Therefore, to clean a knowledge graph of the different errors, several approaches need to be selected. This selection should account for the various cleaning techniques and types of background knowledge available. Cleaning techniques describe a specific methodology that is followed during the processing of an assertion. These methodologies use background knowledge as context to compare an assertion in question to or identify replacements for erroneous entities. This background knowledge can be part of the knowledge graph itself, like other assertions, or external sources, like a text corpus. The adaptation of a cleaning technique by an approach results in different dimensions, i.e., characteristics of this approach. For example, using only a given knowledge graph classifies an approach as internal, whereas using external sources classifies it as external. Considering all these different aspects makes the selection of approaches nontrivial. A great deal of expertise is required to gain an understanding of the various concepts and their interdependencies.

Overview

To address these challenges, the Anatomy of Knowledge Cleaning Ontology (AKCO) defines important concepts and their properties of knowledge cleaning. It enables one to gain insights into various aspects and select approaches based on specific features. It is based on an extensive literature review [Sommer et al., 2025]. It is designed using the design principles discussed in "Ontology 101: A guide to creating your first ontology" [Noy et al., 2001] and "Best practices for implementing fair vocabularies and ontologies on the web" [Garijo et al., 2020]

The ontology streamlines the process of gaining important insights into possibilities, limitations, and dependencies of cleaning assertion errors in a given knowledge graph.

Foundation: the Anatomy of Knowledge Cleaning

Relevant aspects of knowledge cleaning are discussed in the survey [Sommer et al., 2025]. It provides the first qualitative analysis of knowledge cleaning. It introduces eleven cleaning techniques, seven types of background knowledge, four dimensions with at least two variations each, and various cleaning approaches. It follows a methodology for transparency. It employs a framework for context and structure, providing a discussion that highlights important insights into knowledge cleaning and potential future research areas. The Anatomy of Knowledge Cleaning, which is introduced in Section 2, captures the most important aspects of this publication.

Ontology Design Methodology

The design principles of [Noy et al., 2001] include defining basic specifications, like the domain of the ontology. The AKCO covers concepts and relations between those concepts that are relevant aspects of knowledge cleaning. It provides semantics for representing and relating different knowledge graph features, assertion errors, types of background knowledge, dimensions, cleaning techniques, and cleaning approaches. The instances (named individuals) of the concepts factualize knowledge cleaning. They provide insights into dependencies and support answering questions such as:

  • Without domain knowledge, which cleaning techniques are feasible to be applied?
  • Which errors can be targeted using a specific cleaning technique, like path-based approaches?
  • What types of background knowledge are needed for cleaning assertion errors without domain knowledge?
  • Should internal or external approaches be chosen for cleaning a knowledge graph?
  • Which approaches target semantic errors in property-value assertions?
  • How can a text document be used for error detection?
  • What types of background knowledge can be used to target property-value assertions?
  • Which approaches can be used to detect errors in property-value assertions that best support the features of a given knowledge graph?
  • What are knowledge-driven approaches for detecting errors in property-value assertions?
Vocabulary Reuse and Alignment

Based on those basic specifications, the methodology further recommends the reuse of existing ontologies. The AKCO, therefore, adopts concepts and relations from the Simple Knowledge Organization System (SKOS), the Data Quality Vocabulary (DQV), and Dublin Core Terms (DCTerms). Each of these ontologies adds a certain set of semantics to the AKCO.

VocabularyPurpose
SKOSModeling a taxonomy of errors, techniques, and approaches as concepts.
DQVExpressing quality measures and metadata about datasets.
DCATTyping datasets used as background knowledge.
DCTermsCapturing metadata of publications (title, issued date, citation).
OWLSupporting versioning with owl:priorVersion and versionIRI.
VANNAnnotating vocabulary-level metadata (e.g., namespace prefix and URI).
Reused Vocabularies in AKCO
Ontology Structure: Classes and Hierarchies

The AKCO can be interpreted as a taxonomy of concepts relevant to knowledge cleaning. Therefore, it makes the classes, error, technique, and approach subclasses of skos:Concept. This adoption allows these concepts to express hierarchical classifications. For example, an error in a tail vertical domain can be represented using skos:narrower relating to a more general semantic error, or the relation between different cleaning approaches or techniques can be defined.

Knowledge Graph and Background Knowledge modeling

The AKCO also needs to represent various quality measures of a knowledge graph and the associated background knowledge. Therefore, the DQV is adopted. It allows one to represent these quality measures flexibly. Therefore, the concept akco:BackgroundKnowledge has two generally different subtypes. One can be represented as a dcat:DataSet, which contains knowledge graphs and various external sources, such as documents or databases. The other type is human expertise, which cannot be classified in the same manner. Here, individual instances should be created, representing various forms of human expertise.

Approach metadata modeling

Another integral part of the AKCO must be its ability to capture the specification of various cleaning approaches. Therefore, the AKCO adopts the DCTerms vocabulary. It introduces a new property akco:isPublished, which has as a domain akco:Approach and as a range dcterms:BibliographicResource, which describes an approach using dcterms:title, dcterms:source, dcterms:issued, and dcterms:bibliographicCitation. The source can be used to define implementation sources, such as Git repositories, and APA style citations or similar can be defined using the citation property.

Enumerated terms and instances

The design principles for an ontology suggested by [Noy et al., 2001] also include the enumeration of important terms. These terms can be taken from the Anatomy of Knowledge Cleaning (see Section 2). They include concepts such as techniques and errors, and extend to instances, including semantic errors, as well as various cleaning techniques, such as integrity constraint-based and embedding-based.

Defined classes and hierarchy

The AKCO defines eight classes, which are (1) akco:Approach, (2) akco:BackgroundKnowledge, (3) akco:Dimension, (4) akco:Error, (5) akco:ExternalSource, (6) akco:HumanExpertise, (7)akco:KnowledgeGraph, and (8) akco:Technique. These classes are all part of the Anatomy of Knowledge Cleaning. The AKCO uses a simple class hierarchy, which was already discussed as part of the adoption of other ontologies. It comprises subclass relations of Approach, Technique, and Error to be subclasses of skos:Concept. The other hierarchical component is the relation to the DQV, where akco:BackgroundKnowledge has as a subclass dcat:DataSet, and akco:KnowledgeGraph and akco:ExternalSource are subclasses of it. These relations are also shown in the visualization (see Section 3.1).

Defined properties

The AKCO introduces six new object properties, which are (1) akco:isPublishes, (2) akco:hasError, (3) akco:targetsError, (4) akco:usesBackgroundKnowledge, (5) akco:usesTechnique, and (6) akco:hasDimension. (1) defines the bibliographic resource of a akco:Approach. (2) defines the types of errors in a akco:KnowledgeGraph. (3) defines which types of errors a akco:Approach can target. (4) defines which akco:BackgroundKnowledge is used by an akco:Approach or a akco:Technique. (5) defines which akco:Technique is used by an akco:Approach. (6) defines which akco:Dimension an akco:Approach has. The AKCO also introduces three new datatype properties, which are (1) akco:errorNature, (2) akco:errorSource, and (3) akco:errorType. Each of those properties uses a xsd:string as a value. Further, akco:Approach, akco:Error, and akco:Technique are subclasses of skos:Concept, which allows them to be related with hierarchical properties.

Defined instances

The AKCO also contains over one hundred instances of various real-world knowledge cleaning concepts, such as eleven cleaning techniques, seven types of background knowledge, multiple different dimensions and errors, and close to one hundred different cleaning approaches. Each instance is also taken from the insights provided by the Anatomy of Knowledge Cleaning (see Section 2).

Support for FAIR Principles

The AKCO also adopts the insights from another ontology specification proposal. It follows the suggestions of [Garijo et al., 2020], who propose best practices for implementing Findable Accessible, Interoperable, Reusable (FAIR) ontologies. As part of implementing these practices, the ontology documentation now includes the first update to the ontology, with version 1.1.0. This version included various metadata semantics into the AKCO. It included the integration of VANN properties, which is a vocabulary for annotating vocabulary descriptions. The AKCO uses it for defining namespace definitions, by adopting vann:preferredNamespaceURI and vann:preferredNamespacePrefix. Another update was the support of versioning the ontology, which included recording changes and providing a mechanism to obtain previous versions. During this measure, the AKCO also adopted the properties owl:priorVersion, and owl:versionIRI. Both properties are provided by the Web Ontology Language (OWL). Otherwise, the update also improved the dcterms:description property, changing the format for the values of dcterms:creator and dcterms:contributor to IRI instances, rather than literal values. The AKCO now contains all recommended metadata for supporting the FAIR principles.

Documentation features

Furthermore, this documentation aligns with the principles. It contains multiple types of visualizations, one using the WebVowl and the other the Chowlk Visual Notation, both providing different views of the ontology. It provides the ontology in multiple formats, such as JSON-LD, RDF/XML, or N-Triples. It contains schema.org annotated metadata. The AKCO is also registered using a permanent URL: https://purl.archive.org/akco It is also registered with Prefix.cc.

Structure of the Documentation

The following sections of this documentation contain the Anatomy of Knowledge Cleaning, the theoretical foundation of the AKCO. An overview of classes, properties, and instances, including the visualizations of the ontology. Next, it is followed by a description of the AKCO and an application section. This application section shows how the ontology can be applied, for example, to answer the four questions from the introduction. The final section shows the cross-references for all classes, properties, and instances.

Namespace declarations

The Anatomy of Knowledge Cleaning back to ToC

The anatomy of knowledge cleaning is the theoretical foundation to the AKCO. It introduces key aspects of knowledge cleaning assertion errrors in knowledge graphs. It describes aspects like, cleaning techniques, discusses their possibilities and limitations, and highlights relations to other aspects. It is based on extensive literature analysis, which is detailed in [ref_to_my_paper].

The anatomy of knowledge cleaning focuses on three main aspects of knowledge cleaning. These three are cleaning techniques, background knowledge, and dimensions. Cleaning techniques define a methodology that is employed to detect or correct assertion errors. The techniques depend on various forms of background knowledge as context for processing. Dimensions are characteristics of cleaning approaches, that describe a certain feature, like internal approaches use only a given knowledge graph as background knowledge, while external approaches use other sources.

Visualization

Anatomy of Knowledge Cleaning Ontology (AKCO): Application back to ToC

The application of the AKCO showcases how the ontology can be applied to capture relevant aspects of knowledge cleaning. It is based on instances, of the introduced classes, exemplifying the factualization/instantiation of the classes. The ontology currenlty contains over 100 instances, which concepts such as represent different dimensions, techniques, or errors.

Section Representing relevant aspects of knowledge cleaning highlights how different knowledge cleaning aspects can be represented, using the AKCO.

Section Answering questions about knowledge cleaning shows how the instances can be used to answer various knowledge cleaning related questions.

The examples are expressed in Turtle syntax.

Representing relevant aspects of knowledge cleaning

Answering questions about knowledge cleaning

The AKCO is able to answer a variety of questions about knowledge cleaning, the following showcases how to apply the AKCO to answer the question from the introduction.

Without domain knowledge, which cleaning techniques are feasible to be applied?
Which errors can be targeted using a specific cleaning technique, like path-based approaches?
What types of background knowledge are needed for cleaning assertion errors without domain knowledge?
Should internal or external approaches be chosen for cleaning a knowledge graph?

References back to ToC

Acknowledgments back to ToC

The authors would like to thank Silvio Peroni for developing LODE, a Live OWL Documentation Environment, which is used for representing the Cross Referencing Section of this document and Daniel Garijo for developing Widoco, the program used to create the template used in this documentation.