Introduction back to ToC
Motivation
Knowledge graphs are a well-established approach for representing information of any domain in a machine-readable format. One important quality measure for a successful application is the correctness of the assertions. Knowledge cleaning, i.e., error detection and correction, improves the correctness of a knowledge graph. These assertions represent the factual knowledge of the domain and make up the majority of a knowledge graph. Different types of assertions can have different errors, and various approaches have been developed to address these errors. Each approach targets a limited number of different error types and typically performs either error detection or correction.
Therefore, to clean a knowledge graph of the different errors, several approaches need to be selected. This selection should account for the various cleaning techniques and types of background knowledge available. Cleaning techniques describe a specific methodology that is followed during the processing of an assertion. These methodologies use background knowledge as context to compare an assertion in question to or identify replacements for erroneous entities. This background knowledge can be part of the knowledge graph itself, like other assertions, or external sources, like a text corpus. The adaptation of a cleaning technique by an approach results in different dimensions, i.e., characteristics of this approach. For example, using only a given knowledge graph classifies an approach as internal, whereas using external sources classifies it as external. Considering all these different aspects makes the selection of approaches nontrivial. A great deal of expertise is required to gain an understanding of the various concepts and their interdependencies.
Overview
To address these challenges, the Anatomy of Knowledge Cleaning Ontology (AKCO) defines important concepts and their properties of knowledge cleaning. It enables one to gain insights into various aspects and select approaches based on specific features. It is based on an extensive literature review [Sommer et al., 2025]. It is designed using the design principles discussed in "Ontology 101: A guide to creating your first ontology" [Noy et al., 2001] and "Best practices for implementing fair vocabularies and ontologies on the web" [Garijo et al., 2020]
The ontology streamlines the process of gaining important insights into possibilities, limitations, and dependencies of cleaning assertion errors in a given knowledge graph.
Foundation: the Anatomy of Knowledge Cleaning
Relevant aspects of knowledge cleaning are discussed in the survey [Sommer et al., 2025]. It provides the first qualitative analysis of knowledge cleaning. It introduces eleven cleaning techniques, seven types of background knowledge, four dimensions with at least two variations each, and various cleaning approaches. It follows a methodology for transparency. It employs a framework for context and structure, providing a discussion that highlights important insights into knowledge cleaning and potential future research areas. The Anatomy of Knowledge Cleaning, which is introduced in Section 2, captures the most important aspects of this publication.
Ontology Design Methodology
The design principles of [Noy et al., 2001] include defining basic specifications, like the domain of the ontology. The AKCO covers concepts and relations between those concepts that are relevant aspects of knowledge cleaning. It provides semantics for representing and relating different knowledge graph features, assertion errors, types of background knowledge, dimensions, cleaning techniques, and cleaning approaches. The instances (named individuals) of the concepts factualize knowledge cleaning. They provide insights into dependencies and support answering questions such as:
- Without domain knowledge, which cleaning techniques are feasible to be applied?
- Which errors can be targeted using a specific cleaning technique, like path-based approaches?
- What types of background knowledge are needed for cleaning assertion errors without domain knowledge?
- Should internal or external approaches be chosen for cleaning a knowledge graph?
- Which approaches target semantic errors in property-value assertions?
- How can a text document be used for error detection?
- What types of background knowledge can be used to target property-value assertions?
- Which approaches can be used to detect errors in property-value assertions that best support the features of a given knowledge graph?
- What are knowledge-driven approaches for detecting errors in property-value assertions?
Vocabulary Reuse and Alignment
Based on those basic specifications, the methodology further recommends the reuse of existing ontologies. The AKCO, therefore, adopts concepts and relations from the Simple Knowledge Organization System (SKOS), the Data Quality Vocabulary (DQV), and Dublin Core Terms (DCTerms). Each of these ontologies adds a certain set of semantics to the AKCO.
Vocabulary | Purpose |
---|---|
SKOS | Modeling a taxonomy of errors, techniques, and approaches as concepts. |
DQV | Expressing quality measures and metadata about datasets. |
DCAT | Typing datasets used as background knowledge. |
DCTerms | Capturing metadata of publications (title, issued date, citation). |
OWL | Supporting versioning with owl:priorVersion and versionIRI . |
VANN | Annotating vocabulary-level metadata (e.g., namespace prefix and URI). |
Ontology Structure: Classes and Hierarchies
The AKCO can be interpreted as a taxonomy of concepts relevant to knowledge cleaning. Therefore, it makes the classes, error
, technique
, and approach
subclasses of skos:Concept
. This adoption allows these concepts to express hierarchical classifications. For example, an error in a tail vertical domain can be represented using skos:narrower
relating to a more general semantic error, or the relation between different cleaning approaches or techniques can be defined.
Knowledge Graph and Background Knowledge modeling
The AKCO also needs to represent various quality measures of a knowledge graph and the associated background knowledge. Therefore, the DQV is adopted. It allows one to represent these quality measures flexibly. Therefore, the concept akco:BackgroundKnowledge
has two generally different subtypes. One can be represented as a dcat:DataSet
, which contains knowledge graphs and various external sources, such as documents or databases. The other type is human expertise, which cannot be classified in the same manner. Here, individual instances should be created, representing various forms of human expertise.
Approach metadata modeling
Another integral part of the AKCO must be its ability to capture the specification of various cleaning approaches. Therefore, the AKCO adopts the DCTerms vocabulary. It introduces a new property akco:isPublished
, which has as a domain akco:Approach
and as a range dcterms:BibliographicResource, which describes an approach using dcterms:title
, dcterms:source
, dcterms:issued
, and dcterms:bibliographicCitation
. The source can be used to define implementation sources, such as Git repositories, and APA style citations or similar can be defined using the citation property.
Enumerated terms and instances
The design principles for an ontology suggested by [Noy et al., 2001] also include the enumeration of important terms. These terms can be taken from the Anatomy of Knowledge Cleaning (see Section 2). They include concepts such as techniques and errors, and extend to instances, including semantic errors, as well as various cleaning techniques, such as integrity constraint-based and embedding-based.
Defined classes and hierarchy
The AKCO defines eight classes, which are (1) akco:Approach
, (2) akco:BackgroundKnowledge
, (3) akco:Dimension
, (4) akco:Error
, (5) akco:ExternalSource
, (6) akco:HumanExpertise
, (7)akco:KnowledgeGraph
, and (8) akco:Technique
. These classes are all part of the Anatomy of Knowledge Cleaning. The AKCO uses a simple class hierarchy, which was already discussed as part of the adoption of other ontologies. It comprises subclass relations of Approach
, Technique
, and Error
to be subclasses of skos:Concept
. The other hierarchical component is the relation to the DQV, where akco:BackgroundKnowledge
has as a subclass dcat:DataSet
, and akco:KnowledgeGraph
and akco:ExternalSource
are subclasses of it. These relations are also shown in the visualization (see Section 3.1).
Defined properties
The AKCO introduces six new object properties, which are (1) akco:isPublishes
, (2) akco:hasError
, (3) akco:targetsError
, (4) akco:usesBackgroundKnowledge
, (5) akco:usesTechnique
, and (6) akco:hasDimension
. (1) defines the bibliographic resource of a akco:Approach
. (2) defines the types of errors in a akco:KnowledgeGraph
. (3) defines which types of errors a akco:Approach
can target. (4) defines which akco:BackgroundKnowledge
is used by an akco:Approach
or a akco:Technique
. (5) defines which akco:Technique
is used by an akco:Approach
. (6) defines which akco:Dimension
an akco:Approach
has. The AKCO also introduces three new datatype properties, which are (1) akco:errorNature
, (2) akco:errorSource
, and (3) akco:errorType
. Each of those properties uses a xsd:string
as a value. Further, akco:Approach
, akco:Error
, and akco:Technique
are subclasses of skos:Concept
, which allows them to be related with hierarchical properties.
Defined instances
The AKCO also contains over one hundred instances of various real-world knowledge cleaning concepts, such as eleven cleaning techniques, seven types of background knowledge, multiple different dimensions and errors, and close to one hundred different cleaning approaches. Each instance is also taken from the insights provided by the Anatomy of Knowledge Cleaning (see Section 2).
Support for FAIR Principles
The AKCO also adopts the insights from another ontology specification proposal. It follows the suggestions of [Garijo et al., 2020], who propose best practices for implementing Findable Accessible, Interoperable, Reusable (FAIR) ontologies. As part of implementing these practices, the ontology documentation now includes the first update to the ontology, with version 1.1.0
. This version included various metadata semantics into the AKCO. It included the integration of VANN properties, which is a vocabulary for annotating vocabulary descriptions. The AKCO uses it for defining namespace definitions, by adopting vann:preferredNamespaceURI
and vann:preferredNamespacePrefix
. Another update was the support of versioning the ontology, which included recording changes and providing a mechanism to obtain previous versions. During this measure, the AKCO also adopted the properties owl:priorVersion
, and owl:versionIRI
. Both properties are provided by the Web Ontology Language (OWL). Otherwise, the update also improved the dcterms:description
property, changing the format for the values of dcterms:creator
and dcterms:contributor
to IRI instances, rather than literal values. The AKCO now contains all recommended metadata for supporting the FAIR principles.
Documentation features
Furthermore, this documentation aligns with the principles. It contains multiple types of visualizations, one using the WebVowl and the other the Chowlk Visual Notation, both providing different views of the ontology. It provides the ontology in multiple formats, such as
,
, or
.
It contains schema.org annotated metadata. The AKCO is also registered using a permanent URL: https://purl.archive.org/akco It is also registered with Prefix.cc.
Structure of the Documentation
The following sections of this documentation contain the Anatomy of Knowledge Cleaning, the theoretical foundation of the AKCO. An overview of classes, properties, and instances, including the visualizations of the ontology. Next, it is followed by a description of the AKCO and an application section. This application section shows how the ontology can be applied, for example, to answer the four questions from the introduction. The final section shows the cross-references for all classes, properties, and instances.