# OntoDT - Ontology of Datatypes

In the context of the development of the ontology of data mining that needs to be general enough to allow the representation of mining structured data, we developed a separate ontology module, named OntoDT, for representing the knowledge about datatypes. In the preliminary OntoDT development phase, the classes used to represent datatypes were integrated in OntoDM-core. For generality and reuse purposes, however, we later exported datatype specific OntoDM classes in a separate ontology module - OntoDT. This ontology can now be reused independently by any other ontology that requires a representation of and reasoning about general purpose datatypes.

## Background

The content of the OntoDT ontology module is based on an ISO standard for the representation of datatypes in computer systems and programming languages. The first edition of the standard, named `Language independent datatypes', was published in 1996. The revised version, named `General-Purpose Datatypes', was published in 2007. The standard specifies both primitive datatypes, defined without a reference to other datatypes, and non-primitive datatypes, defined in terms of other datatypes, that occur commonly in programming languages. The definitions of datatypes are independent of any particular programming language or implementation. Furthermore, Meek (1994) discusses a proposal for a taxonomy of datatypes using as a base the first version of the ISO standard. His taxonomy follows the practice of starting with a number of primitive datatypes and using these to construct others. The proposed taxonomy is given only in the form of an overview and a discussion of how things are done, without any formal representation (in a machine processable language) that can be reused further.

## Content

The OntoDT ontology defines:

- datatype characterizing operation and a taxonomy of datatype characterizing operations,
- datatype quality and a taxonomy of datatype qualities
- a datatype taxonomy comprising of classes and instances of
- primitive datatypes
- generated datatypes (non-aggregate and aggregated datatypes),
- subtypes, and
- defined datatypes.

### Datatype and value space

In the OntoDT ontology, the *datatype* class is modeled as a subclass of the *OBI: data representational model* class. It defines the type of data, with the set of distinct values that the data can take, the properties of those values, and the operations on those values. The *datatype* class is represented with the *has-member* relation to the *value space specification* class and the *has-operation* relation to the *characterizing operation* class. In addition, OntoDT models *datatype properties* as subclasses of the *quality* class and connects them using the *has-quality* relation.

The *value space specification* class is modeled in OntoDT as a subclass of the *OntoDM: specification entity* class. It specifies the collection of values for a given datatype. The value space of a given datatype can be defined in different ways: by enumerating the values; with axioms using a set of fundamental notions; as a subset of values defined in another value space with a given set of properties; or as a combination of arbitrary values from some other defined value space by specifying a construction procedure.

### Characterizing operations

A *characterizing operation* is defined as *IAO: directive information entity* that specifies those operations on the datatype that distinguish it from other datatypes having identical value spaces. The characterizing operation of a datatype can be: *niliadic, monadic, dyadic* and *n-adic*.

- A
*niliadic operation*specifies an operation that yields values of a given datatype. - A
*monadic operation*specifies an operation that maps a value of a given datatype into a value of the given datatype, or into a value of the*boolean*datatype. - A
*dyadic operation*specifies an operation that maps a pair of values of a given datatype into a value of the given datatype, or into a value of the*boolean*datatype. - An
*n-adic operation*specifies an operation that maps an ordered n-tuple of values (n>2), each of which is of a specific datatype, into values of a given datatype.

Finally, all characterizing operation classes have defined subclasses, which represent datatype specific operations.

### Datatype properties

A *datatype property* is defined as a *quality* that specifies the intrinsic properties of the data units represented by the datatype, regardless of the properties of their representations in computer systems. Each datatype has a set of unique datatype properties. These include property classes such as: *order*, *numericalness*, *cardinality*, *exactness* ,*equality* , and *boundedness*.

*Order*is a datatype property that denotes whether there exists an order relation defined on its value space.*Numericalness*denotes whether the values in the value space are quantities expressed in a mathematical numbering system.*Cardinality*denotes the notion of cardinality of the value space.*Exactness*denotes whether every value from the value space is distinguishable from every other value in the value space.*Boundedness*is a property that denotes the boundaries of the value space.

All datatype property classes have defined subclasses. For example, the *boundedness* class has the following subclasses: *bounded* (*bounded below, bounded above*) and *unbounded* (*unbounded below, unbounded above*).

### Extended datatype

In OntoDT, an *extended datatype* (named *'subtype'* in the ISO standard) is defined as a *IAO: data representational model* that is derived from an existing datatype by restricting the value space to a subset of the base datatype, while maintaining all operations. The base type denotes the role of a datatype as a *parametric datatype* on which a generator operates to produce a new datatype. An extended datatype is defined by a *subtype generator* that represents the relationship between the value spaces of the base type and the extended datatype.

In OntoDT, we define the following classes of subtype generators: *range generator*, *selection generator*, *exclusion generator*, *size generator*, *extension generator*, and *explicit subtype generator*. Subtype generators can change the set of datatype properties valid for the base datatype, and this is the reason we do not represent them simply as subclasses of the datatype class. For example, applying the range generator to an unbound datatype will make it bounded.

Using these notions, we can represent an extended datatype of any previously defined type. For example, by using a range subtype generator we can place a new upper and/or lower bound on the value space of a chosen base datatype. The *positive integer* datatype is a extended datatype of the *integer datatype* obtained by limiting the value space with a lower bound of zero.

Using these notions, we can represent an extended datatype of any previously defined type. For example, by using a range subtype generator we can place a new upper and/or lower bound on the value space of a chosen base datatype. The *positive integer* datatype is a extended datatype of the *integer datatype* obtained by limiting the value space with a lower bound of zero.

## Versions and Download

### Release version 1

- The main ontology file OntoDT.owl
- Extension for representing BioXSD datatypes OntoDT-BioTypes.owl
- Annotation and querying machine learning datasets use case
- annotations of datasets OntoDTinstances.xls
- OWL representation of the dataset anonations OntoDTinstances.owl
- extension of OntoDT and OntoDM-core for annotating datasets clus-instances.owl

## Publications

- Panče Panov, Larisa Soldatova, Sašo Džeroski. Generic Ontology of Datatypes. Information Sciences 329 (2016) 900–920
- Panče Panov. A Modular Ontology of Data Mining. Doctoral Thesis. Jožef Stefan International Postgraduate School. 2012 (Chapter 5 OntoDT: Ontology Module for Datatypes)