Data mining task

The task of data mining is to produce a generalization from given data. In OntoDM-core, we use the term generalization to denote the outcome of a data mining task. A data mining task is defined as sub-class of the IAO class objective specification. It is an objective specification that specifies the objective that a data mining algorithm needs to achieve when executed on a dataset to produce as output a generalization.

Taxonomy of data mining tasks

The definition of a data mining task depends directly on the data specification, and indirectly on the datatype of the data at hand. This allows us to form a taxonomy of data mining tasks based on the type of data. Dzeroski (2006) proposes four basic classes of data mining tasks based on the generalizations that are produced as output: clustering, pattern discovery, probability distribution estimation, and predictive modeling. These classes of tasks are included as the first level of the OntoDM-core data mining task taxonomy. They are fundamental and can be defined on an arbitrary type of data. An exception is the predictive modeling task that is defined on a pair of datatypes (for the descriptive and output data separately). At the next levels, the taxonomy of data mining task depends on the datatype of the descriptive data (in the case of predictive modeling also on the datatype of the output data).

Taxonomy of predictive modeling tasks

If we focus only on the predictive modeling task and using the output data specification as a criterion, we distinguish between the primitive output prediction task and the structured output prediction task. In the first case, the output datatype is primitive (e.g., discrete, boolean or real); in the second case, it is some structured datatype (such as a tuple, set, sequence or graph).

Primitive output prediction tasks

Primitive output prediction tasks can be feature-based or structure-based, depending on the datatype of the descriptive part. The feature-based primitive output prediction tasks have a tuple of primitives (a set of primitive features) on the description side and a primitive datatype on the output side. This is the most exploited data mining task in traditional single-table data mining, described in all major data mining textbooks. If we specify the output datatype in more detail, we have the binary classification task, the multi-class classification task and the regression task; where the output datatype is boolean, discrete or real, respectively. Structure-based primitive output prediction tasks operate on data that have some structured datatype (other than tuple of primitives) on the description side and a primitive datatype on the output side.

Structured output prediction tasks

In a similar way, structured output prediction tasks can be feature-based or structure-based. Feature-based structured output prediction tasks operate on data that have a tuple of primitives on the description side and a structured datatype on the output side. Structure-based structured output prediction tasks operate on data that have structured datatypes both on the description side and the output side.

If we focus just on feature-based structured output tasks and further specify a structured output datatype, we can represent a variety of structured output prediction tasks. For example, we can represent the following tasks: multi-target prediction (which has as output datatype tuple of primitives), multi-label classification (having as output datatype set of discrete), time-series prediction (having as output datatype sequence of real) and hierarchical classification (having as output datatype labeled graph with boolean edges and discrete nodes). Multi-target prediction can be further divided into: multi-target binary classification, multi-target multi-class classification, and multi-target regression.


QR Code
QR Code Data mining task (generated for current page)