## Constraints and constraint-based data mining tasks and algorithms

Constraints play a central role in data mining and constraint-based data mining (CBDM) is now growing in importance. A general statement of the problem involves the specification of a language of generalization and a set of constraints that a generalization needs to satisfy. In CBDM, constraints are propositions or statements about generalizations. They can be classified along three dimensions:

- primitive and composite constraints;
- language and evaluation constraints; and
- hard (Boolean) constraints, soft constraints and optimization constraints.

### Taxonomy of constraints

A *constraint specification* is defined in OntoDM-core as a sub-class of OBI *data representational model* and is the top-level class of a taxonomy of constraints that we propose. At the first level of the taxonomy, we have the *primitive* and *complex constraints*. Primitive constraints are based on atomic and complex constraints on non-atomic propositions. *Complex constraints* have as parts *primitive constraints* and a *combination function specification* that defines how the primitive constraints are combined to form a complex constraint.

At the second level, if we focus on the primitive constraints, we have *primitive language constraints* and *primitive evaluation constraints*. *Language constraints* concern the representation of a generalization and only refer to its form. Commonly used types of language constraints are *subsumption constraints* (e.g., all itemsets must contain the item 'bread`) and language cost constraints (e.g., itemsets should contain at most three items). *Evaluation constraints* concern the semantics of a generalization when applied to a dataset. They usually include evaluation functions, where the evaluation functions measure the validity of a generalization on a given dataset (e.g., classification accuracy).

At the last level the *primitive language cost-function constraint* is extended with three sub-classes that include: *primitive hard language cost-function constraint*, *primitive soft language cost-function constraint*, and *primitive optimization language cost-function constraint*. *Hard constraints* represent boolean functions on generalizations and the constraint can be either satisfied or not satisfied. Soft constraints do not dismiss a generalization that violates a constraint, but rather penalize it for violating a constraint. *Optimization constraints* ask for a fixed-size set of generalizations that have some extreme values for a given cost or evaluation function. In a similar way, we define the sub-classes of the *primitive evaluation constraint* class.

### Constraint-based data mining task

The task of CBDM is to find a set of generalizations that satisfy a set of constraints, given a dataset that consists of examples of a specific datatype, a data mining task, a generalization specification and a specifications of the set of constraints. In the OntoDM-core ontology, we represent a CBDM task as a sub-class of the *objective specification* class (reused from IAO). It has as parts a *data mining task* and a set of c*onstraint specifications*. We further define a *CBDM algorithm* as an algorithm that solves a CBDM task . Finally, this structure allows us to form a taxonomy of CBDM tasks, where at the first level of the taxonomy the basic CBDM task classes that are aligned with the fundamental data mining tasks, and then at the next levels depend on the data specification and the type of constraints.