Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC4891810

Formats

Article sections

Authors

Related links

CEUR Workshop Proc. Author manuscript; available in PMC 2016 June 3.

Published in final edited form as:

CEUR Workshop Proc. 2015 May; 1378: http://ceur-ws.org/Vol-1378/AMW_2015_paper_1.pdf.

Published online 2015 June 11. PMCID: PMC4891810

NIHMSID: NIHMS771655

In the design of machine-learning solutions, a critical and often the most resourceful task is that of feature engineering [7, 4], for which recipes and tooling have been developed [3, 7]. In this vision paper we embark on the establishment of database foundations for feature engineering. We propose a formal framework for classification, in the context of a relational database, towards investigating the application of database and knowledge management to assist with the task of feature engineering. We demonstrate the usefulness of this framework by formally defining two key algorithmic challenges within: (1) *separability* refers to determining the existence of feature queries that agree with the given training examples, and (2) *identifiability* is the task of testing for the property of independence among features (given as queries). Moreover, we give preliminary results on these challenges, in the context of conjunctive queries. We focus here on boolean features that are represented as ordinary database queries, and view this work as the basis of various future extensions such as numerical features and more general regression tasks.

We first present our formal framework for classification with binary features within a relational database.

In this work, a *classifier* is a function of the form

$$\gamma :{\{-1,1\}}^{n}\to \{-1,1\}$$

where *n* is a natural number that we call the *arity* of *γ*. A *classifier class* is a (possibly infinite) family *Γ* of classifiers. We denote by *Γ _{n}* the restriction of

$$c:({\cup}_{n}({\Gamma}_{n}\times {\mathbf{T}}_{n}))\to {\mathbb{R}}_{\ge 0}$$

where _{≥0} is the set of nonnegative numbers. In the context of a classifier class *Γ* and a cost function *c*, *learning* a classifier is the task of finding a classifier *γ* *Γ _{n}* that minimizes

We illustrate the above definitions on the important class of *linear classifiers*. An *n*-ary linear classifier is parameterized by a vector **w** * ^{n}*, is denoted by

$${\mathrm{\lambda}}_{\mathbf{w}}(\mathbf{a})\stackrel{\mathrm{def}}{=}\{\begin{array}{ll}1& \text{if}\phantom{\rule{0.2em}{0ex}}\mathbf{a}\cdot \mathbf{w}\ge 0;\\ -1& \text{otherwise}.\end{array}$$

where “·” denotes the operation of dot product. By Lin we denote the class of linear classifiers. An example of a cost function is the *least square* cost *lsq* that is given by

$$\mathit{lsq}({\Lambda}_{\mathbf{w}},T)\stackrel{\mathrm{def}}{=}{\displaystyle \mathit{\sum}_{\langle \mathbf{x},y\rangle \in T}{(\mathbf{x}\cdot \mathbf{w}-y)}^{2}}$$

for the arguments *Λ*** _{w}** Lin

Our relational terminology is as follows. A *schema* is a pair
$(\mathcal{A},\mathit{\sum})$, where
$\mathcal{A}$ is a *signature* that consists of *relation symbols*, and *Σ* is a set of logical integrity constraints over
$\mathcal{A}$. Each relation symbol *R* has an associated arity. We assume an infinite set Const of *constants*. An *instance I* over a schema
$\mathbf{S}=(\mathcal{A},\mathit{\sum})$ associates with every *k*-ary relation symbol
$R\in \mathcal{A}$ a finite subset *R ^{I}* of Const

Let **S** be schema. A *query* (*over*
**S**) is a function *Q* that is associated with an arity *k*, and that maps every relation instance *I* over **S** into a finite subset *Q*(*I*) of Const* ^{k}*. A query

$$Q({I}_{1}\cup {I}_{2})=Q({I}_{1})\cup Q({I}_{2}).$$

A *query class* is a mapping that associates with every schema **S** a class of queries over **S**. An example of a query class is that of the *conjunctive queries*. A conjunctive query (CQ) is represented by the logical formula *q*(**x**) that has the form

$$\exists \mathbf{y}[{\varphi}_{1}(\mathbf{x},\mathbf{y},\mathrm{d})\wedge \cdots \wedge {\varphi}_{m}(\mathbf{x},\mathbf{y},\mathbf{d})]$$

where **x** and **y** are disjoint sequences of variables, **d** is a sequence of constants, and each *ϕ _{i}* is an atomic query over

We now present our formal framework. An *entity schema* is a triple
$(\mathcal{A},\mathit{\sum},E)$, where
$(\mathcal{A},\mathit{\sum})$ is a schema and *E* is a relation symbol in
$\mathcal{A}$ that represents the *entities* (as tuples). An *instance I* over a schema
$(\mathcal{A},\mathit{\sum},E)$ is simply an instance over
$(\mathcal{A},\mathit{\sum})$. Intuitively, an instance *I* over an entity schema
$(\mathcal{A},\mathit{\sum},E)$ represents a set of entities, namely *E ^{I}* (i.e., the set of tuples in

Let **S** be an entity schema, and let *I* be an instance over **S**. A *feature query* (*over*
**S**) is a query *π* over the schema **S**, such that *π* *E*_{S}, where *E*_{S} is viewed as the query that, given an instance *I*, copies the relation
${E}_{\mathbf{S}}^{I}$. In other words, a feature query is a query that selects entities. For example, if *E* is Persons(*ssn*, *name*) then a feature can be the following CQ *q*(*s*, *n*) (selecting persons working in New York City).

$${\exists}_{c}[\text{Persons}(s,n)\wedge \text{PersonCompany}(s,c)\wedge \text{CompanyCity}(c,\u2018\text{NYC}\u2019)]$$

If *π* is a feature query, then *π ^{I}* denotes the function
$f:{E}_{\mathbf{S}}^{I}\to \{-1,1\}$ where

$$f(e)=\{\begin{array}{ll}1& \text{if}\phantom{\rule{0.2em}{0ex}}e\in \pi (1);\\ -1& \text{otherwise}\end{array}$$

A *statistic* (over **S**) is a sequence *Π* = (*π*_{1},…, *π _{n}*) of feature queries. We denote by

A *feature schema* is a pair (S, *Π*), where **S** is an entity schema, and *Π* is a statistic over **S** that produces a sequence of features for every entity of a given input instance. We say that *Π* and (**S**, *Π*) are in a query class **Q** if every query in *Π* belongs to **Q**. A *training instance* over **S** is a pair (*I*, *o*), where *I* is an instance over **S** and *o*:
${E}_{\mathbf{S}}^{I}\to \{-1,1\}$ is a function that partitions the entities into positive and negative examples. Given a feature schema (**S**, *Π*) and a classifier class *Γ*, the training instance (*I,o*) defines the training collection that consists of the tuple *Π ^{I}*(

In *feature engineering*, one devises feature queries for a classification task. Next, we discuss two computational challenges that naturally arise in feature engineering.

Let (**S**, *Π*) be a feature schema, and let *Γ* be a classifier class. A training instance (*I*, *o*) is said to be *Γ-separable* with respect to (w.r.t.) *Π* if there exists a classifier *γ* *Γ* that fully agrees with *o*; that is, *γ* and *Π* have the same arity, and *γ*(*e*) = *o*(*e*) for every
$e\in {E}_{\mathbf{S}}^{I}$. We define the following core computational problem.

Let **S** be an entity schema, let **Q** be a query class over **S**, and let *Γ* be a classifier class. The *separability problem* is the following. Given a training instance (*I*, *o*) over **S**, determine whether there exists a statistic *Π* in **Q** such that (*I*, *o*) is *Γ*-separable w.r.t. *Π*.

The separability problem, as defined, can be extended in various practical directions. The input can include a bound *N* on the length *n* of the statistic *Π* (hence, limiting the *model complexity*, which results in classifiers that are more efficient and less overfitting). One can allow for an *approximate* agreement with *o* (e.g., the classifier should agree with *o* on at least (**1 − ε**) of the entities, or at most

Let **Q** be the class CQ^{nc}, and let *Γ* be the class Lin. For every entity schema **S**, separability is in NP. Moreover, there exists an entity schema **S** such that separability is NP-complete.

The proof of membership in NP is using the concept of a *canonical database* [1], and the proof of NP-hardness is by a reduction from the maximum-clique problem.

We note that a problem similar to separability has been studied in a recent paper by Cohen and Weiss [2], where data are labeled graphs and features are tree patterns.

We denote by **0^{m}** the vector of

Let **Q** be a query class. *Identifiability* is the problem of testing, given a feature schema (S, *Π*) in **Q**, whether there exists an instance ** I** over

Identifiability is an important property in the design of machine-learning solutions [5]. Particularly, in the case of the classifier class Lin and the cost function *lsq*, this property implies that there is a single optimal classifier, whereas its absence implies that the space of optimal solutions is unbounded.

Next, we show that in the case of CQs. identifiability amounts to query equivalence. A statistic *Π* is said to have *redundancy* if it contains two distinct feature queries that are equivalent.

Let **Q** be the query class of additive CQs, and let (**S**, Π) be a feature schema such that Π is in **Q**. Then Π is identifiable if and only if Π has no redundancy.

We conclude with comments on Theorem 2. First, this theorem is proved again by applying the concept of a canonical database of a CQ. Second, we can extend the theorem to the class of all CQs, but the condition that characterizes identifiability is significantly more complicated (and will be given in the extended version of this paper). Third, this theorem generalizes to *affine independence*, which is important in different cost functions such as *maximum entropy* [6]. Finally, by an immediate application of the NP-completeness of CQ containment [1] we get that identifiability is NP-complete in the case of additive CQs.

Benny Kimelfeld is a Taub Fellow, supported by the Taub Foundation. The research of Christopher Ré is supported by DARPA’s projects XDATA (FA8750-12-2-0335), DEFT (FA8750-13-2-0039), MEMEX, and SIMPLEX. His research is also supported by NSF Career Award IIS-1353606, ONR Awards N000141210041 and N000141310129, NIH Grant U54EB020405 (awarded by NIBIB through funds provided by the trans-NIH BD2K initiative), the Sloan Research Fellowship, the Moore Foundation, American Family Insurance, Google, and Toshiba. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of DARPA, AFRL, NSF. ONR, NIH, or the U.S. government.

^{3}For CQs with constants, the problem is trivial and not interesting, since the positive examples can be hardcoded into the statistic.

1. Chandra AK, Merlin PM. Optimal implementation of conjunctive queries in relational data bases. In: Hopcroft JE, Friedman EP, Harrison MA, editors. STOC. ACM; 1977. pp. 77–90.

2. Cohen S, Weiss YY. Learning Tree Patterns from Example Graphs. In: Arenas M, Ugarte M, editors. 18th International Conference on Database Theory (ICDT 2015), volume 31 of Leibniz International Proceedings in Informatics (LIPIcs) Dagstuhl; Germany: 2015. pp. 127–143. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.

3. Guyon I, Gunn S, Nikravesh M, Zadeh LA. Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing) Springer-Verlag New York, Inc.; Secaucus, NJ, USA: 2006.

4. Kandel S, Paepcke A, Hellerstein JM, Heer J. Enterprise data analysis and visualization: An interview study. IEEE Trans Vis Comput Graph. 2012;18(12):2917–2926. [PubMed]

5. Lehmann EL, Casella G. Theory of point estimation. Vol. 31. Springer; 1998.

6. Wainwright MJ, Jordan MI. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning. 2008;1(1–2):1–305.

7. Zhang C, Kumar A, R C. Materialization optimizations for feature selection workloads. SIGMOD Conference. 2014:265–276.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |