This is a fourth article in the series of works (see also first one, second one, and third one) describing Machine Learning system based on Lattice Theory named 'VKF-system'. The program uses Markov chain algorithms to generate causes of the target property through computing random subset of similarities between some subsets of training objects. This article describes bitset representations of objects to compute these similarities as bit-wise multiplications of corresponding encodings. Objects with discrete attributes require some technique from Formal Concept Analysis. The case of objects with continuous attributes asks for logistic regression, entropy-based separation of their ranges into subintervals, and a presentation corresponding to the convex envelope for subintervals those similarity is computed.
1 Discrete attributes
To encode object with only discrete attributes we need to compute an auxiliary bitset representations of values of each attribute. We assume that an expert is able to relate these values with respect to some 'general/special' partial order. The ordering must form a low semi-lattice after addition the special value 'null' (with shorthand '_' in some cases) to denote a trivial (absent) similarity between values of the given attribute of comparable objects.
The representation of whole object is a concatenation of encodings of values of its attributes in some fixed order. Then bit-wise multiplication over long bitset strings reduces to multiplications over values of each attribute. Hence encoding must replace similarity between values by bit-wise multiplication.
Since any low semi-lattice easily converts into a lattice (with additional top element, if it is absent), Formal Concept Analysis (FCA) provides all essential tools.
Modern formulation of Fundamental Theorem of FCA asserts that
For every finite lattice let be a (super)set of all -irreducible elements and be a (super)set of all -irreducible elements. For the sample generates the all candidates lattice that is isomorphic to the original lattice .
Element of lattice is called -irreducible, if and for all and imply .
Element of lattice is called -irreducible, if and for all and imply .
The lattice below contains red vertices as -irreducible elements and blue vertices as -irreducible ones.
Fundamental theorem (initially proved by Prof. Rudolf Wille through the sample ) implies minimal sample of the form
G\M | h | i | j | k |
---|---|---|---|---|
a | 1 | 1 | 1 | 0 |
b | 0 | 1 | 1 | 1 |
c | 1 | 1 | 0 | 0 |
d | 1 | 0 | 1 | 0 |
f | 0 | 1 | 0 | 1 |
g | 0 | 0 | 1 | 1 |
to generate the lattice of all the candidates that is isomorphic to the original lattice.
Note, that the Wille's sample uses 121 bites, and the new one needs only 24 bites!
The author proposed the following algorithm to encode values by bitsets:
- Topological sort of elements of the semi-lattice.
- In the matrix of order look for columns that coincide with bit-wise multiplication of previous ones (every such column corresponds to -reducible element).
- All founded (-reducible) columns are removed.
- Rows of remaining matrix form codes of the corresponding values.
This algorithm is a part of both 'vkfencoder' (as vkfencoder.XMLImport class constructor) and 'vkf' (as vkf.FCA class constructor) CPython libraries. The difference is sources: vkf.FCA reads a MariaDB database table and vkfencoder.XMLImport reads an XML file.
2 Continuous attributes
We discuss steps of encodings of continuous attributes case in the order of their inventions. At first, we apply an idea of C4.5 system of decision trees learning to separation of variable's domain into subintervals through entropy considerations.
After that we encode appearance of value in some subinterval by bitset in such a way that bit-wise multiplication corresponds to convex envelope of compared subintervals.
At last, we consider how to combine several attributes to obtain their disjunction and implications. The key is to compute logistic regression between them.
2.1 Entropy approach
When we have a continuous attribute its range must be separated into several subintervals of possible different influence on the target property. To choose correct thresholds we relate this attribute and the target property through entropy.
Let be a disjoint union of training examples and counter-examples . Interval of values of continuous attribute generates three subsets , and
.
Entropy of interval of values of continuous attribute is
Mean information for partition of interval of values of continuous attribute is
Threshold is a value with minimal mean information.
For continuous attribute denote by and let be an arbitrary number greater than .
Thresholds are computed sequentially by splitting the most entropy subinterval.
2.2 Bitset encodings for convex envelope
We represent continuous attribute value by bitset string of length , where is a number of thresholds. Bitset may be considered as a string of indicator (Boolean) variables
where .
Then string is a bitset representation of continuous attribute on element .
The next Lemma asserts that the result of bit-wise multiplication of bitset representations is convex envelope of its arguments' intervals.
Let represent and represent . Then
corresponds to .
Note that the trivial similarity corresponds to the trivial condition .
2.3 Relations between continuous attributes
The FCA-based approach to Machine Learning naturally considers a conjunction of several binary attributes as a possible cause of the target property. In the case of discrete attribute an expert has opportunity to express disjunction of the values through additional values (see lattice structures in paragraph 1). The case of continuous attributes is different. So we need some technique to include this case too.
The key was the following Lemma
The disjunction of propositional variables is equivalent to satisfaction of inequality for any .
Since we restrict ourselves to two target classes, we look for a classifier
Classifier is a map R, where is a domain of objects to classify (described by continuous attributes) and are the target labels.
As usual, we assume the existence of some probability distribution of which can be decomposed as
where is a marginal distribution of objects and is a conditional distribution of labels on given object, i.e. for every the following decomposition
holds.
Error probability of classifier is
Bayes classifier with respect to
corresponds to
We remind well-known Theorem on optimality of Bayes classifier
The Bayes classifier has the minimal error probability:
Bayes Theorem implies
where and is the well-known logistic function.
2.4 Logistic regression between attributes
Let approximate unknown by linear combination of basis functions () with respect to unknown weights .
For training sample introduce signs . Then
Hence the logarithm of likelihood
is concave.
Newton-Raphson method leads to iterative procedure
With help of we obtain
where is diagonal matrix with elements
and is vector with coordinates .
where are iterative calculated weights.
As usual, the ridge regression helps to avoid ill-conditioned situation
In the computer program 'VKF system' we use standard basis: constant 1 and attributes themselves.
At last, we need a criterion for significance of regression. For logistic regression two types of criteria were applied:
Criterion of Cox-Snell declares attribute significant, if
McFadden criterion declares attribute significant, if
Conclusion
The 'VKF-system' was applied to Wine Quality dataset from Machine Learning repository (University California Irvine). The experiments demonstrated the prospects of the proposed approach. For high-quality red wines (with rating >7), all examples were classified correctly.
The disjunction situation (from paragraph 2.3) arose with 'alcohol' and 'sulphates' relationship. Positive (although slightly different) weights correspond to different scales of measurement of different attributes, and the threshold was strictly between 0 and 1. The situation with 'citric acid' and 'alcohol' was similar.
The situation with the pair ('pH', 'alcohol') was radically different. The weight of 'alcohol' was positive, whereas the weight for 'pH' was negative. But with an obvious logical transformation we get the implication ('pH' 'alcohol').
The author would like to thanks his colleagues and students for support and stimulus.