Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DC verification algorithm #444

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

xJoskiy
Copy link
Contributor

@xJoskiy xJoskiy commented Aug 1, 2024

Add DC verification algorithm and corresponding tests as well as python bindings

@xJoskiy xJoskiy force-pushed the DC-verification branch 13 times, most recently from f1eef70 to 1ea345b Compare August 15, 2024 12:41
@xJoskiy xJoskiy force-pushed the DC-verification branch 16 times, most recently from d6af1f4 to c3c62e3 Compare August 20, 2024 11:16
ol-imorozko and others added 11 commits October 29, 2024 20:14
This class represents an operator used in predicates for
Denial Constrains (DC) representation. Predicates used there
are less, greater, eq, neq, geq and leq.

This C++ implementation is 100% bad, there is whole bunch of
objects being created, but conceptually all of them are the same.
Object "Operator '+'" and another object "Operator '+'" represents
the same thing.

----------------------------------------------------------------------------
This is just a copy of a java code from https://github.com/RangerShaw/FastADC.

I will refactor and think about better implementation later.
I'll be just copying java code to get working algorithm ASAP,
and after that I'll start thinking about good implementation.
This commit adds test_dc_structures.cpp file, which will be used
to test different data structures which are required for
DC representation (there are a lot).
This class represents a column operand within a predicate for FastADC.

FastADC processes Denial Constraints (DCs) that involve comparisons between
pairs of rows within a dataset. A typical DC example, derived from a Functional
Dependency (FD) such as A -> B, is expressed as: ∀𝑡, 𝑠 ∈ 𝑟, ¬(𝑡.𝐴 = 𝑠.𝐴 ∧ 𝑡.𝐵 ≠ 𝑠.𝐵).
This denotes that for any pair of rows in the relation, it should not be the case
that while the values in column "A" are equal, the values in column "B" are unequal.

A predicate in this context (e.g., 𝑡.𝐴 = 𝑠.𝐴) comprises three elements to be fully
represented: the column operand from the first tuple ("t.A"), the comparison operator
("="), and the column operand from the second tuple ("s.A"). The `ColumnOperand` class
encapsulates the column operand part of a predicate, such as "t.A" or "s.A".
First step in FastADC algorithm is to build so-called "Predicate Space".
This is a long process during which many places in the code wants to get
a Predicate. But each predicate is stored in a global storage -- map.
In Java code this class (and other similar "provider" classes) are
singletons.

BaseProvider class is the class, from which a *Provider class should be
derived. It ensures that only a PredicateBuilder class can initialize
and free these singletons.

I'm sure there exists a better approach, where we will store Provider
classes in some fields to bind their lifetime more explicitly, but
this is how it's done in Java, and I don't have much time to devise
perfect architecture.
This class acts as a centralized storage to manage
and provide access to Predicate objects.

A Predicate is defined as "t1.A_i op t2.A_j", where t1 and t2 represent
different rows, and A_i and A_j are columns (which may be the same or different)

The FastADC algorithm first will build a so-called "Predicate Space",
which is a set of all predicates that are allowed on R (set of rows,
basically a table). In order to create and store predicates, this commit
implements a singleton class with a hashmap storage.
FastADC processes Denial Constraints (DCs) that involve comparisons between
pairs of rows within a dataset. A typical DC example, derived from a Functional
Dependency such as A -> B, is expressed as:
`forall t, s in r, not (t.A = s.A and t.B != s.B)`
This denotes that for any pair of rows in the relation, it should not be the case
that while the values in column "A" are equal, the values in column "B" are unequal.

A predicate in this context (e.g., t.A == s.A) comprises three elements to be fully
represented: the column operand from the first tuple ("t.A"), the comparison operator
("="), and the column operand from the second tuple ("s.A").
This simple test creates two predicates on a 2x2 table and evaluates them.
We're checking for mo::GetPredicate function ability to correctly
create a predicate
In the original FastADC pull request this class manages creation of
predicates, so it initializes PredicateProvider. But in this pr this
class is not required for DC verification. Hence adding a temorary
class just to make the tests work
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

src/tests/test_dc_verifier.cpp Outdated Show resolved Hide resolved
src/tests/test_dc_verifier.cpp Outdated Show resolved Hide resolved
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@xJoskiy xJoskiy force-pushed the DC-verification branch 7 times, most recently from 88a8cde to 44e8329 Compare October 29, 2024 22:03
src/core/algorithms/dc/verifier/dc_verifier.cpp Outdated Show resolved Hide resolved
Comment on lines +372 to +383
auto left_op = ColumnOperand(left.GetColumn(), left.GetTuple());
auto right_op = ColumnOperand(right.GetColumn(), right.GetTuple());

auto less_equal = Operator(OperatorType::kLessEqual);
auto greater_equal = Operator(OperatorType::kGreaterEqual);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto left_op = ColumnOperand(left.GetColumn(), left.GetTuple());
auto right_op = ColumnOperand(right.GetColumn(), right.GetTuple());
auto less_equal = Operator(OperatorType::kLessEqual);
auto greater_equal = Operator(OperatorType::kGreaterEqual);
ColumnOperand left_op{left.GetColumn(), left.GetTuple()};
ColumnOperand right_op{right.GetColumn(), right.GetTuple()};
Operator less_equal{OperatorType::kLessEqual};
Operator greater_equal{OperatorType::kGreaterEqual};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for me, it is just a matter of choice. I find copy initialization here more readable and informative rather than direct list initialization

Comment on lines +409 to +417
auto less_pred = Predicate(Operator(OperatorType::kLess), left, right);
auto greater_pred = Predicate(Operator(OperatorType::kGreater), left, right);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto less_pred = Predicate(Operator(OperatorType::kLess), left, right);
auto greater_pred = Predicate(Operator(OperatorType::kGreater), left, right);
Predicate less_pred{Operator(OperatorType::kLess), left, right};
Predicate greater_pred{Operator(OperatorType::kGreater), left, right};

size_t ind, start = 0;
std::string token, sep = " and ";
std::vector<Predicate> predicates;
while (true) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to use

Suggested change
while (true) {
std::vector<std::string> tokens;
boost::split(tokens, dc_string, boost::is_any_of(" and "));
for (auto& token : tokens) {

Copy link
Contributor Author

@xJoskiy xJoskiy Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It splits the string be each of the specified characters, not by a whole word

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I haven't found any intelligible way to split the string

Comment on lines 457 to 460
std::vector<std::byte const*> res;
res.reserve(data_.size());
for (auto const& col : data_) {
res.push_back(col.GetValue(row));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::vector<std::byte const*> res;
res.reserve(data_.size());
for (auto const& col : data_) {
res.push_back(col.GetValue(row));
}
std::vector<std::byte const*> res;
std::transform(data_.begin(), data_.end(), std::back_inserter(res), [&row](model::TypedColumnData& col) { return col.GetValue(row); });

or, if you want to pre-allocate memory,

Suggested change
std::vector<std::byte const*> res;
res.reserve(data_.size());
for (auto const& col : data_) {
res.push_back(col.GetValue(row));
}
std::vector<std::byte const*> res(data_.size());
std::transform(data_.begin(), data_.end(), res.begin(), [&row](model::TypedColumnData& col) { return col.GetValue(row); });

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions


if (!search_res.empty() or !inv_search_res.empty()) res = false;

AddHighlights(search_res, i + index_offset_);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use of undeclared identifier 'AddHighlights' [clang-diagnostic-error]

        AddHighlights(search_res, i + index_offset_);
        ^

if (!search_res.empty() or !inv_search_res.empty()) res = false;

AddHighlights(search_res, i + index_offset_);
AddHighlights(inv_search_res, i + index_offset_);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use of undeclared identifier 'AddHighlights' [clang-diagnostic-error]

        AddHighlights(inv_search_res, i + index_offset_);
        ^

if (Eval(tuple, dc.GetPredicates())){
res = false;
size_t ind = i + index_offset_;
violations_.push_back({ind, ind});

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use of undeclared identifier 'violations_' [clang-diagnostic-error]

            violations_.push_back({ind, ind});
            ^

std::vector<point> search_res = hash[key].QuerySearch(box);
std::vector<point> inv_search_res = hash[key].QuerySearch(inv_box);
if (!search_res.empty() or !inv_search_res.empty()) {
AddHighlights(search_res, i + index_offset_);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use of undeclared identifier 'AddHighlights' [clang-diagnostic-error]

            AddHighlights(search_res, i + index_offset_);
            ^

std::vector<point> inv_search_res = hash[key].QuerySearch(inv_box);
if (!search_res.empty() or !inv_search_res.empty()) {
AddHighlights(search_res, i + index_offset_);
AddHighlights(inv_search_res, i + index_offset_);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use of undeclared identifier 'AddHighlights' [clang-diagnostic-error]

            AddHighlights(inv_search_res, i + index_offset_);
            ^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants