Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] first version of DynamicAlgorithm class + demo algorithm to demonstrate #395

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

rpuwa
Copy link
Contributor

@rpuwa rpuwa commented Apr 21, 2024

No description provided.

Copy link
Collaborator

@BUYT-1 BUYT-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't what I meant. The Algorithm class is designed to have its LoadData method be called once, and then for Execute to be called multiple times. Thus, for dynamic algorithms I suggest executing the algorithm for the first time in LoadData, then processing updates in Execute. Because the names of methods no longer make sense, I then suggested that they should be renamed: LoadData -> Initialize, Execute -> Process, along with their Python bindings (.def("execute", ...) -> .def("process", ...)).

The workflow with the Python constructor doing the initialization is for the future. That's what Algorithm's bindings should be doing. It would be nice for it to be implemented here, but if not, then that's fine.

So, again, here is what I think should be done for now: execute the dynamic algorithm in LoadData, process updates in Execute. Rename the methods and their bindings as above. Leave everything else as is.

@BUYT-1
Copy link
Collaborator

BUYT-1 commented May 3, 2024

With the changes above, the DynamicAlgorithm class will not be needed. The methods for getting results should be moved to a class analogous to FdAlgorithm (DynamicFdAlgorithm?) and named accordingly. From that class, the other dynamic FD mining algorithms can be inherited. Bindings should then look similar to FD algos' bindings, where DynamicFdAlgorithm is bound by itself, with result methods, and concrete algorithms are bound as its inheritors.

Comment on lines +34 to +35
insert_statements_.Clear();
delete_statements_.clear();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be achieved if we just provide empty values as defaults for these options. The user has to set all options before executing the algorithm, so they would inevitably become empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fields will not be empty, because they are not options

Copy link
Collaborator

@BUYT-1 BUYT-1 May 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, okay. They store the inserted and deleted lines, right? Isn't it wasteful to store them entirely in memory, though? I think these things should be managed by the algorithm.

Comment on lines +55 to +59
if (data->GetNumberOfColumns() != input_table_->GetNumberOfColumns()) {
throw config::ConfigurationError(
"Invalid data received: the number of columns in the \
modification statements is different from the table.");
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Option class allows validation and normalization of values, default values, and it is also possible to have options that are only shown if some condition, which is expected to depend on other option values, holds. The basic checks can be done with those methods. MetricVerifier uses all the Option's methods, so you can look there for an example. But option management shouldn't be in the base class anyway. In particular, some algorithms may not support deletion, yet the option would be visible here.

Comment on lines +20 to +23
static int CreateId() {
static int id = 1;
return id++;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes row IDs unpredictable for the user. They could have several algorithms running, and this will make row numbers inconsistent.
Don't make row IDs static. Row IDs should be indices of the rows in the table that a particular algorithm object remembers rows of (perhaps without actually storing it in memory in its entirety). So I doubt the row itself needs to remember its own ID.
Also, using IDs like that will make some checks easier. An algorithm will only need to store the total number of rows to check if a row can be deleted, for example. And it will be possible to store which rows have been deleted in a bitset.

// configure update statements
ValidateDeleteStatements(update_old_batch_);
ValidateInsertStatements(update_new_batch_);
if (insert_statements_.Size() != delete_statements_.size()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make more sense to have an object that yields (id, new_data) pairs instead as a single update option in a manner similar to IDatasetStream or just a vector of pairs. This option is one of the cases where checking its value (for example, checking that all the rows have the correct length) completely might not be feasible.

Comment on lines +103 to +109
bool algos::DynamicAlgorithm::HasBatch() {
bool result = false;
for (const std::string_view& option_name : kCrudOptions) {
result |= IsOptionSet(option_name);
}
return result;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already taken care of by the configuration system. Default values can be set by Option to be empty, and we'd just have no rows/IDs/pairs in the relevant field.

}

void algos::DynamicAlgorithm::MakeExecuteOptsAvailable() {
if (is_initialized_) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check does nothing, existing configuration only calls this method after LoadData has been called.

Comment on lines +46 to +48
for (const std::string_view& option_name : kCrudOptions) {
previous_options.erase(option_name);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes these options invisible in a GetNeededOptions call, but they should be visible.

Comment on lines +58 to +60
for (const std::string_view& option_name : kCrudOptions) {
SetOptionByName(algo, option_name, kwargs);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just bypassing the existing configuration infrastructure.

unsigned long long algos::DynamicAlgorithm::ExecuteInternal() {
if (HasBatch()) {
ConfigureBatch();
unsigned long long time_ms = ProcessBatch();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProcessBatch should just be ExecuteInternal for every algorithm (unless there are a lot of common things happening, but that's for the future).

Comment on lines +50 to +55
.def(py::init([](py::kwargs const& kwargs) {
auto algo = std::make_unique<DynamicAlgorithmDemo>();
ConfigureAlgo(*algo, kwargs);
algo->LoadData();
return algo;
}))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave the constructor as is for now. Use RegisterAlgorithm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants