Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] [R] Add QuantileDMatrix creation from dense matrices #9864

Closed
wants to merge 4 commits into from

Conversation

david-cortes
Copy link
Contributor

@david-cortes david-cortes commented Dec 10, 2023

ref #9810

This PR adds a wrapper over QuantileDMatrix for dense matrix inputs.

I am not very sure that I'm implementing all of this correctly so would be ideal if a maintainer could take a deep look.

I have some doubts about how this class (QuantileDMatrix) is meant to be created:

  • Should feature types / names / etc. be added to the proxy dmatrix as part of the iterations?
  • Should labels and/or bounds be set as part of the iterations? If not, are categorical features supported for this class?
  • If the call to XGQuantileDMatrixCreateFromCallback fails, returning a non-zero code, does the proxy dmatrix still need to be freed manually through XGDMatrixFree?
  • Is there something that would currently work with a regular DMatrix but not with a QuantileDMatrix, and which would need to be checked on the R side instead of passing things to the C functions?

Also a small note: there's a PR which hasn't yet been merged as it's currently failing MSVC jobs, which introduces some refactorings around the handling of 'missing' parameter in the JSON configs. After such refactoring is introduced, it should also be used as part of the code introduced here, as otherwise missing values in integer matrices will not be handled correctly.

@trivialfis
Copy link
Member

Let me briefly answer some the of questions first, then try to come up with improved documents.

Should feature types / names / etc. be added to the proxy dmatrix as part of the iterations?

yes.

Should labels and/or bounds be set as part of the iterations? If not, are categorical features supported for this class?

yes, it should be part of the iterations. Yes, categorical features are supported.

If the call to XGQuantileDMatrixCreateFromCallback fails, returning a non-zero code, does the proxy dmatrix still need to be freed manually through XGDMatrixFree?

Yes, everything created successfully should be freed accordingly. If the proxy is created successfully, it should be freed, regardless of how it's used.

Is there something that would currently work with a regular DMatrix but not with a QuantileDMatrix,

The QDM works with the hist tree method (default) for training and normal prediction. It's used to conserve memory for training.

and which would need to be checked on the R side instead of passing things to the C functions?

I usually just let the C functions emit errors.

Comment on lines 80 to 83
feature_weights = NULL,
as_quantile_dmatrix = FALSE,
ref = NULL,
max_bin = NULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably better to split it into a sub class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, since the code paths will be entirely different, better to have a dedicated function xgb.QuantileDMatrix like in the python interface.

@david-cortes
Copy link
Contributor Author

Now I'm confused.

If I understand it correctly, creating a QuantileDMatrix from a dense input involves setting the proxy dmatrix 4 times.

From some quick tests, it seems to be possible to pass label by calling XGDMatrixSetInfoFromInterface after the QuantileDMatrix has been created with argument data only (at least for cases without categorical data).

If the label were to be set on the proxy dmatrix at every iterator call, wouldn't that result in more work being done compared to setting it after? Aren't there conversions and memory copies happening inside those calls?

Also, just to be sure - if one is supposed to set the feature names and types on the proxy dmatrix at every iterator call, does that mean that one can pass different subsets of columns in different iterations?


By the way, what would happen in a case in which XGQuantileDMatrixCreateFromCallback returns an error code, and then XGDMatrixFree after that also returns an error code? Can that happen? How/what should be reported to the user then?

@trivialfis
Copy link
Member

trivialfis commented Dec 12, 2023

Let me write some documents for the QDM internal, it's a special case for the iterator interface, which is also used by external memory support. As a result, some of the things seem redundant, but we kept it to simplify the code so that it can work with both use cases without any special handling. So, to answer the question of whether it can accept a subset of samples, yes, but it's for external memory support and #9864 (comment).

By the way, what would happen in a case in which XGQuantileDMatrixCreateFromCallback returns an error code, and then XGDMatrixFree after that also returns an error code? Can that happen?

If you are trying to free the proxy, then should succeed assuming the proxy was successfully created. If you are trying to free the QDM, then it should fail.

How/what should be reported to the user then?

If FromCallback returns -1, then it's likely that something went wrong during iteration, which means an error in the next or reset function. In this case:

  • Report the errors from callbacks. No need to free the unsuccessfully created QDM, (no harm in doing it either since it will deterministically fail anyway). See:
  • If no error from reset/next callbacks, then report the error from FromCallback as usual like the original DMatrix class.

@trivialfis
Copy link
Member

it can accept a subset of samples, yes, but it's for external memory support.

And yes, the QDM indeed can take on batches on initialization. See quantile_data_iterator.py in demo. It's lesser known but can be helpful when a user doesn't have the memory capacity to construct the QDM while holding the original data in memory.

@trivialfis
Copy link
Member

Ref: #9734 (comment)

Would it be possible to have C functions to create these QuantileDMatrices from array_interface strings for dense/csr data sources + required parameters? Is there some limitation that would make such an approach not work?

it can no longer take on iterators as described in #9864 (comment).

R-package/R/xgb.DMatrix.R Outdated Show resolved Hide resolved
Comment on lines +559 to +611
XGB_DLL SEXP XGQuantileDMatrixFromMat_R(SEXP R_mat, SEXP missing, SEXP n_threads,
SEXP max_bin, SEXP ref_dmat) {
SEXP ret = PROTECT(R_MakeExternalPtr(nullptr, R_NilValue, R_NilValue));
R_API_BEGIN();
DMatrixHandle proxy_dmat_handle;
CHECK_CALL(XGProxyDMatrixCreate(&proxy_dmat_handle));
DMatrixHandle out_dmat;
int res_code1, res_code2;

try {
xgboost::Json jconfig{xgboost::Object{}};
/* FIXME: this 'missing' field should have R_NaInt when the input is an integer matrix. */
jconfig["missing"] = Rf_asReal(missing);
if (!Rf_isNull(n_threads)) {
jconfig["nthread"] = Rf_asInteger(n_threads);
}
if (!Rf_isNull(max_bin)) {
jconfig["max_bin"] = Rf_asInteger(max_bin);
}
std::string json_str = xgboost::Json::Dump(jconfig);

DMatrixHandle ref_dmat_handle = nullptr;
if (!Rf_isNull(ref_dmat)) {
ref_dmat_handle = R_ExternalPtrAddr(ref_dmat);
}

std::string array_str = MakeArrayInterfaceFromRMat(R_mat);
_RMatrixSingleIterator single_iterator(proxy_dmat_handle, array_str.c_str());

res_code1 = XGQuantileDMatrixCreateFromCallback(
&single_iterator,
proxy_dmat_handle,
ref_dmat_handle,
_reset_RMatrixSingleIterator,
_next_RMatrixSingleIterator,
json_str.c_str(),
&out_dmat);
res_code2 = XGDMatrixFree(proxy_dmat_handle);
} catch(IteratorError &err) {
XGDMatrixFree(proxy_dmat_handle);
Rf_error(XGBGetLastError());
}

CHECK_CALL(res_code2);
CHECK_CALL(res_code1);

R_SetExternalPtrAddr(ret, out_dmat);
R_RegisterCFinalizerEx(ret, _DMatrixFinalizer, TRUE);
R_API_END();

UNPROTECT(1);
return ret;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on your roadmap, it might be desirable to implement this in R instead of C++, the data iterator is the common interface to QDM and external memory, I will share some old unpublished documents soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be better to use a dedicated C++-only route for single-iteration QuantileDMatrix, and then later on implement a customizable DataIterator in R.

@trivialfis
Copy link
Member

Is this still draft?

@david-cortes
Copy link
Contributor Author

Is this still draft?

Yes, and I'm going to create a new MR as I see this kind of functionality would need to be coded differently.

@david-cortes
Copy link
Contributor Author

Closing in favor of bigger PR that follows the same approach as in the python interface: #9913

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants