Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Add data iterator, quantile dmatrix, and missing feature_types #9913

Merged
merged 18 commits into from
Jan 30, 2024

Conversation

david-cortes
Copy link
Contributor

@david-cortes david-cortes commented Dec 20, 2023

ref #9810
closes #9864

(note that it has some overlap with #9829 - whichever is merged first will create conflicts in the other)

This PR adds R wrappers over the following:

  • Data iterators, for building DMatrix objects from external memory.
  • QuantileDMatrix, built using the same data iterator from the R side.
  • Parameter feature_types in DMatrix construction.

Additionally, it updates the docs for feature_types in the python interface in order to reflect what the current code does.

@@ -32,6 +32,10 @@ export(slice)
export(xgb.DMatrix)
export(xgb.DMatrix.hasinfo)
export(xgb.DMatrix.save)
export(xgb.DataIter)
export(xgb.ExternalDMatrix)
export(xgb.ProxyDMatrix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we need to export the proxy DMatrix, users are unlikely to interact with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the function that needs to be called inside the data iterator. The class constructor is not exported.

@@ -32,6 +32,10 @@ export(slice)
export(xgb.DMatrix)
export(xgb.DMatrix.hasinfo)
export(xgb.DMatrix.save)
export(xgb.DataIter)
export(xgb.ExternalDMatrix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not expose this as an independent class for now. One day we might support external storage with QuantileDMatrix.

What do you think if we make the data parameter of DMatrix to be compatible with the iterator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If QuantileDMatrix is a different class from from DMatrix, I think it does make sense to have an ExternalDMatrix as a different class, since just like QDM, it behaves differently from a regular DMatrix.

If implemented through xgb.DMatrix by passing an iterator as first argument, I think the function signature would be quite unintuitive, since for example label would be an argument to both the iterator's ProxyDMatrix and to DMatrix.

Would also leave the question about what happens if e.g. the user passes some features names in the iterator, and then different names in DMatrix, and so on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, I have been struggling with this since the first review and have been stalling the decision (my bad). Will look closer in the new class.

R-package/R/xgb.DMatrix.R Outdated Show resolved Hide resolved
R-package/R/xgb.DMatrix.R Show resolved Hide resolved
R-package/R/xgb.DMatrix.R Show resolved Hide resolved
R-package/R/xgb.DMatrix.R Show resolved Hide resolved
Comment on lines 423 to 424
#' with a value of zero when the stream of data ends (all batches in the iterator have been consumed),
#' or an integer with a positive value when there are more batches in the line to be consumed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is one of the paper cuts I made to XGBoost, I think it's better to use boolean value instead of 0 and 1. At the time I was dealing with the C callbacks and the need to create appropriate wrappers skipped my mind.

Comment on lines 597 to 604
#' \item If created as a regular 'DMatrix', the data is accessed on-demand as needed, multiple
#' times, without being concatenated (but note that fields like 'label' \bold{will} be concatenated
#' from multiple calls to the data iterator).
#' \item If created as a 'QuantileDMatrix', the quantized representation of the full data will
#' be created in memory, concatenated from multiple calls to the data iterator. The quantized
#' version is typically lighter than the original data, so there might be cases in which this
#' representation could potentially fit in memory even if the full data doesn't.
#' }
Copy link
Member

@trivialfis trivialfis Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remain wary about the interface here. We currently have:

  • DMatrix
  • QuantileDMatrix
  • DMatrix + Iter => External Memory training
  • QuantileDMatrix + iter => Instance construction != External memory training.

As you are already aware, the last option doesn't fetch data from external memory during training, it's more an option for distributed frameworks that manage data by chunks. In XGBoost, when we talk about external memory, we usually specifically mean training, which is fetching data on demand.

  • (Potential future) Quantile + iter => External memory training.

I think there's a need to clarify these types (and their names). I agree with the preference that external memory DMatrix can have its type, but mixing it with QDM seems a bit confusing. From a Python/C++ programmer's perspective, maybe we don't need a new type for it, but a new constructor instead? Alternatively, we can defer the decision by not supporting iterator with QDM at the moment.

Discussions are welcomed!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does sound quite confusing, but I don't think the python interface is the best reference here.

If I understand it correctly, the issue here is that there's two routes in this PR which give the same object type:

  • xgb.QuantileDMatrix + in-memory data.
  • xgb.ExternalDMatrix + xgb.DataIter + as_quantile_dmatrix=TRUE.

But if this were to be implemented the same way as in python (accepting xgb.DataIter as data instead of having a dedicated class for it), then there would be:

  • A function xgb.DMatrix which could still produce different output types depending on its inputs (regular DMatrix vs. the 'ExternalDMatrix' introduced here).
  • A parameter missing which is auto-modified for integer inputs in some cases but not others.
  • Multiple parameters that can contradict each other (e.g. feature names).

If the idea is to avoid inconsistencies, then I think the best option would be to have only two functions:

  • xgb.DMatrix, taking only in-memory data, and accepting a parameter as_quantile_dmatrix in order to produce class xgb.QuantileDMatrix as output.
  • xgb.ExternalDMatrix, taking only xgb.DataIter as input, and also accepting a parameter as_quantile_dmatrix.

Either that, or 4 functions after an 'ExternalQuantileDMatrix' gets implemented.

But I can also guess that:

  • There will be much fewer people using xgb.ExternalDMatrix than the others.
  • People using xgb.ExternalDMatrix will likely be the more advanced users.
  • People using xgb.ExternalDMatrix will probably be more worried about getting the data iterators right and wondering about their usage than about other DMatrix properties.

Which is why I chose this kind of separation.

If there was an ExternalQuantileDMatrix type implemented, perhaps it'd make more sense to have 4 possible constructors, but I'm not sure about what's the best way to define these constructors and classes with the current behavior.

Copy link
Member

@trivialfis trivialfis Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I don't think the python interface is the best reference here

I agree. For background, the interface of Python is mostly caused by the use of text file input like data="path/to/csv", which has the same problem with the iterator with confusing parameters.

A parameter missing which is auto-modified for integer inputs in some cases but not others.

Not entirely sure what this means. I think as long as a missing value is accepted, integer will be considered as float.

Thank you for listing all the possible options, and apologies for the slow replies lately, will be more active starting tomorrow.

In general, I would still like to differentiate DMatrix and QuantileDMatrix since they have different functionalities and accept a different set of parameters, only QDM accepts max_bin. Lastly, they have different data inside, one stores the raw data, while the other stores the histogram index. A single parameter as_quantile_dmatrix might not be sufficient to reflect and clarify the difference, and my preference is still to have two different classes for them.

On the other hand, ExternalMemDMatrix is a lot more similar to DMatrix compared to QDM. They have the same set of parameters, and can be used "almost" all the places where DMatrix can be used, share the same underlying data content, it's just the former stores it on disk while the latter stores it in memory. They are different for sure, and I agree with you that it should have a different class. My point is just that if ExternalMemory were to have a different class, we should definitely have a different class for QDM.

The question is how to integrate the concept of external memory into the mix. I propose the following functions and classes, ExternalMemoryDMatrix as implemented in this PR, but without the support for being a QDM. In addition, a QDM constructor that accepts only the iterator xgb.QuantileDMatrix.from_iterator(iter) -> xgb.QuantileDMatrix. So, one new class, and one new method to an existing class.

The reasoning behind this is that the xgb.QuantileDMatrix and xgb.QuantileDMatrix.from_iterator return the same type with the exact same underlying data (in c++), it's just how it's constructed in the first place and so there's no need for a new class in R. Then we use different constructor to represent the difference of construction process. As for the external memory version of DMatrix, as you have suggested, they behave differently, we will continue with your approach of creating a new class for it.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

So that'd leave the following functions with the following output classes:

  • xgb.DMatrix -> xgb.DMatrix
  • xgb.QuantileDMatrix -> xgb.QuantileDMatrix
  • xgb.QuantileDMatrix.from_iterator -> xgb.QuantileDMatrix
  • xgb.ExternalDMatrix -> xgb.ExternalDMatrix

Is that a correct understanding?

I think as long as a missing value is accepted, integer will be considered as float.

Not quite - integer matrices are also accepted, and are passed as such to the array_interface parser. In such case, the missing value is auto-adjusted in the C++ code according to the type of the input, but if the user passes a data iterator, the missing value is passed before the iterator to the first C function, and thus cannot be converted later on once the data is seen and the dtype is determined (with integer missing value in R being -INT_MAX).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a correct understanding?

Yes.

and thus cannot be converted later on once the data is seen and the dtype is determined

That's interesting, did not think of that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed.

#' # Function 'xgb.ProxyDMatrix' must be called manually
#' # at each batch with all the appropriate attributes,
#' # such as feature names and feature types.
#' xgb.ProxyDMatrix(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that we hide this function and pass a closure to the user like the Python code? I think it's easier to use if the user doesn't need to worry about an additional function and an opaque handle. R is a high level language and it would great if we can hide these C artifacts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a closure is the best option here.

Another option could be to have the user return the result of this xgb.ProxyDMatrix (or NULL to signal end of batch), but let this be just a simple list and defer processing for later instead of processing things just after xgb.ProxyDMatrix is called.

It would however forego the possibility of doing a manual clean up of whatever environment generated the data inside the iterator if the user wishes to do so immediately before the next batch, but perhaps it's a more logical design.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a closure is the best option here.

Could you please elaborate on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a closure is the best option here.

Could you please elaborate on this?

They are akward to use, hard to document, and hard to examine as a user. Functions, environments and lists are more common and easier to use. Or do you think that a closure could bring some advantage here that would not be achieved by e.g. letting the user return the result of xgb.ProxyDMatrix ? (assuming that the handle is left out of the signature)

Copy link
Member

@trivialfis trivialfis Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have preference on whether a closure or a function or a class should be used (they are all the same), it's just I would like to hide/abstract many things as possible, and sometimes closure is the way to do it. If there are other approaches, I'm happy to take it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Factored out the proxy_handle parameter. Now the user needs to call a function xgb.ProxyDMatrix inside of the callback (which doesn't take any handle) and return its result.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work! I'm excited to have this merged for the R package.

@trivialfis trivialfis merged commit 3abbbe4 into dmlc:master Jan 30, 2024
26 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants