-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Add FreeForm2 Into LightGBM #4742
Comments
@StrikerRUS @jameslamb @guolinke @btrotta Let's have discussion about adding the FreeForm2 feature here. Feel free to express any concerns and suggestions. Thanks! |
Thanks for this additional information. This seems like a very interesting and powerful project! But I am concerned about the proposal to add it directly to LightGBM. Could you please explain why adding this code to the LightGBM repo is preferable to releasing the project under its own repo? I don't agree with the statement that this change "will have no interference with our current code base".
|
Hi @jameslamb, after some discussion with @chjinche, we agree that we can make the large PR #4733 a separate repo, and use it in LightGBM as an external library, just like compute, eigen and fmt. Do you think that's acceptable? |
hi @jameslamb Thanks for your insights! And thanks @shiyu1994 for proposing the discussion. I'd like to share more comments from my perspective. Feel free to share your views, and hope we could solve the issue asap.
|
Thanks for inviting to this discussion! I'm absolutely agree with each point @jameslamb has already written above. I just want to highlight some moments about what I'm personally concerned. As it was fairly noted, the main purpose of adding FreeForm2 to the LightGBM repo is to help developers use it inside different Microsoft departments.
I agree that this project is quite interesting and it's great that Microsoft decided to open-source it. For sure, this project can gain some popularity among ML community. But it's debatable statement that it should be a part of LightGBM itself as a general-purpose tree-based gradient boosting framework. Adding FreeForm2 can shift development direction to an autoML field. I strongly believe that a separate project with LightGBM as a dependency is the best option. Thanks to git and GitHub it can be done by various methods. Going this way maintainers attention won't be diffused by issues and PRs to FreeForm2. I'm a kind of OK with a lightweight PR adding some core interface enhancements without which FreeForm2 can't work. But it'll be much better if they done inside FreeForm2 own repository as they are not used anywhere else. But I'm strongly against adding new CI jobs into LightGBM repo. We already have a lot, really a lot, of CI jobs to cover as much possible combinations of different environments as we can. Unfortunately, the number of CI jobs are not unlimited and we made a lot of efforts to overcome this without significantly losing the coverage. One another frustrating thing is that CI environments are not stable and sometimes (to be honest, quite often) we spend several days to simply repair something in a CI environment. During that time PRs are blocked for the merge and even are not being reviewed because of the limit of human resources here. Summing up all the above, any problems with FreeForm2 CI will hurt the development speed of LightGBM. Sorry, but I really can't get it why this all should be a part of this repo. |
ha thank you! sorry about that.
Oh wow! Until this comment I did not know that this "FreeForm2" project is intended to be LightGBM-specific. If that's true then yes, I think it makes sense to include
Thank you very much. I definitely think that is preferable. That makes it clearer that that other repo is the place where issues and pull requests related to this large library should go.
I don't agree with this reasoning. As maintainers of this project, we're responsible for supporting all configurations of LightGBM. And users should expect that if we provide an alternative configuration, it has the full support of the projects' maintainers unless it is explicitly documented as not being well-supported.
I took a quick look, and see some immediate concerns:
No need to answer each of those specific questions here (to avoid this thread getting too large), but I just raise them as some tangible examples of the maintenance costs that would be introduced by moving forward with this proposal. |
@StrikerRUS @jameslamb, thanks for your suggestions. I think these PRs are only initial steps towards the full support of FreeForm2. And we can spend efforts in the future to make the support of FreeForm2 more portable. Actually, we have some ideas to create a separate library for data handling in LightGBM, which can contain current preprocessing logics in LightGBM (such as discretization), and the target encoding in PR #3234, and FreeForm2. This allows LightGBM to focus more on the GBDT algorithm. And the data library can handle feature engineering related issues. |
@StrikerRUS @jameslamb, thanks for your comments. With some discussions, we figured out a way that may solved concerns you raised 😉. Could we add a reflection mechanism for base class |
@chjinche Thank you very much for your response!
I believe this is the best way to go! And I think it's OK to add some API changes that are required by
Please don't forget to add some links in LightGBM docs to |
Thanks @StrikerRUS for the plan approval! We will prepare a new pr soon as we discussed. |
I don't understand what "a reflection mechanism for base class |
hi @shiyu1994 @jameslamb @StrikerRUS hope you are doing well! 😄 |
Thanks @shiyu1994 for review the PR! @jameslamb @StrikerRUS May I know if you have time to take a look? 😄 all ci jobs passed now. |
Sorry for the delay, I've been traveling this week. I just reviewed and I'm comfortable with those changes, just looks like there is a bit of extra work to do to ensure the library continues to meet the strict portability requirements for CRAN. (#4782 (review)) |
Hey @chjinche ! Can the branch |
Gently ping @chjinche |
Sorry for the late response, deleted the unused branch. |
This issue has been automatically locked since there has not been any recent activity since it was closed. |
Summary
FreeForm2 is a module that transforms original feature values according to the arithmetic and logic expressions specified by users. The transformed features will be used as inputs to the model for training. We plan to integrate FreeForm2 into LightGBM.
Motivation
Quote the description from @chjinche in #4733:
FreeForm2 is a flexible type of feature transforms, created by Microsoft Core Ranking and used widely over Microsoft production model training. As the name indicates, FreeForm2 empowers users to compose a free combination of features as they like. It is expressed by formulas to be applied in the model inputs. The surface syntax is s-expression, with parentheses in a LISP-like fashion to delimit. FreeForm2 has implicit type systems and evaluate a single, nested expression that returns a floating-point number. Please see more info about FreeForm in doc PR #4735.
Integration of FreeForm2 allows LightGBM to be more widely used in our products.
Description
We plan to divide the integration into 3 parts:
transform_file
andheader_file
.transform_file
describes the transform expressions, andheader_file
is a separate file for the original feature names. This PR also adds some tests and enables checks for transform in both Github action workflow and Azure Pipeline. (Integrate transform implementation with lightGBM, add separate header file support. #4734)The text was updated successfully, but these errors were encountered: