Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support using CountVectorizer & TfidVectorizer in cuml.pipeline.Pipeline #5034

Merged
merged 3 commits into from
Dec 17, 2022

Conversation

lasse-it
Copy link
Contributor

It's not possible to use the vectorizers with a pipeline, because the pipeline calls the method fit_transform with 3 arguments and only 2 are supported.

This fix is similar to the implementation in scikit-learn: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/feature_extraction/text.py#L2095
And has already been implemented for the HashingVectorizer:

def fit_transform(self, X, y=None):

@lasse-it lasse-it requested a review from a team as a code owner November 29, 2022 23:36
@GPUtester
Copy link
Contributor

Can one of the admins verify this patch?

Admins can comment ok to test to allow this one PR to run or add to allowlist to allow all future PRs from the same author to run.

@rapids-bot
Copy link

rapids-bot bot commented Nov 29, 2022

Pull requests from external contributors require approval from a rapidsai organization member with write or admin permissions before CI can begin.

@github-actions github-actions bot added the Cython / Python Cython or Python issue label Nov 29, 2022
@cjnolet
Copy link
Member

cjnolet commented Nov 30, 2022

ok to test

@cjnolet cjnolet added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 30, 2022
Copy link
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the contribution!

@dantegd
Copy link
Member

dantegd commented Dec 8, 2022

For some reason the Branch checker job is stuck in a couple of PRs, will work on solving that or admin merging later today

@dantegd dantegd changed the title Support using CountVectorizer & TfidVectorizer in cuml.pipeline.Pipeline Support using CountVectorizer & TfidVectorizer in cuml.pipeline.Pipeline Dec 17, 2022
@dantegd dantegd changed the base branch from branch-22.12 to branch-23.02 December 17, 2022 02:20
@dantegd
Copy link
Member

dantegd commented Dec 17, 2022

@gpucibot merge

@codecov-commenter
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (branch-23.02@07f0bc4). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-23.02    #5034   +/-   ##
===============================================
  Coverage                ?   81.83%           
===============================================
  Files                   ?      200           
  Lines                   ?    14892           
  Branches                ?        0           
===============================================
  Hits                    ?    12187           
  Misses                  ?     2705           
  Partials                ?        0           
Flag Coverage Δ
dask 48.93% <0.00%> (?)
non-dask 72.97% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@dantegd
Copy link
Member

dantegd commented Dec 17, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit d59da09 into rapidsai:branch-23.02 Dec 17, 2022
jakirkham pushed a commit to jakirkham/cuml that referenced this pull request Feb 27, 2023
…Pipeline` (rapidsai#5034)

It's not possible to use the vectorizers with a pipeline, because the pipeline calls the method `fit_transform` with 3 arguments and only 2 are supported.

This fix is similar to the implementation in scikit-learn: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/feature_extraction/text.py#L2095
And has already been implemented for the `HashingVectorizer`: https://github.com/rapidsai/cuml/blob/eeb4d7c47803ea8d4fe5f8bcfbc39394fd1b9bee/python/cuml/feature_extraction/_vectorizers.py#L867

Authors:
  - Lasse Hyldahl Jensen (https://github.com/lasse-it)

Approvers:
  - Carl Simon Adorf (https://github.com/csadorf)
  - Dante Gama Dessavre (https://github.com/dantegd)
  - Victor Lafargue (https://github.com/viclafargue)

URL: rapidsai#5034
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants