-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sklearn] OneHotEncoder does't work correctly #684
Comments
Hi @faterazer, thanks for reaching out! We use whatever the most current version of SKL is, so right now 1.2.1. Was your model trained on the same version of scikit-learn that you're trying to use Hummingbird with? Just trying to make sure it's not a simple fix. (Lots of times, users have issues if the model is trained with an older version of SKL and then they call Hummingbird on a saved model.) Can you post a little bit of your code so we can take a look? Maybe we need to add the new field. |
Hi, so appreciated your suggestions, I read the letter and checked through my operations. Unfortunately, the problem still exists. I guess providing more details could be convenient for you to locate the problem. So I post my code and test data, and they are all in test.zip. Now, let me describe my processing flow:
1.
In test.zip, I constructed some data for test, they all categorical features, fifteen columns in total. I saved data as test/test.csv .
2. For some reasons, I need to cross the conda environments. At first, I use a conda environment, which includes python 3.10, sci-kit learn 1.2.1, and does not include hummingbird-ml. I construct an OneHotEncoder of sklearn, and then fit the test data. Finally, I save the encoder/pipeline as a binary file by pickle. You could find the code in test/A.py .
3. Then, I use another conda environment, which includes python 3.8, sci-kit learn 1.2.1, and hummingbird-ml 0.4.7. I load my sklearn preprocessor from the binary file by pickle, and then use hummingbird-ml to covert it. Finally, I check the outputs from sklearn and hummingbrid-ml, however, the shapes are different. You could find the code in test/B.py.
4. I found that if I modify the code on line 16 of test/A.py. From OneHotEncoder(sparse_output=False, handle_unknown="infrequent_if_exist", min_frequency=0.005) to OneHotEncoder(sparse_output=False, handle_unknown="ignore"), then everything is ok. I found the changelog of sklearn, it said since version 1.1, sklearn provides the new choice of handle_unknown, which I would like to use but caused the problem.
Could you look into my operations and codes? Did I make a mistake in any step? Or is there a solution to fix the problem? I appreciate your reading and efforts.
Thanks again for all your work in hummingbird-ml. It's an awesome project, and I hope I could use it all the time.
Yours sincerely,
faterazer
…________________________________
发件人: Karla Saur ***@***.***>
发送时间: 2023年2月10日 4:47
收件人: microsoft/hummingbird ***@***.***>
抄送: fater ***@***.***>; Mention ***@***.***>
主题: Re: [microsoft/hummingbird] [sklearn] OneHotEncoder does't work correctly (Issue #684)
Hi @faterazer<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffaterazer&data=05%7C01%7C%7C2ad26099a78349f23fcb08db0adee93a%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638115724754805942%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=LQMqlDk9H7kSbEwZB2hloKbLmfkTsCQqReSC2kREe8U%3D&reserved=0>, thanks for reaching out! We use whatever the most current version of SKL is, so right now 1.2.1.
Was your model trained on the same version of scikit-learn that you're trying to use Hummingbird with? Just trying to make sure it's not a simple fix. (Lots of times, users have issues if the model is trained with an older version of SKL and then they call Hummingbird on a saved model.)
Can you post a little bit of your code so we can take a look? Maybe we need to add the new field.
―
Reply to this email directly, view it on GitHub<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmicrosoft%2Fhummingbird%2Fissues%2F684%23issuecomment-1424813248&data=05%7C01%7C%7C2ad26099a78349f23fcb08db0adee93a%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638115724754805942%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Op7tq2w8p4yPrT7Dfspe9IrXWX4MxvkVq3GzhEQ0X3s%3D&reserved=0>, or unsubscribe<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FADIJWXPKKYMTUS3NO7SBTOLWWVJXNANCNFSM6AAAAAAUWROEPA&data=05%7C01%7C%7C2ad26099a78349f23fcb08db0adee93a%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638115724754805942%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=9OSe%2BWzec7QbtxCwlk%2B5x2pTr2mOWg4kKjAnJDEGtvQ%3D&reserved=0>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Hello! I think that the attachment (test.zip) got dropped. If it's easier, you could check them into a fork in github and put a link! |
|
Thank you for your in-depth example with details! I was able to reproduce everything you said. Yes it looks like we need to add this feature to the list of supported options (and we should at least be putting an error for ones we don't support). We'll add that to the queue! |
Hello, I found this project last week, and thanks for all of these work.
I installed
Hummingbird-ml==0.47
by pip, and I want to know which version of sklearn should I use.I want to use one-hot encoder of sklearn to preprocess my categorical features, but the result's dim of sklearn is different from the dim of converted pytorch model. For sklearn, 15 features -> 69 dim,but for converted pytorch mdoel, 15 features -> 76 dim.
After my check, I'm sure the problem is the argument of sklearn's OneHotEncoder:
Is there any way to solve this problem?Thanks for any solution!
The text was updated successfully, but these errors were encountered: