Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is your data handled? #7

Open
pursuit1994 opened this issue May 20, 2017 · 2 comments
Open

How is your data handled? #7

pursuit1994 opened this issue May 20, 2017 · 2 comments

Comments

@pursuit1994
Copy link

希望我用中文问问题不会失礼~
研一小白想请教您几个问题:
1.数据featindex.txt和featindex.fm.txt是什么关系?我观察到featindex.txt中大多数是8、10、12列的编码(在这里要再问一句这些编码是自己随意设定的吗?感觉没有顺序呀?)别的列的编码呢?
2.数据中标签为1的样本数量远小于标签为0的样本,需要做什么操作来处理这种情况吗?样本的不均衡会影响结果吗?

@tianmingdu
Copy link
Collaborator

你好 pursuit1994,
1.featindex.txt 是所有需要用的特征都做了编码(a:b c)a是特征的序列,b是对应的值,c是编码。featindex.fm.txt只是对部分特征做了编码。编码是随机的。
2.不平衡的情况的确存在,我们在训练的时候会随机删除一些0的样本的。

@pursuit1994
Copy link
Author

您好,
非常感谢您的解惑~!您的回答帮了我很多~
另外想再请教一下有关ID的特征是否需要做编码呢,感觉编码后数据会变得很大,而且我感觉ID除了链接别的field的信息外好像没别的作用了,想问下您的意见呢。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants