How is your data handled? #7

pursuit1994 · 2017-05-20T13:54:44Z

希望我用中文问问题不会失礼~
研一小白想请教您几个问题：
1.数据featindex.txt和featindex.fm.txt是什么关系？我观察到featindex.txt中大多数是8、10、12列的编码（在这里要再问一句这些编码是自己随意设定的吗？感觉没有顺序呀？）别的列的编码呢？
2.数据中标签为1的样本数量远小于标签为0的样本，需要做什么操作来处理这种情况吗？样本的不均衡会影响结果吗？

tianmingdu · 2017-05-21T07:47:46Z

你好 pursuit1994,
1.featindex.txt 是所有需要用的特征都做了编码（a:b c）a是特征的序列，b是对应的值，c是编码。featindex.fm.txt只是对部分特征做了编码。编码是随机的。
2.不平衡的情况的确存在，我们在训练的时候会随机删除一些0的样本的。

pursuit1994 · 2017-05-21T11:38:16Z

您好，
非常感谢您的解惑~！您的回答帮了我很多~
另外想再请教一下有关ID的特征是否需要做编码呢，感觉编码后数据会变得很大，而且我感觉ID除了链接别的field的信息外好像没别的作用了，想问下您的意见呢。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How is your data handled? #7

How is your data handled? #7

pursuit1994 commented May 20, 2017

tianmingdu commented May 21, 2017

pursuit1994 commented May 21, 2017

How is your data handled? #7

How is your data handled? #7

Comments

pursuit1994 commented May 20, 2017

tianmingdu commented May 21, 2017

pursuit1994 commented May 21, 2017