fastText的label有上限 #44

yuyoyth · 2022-05-18T01:19:23Z

这是打印读取train的结果

Number of words:  2148
Number of labels: 185898
Max threshold count: 2`
Number of wordHash2Id: 250728

可看到读取上限为185898，而train中我提供的label数为1300000+，为了排除数据问题，我将原本train以150000分割为9个文件，依次进行读取测试，结果均能正常返回label读取数，基本可排除是数据文件的问题
fastText是确定的设置了这个上限吗还是文件读取量有上限？原train文件有480MB大小，分割后最大为52MB

The text was updated successfully, but these errors were encountered:

yuyoyth · 2022-05-18T01:20:56Z

训练代码为

InputArgs inputArgs = new InputArgs();
inputArgs.setLoss(LossName.ns);
inputArgs.setThread(15);
inputArgs.setEpoch(100);
inputArgs.setLr(0.5);
inputArgs.setDim(100);
FastText model = FastText.trainSupervised(trainFile, inputArgs);

jimichan · 2022-05-18T01:23:41Z

确定是分类问题吗？label数量这么大

yuyoyth · 2022-05-18T01:26:43Z

我想做模糊文本到唯一id的映射，即使缺字多字依旧能尽可能匹配，为此专门做了汉字编码，希望对于相似字也能实现匹配
以下是train的一行参照

__label__00004e937c254cef906f24ae819ed540   78542508029 AE010320006 GG032906029 FC42168327G 4D012106046 F7022402279 AE010320006 F0012304046 K702C430145 FD442777327 FJ542102273 G401127754A 5A02137120C GE04184781F F803134117C FJ51130127C 3G041342107 6C018717144 E0042101002 5E031271128 7 2 9A042600275

jimichan · 2022-05-18T01:32:43Z

你这个应该去用词向量或者simhash之类的方案，不应该用文本分类

yuyoyth · 2022-05-18T01:35:31Z

你这个应该去用词向量或者simhash之类的方案，不应该用文本分类

感谢建议，我尝试更换下方法

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fastText的label有上限 #44

fastText的label有上限 #44

yuyoyth commented May 18, 2022

yuyoyth commented May 18, 2022

jimichan commented May 18, 2022

yuyoyth commented May 18, 2022

jimichan commented May 18, 2022

yuyoyth commented May 18, 2022

fastText的label有上限 #44

fastText的label有上限 #44

Comments

yuyoyth commented May 18, 2022

yuyoyth commented May 18, 2022

jimichan commented May 18, 2022

yuyoyth commented May 18, 2022

jimichan commented May 18, 2022

yuyoyth commented May 18, 2022