tianchi_OGeek

在搜索业务下有一个场景叫实时搜索（Instance Search）,就是在用户不断输入过程中，实时返回查询结果。此次赛题来自OPPO手机搜索排序优化的一个子场景，并做了相应的简化，意在解决query-title语义匹配的问题。简化后，本次题目内容主要为一个实时搜索场景下query-title的ctr预估问题。

0 分数

(1) A榜：0.7347
(2) B榜：0.7335
(3) 比赛网址：https://tianchi.aliyun.com/competition/introduction.htm?spm=5176.11409106.5678.1.2c547b6fmKviKy&raceId=231688
(4) 数据下载地址：链接：https://pan.baidu.com/s/1NPUWzt7usUniogCJosWnzw 提取码：69xr

1 baseline 共享网址

(1) 天池-OGeek算法挑战赛baseline(0.7016) https://zhuanlan.zhihu.com/p/46482521
(2) OGEEK算法挑战赛代码分享 https://zhuanlan.zhihu.com/p/46479794
(3) GrinAndBear/OGeek: https://github.com/GrinAndBear/OGeek
(4) flytoylf/OGeek 一个lgb和rnn的代码: https://github.com/flytoylf/OGeek
(5) https://github.com/search?q=OGeek
(6) https://github.com/search?q=tianchi_oppo
(7) https://github.com/luoling1993/TianChi_OGeek/stargazers

2 CTR 参考资料

(1) 推荐系统遇上深度学习: https://github.com/princewen/tensorflow_practice
(2) 推荐系统中使用ctr排序的f(x)的设计-dnn篇: https://github.com/nzc/dnn_ctr
(3) CTR预估算法之FM, FFM, DeepFM及实践: https://github.com/milkboylyf/CTR_Prediction
(4) MLR算法: https://wenku.baidu.com/view/b0e8976f2b160b4e767fcfdc.html

3 nlp 参考资料

(1) 用深度学习（CNN RNN Attention）解决大规模文本分类问题 - 综述和实践 https://zhuanlan.zhihu.com/p/25928551
(2) 知乎“看山杯” 夺冠记：https://zhuanlan.zhihu.com/p/28923961
(3) 2017知乎看山杯从入门到第二 https://zhuanlan.zhihu.com/p/29020616
(4) liuhuanyong https://github.com/liuhuanyong
(5) Chinese Word Vectors 中文词向量 https://github.com/Embedding/Chinese-Word-Vectors 注释：这个链接收藏语料库

4 其他比赛总结参考链接

(1) ML理论&实践 https://zhuanlan.zhihu.com/c_152307828?tdsourcetag=s_pctim_aiomsg

5 未整理思路

(1) 主线思路：CTR思路，围绕用户点击率做文章(如开源中：单字段点击率，组合字段点击率等等) (FM, FFM模型，参考腾讯社交广告比赛？？)
(2) 文本匹配思路（Kaggle Quora）传统特征：抽取文本相似度特征，各个字段之间的距离量化 https://www.kaggle.com/c/quora-question-pairs https://github.com/qqgeogor/kaggle-quora-solution-8th https://github.com/abhishekkrthakur/is_that_a_duplicate_quora_question
(3) 深度学习模型(1DCNN, Esim, Decomp Attention，ELMO等等)： https://www.kaggle.com/rethfro/1d-cnn-single-model-score-0-14-0-16-or-0-23/notebook https://www.kaggle.com/lamdang/dl-models/comments 更多文本匹配模型见斯坦福SNLI论文集：https://nlp.stanford.edu/projects/snli/
(4) 文本分类思想：主要是如何组织输入文本？另外query_prediction权重考虑？传统特征：tfidf，bow，ngram+tfidf，sent2vec，lsi，lda等特征
(5) 深度学习模型：参考知乎看山杯(知乎)以及Kaggle Toxic比赛

https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge
https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/discussion/52557
https://www.kaggle.com/larryfreeman/toxic-comments-code-for-alexander-s-9872-model/comments
https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/discussion/52702

(6) Stacking无效(模型个数限制)，简单Blending，NN+LightGBM的方案比较靠谱？
(7) PS1：词向量可使用word2vec训练或者使用公开词向量数据：https://github.com/Embedding/Chinese-Word-Vectors PS2：分词需要加上自定义词典，分词质量对模型训练很重要！

6 基本思考

(1)：如何选用一些泛化能力分类器 -> logistic regression; support vector machine; linear regression
(2)：如何构造文本特征 -> nlp分析
(3)：如何解决特征稀疏问题 -> deep-fm

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
tool		tool
DealData.py		DealData.py
GetData.py		GetData.py
Model.py		Model.py
PrintData.py		PrintData.py
README.md		README.md
VisualData.py		VisualData.py
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tianchi_OGeek

0 分数

1 baseline 共享网址

2 CTR 参考资料

3 nlp 参考资料

4 其他比赛总结参考链接

5 未整理思路

6 基本思考

About

Releases

Packages

Languages

milkboylyf/tianchi_OGeek

Folders and files

Latest commit

History

Repository files navigation

tianchi_OGeek

0 分数

1 baseline 共享网址

2 CTR 参考资料

3 nlp 参考资料

4 其他比赛总结参考链接

5 未整理思路

6 基本思考

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages