Winning Solution for Zalo AI Challenge 2022 - E2E Question Answering

Overview

Pipeline gồm 4 bước chính:

Cắt data wikidump thành các sliding windows kích thước 256.
Tìm candidate contexts bằng BM25 (Recall@200 ~ 0.95)
Rank lại top200 candidate contexts bằng model BERT sentence pair.
Tìm candidate answers từ contexts, chọn kết quả cuối cùng bằng mojority vote + community detection w/ Louvain.
Tìm top100 candidate articles cho answer bằng BM25, rank lại bằng một model BERT sentence pair khác để tìm article cuối cùng.

Requirements

transformers==4.24.0
git+https://github.com/witiko/gensim.git@feature/bm25

Inference example

Tải pretrained models và các data càn thiết từ: link, giải nén vào thư mục ./data/

Tham khảo notebook example

question = "Công ty mẹ của Zalo là gì"

Lấy top200 contexts bằng BM25

query = preprocess(question).lower()
top_n, bm25_scores = bm25_model_stage1.get_topk_stage1(query, topk=200)
titles = [preprocess(df_wiki_windows.title.values[i]) for i in top_n]
texts = [preprocess(df_wiki_windows.text.values[i]) for i in top_n]

Rerank bằng BERT sentence pair

question = preprocess(question)
ranking_preds = pairwise_model_stage1.stage1_ranking(question, texts)
ranking_scores = ranking_preds * bm25_scores

Tìm câu trả lời tốt nhất bằng model QA

best_idxs = np.argsort(ranking_scores)[-10:]
ranking_scores = np.array(ranking_scores)[best_idxs]
texts = np.array(texts)[best_idxs]
best_answer = qa_model(question, texts, ranking_scores)

Entity map để tìm ra câu trả lời cuối cùng

bm25_answer = preprocess(str(best_answer).lower(), max_length=128, remove_puncts=True)
bm25_question = preprocess(str(question).lower(), max_length=128, remove_puncts=True)
candidates, scores = bm25_model_stage2_title.get_topk_stage2(bm25_answer, raw_answer=best_answer)
titles = [df_wiki.title.values[i] for i in candidates]
texts = [df_wiki.text.values[i] for i in candidates]
ranking_preds = pairwise_model_stage2.stage2_ranking(question, best_answer, titles, texts)
final_answer = titles[ranking_preds.argmax()]

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
train		train
LICENSE		LICENSE
README.md		README.md
bm25_utils.py		bm25_utils.py
example.ipynb		example.ipynb
graph_utils.py		graph_utils.py
pairwise_model.py		pairwise_model.py
qa_model.py		qa_model.py
text_utils.py		text_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Winning Solution for Zalo AI Challenge 2022 - E2E Question Answering

Overview

Requirements

Inference example

About

Releases

Packages

Languages

License

Telegram-Zalo/zac2022-e2e-qa

Folders and files

Latest commit

History

Repository files navigation

Winning Solution for Zalo AI Challenge 2022 - E2E Question Answering

Overview

Requirements

Inference example

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages