FastText cc zh 300 vec trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives.
Tencent AILab Chinese Embedding This corpus provides 200-dimension vector representations, a.k.a. embeddings, for over 8 million Chinese words and phrases, which are pre-trained on large-scale high-quality data.
UD_Chinese-GSDSimp UD_Chinese-GSD经过转化修正之后的简体中文版
CLUENER2020 CLUENER2020数据集,是在清华大学开源的文本分类数据集THUCTC基础上,选出部分数据进行细粒度命名实体标注,原数据来源于Sina News RSS。数据包含10个标签类别,训练集共有10748条语料,验证集共有1343条语料。谷歌下载地址 项目里包含了一份。
-
convert to spacy
- 用fasttext/Tencent AILab Chinese Embedding的vectors初始化一个spacy模型
python -m spacy init-model zh ./zh_vectors_init -v cc.zh.300.vec.gz or python -m spacy init-model zh ./zh_vectors_init -v Tencent_AILab_ChineseEmbedding.tar.gz
- 转换ud库格式
python -m spacy convert UD_Chinese-GSDSimp-master\zh_gsdsimp-ud-train.conllu ./ -t jsonl python -m spacy convert UD_Chinese-GSDSimp-master\zh_gsdsimp-ud-dev.conllu ./ -t jsonl
- 转换clue ner标注数据格式
python scripts/convert2spacy.py
-
train
python -m spacy train zh ./zh_vectors_web_ud_lg zh_gsdsimp-ud-train.json zh_gsdsimp-ud-dev.json --base-model ./zh_vectors_init python scripts/train_ner.py
-
Windows用户要注意spacy 2.2.3版本训练的时候想用GPU的话要把thinc升级到7.4.0
pip install -U thinc
因为一些年轻人可能不知道的原因,预训练模型有的时候下载不下来,所以推荐用可以断点续传的工具下载。
bert-base-chinese config bert-base-chinese model bin bert-base-chinese vocab
想不翻墙仅获取pytorch模型下载地址的话可以用,全都要的请点击链接https://huggingface.co/models
python ./script/get_transformers_models_url.py bert-base-chinese -mk -local
⚠ ./trf_models/bert-base-chinese already exists
⚠ ================url================
https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-config.json
https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-pytorch_model.bin
https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt
⚠ ================url================
✔ 使用下载工具下载后,将模型文件放入缓存文件夹中。
ValueError: 本地Class中未找到 't5-3b'的配置,请去掉-local试一下。
将下载的模型文件名整体去掉bert-base-chinese-
python ./spacy-transformers/init_model.py
ℹ Creating model for 'bert-base-chinese' (zh)
✔ Initialized the model pipeline
✔ Saved 'bert-base-chinese' (zh)
Pipeline: ['sentencizer', 'trf_wordpiecer', 'trf_tok2vec']
Location: ./spacy_trf_zh
✔ Model loads!
python -m spacy train zh ./zh_bert_ud zh_gsdsimp-ud-train.json zh_gsdsimp-ud-dev.json --base-model ./spacy_trf_zh
- 添加腾讯AI Lab Embedding地址
- msra语料与onto 5语料训练
- spacy-transformers zh模型
License: CC BY-SA 4.0