Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support download modal from modelscope #399

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
## 0.8.1
* feat: add support for downloading models from modelscope
* fix: fix list index out of range error caused by calling LayoutElements.from_list() with empty list

## 0.8.0
Expand Down
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,9 @@ The inference pipeline operates by finding text elements in a document page usin

We offer several detection models including [Detectron2](https://github.com/facebookresearch/detectron2) and [YOLOX](https://github.com/Megvii-BaseDetection/YOLOX).

> [!NOTE]
> By default, `unstructured_inference` downloads models from [HuggingFace](https://huggingface.co/). If you would like to use models from [ModelScope](https://modelscope.cn/), set the environment variable `UNSTRUCTURED_USE_MODELSCOPE=true` before initialization.

### Using a non-default model

When doing inference, an alternate model can be used by passing the model object to the ingestion method via the `model` parameter. The `get_model` function can be used to construct one of our out-of-the-box models from a keyword, e.g.:
Expand Down
10 changes: 9 additions & 1 deletion unstructured_inference/models/tables.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# https://github.com/microsoft/table-transformer/blob/main/src/inference.py
# https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Table%20Transformer/Using_Table_Transformer_for_table_detection_and_table_structure_recognition.ipynb
import os
import xml.etree.ElementTree as ET
from collections import defaultdict
from pathlib import Path
Expand Down Expand Up @@ -139,7 +140,14 @@ def load_agent():

if not hasattr(tables_agent, "model"):
logger.info("Loading the Table agent ...")
tables_agent.initialize("microsoft/table-transformer-structure-recognition")
if os.environ.get("UNSTRUCTURED_USE_MODELSCOPE", "false") == "true":
from modelscope import snapshot_download
model_dir = snapshot_download(
"AI-ModelScope/table-transformer-structure-recognition-v1.1-all"
)
tables_agent.initialize(model_dir)
else:
tables_agent.initialize("microsoft/table-transformer-structure-recognition")

return

Expand Down
10 changes: 9 additions & 1 deletion unstructured_inference/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,4 +112,12 @@ def download_if_needed_and_get_local_path(path_or_repo: str, filename: str, **kw
if os.path.exists(full_path):
return full_path
else:
return hf_hub_download(path_or_repo, filename, **kwargs)
if os.environ.get("UNSTRUCTURED_USE_MODELSCOPE", "false") == "true":
from modelscope import snapshot_download
path_or_repo = path_or_repo.replace(
"unstructuredio/", "AI-ModelScope/")
model_dir = snapshot_download(
path_or_repo, allow_patterns=filename)
return os.path.join(model_dir, filename)
else:
return hf_hub_download(path_or_repo, filename, **kwargs)