Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: dataelement/bisheng-unstructured
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v0.0.3.7
Choose a base ref
...
head repository: dataelement/bisheng-unstructured
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: main
Choose a head ref

Commits on Aug 6, 2024

  1. add xls

    yaojin3616 committed Aug 6, 2024

    Verified

    This commit was signed with the committer’s verified signature.
    cwschilly Caleb Schilly
    Copy the full SHA
    0a319ec View commit details
  2. update default rt mode

    yaojin3616 committed Aug 6, 2024
    Copy the full SHA
    1dc2b83 View commit details

Commits on Aug 16, 2024

  1. feat: 修改环境变量的key

    zgqgit committed Aug 16, 2024
    Copy the full SHA
    db6546b View commit details

Commits on Aug 20, 2024

  1. scan

    add scan
    刘志硕 committed Aug 20, 2024
    Copy the full SHA
    ff0ba9b View commit details
  2. ci: support dockerhub and cr.dataelem.com

    zgqgit committed Aug 20, 2024
    Copy the full SHA
    4e52cd3 View commit details
  3. scan

    scan
    刘志硕 committed Aug 20, 2024
    Copy the full SHA
    4ad29e2 View commit details
  4. ci: support multi os

    zgqgit committed Aug 20, 2024
    Copy the full SHA
    b800eee View commit details

Commits on Aug 21, 2024

  1. ci: bug fix

    zgqgit committed Aug 21, 2024
    Copy the full SHA
    833bb94 View commit details
  2. ci: bug fix

    zgqgit committed Aug 21, 2024
    Copy the full SHA
    ad8283f View commit details

Commits on Aug 23, 2024

  1. ci: bug fix

    zgqgit committed Aug 23, 2024
    Copy the full SHA
    cc38ed1 View commit details
  2. ci: bug fix

    zgqgit committed Aug 23, 2024
    Copy the full SHA
    f21d453 View commit details

Commits on Aug 26, 2024

  1. feat: 没有模型也支持topdf模式

    zgqgit committed Aug 26, 2024
    Copy the full SHA
    54cf0c0 View commit details
  2. Merge pull request #19 from dataelement/feat/v0.0.3.9

    Feat/v0.0.3.9
    zgqgit authored Aug 26, 2024
    Copy the full SHA
    6b6271d View commit details
  3. scan

    scan
    刘志硕 committed Aug 26, 2024
    Copy the full SHA
    b051819 View commit details

Commits on Aug 27, 2024

  1. up

    up
    刘志硕 committed Aug 27, 2024
    Copy the full SHA
    0660315 View commit details
  2. up

    up
    刘志硕 committed Aug 27, 2024
    Copy the full SHA
    d3d87a9 View commit details

Commits on Aug 28, 2024

  1. up

    up
    刘志硕 committed Aug 28, 2024
    Copy the full SHA
    d8faaa1 View commit details

Commits on Aug 30, 2024

  1. Merge pull request #20 from dataelement/feat/config_scan

    Feat/config scan
    zgqgit authored Aug 30, 2024
    Copy the full SHA
    8828f98 View commit details
  2. support model agent for idp sdk

    hrfng committed Aug 30, 2024
    Copy the full SHA
    494ffa1 View commit details
  3. Merge pull request #22 from dataelement/feat/add_model_agent_for_idp

    support model agent for idp sdk
    zgqgit authored Aug 30, 2024
    Copy the full SHA
    8886566 View commit details
  4. Merge pull request #23 from dataelement/feat/v0.0.3.9

    Feat/v0.0.3.9
    zgqgit authored Aug 30, 2024
    Copy the full SHA
    d7b95a9 View commit details

Commits on Sep 1, 2024

  1. update sdk

    yaojin3616 committed Sep 1, 2024
    Copy the full SHA
    aafe4be View commit details

Commits on Sep 2, 2024

  1. support sdk

    yaojin3616 committed Sep 2, 2024
    Copy the full SHA
    fc58910 View commit details
  2. add requirements

    yaojin3616 committed Sep 2, 2024
    Copy the full SHA
    f753ce6 View commit details
  3. debug log

    yaojin3616 committed Sep 2, 2024
    Copy the full SHA
    0723110 View commit details
  4. update server_type enum

    yaojin3616 committed Sep 2, 2024
    Copy the full SHA
    df515c3 View commit details
  5. update server_type enum

    yaojin3616 committed Sep 2, 2024
    Copy the full SHA
    1f15dcb View commit details
  6. add log

    yaojin3616 committed Sep 2, 2024
    Copy the full SHA
    f839e98 View commit details
  7. thread error

    yaojin3616 committed Sep 2, 2024
    Copy the full SHA
    3020875 View commit details
  8. Merge branch 'feat/v0.0.3.9' into feat/add-sdk-support

    yaojin3616 authored Sep 2, 2024
    Copy the full SHA
    eff67c2 View commit details
  9. Merge pull request #24 from dataelement/feat/add-sdk-support

    Feat/add sdk support
    yaojin3616 authored Sep 2, 2024
    Copy the full SHA
    c400e8d View commit details
  10. Merge pull request #25 from dataelement/feat/v0.0.3.9

    Feat/v0.0.3.9
    zgqgit authored Sep 2, 2024
    Copy the full SHA
    88930ee View commit details

Commits on Sep 3, 2024

  1. feat: add is scan

    is scan
    刘志硕 committed Sep 3, 2024
    Copy the full SHA
    4e7252b View commit details

Commits on Sep 4, 2024

  1. Merge pull request #26 from dataelement/main

    sync main
    zgqgit authored Sep 4, 2024
    Copy the full SHA
    707aad7 View commit details
  2. Merge pull request #27 from dataelement/feat/v0.0.3.10

    feat: add is scan
    zgqgit authored Sep 4, 2024
    Copy the full SHA
    1f69d2d View commit details

Commits on Sep 10, 2024

  1. fix: 修复多线程的client错误

    zgqgit committed Sep 10, 2024
    Copy the full SHA
    48e06de View commit details
  2. Merge pull request #28 from dataelement/feat/v0.0.3.10

    fix: 修复多线程的client错误
    zgqgit authored Sep 10, 2024
    Copy the full SHA
    c4dc384 View commit details
  3. fix: 修复多线程的client错误

    zgqgit committed Sep 10, 2024
    Copy the full SHA
    213459a View commit details
  4. Merge pull request #29 from dataelement/feat/v0.0.3.10

    fix: 修复多线程的client错误
    zgqgit authored Sep 10, 2024
    Copy the full SHA
    367d375 View commit details
  5. fix: 修复页数不对的问题

    zgqgit committed Sep 10, 2024
    Copy the full SHA
    935c6ea View commit details
  6. Merge pull request #30 from dataelement/feat/v0.0.3.10

    fix: 修复页数不对的问题
    zgqgit authored Sep 10, 2024
    Copy the full SHA
    b4858b4 View commit details

Commits on Sep 11, 2024

  1. fix: 修复解析结果顺序不对的问题

    zgqgit committed Sep 11, 2024
    Copy the full SHA
    69d2d16 View commit details
  2. Merge pull request #31 from dataelement/feat/v0.0.3.10

    fix: 修复解析结果顺序不对的问题
    zgqgit authored Sep 11, 2024
    Copy the full SHA
    fbd3342 View commit details
  3. fix: git pull error

    zgqgit committed Sep 11, 2024
    Copy the full SHA
    5a24968 View commit details
  4. Merge pull request #32 from dataelement/feat/v0.0.3.10

    fix: git pull error
    zgqgit authored Sep 11, 2024
    Copy the full SHA
    829336a View commit details

Commits on Sep 12, 2024

  1. fix: uns服务增加配置是否全部走ocr

    zgqgit committed Sep 12, 2024
    Copy the full SHA
    0039839 View commit details
  2. Merge pull request #33 from dataelement/feat/v0.0.3.10

    fix: uns服务增加配置是否全部走ocr
    zgqgit authored Sep 12, 2024
    Copy the full SHA
    7f39b5a View commit details
  3. support rowcol cell table route

    hrfng committed Sep 12, 2024
    Copy the full SHA
    a938ae7 View commit details
  4. disable debug log

    hrfng committed Sep 12, 2024
    Copy the full SHA
    5cee718 View commit details
  5. Merge pull request #35 from dataelement/feat/support_multi_class_tab

    Feat/support multi class tab
    zgqgit authored Sep 12, 2024
    Copy the full SHA
    5cd089b View commit details
Showing with 1,054 additions and 334 deletions.
  1. +200 −21 .drone.yml
  2. 0 .github/workflows/{image-pub.yml → image-pub.yml.bak}
  3. 0 .github/workflows/{release.yml → release.yml.bak}
  4. +1 −0 .gitignore
  5. +1 −1 config/config.yaml
  6. +8 −1 docker/Dockerfile
  7. +27 −0 docker/Dockerfile-arm
  8. +8 −0 docker/entrypoint-arm.sh
  9. BIN examples/docs/table_test_001.jpg
  10. +5 −2 requirements.txt
  11. +5 −7 src/bisheng_unstructured/api/main.py
  12. +28 −30 src/bisheng_unstructured/api/pipeline.py
  13. +1 −0 src/bisheng_unstructured/api/types.py
  14. +2 −1 src/bisheng_unstructured/config/config.yaml
  15. +3 −2 src/bisheng_unstructured/config/settings.py
  16. +41 −13 src/bisheng_unstructured/documents/markdown.py
  17. +92 −10 src/bisheng_unstructured/documents/pdf_parser/idp/pdf.py
  18. +27 −16 src/bisheng_unstructured/documents/pdf_parser/image.py
  19. +178 −82 src/bisheng_unstructured/documents/pdf_parser/pdf.py
  20. +3 −3 src/bisheng_unstructured/documents/pdf_parser/test_pdf.py
  21. +42 −33 src/bisheng_unstructured/models/idp/dummy_ocr_agent.py
  22. +14 −11 src/bisheng_unstructured/models/idp/layout_agent.py
  23. +19 −32 src/bisheng_unstructured/models/idp/ocr_agent.py
  24. +55 −29 src/bisheng_unstructured/models/idp/table_agent.py
  25. +25 −1 src/bisheng_unstructured/models/layout_agent.py
  26. +16 −18 src/bisheng_unstructured/models/ocr_agent.py
  27. +8 −0 src/bisheng_unstructured/models/readme.md
  28. +83 −2 src/bisheng_unstructured/models/table_agent.py
  29. +3 −8 src/bisheng_unstructured/staging/prodigy.py
  30. +12 −1 src/bisheng_unstructured/topdf/docx2pdf.py
  31. +23 −3 src/bisheng_unstructured/topdf/excel2pdf.py
  32. +18 −0 src/bisheng_unstructured/topdf/text2pdf.py
  33. +10 −0 src/bisheng_unstructured/utils.py
  34. +56 −0 tests/test_idp_models_sdk.py
  35. +27 −2 tests/test_image.py
  36. +13 −5 tests/test_pdf_parser.py
221 changes: 200 additions & 21 deletions .drone.yml
Original file line number Diff line number Diff line change
@@ -10,15 +10,16 @@ steps: # 定义流水线执行步骤,这些步骤将顺序执行
image: alpine/git
pull: if-not-exists
environment:
http_proxy:
http_proxy:
from_secret: PROXY
https_proxy:
from_secret: PROXY
commands:
- git config --global core.compression 0
- git clone https://github.com/dataelement/bisheng-unstructured.git .
- git checkout $DRONE_COMMIT

- name: build_docker
- name: build_docker_release
pull: if-not-exists
image: plugins/docker
privileged: true
@@ -27,25 +28,203 @@ steps: # 定义流水线执行步骤,这些步骤将顺序执行
path: /var/cache/apt/archives # 将应用打包好的Jar和执行脚本挂载出来
- name: socket
path: /var/run/docker.sock
settings:
registry: http://192.168.106.8:6082
insecure: true
purge: true
repo: 192.168.106.8:6082/dataelement/bisheng-unstructured
tags: [ release ]
context: ./
dockerfile: ./docker/Dockerfile
username:
environment:
http_proxy:
from_secret: PROXY
https_proxy:
from_secret: PROXY
no_proxy: 192.168.106.8
version: release
docker_repo: 192.168.106.8:6082/dataelement/bisheng-unstructured
docker_registry: http://192.168.106.8:6082
docker_user:
from_secret: NEXUS_USER
docker_password:
from_secret: NEXUS_PASSWORD
commands:
- docker login -u $docker_user -p $docker_password $docker_registry
- docker build -t $docker_repo:$version -f ./docker/Dockerfile .
- docker push $docker_repo:$version
when:
status:
- success
branch:
- release
event:
- push

- name: build_docker
pull: if-not-exists
image: docker:24.0.6
privileged: true
volumes: # 将容器内目录挂载到宿主机,仓库需要开启Trusted设置
- name: apt-cache
path: /var/cache/apt/archives # 将应用打包好的Jar和执行脚本挂载出来
- name: socket
path: /var/run/docker.sock
environment:
http_proxy:
from_secret: PROXY
https_proxy:
from_secret: PROXY
no_proxy: 192.168.106.8,192.168.106.8
version: ${DRONE_TAG}
docker_repo: dataelement/bisheng-unstructured
docker_user:
from_secret: DOCKER_USER
docker_password:
from_secret: DOCKER_PASSWORD
cr_user:
from_secret: CR_USER
cr_password:
from_secret: CR_PASSWORD
cr_repo_host: cr.dataelem.com
commands:
- docker login -u $cr_user -p $cr_password $cr_repo_host # 登录官方镜像源
- docker login -u $docker_user -p $docker_password # 登录私有镜像源
# 推送amd的镜像到cr镜像仓库
- docker build -t $docker_repo:$version -t $docker_repo:latest -t $cr_repo_host/$docker_repo:$version -t $cr_repo_host/$docker_repo:latest -f ./docker/Dockerfile .
- docker push $docker_repo:$version
- docker push $cr_repo_host/$docker_repo:$version
- docker push $docker_repo:latest
- docker push $cr_repo_host/$docker_repo:latest
when:
status:
- success
ref:
- refs/tags/v*

volumes:
- name: bisheng-cache
host:
path: /opt/drone/data/bisheng/
- name: apt-cache
host:
path: /opt/drone/data/bisheng/apt/
- name: socket
host:
path: /var/run/docker.sock



---
kind: pipeline
type: docker
name: unstructured-arm

clone:
disable: true

platform:
os: linux
arch: arm64

steps:
- name: clone
image: alpine/git
pull: if-not-exists
environment:
http_proxy:
from_secret: PROXY
https_proxy:
from_secret: PROXY
commands:
- git config --global core.compression 0
- git clone https://github.com/dataelement/bisheng-unstructured.git .
- git checkout $DRONE_COMMIT



- name: build_docker_release
pull: if-not-exists
image: docker:24.0.6
privileged: true
volumes: # 将容器内目录挂载到宿主机,仓库需要开启Trusted设置
- name: apt-cache
path: /var/cache/apt/archives # 将应用打包好的Jar和执行脚本挂载出来
- name: apt-cache
path: /root/.cache/pip/
- name: socket
path: /var/run/docker.sock
environment:
http_proxy:
from_secret: PROXY
https_proxy:
from_secret: PROXY
no_proxy: 192.168.106.8
version: release
docker_repo: 192.168.106.8:6082/dataelement/bisheng-unstructured-arm
docker_registry: http://192.168.106.8:6082
cr_user:
from_secret: CR_USER
cr_password:
from_secret: CR_PASSWORD
cr_repo_host: cr.dataelem.com
docker_user:
from_secret: NEXUS_USER
password:
docker_password:
from_secret: NEXUS_PASSWORD
commands:
- docker login -u $docker_user -p $docker_password $docker_registry
- docker login -u $cr_user -p $cr_password $cr_repo_host # 登录官方镜像源
- docker buildx build --push -t $cr_repo_host/dataelement/bisheng-unstructured-arm:$version -t $docker_repo:$version -f ./docker/Dockerfile-arm .
when:
status:
- success
branch:
- release
event:
- push


- name: build_docker
pull: if-not-exists
image: docker:24.0.6
privileged: true
volumes: # 将容器内目录挂载到宿主机,仓库需要开启Trusted设置
- name: apt-cache
path: /var/cache/apt/archives # 将应用打包好的Jar和执行脚本挂载出来
- name: socket
path: /var/run/docker.sock
environment:
http_proxy:
from_secret: PROXY
https_proxy:
from_secret: PROXY
no_proxy: 192.168.106.8,192.168.106.8
version: ${DRONE_TAG}
docker_repo: dataelement/bisheng-unstructured-arm
docker_user:
from_secret: DOCKER_USER
docker_password:
from_secret: DOCKER_PASSWORD
cr_user:
from_secret: CR_USER
cr_password:
from_secret: CR_PASSWORD
cr_repo_host: cr.dataelem.com
commands:
- docker login -u $cr_user -p $cr_password $cr_repo_host # 登录官方镜像源
- docker login -u $docker_user -p $docker_password # 登录私有镜像源
# 推送amd的镜像到cr镜像仓库
- docker buildx build --push -t $docker_repo:$version -t $docker_repo:latest -t $cr_repo_host/$docker_repo:$version -t $cr_repo_host/$docker_repo:latest -f ./docker/Dockerfile-arm .
#- docker push $docker_repo:$version
# - docker push $cr_repo_host/$docker_repo:$version
# - docker push $docker_repo:latest
# - docker push $cr_repo_host/$docker_repo:latest
when:
status:
- success
ref:
- refs/tags/v*

volumes:
- name: bisheng-cache
host:
path: /opt/drone/data/bisheng/
- name: apt-cache
host:
path: /opt/drone/data/bisheng/apt/
- name: socket
host:
path: /var/run/docker.sock
- name: bisheng-cache
host:
path: /opt/drone/data/bisheng/
- name: apt-cache
host:
path: /opt/drone/data/bisheng/apt/
- name: socket
host:
path: /var/run/docker.sock
File renamed without changes.
File renamed without changes.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -3,3 +3,4 @@ sftp-config.json
.isort.cfg
.idea/
.DS_Store
__pycache__
2 changes: 1 addition & 1 deletion config/config.yaml
Original file line number Diff line number Diff line change
@@ -32,7 +32,7 @@ pdf_model_params:
table_model_ep: "http://192.168.106.12:9001/v2.1/models/elem_table_detect_v1/infer"
ocr_model_ep: "http://192.168.106.12:9001/v2.1/models/elem_ocr_collection_v3/infer"


is_all_ocr: false
# ocr识别需要的配置项
ocr_conf:
params:
9 changes: 8 additions & 1 deletion docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -9,9 +9,16 @@ ENV https_proxy=
ENV HTTP_PROXY=
ENV HTTPS_PROXY=


RUN sh -c 'echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal main restricted universe multiverse \n \
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-updates main restricted universe multiverse \n \
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-backports main restricted universe multiverse \n \
deb http://security.ubuntu.com/ubuntu/ focal-security main restricted universe multiverse" > /etc/apt/sources.list'

RUN cat /etc/apt/sources.list
# Install Poetry
RUN apt-get update && apt-get install gcc g++ curl build-essential postgresql-server-dev-all -y
RUN apt-get update && apt-get install procps -y
RUN apt-get update && apt-get install procps poppler-utils -y
# opencv
RUN apt-get install -y libglib2.0-0 libsm6 libxrender1 libxext6 libgl1
RUN curl -sSL https://install.python-poetry.org | python3 -
27 changes: 27 additions & 0 deletions docker/Dockerfile-arm
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
FROM uns-armv8-ubuntu-20-04:v3
LABEL org.opencontainers.image.authors="Dataelem inc."

ARG BISHENG_UNS_VER=0.0.2

RUN cat /etc/apt/sources.list
RUN apt update && apt-get install poppler-utils -y

# Copy bins and configs
RUN mkdir -p /opt/bisheng-unstructured/bin
COPY ./docker/entrypoint-arm.sh /opt/bisheng-unstructured/bin/
COPY config /opt/bisheng-unstructured/


WORKDIR /opt/bisheng-unstructured

# Copy source code
COPY ./src/ /opt/bisheng-unstructured/
COPY ./requirements.txt /opt/bisheng-unstructured/

# install requirements
RUN python3 -m pip install --upgrade pip
RUN pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

RUN apt-get clean && rm -rf /var/lib/apt/lists/* && rm -rf /root/.cache/pip

CMD ["bash", "bin/entrypoint-arm.sh"]
8 changes: 8 additions & 0 deletions docker/entrypoint-arm.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#!/bin/bash


export PATH=/usr/local/texlive/2024/bin/aarch64-linux:$PATH
export MANPATH=/usr/local/texlive/2024/texmf-dist/doc/man:$MANPATH
export INFOPATH=/usr/local/texlive/2024/texmf-dist/doc/info:$INFOPATH

uvicorn --host 0.0.0.0 --port 10001 --workers 8 bisheng_unstructured.api.main:app
Binary file added examples/docs/table_test_001.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
7 changes: 5 additions & 2 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -45,9 +45,9 @@ pdf2image==1.16.3
pdfminer-six==20221105
pdfplumber==0.10.2
wheel==0.41.0
pypdfium2==4.23.1
#pypdfium2==4.23.1
pypdf==4.3.0
PyMuPDF==1.23.2
PyMuPDF==1.23.8
opencv-python==4.8.0.76
certifi==2023.7.22
cffi==1.15.1
@@ -75,3 +75,6 @@ xlrd==2.0.1
uvicorn
fastapi
orjson

# client
tritonclient[http]==2.41.0
12 changes: 5 additions & 7 deletions src/bisheng_unstructured/api/main.py
Original file line number Diff line number Diff line change
@@ -8,13 +8,12 @@
from fastapi.responses import ORJSONResponse
from loguru import logger

from bisheng_unstructured.api.pipeline import Pipeline
from bisheng_unstructured.api.types import ConfigInput, UnstructuredInput, UnstructuredOutput
from bisheng_unstructured.common import Timer
from bisheng_unstructured.config.settings import settings

from bisheng_unstructured.common.logger import configure
from bisheng_unstructured.config.settings import settings
from bisheng_unstructured.middlewares.http_middleware import CustomMiddleware
from bisheng_unstructured.api.pipeline import Pipeline
from bisheng_unstructured.api.types import ConfigInput, UnstructuredInput, UnstructuredOutput

# Fastapi App

@@ -100,8 +99,7 @@ async def etl4_llm(inp: UnstructuredInput):

inp.file_path = file_path
inp.file_type = file_type

if pipeline.mode == "local":
if pipeline.mode == "local" and inp.mode == "partition":
# 本地模式只支持text 有限格式
logger.info(f"local_pipeline mode=[{inp.mode}] filename=[{inp.filename}]")
inp.mode = "text"
@@ -120,7 +118,7 @@ async def etl4_llm(inp: UnstructuredInput):

timer.toc()
outp = pipeline.predict(inp)
if inp.mode == "partition":
if inp.mode == "partition" and outp.status_code == 200:
with open(file_path, "rb") as fin:
outp.b64_pdf = base64.b64encode(fin.read()).decode("utf-8")

58 changes: 28 additions & 30 deletions src/bisheng_unstructured/api/pipeline.py
Original file line number Diff line number Diff line change
@@ -3,11 +3,13 @@

from loguru import logger

from bisheng_unstructured.api.any2pdf import Any2PdfCreator
from bisheng_unstructured.api.types import UnstructuredInput, UnstructuredOutput
from bisheng_unstructured.documents.elements import ElementMetadata, NarrativeText
from bisheng_unstructured.documents.html_utils import save_to_txt, visualize_html
from bisheng_unstructured.documents.pdf_parser.blob import Blob
from bisheng_unstructured.documents.pdf_parser.image import ImageDocument
from bisheng_unstructured.documents.pdf_parser.pdf import PDFDocument
from bisheng_unstructured.documents.pdf_parser.idp.pdf import PDFDocument as IDP_PDFDocument
from bisheng_unstructured.documents.pdf_parser.idp.image import ImageDocument as IDP_ImageDocument
from bisheng_unstructured.partition.csv import partition_csv
from bisheng_unstructured.partition.doc import partition_doc
from bisheng_unstructured.partition.docx import partition_docx
@@ -17,20 +19,16 @@
from bisheng_unstructured.partition.pptx import partition_pptx
from bisheng_unstructured.partition.text import partition_text
from bisheng_unstructured.partition.tsv import partition_tsv
from bisheng_unstructured.partition.xls import partition_xls
from bisheng_unstructured.partition.xlsx import partition_xlsx
from bisheng_unstructured.staging.base import convert_to_isd

from bisheng_unstructured.api.any2pdf import Any2PdfCreator
from bisheng_unstructured.api.types import UnstructuredInput, UnstructuredOutput
from bisheng_unstructured.documents.elements import ElementMetadata, NarrativeText
from bisheng_unstructured.documents.pdf_parser.blob import Blob
from src.bisheng_unstructured.partition.xls import partition_xls


def partition_pdf(filename, model_params, **kwargs):
if kwargs.get("mode") == "local":
# pypdf 进行解析
import pypdf

blob = Blob.from_path(filename)
with blob.as_bytes_io() as pdf_file_obj:
reader = pypdf.PdfReader(pdf_file_obj)
@@ -41,22 +39,22 @@ def partition_pdf(filename, model_params, **kwargs):
for page_num, page in enumerate(reader.pages)
]
else:
rt_type = kwargs.get("rt_type", "sdk")
if rt_type in {"sdk", "idp"}:
doc = IDP_PDFDocument(file=filename, model_params=model_params, **kwargs)
else:
doc = PDFDocument(file=filename, model_params=model_params, **kwargs)
# rt_type = kwargs.get("rt_type", "sdk")
# if rt_type in {"ocr_sdk", "idp"}:
# doc = IDP_PDFDocument(file=filename, model_params=model_params, **kwargs)
# else:
doc = PDFDocument(file=filename, model_params=model_params, **kwargs)

_ = doc.pages
return doc.elements


def partition_image(filename, model_params, **kwargs):
rt_type = kwargs.get("rt_type", "sdk")
if rt_type in {"sdk", "idp"}:
doc = IDP_ImageDocument(file=filename, model_params=model_params, **kwargs)
else:
doc = ImageDocument(file=filename, model_params=model_params, **kwargs)
# if rt_type in {"ocr_sdk", "idp", "sdk"}:
# doc = IDP_ImageDocument(file=filename, model_params=model_params, **kwargs)
# else:
doc = ImageDocument(file=filename, model_params=model_params, **kwargs)

_ = doc.pages
return doc.elements
@@ -89,24 +87,24 @@ class Pipeline(object):
def __init__(self, settings: Dict):
"""k8s 使用cm 创建环境变量"""
tmp_dict = settings
rt_ep = os.getenv("rt_server")
self.rt_type = os.getenv("rt_type", "sdk")
rt_ep = os.getenv("server_address")
self.rt_type = os.getenv("server_type", "rt")
if rt_ep:
if self.rt_type in {"sdk", "idp"}:
if self.rt_type in {"ocr_sdk", "idp", "sdk"}:
pdf_model_params_temp = {
"layout_ep": f"http://{rt_ep}/v2/idp/idp_app/infer",
"cell_model_ep": f"http://{rt_ep}/v2/idp/idp_app/infer",
"rowcol_model_ep": f"http://{rt_ep}/v2/idp/idp_app/infer",
"table_model_ep": f"http://{rt_ep}/v2/idp/idp_app/infer",
"ocr_model_ep": f"http://{rt_ep}/v2/idp/idp_app/infer"
"layout_ep": f"http://{rt_ep}/v2/idp/elem_layout_v1/infer",
"cell_model_ep": f"http://{rt_ep}/v2/idp/elem_table_cell_detect_v1/infer",
"rowcol_model_ep": f"http://{rt_ep}/v2/idp/elem_table_rowcol_detect_v1/infer",
"table_model_ep": f"http://{rt_ep}/v2/idp/elem_table_multiclass_v1/infer",
"ocr_model_ep": f"http://{rt_ep}/v2/idp/idp_app/infer",
}
else:
pdf_model_params_temp = {
"layout_ep": f"http://{rt_ep}/v2.1/models/elem_layout_v1/infer",
"cell_model_ep": f"http://{rt_ep}/v2.1/models/elem_table_cell_detect_v1/infer",
"rowcol_model_ep":
f"http://{rt_ep}/v2.1/models/elem_table_rowcol_detect_v1/infer",
"table_model_ep": f"http://{rt_ep}/v2.1/models/elem_table_detect_v1/infer",
"table_model_ep": f"http://{rt_ep}/v2.1/models/elem_table_multiclass_v1/infer",
"ocr_model_ep": f"http://{rt_ep}/v2.1/models/elem_ocr_collection_v3/infer",
}
self.mode = "sdk"
@@ -137,7 +135,6 @@ def to_pdf(self, inp: UnstructuredInput) -> UnstructuredOutput:
return UnstructuredOutput(status_code=400, status_message=str(e))

def predict(self, inp: UnstructuredInput) -> UnstructuredOutput:

if inp.file_type not in PARTITION_MAP:
raise Exception(f"file type[{inp.file_type}] not supported")

@@ -150,11 +147,12 @@ def predict(self, inp: UnstructuredInput) -> UnstructuredOutput:
part_inp = {
"filename": filename,
"mode": self.mode,
'rt_type': self.rt_type,
**inp.parameters
"rt_type": self.rt_type,
"is_scan": inp.is_scan,
**inp.parameters,
}
part_func = PARTITION_MAP.get(file_type)
if part_func == partition_image and self.mode == 'local':
if part_func == partition_image and self.mode == "local":
return UnstructuredOutput(status_code=400, status_message="本地模型不支持图片")

if part_func == partition_pdf or part_func == partition_image:
1 change: 1 addition & 0 deletions src/bisheng_unstructured/api/types.py
Original file line number Diff line number Diff line change
@@ -11,6 +11,7 @@ class UnstructuredInput(BaseModel):
mode: str = "text" # text, partition, vis, topdf
file_path: Optional[str] = None
file_type: Optional[str] = None
is_scan: Optional[bool] = None


class UnstructuredOutput(BaseModel):
3 changes: 2 additions & 1 deletion src/bisheng_unstructured/config/config.yaml
Original file line number Diff line number Diff line change
@@ -32,7 +32,8 @@ pdf_model_params:
table_model_ep: "http://192.168.106.12:9001/v2.1/models/elem_table_detect_v1/infer"
ocr_model_ep: "http://192.168.106.12:9001/v2.1/models/elem_ocr_collection_v3/infer"


# 是否全部走ocr识别, false的话则由代码逻辑判断是否需要走ocr识别
is_all_ocr: false
# ocr识别需要的配置项
ocr_conf:
params:
5 changes: 3 additions & 2 deletions src/bisheng_unstructured/config/settings.py
Original file line number Diff line number Diff line change
@@ -4,7 +4,7 @@

import yaml
from loguru import logger
from pydantic import BaseModel, BaseSettings, validator
from pydantic import BaseModel, BaseSettings, Field, validator


class LoggerConf(BaseModel):
@@ -51,6 +51,7 @@ class Settings(BaseSettings):
logger_conf: LoggerConf = LoggerConf()
pdf_model_params: PdfModelParams = PdfModelParams()
ocr_conf: OcrConf = OcrConf()
is_all_ocr: bool = Field(default=False)


def load_settings_from_yaml(file_path: str) -> Settings:
@@ -66,7 +67,7 @@ def load_settings_from_yaml(file_path: str) -> Settings:
for key in settings_dict:
if key not in Settings.__fields__.keys():
raise KeyError(f"Key {key} not found in settings")
logger.debug(f"Loading {len(settings_dict[key])} {key} from {file_path}")
logger.debug(f"Loading {key} from {file_path}")

return Settings(**settings_dict)

54 changes: 41 additions & 13 deletions src/bisheng_unstructured/documents/markdown.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import re

from loguru import logger
import lxml
from lxml import etree
from lxml.builder import E
from lxml.html.clean import Cleaner

RE_MULTISPACE_INCLUDING_NEWLINES = re.compile(pattern=r"\s+", flags=re.DOTALL)
@@ -13,6 +13,7 @@ def norm_text(e):


def markdown_table(rows):

def _format_row(r):
content = " | ".join(r)
content = "| " + content + " |"
@@ -41,26 +42,53 @@ def _format_header(n):
return "\n".join(content)


def norm_texts(texts):
return [t for t in map(norm_text, texts) if t]


def transform_html_table_to_md(html_table_str, field_sep=" "):
table_node = lxml.html.fromstring(html_table_str)
try:
table_node = lxml.html.fromstring(html_table_str)
except Exception as e:
logger.error("html table parse error: %s", e)
return dict(text="", html="", error=str(e))
rows = []
for thead_node in table_node.xpath(".//thead"):
row = []
texts = tuple(thead_node.xpath(".//th//text()"))
texts = list(map(norm_text, texts))
row = texts

row = norm_texts(thead_node.xpath(".//th//text()"))
if row:
rows.append(row)

for tr in table_node.xpath(".//tr"):
rowspan_data = {} # 用于存储跨行单元格的数据
for index, tr in enumerate(table_node.xpath(".//tr")):
row = []
for e in tr.getchildren():
texts = tuple(e.xpath(".//text()"))
texts = map(norm_text, texts)
texts = [t for t in texts if t]
col_offset = 0 # 用于处理行合并,遍历的偏移
for col_index, e in enumerate(tr.getchildren()):
col_real_index = col_index + col_offset
if index in rowspan_data:
# 表示这个单元格是合并的,需要补充合并的,再往下走
while col_real_index in rowspan_data[index]:
row.extend(rowspan_data[index][col_real_index])
col_offset += len(rowspan_data[index][col_real_index])
col_real_index = col_index + col_offset

texts = norm_texts(e.xpath(".//text()"))
field_text = field_sep.join(texts)
row.append(field_text)

colspan = int(e.get('colspan', 1))
rowspan = int(e.get('rowspan', 1))

inner_col = []
inner_col.extend([field_text] * colspan)
col_offset += colspan - 1 # 当有单元合并,补充offset

if rowspan > 1:
for i in range(1, rowspan):
if (index + i) not in rowspan_data:
rowspan_data[index + i] = {}
# 行合并 列合并
rowspan_data[index + i][col_real_index] = inner_col

row.extend(inner_col)

if row:
rows.append(row)
102 changes: 92 additions & 10 deletions src/bisheng_unstructured/documents/pdf_parser/idp/pdf.py
Original file line number Diff line number Diff line change
@@ -10,7 +10,8 @@

import fitz as pymupdf
import numpy as np
import pypdfium2
#import pypdfium2
from pdf2image import convert_from_path,convert_from_bytes
from PIL import Image, ImageOps

from bisheng_unstructured.common import Timer
@@ -160,6 +161,7 @@ def __init__(
password: Optional[Union[str, bytes]] = None,
is_join_table: bool = True,
with_columns: bool = False,
is_scan: Optional[bool] = None,
support_rotate: bool = False,
text_elem_sep: str = "\n",
start: int = 0,
@@ -188,6 +190,7 @@ def __init__(
self.support_formula = support_formula
self.enable_isolated_formula = enable_isolated_formula
self.n_parallel = n_parallel
self.is_scan = is_scan
super().__init__()

def _get_image_blobs(self, fitz_doc, pdf_reader, n=None, start=0):
@@ -276,6 +279,63 @@ def _save_to_pages(self, groups, page_inds, lang):

return pages

def _extract_lines_v2(self, textpage):
line_blocks = []
line_words_info = []
page_dict = textpage.extractRAWDICT()
for block in page_dict["blocks"]:
block_type = block["type"]
block_no = block["number"]
if block_type != 0:
bbox = block["bbox"]
block_text = ""
block_info = BlockInfo(
[bbox[0], bbox[1], bbox[2], bbox[3]], block_text, block_no, block_type
)
line_blocks.append(block_info)
line_words_info.append((None, None))

lines = block["lines"]

for line in lines:
bbox = line["bbox"]
words = []
words_bboxes = []
for span in line["spans"]:
cont_bboxes = []
cont_text = []
for char in span["chars"]:
c = char["c"]
if c == " ":
if cont_bboxes:
word_bbox = merge_rects(np.asarray(cont_bboxes))
word = "".join(cont_text)
words.append(word)
words_bboxes.append(word_bbox)
cont_bboxes = []
cont_text = []
else:
cont_bboxes.append(char["bbox"])
cont_text.append(c)

if cont_bboxes:
word_bbox = merge_rects(np.asarray(cont_bboxes))
word = "".join(cont_text)
words.append(word)
words_bboxes.append(word_bbox)

if not words_bboxes:
continue

line_words_info.append((words, words_bboxes))
line_text = "".join([char["c"] for span in line["spans"] for char in span["chars"]])
bb0, bb1, bb2, bb3 = merge_rects(np.asarray(words_bboxes))

block_info = BlockInfo([bb0, bb1, bb2, bb3], line_text, block_no, block_type,rs=[bb0,bb1,bb2,bb3],ts=[line_text])
line_blocks.append(block_info)

return line_blocks, line_words_info

def load(self) -> List[Page]:
"""Load given path as pages."""
blob = Blob.from_path(self.file)
@@ -284,15 +344,19 @@ def load(self) -> List[Page]:
page_inds = []
lang = None

def _task(bytes_img, img, is_scan, lang, rot_matirx):
def _task(textpage_info,bytes_img, img, is_scan, lang, rot_matirx):
if not is_scan:
return textpage_info

b64_data = base64.b64encode(bytes_img).decode()
payload = {"b64_image": b64_data}
result = self.ocr_agent.predict(payload)
return result

with blob.as_bytes_io() as file_path:
fitz_doc = pymupdf.open(file_path)
pdf_doc = pypdfium2.PdfDocument(file_path, autoclose=True)
#pdf_doc = pypdfium2.PdfDocument(file_path, autoclose=True)
pdf_doc = convert_from_bytes(file_path.read(), dpi=72)
max_page = fitz_doc.page_count - start
n = self.n if self.n else max_page
n = min(n, max_page)
@@ -318,13 +382,20 @@ def _task(bytes_img, img, is_scan, lang, rot_matirx):
bytes_imgs = []
page_imgs = []
for idx in range(start, start + n):
page = pdf_doc.get_page(idx)
pil_image = page.render().to_pil()
page_imgs.append(pil_image)
#page = pdf_doc.get_page(idx)
#pil_image = page.render().to_pil()
#page_imgs.append(pil_image)
#img_byte_arr = io.BytesIO()
#pil_image.save(img_byte_arr, format="PNG")
#bytes_img = img_byte_arr.getvalue()
#bytes_imgs.append(bytes_img)
page = pdf_doc[idx]
img_byte_arr = io.BytesIO()
pil_image.save(img_byte_arr, format="PNG")
bytes_img = img_byte_arr.getvalue()
bytes_imgs.append(bytes_img)
page.save(img_byte_arr, format='PNG')
img_byte_arr = img_byte_arr.getvalue()
bytes_imgs.append(img_byte_arr)
page_imgs.append(page)


timer.toc()
print("pdfium render image", timer.get())
@@ -340,8 +411,19 @@ def _task(bytes_img, img, is_scan, lang, rot_matirx):
rot_matrix = None
bytes_img = bytes_imgs[idx - start]
img = page_imgs[idx - start]


if self.is_scan is not None:
is_scan = self.is_scan

if not is_scan:
textpage_info,_ = self._extract_lines_v2(textpage)
else:
textpage_info = []


futures.append(
executor.submit(_task, bytes_img, img, is_scan, lang, rot_matrix))
executor.submit(_task,textpage_info, bytes_img, img, is_scan, lang, rot_matrix))

idx = start
for future in futures:
43 changes: 27 additions & 16 deletions src/bisheng_unstructured/documents/pdf_parser/image.py
Original file line number Diff line number Diff line change
@@ -2,31 +2,39 @@
from typing import List

from bisheng_unstructured.documents.base import Page
from bisheng_unstructured.models import (LayoutAgent, OCRAgent, TableAgent, TableDetAgent,
RTLayoutAgent, RTOCRAgent, RTTableAgent, RTTableDetAgent)

from bisheng_unstructured.documents.pdf_parser.blob import Blob
from bisheng_unstructured.documents.pdf_parser.pdf import PDFDocument
from bisheng_unstructured.models import (
LayoutAgent,
OCRAgent,
RTLayoutAgent,
RTOCRAgent,
RTTableAgent,
RTTableDetAgent,
TableAgent,
TableDetAgent,
)

# from bisheng_unstructured.common import Timer


class ImageDocument(PDFDocument):

def __init__(self,
file: str,
model_params: dict,
with_columns: bool = False,
text_elem_sep: str = "\n",
enhance_table: bool = True,
keep_text_in_image: bool = True,
lang: str = "zh",
verbose: bool = False,
n_parallel: int = 10,
**kwargs) -> None:
def __init__(
self,
file: str,
model_params: dict,
with_columns: bool = False,
text_elem_sep: str = "\n",
enhance_table: bool = True,
keep_text_in_image: bool = True,
lang: str = "zh",
verbose: bool = False,
n_parallel: int = 10,
**kwargs
) -> None:
super(ImageDocument, self).__init__(file=file, model_params=model_params)
rt_type = kwargs.get("rt_type", "sdk")
if rt_type in {"sdk", "idp"}:
if rt_type in {"ocr_sdk", "idp", "sdk"}:
self.layout_agent = LayoutAgent(**model_params)
self.table_agent = TableAgent(**model_params)
self.ocr_agent = OCRAgent(**model_params)
@@ -67,6 +75,9 @@ def load(self) -> List[Page]:
# timer.toc()

if blocks:
for tmp_block in blocks:
tmp_block.pages = [1 for _ in tmp_block.rs]
tmp_block.bbox_text = None
if self.with_columns:
sub_groups = self._divide_blocks_into_groups(blocks)
groups.extend(sub_groups)
260 changes: 178 additions & 82 deletions src/bisheng_unstructured/documents/pdf_parser/pdf.py

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions src/bisheng_unstructured/documents/pdf_parser/test_pdf.py
Original file line number Diff line number Diff line change
@@ -4,7 +4,7 @@
import cv2
import fitz
import numpy as np
import pypdfium2
#import pypdfium2
from shapely import Polygon
from shapely import box as Rect

@@ -259,8 +259,8 @@ def test_vis():
with blob.as_bytes_io() as file_path:
pages = fitz.open(file_path)
print("pages", pages)
pdf_reader = pypdfium2.PdfDocument(file_path, autoclose=True)
image_blobs = get_image_blobs(pages, pdf_reader, n, start)
# pdf_reader = pypdfium2.PdfDocument(file_path, autoclose=True)
# image_blobs = get_image_blobs(pages, pdf_reader, n, start)

assert len(image_blobs) == n

75 changes: 42 additions & 33 deletions src/bisheng_unstructured/models/idp/dummy_ocr_agent.py
Original file line number Diff line number Diff line change
@@ -4,7 +4,6 @@
from typing import Any, List, Optional, Union

import numpy as np
import copy
import requests


@@ -21,6 +20,9 @@ class BlockInfo:
ord_ind: int = None
layout_type: int = None # 3: title 4: pragraph, 5: table
html_text: str = None
pages: List[int] = None # record every rs item belong to which page
bbox_text: List[str] = None # record every bbox relate to which text


def find_xy(box):
xmin = box[0][0]
@@ -34,37 +36,43 @@ def find_xy(box):
ymin = min(ymin, y)
xmax = max(xmax, x)
ymax = max(ymax, y)
return [xmin,ymin,xmax,ymax]
return [xmin, ymin, xmax, ymax]


def random_merge_text(a, n):
size = len(a) // n
remainder = len(a) % n
result = [''.join(a[i * size + min(i, remainder):(i + 1) * size + min(i + 1, remainder)]) for i in range(n)]
result = [
"".join(a[i * size + min(i, remainder) : (i + 1) * size + min(i + 1, remainder)])
for i in range(n)
]
return result


def recalculate_xy(arr):
arr = np.array(arr)
min_col1 = np.min(arr[:, 0])
min_col2 = np.min(arr[:, 1])
max_col3 = np.max(arr[:, 2])
max_col4 = np.max(arr[:, 3])
return [min_col1,min_col2,max_col3,max_col4]
return [min_col1, min_col2, max_col3, max_col4]

def process_table(table_result):

max_row = max([max(item['rows']) for item in table_result])
max_col = max([max(item['cols']) for item in table_result])
layout = [["" for _ in range(max_col+1)] for _ in range(max_row+1)]
def process_table(table_result):
max_row = max([max(item["rows"]) for item in table_result])
max_col = max([max(item["cols"]) for item in table_result])
layout = [["" for _ in range(max_col + 1)] for _ in range(max_row + 1)]
rect_box = []
for item in table_result:
if not len(item["text_box"]): continue
bboxes = np.array(item["text_box"]).reshape((-1,4,2)).tolist()
if not len(item["text_box"]):
continue
bboxes = np.array(item["text_box"]).reshape((-1, 4, 2)).tolist()
new_bboxes = np.array([find_xy(box) for box in bboxes])
sub_rect_box = recalculate_xy(new_bboxes)
rect_box.append(sub_rect_box)
cols = item['cols']
rows = item['rows'] if len(cols) == len(item['rows']) else item['rows']*len(cols)

cols = item["cols"]
rows = item["rows"] if len(cols) == len(item["rows"]) else item["rows"] * len(cols)
if len(cols) == len(item["text"]):
text = item["text"]
else:
@@ -73,7 +81,7 @@ def process_table(table_result):
else:
text = random_merge_text(item["text"], len(cols))

for row, col, txt in zip(rows,cols, text):
for row, col, txt in zip(rows, cols, text):
layout[row][col] = txt

md_str = ""
@@ -85,32 +93,33 @@ def process_table(table_result):


def process_paragraph(bboxes, texts, rect_box):
up_table = {'boxs': [], 'texts':[]}
down_table = {'boxs': [], 'texts':[]}
up_table = {"boxs": [], "texts": []}
down_table = {"boxs": [], "texts": []}
if rect_box:
bboxes = np.array(bboxes).reshape((-1,4,2))
bboxes = np.array(bboxes).reshape((-1, 4, 2))
for idx, box in enumerate(bboxes):
a,ymi,b,ymx = find_xy(box)
a, ymi, b, ymx = find_xy(box)
if ymi < rect_box[1]:
up_table['boxs'].append([a,ymi,b,ymx])
up_table['texts'].append(texts[idx])
up_table["boxs"].append([a, ymi, b, ymx])
up_table["texts"].append(texts[idx])
elif ymx > rect_box[3]:
down_table['boxs'].append([a,ymi,b,ymx])
down_table['texts'].append(texts[idx])
down_table["boxs"].append([a, ymi, b, ymx])
down_table["texts"].append(texts[idx])
else:
bboxes = np.array(bboxes).reshape((-1,4,2))
bboxes = np.array(bboxes).reshape((-1, 4, 2))
for idx, box in enumerate(bboxes):
a,ymi,b,ymx = find_xy(box)
up_table['boxs'].append([a,ymi,b,ymx])
up_table['texts'].append(texts[idx])
a, ymi, b, ymx = find_xy(box)
up_table["boxs"].append([a, ymi, b, ymx])
up_table["texts"].append(texts[idx])

return up_table, down_table


def process_whole_paragraph(general_ocr_res):
boxes = general_ocr_res["bboxes"]
texts = general_ocr_res["texts"]
rowcol = general_ocr_res["row_col_info"]

max_row = max(row[0] for row in rowcol) + 1
max_col = max(row[1] for row in rowcol) + 1

@@ -140,19 +149,19 @@ def predict(self, inp) -> List[BlockInfo]:
req_data = {"param": params, "data": [b64_image]}
try:
r = self.client.post(url=self.ep, json=req_data, timeout=self.timeout)

except requests.exceptions.Timeout:
raise Exception(f"timeout in formula agent predict")
except Exception as e:
raise Exception(f"exception in formula agent predict: [{e}]")
layout_text, layout_boxs = process_whole_paragraph(r["data"]["json"]["general_ocr_res"])

layout_text, layout_boxs = process_whole_paragraph(r["data"]["json"]["general_ocr_res"])
b0 = BlockInfo(
block=[],
block_text=''.join([''.join(text) for text in layout_text]),
block_text="".join(["".join(text) for text in layout_text]),
block_no=0,
ts=[''.join(text) for text in layout_text],
rs=[''.join(text) for text in layout_boxs],
ts=["".join(text) for text in layout_text],
rs=["".join(text) for text in layout_boxs],
layout_type=0,
)
return [b0]
25 changes: 14 additions & 11 deletions src/bisheng_unstructured/models/idp/layout_agent.py
Original file line number Diff line number Diff line change
@@ -1,27 +1,30 @@
import copy
import json

import requests
import numpy as np
import tritonclient.http as httpclient


# Layout Agent Version 0.1, update at 2023.08.18
class LayoutAgent(object):

def __init__(self, *args, **kwargs):
self.ep = kwargs.get("layout_ep")
self.client = requests.Session()
ep_parts = self.ep.split("/")
self.model = ep_parts[-2]
self.server_url = ep_parts[2]
self.timeout = kwargs.get("timeout", 60)
self.params = {
"longer_edge_size": 0,
}

def predict(self, inp):
params = copy.deepcopy(self.params)
params.update(inp)
# print('params', params, self.ep)
input0_data = np.asarray([json.dumps(inp)], dtype=np.object_)
inputs = [httpclient.InferInput("INPUT", [1], "BYTES")]
inputs[0].set_data_from_numpy(input0_data)
outputs = [httpclient.InferRequestedOutput("OUTPUT")]
client = httpclient.InferenceServerClient(url=self.server_url, verbose=False)
try:
r = self.client.post(url=self.ep, json=params, timeout=self.timeout)
return r.json()
except requests.exceptions.Timeout:
raise Exception(f"timeout in layout predict")
response = client.infer(self.model, inputs, request_id=str(1), outputs=outputs)
output_data = json.loads(response.as_numpy("OUTPUT")[0].decode("utf-8"))
except Exception as e:
raise Exception(f"exception in layout predict: [{e}]")
return output_data
51 changes: 19 additions & 32 deletions src/bisheng_unstructured/models/idp/ocr_agent.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,9 @@
import copy

import os

import requests

from bisheng_unstructured.models.common import (load_json)
from bisheng_unstructured.models.idp.dummy_ocr_agent import BlockInfo, process_whole_paragraph
from bisheng_unstructured.models.common import load_json

DEFAULT_CONFIG = {
"params": {
@@ -20,29 +19,27 @@
"recog": "general_text_reg_nb_v1.0_faster",
},
"hand": {
"det": "general_text_det_mrcnn_v2.0",
"recog": "transformer-hand-v1.16-faster",
"det": "general_text_det_v2.0",
"recog": "general_text_reg_nb_v1.0_faster",
},
"print_recog": {
"recog": "transformer-blank-v0.2-faster",
"recog": "general_text_reg_nb_v1.0_faster",
},
"hand_recog": {
"recog": "transformer-hand-v1.16-faster",
"recog": "general_text_reg_nb_v1.0_faster",
},
"det": {
"det": "general_text_det_mrcnn_v2.0",
"det": "general_text_det_v2.0",
},
}
},
}

ocr_predict_bak = {
'code': 200,
'message': 'ok',
'request_id': 21513655180,
'elapse': 595,
'result': {
'ocr_result': {}
}
"code": 200,
"message": "ok",
"request_id": 21513655180,
"elapse": 595,
"result": {"ocr_result": {}},
}


@@ -55,7 +52,6 @@ def convert_json(inp):
# OCR Agent Version 0.1, update at 2023.08.18
# - add predict_with_mask support recog with embedding formula, 2024.01.16
class OCRAgent(object):

def __init__(self, **kwargs):
self.ep = kwargs.get("ocr_model_ep")
self.client = requests.Session()
@@ -65,8 +61,7 @@ def __init__(self, **kwargs):
jsoncontent = load_json(mdoel_config_path)
else:
jsoncontent = None
if jsoncontent is not None and "params" in jsoncontent and \
"scene_mapping" in jsoncontent:
if jsoncontent is not None and "params" in jsoncontent and "scene_mapping" in jsoncontent:
self.params = jsoncontent["params"]
self.scene_mapping = jsoncontent["scene_mapping"]
else:
@@ -83,19 +78,11 @@ def predict(self, inp):
req_data = {"param": params, "data": [b64_image]}

try:
r = self.client.post(url=self.ep, json=req_data, timeout=self.timeout).json()
# ret = convert_json(r.json())

layout_text, layout_boxs = process_whole_paragraph(r["result"]['ocr_result'])
b0 = BlockInfo(
bbox=[],
block_text=''.join([''.join(text) for text in layout_text]),
block_no=0,
ts=[''.join(text) for text in layout_text],
rs=[text[0] for text in layout_boxs],
layout_type=0,
)
return [b0]
from loguru import logger

logger.info(f"ocr predict request: {params}")
r = self.client.post(url=self.ep, json=req_data, timeout=self.timeout)
return r.json()
# return r.json()
except requests.exceptions.Timeout:
raise Exception(f"timeout in ocr predict")
84 changes: 55 additions & 29 deletions src/bisheng_unstructured/models/idp/table_agent.py
Original file line number Diff line number Diff line change
@@ -1,59 +1,85 @@
import base64
import copy
import json

import requests
import numpy as np
import tritonclient.http as httpclient
from loguru import logger


# Table Agent Version 0.1, update at 2023.08.18
# Table Agent Version 0.1, update at 2023.08.31
class TableAgent(object):
def __init__(self, **kwargs):
cell_model_ep = kwargs.get("cell_model_ep")
rowcol_model_ep = kwargs.get("rowcol_model_ep")
ep_parts = kwargs.get("cell_model_ep").split("/")
self.cell_server_url = ep_parts[2]
self.cell_model = ep_parts[-2]

self.ep_map = {
"cell": cell_model_ep,
"rowcol": rowcol_model_ep,
}
ep_parts = kwargs.get("rowcol_model_ep").split("/")
self.rowcol_server_url = ep_parts[2]
self.rowcol_model = ep_parts[-2]

self.timeout = kwargs.get("timeout", 60)
self.params = {
"sep_char": " ",
"longer_edge_size": None,
"padding": False,
}

self.client = requests.Session()
self.timeout = kwargs.get("timeout", 60)

def predict(self, inp):
scene = inp.pop("scene", "rowcol")
ep = self.ep_map.get(scene)
params = copy.deepcopy(self.params)
params.update(inp)
if scene == "rowcol":
client, model = (
httpclient.InferenceServerClient(url=self.rowcol_server_url, verbose=False),
self.rowcol_model,
)
else:
client, model = (
httpclient.InferenceServerClient(url=self.cell_server_url, verbose=False),
self.cell_model,
)

payload = copy.deepcopy(self.params)
payload.update(inp)

# ocr_result = json.dumps(ocr_result)
# table_bbox = table_result["bboxes"]
# b64_image = base64.b64encode(open(image_file, 'rb').read()).decode('utf-8')
# payload = {'b64_image': b64_image, 'table_bboxes': table_bbox, 'ocr_result': ocr_result}

input0_data = np.asarray([json.dumps(payload)], dtype=np.object_)
# print(input0_data)
inputs = [httpclient.InferInput("INPUT", [1], "BYTES")]
inputs[0].set_data_from_numpy(input0_data)
outputs = [httpclient.InferRequestedOutput("OUTPUT")]
try:
r = self.client.post(url=ep, json=params, timeout=self.timeout)
return r.json()
except requests.exceptions.Timeout:
raise Exception(f"timeout in table structure predict")
logger.info("table predict request, model: {}", model)
response = client.infer(model, inputs, request_id=str(1), outputs=outputs)
print("response", response)
output_data = json.loads(response.as_numpy("OUTPUT")[0].decode("utf-8"))
except Exception as e:
raise Exception(f"exception in table structure predict: [{e}]")

return output_data


# TableDet Agent Version 0.1, update at 2023.08.31
class TableDetAgent(object):
def __init__(self, **kwargs):
self.ep = kwargs.get("table_model_ep")
self.client = requests.Session()
ep_parts = kwargs.get("table_model_ep").split("/")
self.server_url = ep_parts[2]
self.model = ep_parts[-2]
self.timeout = kwargs.get("timeout", 60)
self.params = {}

def predict(self, inp):
params = copy.deepcopy(self.params)
params.update(inp)
# b64data = base64.b64encode(open(image_file, 'rb').read()).decode('utf-8')
input0_data = np.asarray([json.dumps(inp)], dtype=np.object_)

inputs = [httpclient.InferInput("INPUT", [1], "BYTES")]
inputs[0].set_data_from_numpy(input0_data)
outputs = [httpclient.InferRequestedOutput("OUTPUT")]
client = httpclient.InferenceServerClient(url=self.server_url, verbose=False)
try:
r = self.client.post(url=self.ep, json=params, timeout=self.timeout)
return r.json()
except requests.exceptions.Timeout:
raise Exception(f"timeout in table det predict")
response = client.infer(self.model, inputs, request_id=str(1), outputs=outputs)
output_data = json.loads(response.as_numpy("OUTPUT")[0].decode("utf-8"))
except Exception as e:
raise Exception(f"exception in table det predict: [{e}]")

return output_data
26 changes: 25 additions & 1 deletion src/bisheng_unstructured/models/layout_agent.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
import base64
import copy
import json

import numpy as np
import requests
import tritonclient.http as httpclient


# Layout Agent Version 0.1, update at 2023.08.18
class LayoutAgent(object):
class LayoutAgentV0(object):
def __init__(self, *args, **kwargs):
self.ep = kwargs.get("layout_ep")
self.client = requests.Session()
@@ -25,3 +28,24 @@ def predict(self, inp):
raise Exception(f"timeout in layout predict")
except Exception as e:
raise Exception(f"exception in layout predict: [{e}]")


class LayoutAgent:
def __init__(self, *args, **kwargs):
ep_parts = kwargs.get("layout_ep").split("/")
self.model = ep_parts[-2]
self.server_url = ep_parts[2]

def predict(self, inp):
# b64_image = base64.b64encode(open(image_file, 'rb').read()).decode('utf-8')
input0_data = np.asarray([json.dumps(inp)], dtype=np.object_)
inputs = [httpclient.InferInput("INPUT", [1], "BYTES")]
inputs[0].set_data_from_numpy(input0_data)
outputs = [httpclient.InferRequestedOutput("OUTPUT")]
client = httpclient.InferenceServerClient(url=self.server_url, verbose=False)
try:
response = client.infer(self.model, inputs, request_id=str(1), outputs=outputs)
output_data = json.loads(response.as_numpy("OUTPUT")[0].decode("utf-8"))
except Exception as e:
raise Exception(f"exception in layout predict: [{e}]")
return output_data
34 changes: 16 additions & 18 deletions src/bisheng_unstructured/models/ocr_agent.py
Original file line number Diff line number Diff line change
@@ -6,7 +6,6 @@
from PIL import Image

from bisheng_unstructured.config.settings import settings

from bisheng_unstructured.models.common import (
bbox_overlap,
draw_polygon,
@@ -30,21 +29,21 @@
},
"scene_mapping": {
"print": {
"det": "general_text_det_mrcnn_v2.0",
"recog": "transformer-blank-v0.2-faster",
"det": "general_text_det_v2.0",
"recog": "general_text_reg_nb_v1.0_faster",
},
"hand": {
"det": "general_text_det_mrcnn_v2.0",
"recog": "transformer-hand-v1.16-faster",
"det": "general_text_det_v2.0",
"recog": "general_text_reg_nb_v1.0_faster",
},
"print_recog": {
"recog": "transformer-blank-v0.2-faster",
"recog": "general_text_reg_nb_v1.0_faster",
},
"hand_recog": {
"recog": "transformer-hand-v1.16-faster",
"recog": "general_text_reg_nb_v1.0_faster",
},
"det": {
"det": "general_text_det_mrcnn_v2.0",
"det": "general_text_det_v2.0",
},
},
}
@@ -53,7 +52,6 @@
# OCR Agent Version 0.1, update at 2023.08.18
# - add predict_with_mask support recog with embedding formula, 2024.01.16
class OCRAgent(object):

def __init__(self, **kwargs):
self.ep = kwargs.get("ocr_model_ep")
self.client = requests.Session()
@@ -118,10 +116,8 @@ def predict_with_mask(self, img0, mf_out, scene="print", **kwargs):

xmin, ymin = max(0, int(box[0][0]) - 1), max(0, int(box[0][1]) - 1)
xmax, ymax = (
min(img0.size[0],
int(box[2][0]) + 1),
min(img0.size[1],
int(box[2][1]) + 1),
min(img0.size[0], int(box[2][0]) + 1),
min(img0.size[1], int(box[2][1]) + 1),
)
img[ymin:ymax, xmin:xmax, :] = 255

@@ -151,11 +147,13 @@ def predict_with_mask(self, img0, mf_out, scene="print", **kwargs):
emb_bbox = [bb[0], bb[1], bb[4], bb[5]]
bbox_iou = bbox_overlap(hori_bbox, emb_bbox)
if bbox_iou > EMB_BBOX_THREHOLD:
embed_mfs.append({
"position": emb_bbox,
"text": box_info["text"],
"type": box_info["type"],
})
embed_mfs.append(
{
"position": emb_bbox,
"text": box_info["text"],
"type": box_info["type"],
}
)

ocr_boxes = split_line_image(hori_bbox, embed_mfs)
text_bboxes.extend(ocr_boxes)
8 changes: 8 additions & 0 deletions src/bisheng_unstructured/models/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@

## models endpoint for idp sdk

- ocr model endpoint: http://{host:port}/v2/idp/idp_app/infer
- layout model endpoint: http://{host:port}/v2/models/elem_layout_v1/infer
- table det model endpoint: http://{host:port}/v2/models/elem_table_detect_v1/infer
- table rowcol model endpoint: http://{host:port}/v2/models/elem_table_rowcol_detect_v1/infer
- table cell model endpoint: http://{host:port}/v2/models/elem_table_cell_detect_v1/infer
85 changes: 83 additions & 2 deletions src/bisheng_unstructured/models/table_agent.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
import base64
import copy
import json

import numpy as np
import requests
import tritonclient.http as httpclient


# Table Agent Version 0.1, update at 2023.08.18
class TableAgent(object):
class TableAgentV0(object):
def __init__(self, **kwargs):
cell_model_ep = kwargs.get("cell_model_ep")
rowcol_model_ep = kwargs.get("rowcol_model_ep")
@@ -39,7 +42,7 @@ def predict(self, inp):


# TableDet Agent Version 0.1, update at 2023.08.31
class TableDetAgent(object):
class TableDetAgentV0(object):
def __init__(self, **kwargs):
self.ep = kwargs.get("table_model_ep")
self.client = requests.Session()
@@ -57,3 +60,81 @@ def predict(self, inp):
raise Exception(f"timeout in table det predict")
except Exception as e:
raise Exception(f"exception in table det predict: [{e}]")


class TableAgent(object):
def __init__(self, **kwargs):
ep_parts = kwargs.get("cell_model_ep").split("/")
self.cell_server_url = ep_parts[2]
self.cell_model = ep_parts[-2]

ep_parts = kwargs.get("rowcol_model_ep").split("/")
self.rowcol_server_url = ep_parts[2]
self.rowcol_model = ep_parts[-2]

self.timeout = kwargs.get("timeout", 60)
self.params = {
"sep_char": " ",
"longer_edge_size": None,
"padding": False,
}

def predict(self, inp):
scene = inp.pop("scene", "rowcol")
if scene == "rowcol":
client, model = (
httpclient.InferenceServerClient(url=self.cell_server_url, verbose=False),
self.rowcol_model,
)
else:
client, model = (
httpclient.InferenceServerClient(url=self.rowcol_server_url, verbose=False),
self.cell_model,
)

payload = copy.deepcopy(self.params)
payload.update(inp)

# ocr_result = json.dumps(ocr_result)
# table_bbox = table_result["bboxes"]
# b64_image = base64.b64encode(open(image_file, 'rb').read()).decode('utf-8')
# payload = {'b64_image': b64_image, 'table_bboxes': table_bbox, 'ocr_result': ocr_result}

input0_data = np.asarray([json.dumps(payload)], dtype=np.object_)
# print(input0_data)
inputs = [httpclient.InferInput("INPUT", [1], "BYTES")]
inputs[0].set_data_from_numpy(input0_data)
outputs = [httpclient.InferRequestedOutput("OUTPUT")]
try:
response = client.infer(model, inputs, request_id=str(1), outputs=outputs)
print("response", response)
output_data = json.loads(response.as_numpy("OUTPUT")[0].decode("utf-8"))
except Exception as e:
raise Exception(f"exception in table structure predict: [{e}]")

return output_data


class TableDetAgent(object):
def __init__(self, **kwargs):
ep_parts = kwargs.get("table_model_ep").split("/")
self.server_url = ep_parts[2]
self.model = ep_parts[-2]
self.timeout = kwargs.get("timeout", 60)

def predict(self, inp):
# b64data = base64.b64encode(open(image_file, 'rb').read()).decode('utf-8')
input0_data = np.asarray([json.dumps(inp)], dtype=np.object_)

inputs = [httpclient.InferInput("INPUT", [1], "BYTES")]
inputs[0].set_data_from_numpy(input0_data)
outputs = [httpclient.InferRequestedOutput("OUTPUT")]
client = httpclient.InferenceServerClient(url=self.server_url, verbose=False)

try:
response = client.infer(self.model, inputs, request_id=str(1), outputs=outputs)
output_data = json.loads(response.as_numpy("OUTPUT")[0].decode("utf-8"))
except Exception as e:
raise Exception(f"exception in table det predict: [{e}]")

return output_data
11 changes: 3 additions & 8 deletions src/bisheng_unstructured/staging/prodigy.py
Original file line number Diff line number Diff line change
@@ -20,18 +20,15 @@ def _validate_prodigy_metadata(
if len(metadata) != len(elements):
raise ValueError(
"The length of the metadata parameter does not match with"
" the length of the elements parameter.",
)
" the length of the elements parameter.", )
id_error_index: Optional[int] = next(
(index for index, metadatum in enumerate(metadata) if "id" in metadatum),
None,
)
if isinstance(id_error_index, int):
raise ValueError(
'The key "id" is not allowed with metadata parameter at index: {index}'.format(
index=id_error_index,
),
)
index=id_error_index, ), )
validated_metadata = metadata
else:
validated_metadata = [{} for _ in elements]
@@ -72,9 +69,7 @@ def stage_csv_for_prodigy(
csv_fieldnames = ["text", "id"]
csv_fieldnames += list(
set().union(
*((key.lower() for key in metadata_item) for metadata_item in validated_metadata),
),
)
*((key.lower() for key in metadata_item) for metadata_item in validated_metadata), ), )

def _get_rows() -> Generator[Dict[str, str], None, None]:
for element, metadatum in zip(elements, validated_metadata):
13 changes: 12 additions & 1 deletion src/bisheng_unstructured/topdf/docx2pdf.py
Original file line number Diff line number Diff line change
@@ -3,6 +3,7 @@
import signal
import subprocess

from bisheng_unstructured import utils
from bisheng_unstructured.partition.common import convert_office_doc


@@ -18,6 +19,16 @@ def __init__(self, kwargs={}):
-V CJKmonofont="Cascadia Mono"
"""

if utils.get_architecture() == "ARM":
cmd_template = """
pandoc -o {1} --pdf-engine=xelatex {0}
-V mainfont="Alibaba PuHuiTi 3.0"
-V sansfont="Alibaba PuHuiTi 3.0"
-V monofont="Cascadia Mono"
-V CJKmainfont="Alibaba PuHuiTi 3.0"
-V CJKsansfont="Alibaba PuHuiTi 3.0"
-V CJKmonofont="Cascadia Mono"
"""
def _norm_cmd(cmd):
return " ".join([p.strip() for p in cmd.strip().split()])

@@ -74,7 +85,7 @@ def run(cmd):
stderr=subprocess.PIPE,
stdout=subprocess.PIPE,
)
exit_code = p.wait(timeout=30)
exit_code = p.wait(timeout=300)
if exit_code != 0:
stdout, stderr = p.communicate()
raise Exception(
26 changes: 23 additions & 3 deletions src/bisheng_unstructured/topdf/excel2pdf.py
Original file line number Diff line number Diff line change
@@ -10,22 +10,25 @@
class ExcelToPDF(object):
def __init__(self, kwargs={}):
cmd_template = """
soffice -env:SingleAppInstance=\"false\" -env:UserInstallation=\"file://{1}\" --convert-to
"pdf:calc_pdf_Export:{{\"SinglePageSheets\":{{\"type\":\"boolean\",\"value\":\"true\"}}}}"
--outdir \"{1}\" \"{0}\"
soffice -env:SingleAppInstance=\"false\" -env:UserInstallation=\"file://{1}\" --convert-to html --outdir \"{1}\" \"{0}\"
"""
cmd_template2 = """
soffice --headless -env:SingleAppInstance=\"false\" -env:UserInstallation=\"file://{1}\" --convert-to xlsx --outdir \"{1}\" \"{0}\"
"""

cmd_template3 = 'sed -e \'s/\t/,/g\' "{0}" > "{1}"'

cmd_template4 = """
wkhtmltopdf --disable-javascript --disable-local-file-access --disable-external-links --no-images "{0}" "{1}"
"""

def _norm_cmd(cmd):
return " ".join([p.strip() for p in cmd.strip().split()])

self.cmd_template = _norm_cmd(cmd_template)
self.cmd_template2 = _norm_cmd(cmd_template2)
self.cmd_template3 = cmd_template3
self.cmd_template4 = cmd_template4

@staticmethod
def run(cmd):
@@ -79,8 +82,25 @@ def render(self, input_file, output_file=None, to_bytes=False):
input_file = os.path.join(temp_dir, filename)
wb.save(input_file)

# 先把excel转为html
cmd = self.cmd_template.format(input_file, temp_dir)
ExcelToPDF.run(cmd)
html_file_path = os.path.join(temp_dir, filename.rsplit(".", 1)[0] + ".html")
with open(html_file_path, "r+", encoding="utf-8") as f:
html_content = f.readlines()
for index, one in enumerate(html_content):
if one.find("text/css") != -1:
html_content.insert(
index + 1,
"table {word-break: break-word;}\ntable td{word-break: break-all;border: 1px solid #000000; padding: 3px;}",
)
break
f.seek(0)
f.writelines(html_content)

# 在把html转成pdf
cmd = self.cmd_template4.format(html_file_path, temp_output_file)
ExcelToPDF.run(cmd)

if output_file is not None:
shutil.move(temp_output_file, output_file)
18 changes: 18 additions & 0 deletions src/bisheng_unstructured/topdf/text2pdf.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
import os
from platform import platform
import shutil
import signal
import subprocess
import tempfile
from html import parser
from typing import Tuple

from bisheng_unstructured import utils
import lxml.html
import numpy as np
from lxml import etree
@@ -101,6 +103,20 @@ def __init__(self, kwargs={}):
-V CJKmonofont="Adobe Heiti Std"
"""

if utils.get_architecture() == "ARM":
cmd_template = """
pandoc -o {1} --pdf-engine=xelatex
--lua-filter=/opt/pandoc/unnested-table.lua
--template /opt/pandoc/pandoc-3.1.9/share/templates/default.latex
{0}
-V mainfont="Alibaba PuHuiTi 3.0"
-V sansfont="Alibaba PuHuiTi 3.0"
-V monofont="Adobe Heiti Std"
-V CJKmainfont="Alibaba PuHuiTi 3.0"
-V CJKsansfont="Alibaba PuHuiTi 3.0"
-V CJKmonofont="Adobe Heiti Std"
"""

cmd_template2 = """
soffice --headless -env:SingleAppInstance=\"false\" -env:UserInstallation=\"file://{1}\" --convert-to pdf --outdir \"{1}\" \"{0}\"
"""
@@ -128,6 +144,8 @@ def run(cmd: str, timeout: int = 30):
stderr=subprocess.PIPE,
stdout=subprocess.PIPE,
)
if utils.get_architecture() == "ARM":
timeout = 3000
exit_code = p.wait(timeout=timeout)
if exit_code != 0:
stdout, stderr = p.communicate()
10 changes: 10 additions & 0 deletions src/bisheng_unstructured/utils.py
Original file line number Diff line number Diff line change
@@ -2,6 +2,7 @@
import json
from datetime import datetime
from functools import wraps
import platform
from typing import Dict, List, Optional, Union

DATE_FORMATS = ("%Y-%m-%d", "%Y-%m-%dT%H:%M:%S", "%Y-%m-%d+%H:%M:%S", "%Y-%m-%dT%H:%M:%S%z")
@@ -11,6 +12,15 @@ def save_as_jsonl(data: List[Dict], filename: str) -> None:
with open(filename, "w+") as output_file:
output_file.writelines(json.dumps(datum) + "\n" for datum in data)

def get_architecture():
machine = platform.machine()
if 'x86' in machine or 'i686' in machine or 'i386' in machine:
return "x86"
elif 'arm' in machine or 'aarch64' in machine:
return "ARM"
else:
return "x86"


def read_from_jsonl(filename: str) -> List[Dict]:
with open(filename) as input_file:
56 changes: 56 additions & 0 deletions tests/test_idp_models_sdk.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# flake8: noqa
import base64
import hashlib
import json
import os

import pytest

from bisheng_unstructured.models.layout_agent import LayoutAgent
from bisheng_unstructured.models.ocr_agent import OCRAgent
from bisheng_unstructured.models.table_agent import TableAgent, TableDetAgent

configs = dict(
layout_ep="http://192.168.106.20:10502/v2/models/elem_layout_v1/infer",
cell_model_ep="http://192.168.106.20:10502/v2/models/elem_table_cell_detect_v1/infer",
rowcol_model_ep="http://192.168.106.20:10502/v2/models/elem_table_rowcol_detect_v1/infer",
table_model_ep="http://192.168.106.20:10502/v2/models/elem_table_detect_v1/infer",
ocr_model_ep="http://192.168.106.20:10502/v2/idp/idp_app/infer",
)


# @pytest.mark.skip
def test_layout():
layout_agent = LayoutAgent(**configs)

image_file = "data/001.png"
b64_image = base64.b64encode(open(image_file, "rb").read()).decode("utf-8")
inp = {"b64_image": b64_image}
result = layout_agent.predict(inp)
print("result", result)


# @pytest.mark.skip
def test_ocr():
ocr_agent = OCRAgent(**configs)

image_file = "data/001.png"
b64_image = base64.b64encode(open(image_file, "rb").read()).decode("utf-8")
inp = {"b64_image": b64_image}
result = ocr_agent.predict(inp)
print("result", result)


def test_table_det():
table_det_agent = TableDetAgent(**configs)
table_agent = TableAgent(**configs)
ocr_agent = OCRAgent(**configs)

image_file = "data/001.png"
b64_image = base64.b64encode(open(image_file, "rb").read()).decode("utf-8")
inp = {"b64_image": b64_image}
table_bboxes = table_det_agent.predict(inp)["bboxes"]
ocr_result = json.dumps(ocr_agent.predict(inp)["result"]["ocr_result"])
inp = {"b64_image": b64_image, "table_bboxes": table_bboxes, "ocr_result": ocr_result}
table_result = table_agent.predict(inp)
print("table_result", table_result)
29 changes: 27 additions & 2 deletions tests/test_image.py
Original file line number Diff line number Diff line change
@@ -91,7 +91,32 @@ def test_regress():
assert s1 == s2


test_image3()
def test_image4():
url = "http://192.168.106.20:10502/v2/models/"
layout_ep = url + "elem_layout_v1/infer"
cell_model_ep = url + "elem_table_cell_detect_v1/infer"
rowcol_model_ep = url + "elem_table_rowcol_detect_v1/infer"
table_model_ep = url + "elem_table_multiclass_v1/infer"

model_params = {
"layout_ep": layout_ep,
"cell_model_ep": cell_model_ep,
"rowcol_model_ep": rowcol_model_ep,
"table_model_ep": table_model_ep,
"ocr_model_ep": "http://192.168.106.20:10502/v2/idp/idp_app/infer",
}
print("model_params", model_params)

filename = "examples/docs/table_test_001.jpg"
doc = ImageDocument(file=filename, model_params=model_params, rt_type="sdk")
pages = doc.pages
elements = doc.elements

save_to_txt(elements, "data/table_test_001.txt")


test_image4()
# test_image3()
# test_image2()
# test_image()
test_regress()
# test_regress()
18 changes: 13 additions & 5 deletions tests/test_pdf_parser.py
Original file line number Diff line number Diff line change
@@ -236,28 +236,36 @@ def test_regress():


def test_pdf_doc10():
url = TEST_RT_URL
# url = TEST_RT_URL
url = "http://192.168.106.20:10502/v2/models/"
layout_ep = url + "elem_layout_v1/infer"
cell_model_ep = url + "elem_table_cell_detect_v1/infer"
rowcol_model_ep = url + "elem_table_rowcol_detect_v1/infer"
table_model_ep = url + "elem_table_detect_v1/infer"
# table_model_ep = url + "elem_table_detect_v1/infer"
table_model_ep = url + "elem_table_multiclass_v1/infer"

model_params = {
"layout_ep": layout_ep,
"cell_model_ep": cell_model_ep,
"rowcol_model_ep": rowcol_model_ep,
"table_model_ep": table_model_ep,
"ocr_model_ep": f"{TEST_RT_URL}elem_ocr_collection_v3/infer",
"ocr_model_ep": "http://192.168.106.125:10502/v2/idp/idp_app/infer",
}
print("model_params", model_params)

filename = "examples/docs/南陵电子2022.pdf"
pdf_doc = PDFDocument(
file=filename, model_params=model_params, start=0, verbose=True, n_parallel=10
file=filename,
model_params=model_params,
start=0,
verbose=True,
n_parallel=10,
n=50,
mode="server",
)
pages = pdf_doc.pages
elements = pdf_doc.elements
visualize_html(elements, "data/南陵电子2022-2.html")
# visualize_html(elements, "data/南陵电子2022-2.html")
save_to_txt(elements, "data/南陵电子2022-2.txt")