Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There is a huge difference in the search results of pg_cjk_parser and zhparser #10

Open
aaro-n opened this issue May 18, 2024 · 0 comments

Comments

@aaro-n
Copy link

aaro-n commented May 18, 2024

introduce

I'm not a computer professional, I just want the Miniflux I built to be able to search for more content.

Deployment method

Both pg_cjk_parse and zhparser build images and run them through docker-compose.yml.

Test search content

When Miniflux uses pg_cjk_parse and zhparser as databases respectively, the same RSS entries are imported and forced to refresh. After waiting for a period of time, pg_cjk_parse gets 7829 unread items and zhparser gets 7824 unread items.
I tested Chinese and English in the search respectively, and found that the number given by pg_cjk_parse was much smaller than that of zhparser. For example, searching for Hangzhou, pg_cjk_parse showed 28 entries, and zhparser showed 123 entries. When searching for English, pg_cjk_parse and zhparser would also have differences. , in general, pg_cjk_parse will search for fewer entries.

question

  1. I have installed several extensions on zhparser, will these have any impact?
  2. Is pg_cjk_parse enabled correctly?

Schedule

pg_cjk_parse

docker-compose.yml

version: '3'
services:
   postgres:
     build:
       context: /home/www/miniflux/pg_cjk_parser
       dockerfile: Dockerfile
     image: pg_cjk_parser
     container_name: postgres
     environment:
       - POSTGRES_PASSWORD=vg9QUVgeyjsQq6jwi5PsHZSDKWFReFZzKzNaPZ8X
     volumes:
       - /home/www/miniflux/pg_cjk_parser/database:/var/lib/postgresql/data
     networks:
       - pg_cjk_parser

   pgAdmin:
     restart: always
     image: dpage/pgadmin4
     ports:
       - "8085:80"
     environment:
       PGADMIN_DEFAULT_EMAIL: [email protected]
       PGADMIN_DEFAULT_PASSWORD: 123
     networks:
       - pg_cjk_parser

   miniflux:
     restart: always
     image: miniflux/miniflux:latest
     ports:
       - "8086:8080"
     environment:
       - DATABASE_URL=postgres://postgres:vg9QUVgeyjsQq6jwi5PsHZSDKWFReFZzKzNaPZ8X@postgres:5432/postgres?sslmode=disable
       - RUN_MIGRATIONS=1
       - CREATE_ADMIN=1
       - ADMIN_USERNAME=admin
       -ADMIN_PASSWORD=test123
     networks:
       - pg_cjk_parser

networks:
   pg_cjk_parser:
     driver:bridge

Dockerfile

ARG POSTGRES_VERSION=16
FROM postgres:$POSTGRES_VERSION as build
RUN apt-get update && apt-get install -y --no-install-recommends ca-certificates git postgresql-server-dev-16 gcc make icu-devtools libicu-dev

RUN mkdir -p /root/parser
WORKDIR /root/parser
RUN git clone https://github.com/huangjimmy/pg_cjk_parser.git /tmp/pg_cjk_parser && \
     cp /tmp/pg_cjk_parser/pg_cjk_parser.c /tmp/pg_cjk_parser/pg_cjk_parser.control /tmp/pg_cjk_parser/Makefile /tmp/pg_cjk_parser/pg_cjk_parser--0.0.1.sql /tmp/pg_cjk_parser/zht2zh s.h /root/parser/ && \
     make clean && make install

FROM postgres:$POSTGRES_VERSION
ARG POSTGRES_VERSION=16
COPY --from=build /root/parser/pg_cjk_parser.bc /usr/lib/postgresql/$POSTGRES_VERSION/lib/bitcode
COPY --from=build /root/parser/pg_cjk_parser.so /usr/lib/postgresql/$POSTGRES_VERSION/lib
COPY --from=build /root/parser/pg_cjk_parser--0.0.1.sql /usr/share/postgresql/$POSTGRES_VERSION/extension
COPY --from=build /root/parser/pg_cjk_parser.control /usr/share/postgresql/$POSTGRES_VERSION/extension
  1. Build the image and run it, log in with pgAdmin, and configure the connection database
  2. Select CREATE script on the postgres database, and a new window will pop up.
  3. Clear the window content and enter
CREATE TEXT SEARCH PARSER public.pg_cjk_parser (
     START = prsd2_cjk_start,
     GETTOKEN = prsd2_cjk_nexttoken,
     END = prsd2_cjk_end,
     LEXTYPES = prsd2_cjk_lextype,
     HEADLINE = prsd2_cjk_headline);

CREATE TEXT SEARCH CONFIGURATION public.config_2_gram_cjk (
     PARSER = pg_cjk_parser
);

SET default_text_search_config = 'public.config_2_gram_cjk';

Click Run and clear the input

ALTER TEXT SEARCH CONFIGURATION public.config_2_gram_cjk
     ADD MAPPING FOR asciihword
     WITH simple;

ALTER TEXT SEARCH CONFIGURATION public.config_2_gram_cjk
     ADD MAPPING FOR cjk
     WITH simple;

ALTER TEXT SEARCH CONFIGURATION public.config_2_gram_cjk
     ADD MAPPING FOR email
     WITH simple;

ALTER TEXT SEARCH CONFIGURATION public.config_2_gram_cjk
     ADD MAPPING FOR asciiword
     WITH english_stem;

ALTER TEXT SEARCH CONFIGURATION public.config_2_gram_cjk
     ADD MAPPING FOR entity
     WITH simple;

ALTER TEXT SEARCH CONFIGURATION public.config_2_gram_cjk
     ADD MAPPING FOR file
     WITH simple;

ALTER TEXT SEARCH CONFIGURATION public.config_2_gram_cjk
     ADD MAPPING FOR float
     WITH simple;

ALTER TEXT SEARCH CONFIGURATION public.config_2_gram_cjk
     ADD MAPPING FOR host
     WITH simple;

ALTER TEXT SEARCH CONFIGURATION public.config_2_gram_cjk
     ADD MAPPING FOR hword
     WITH simple;

ALTER TEXT SEARCH CONFIGURATION public.config_2_gram_cjk
     ADD MAPPING FOR hword_asciipart
     WITH simple;

ALTER TEXT SEARCH CONFIGURATION public.config_2_gram_cjk
     ADD MAPPING FOR hword_numpart
     WITH simple;

ALTER TEXT SEARCH CONFIGURATION public.config_2_gram_cjk
     ADD MAPPING FOR hword_part
     WITH simple;

ALTER TEXT SEARCH CONFIGURATION public.config_2_gram_cjk
     ADD MAPPING FOR int
     WITH simple;

ALTER TEXT SEARCH CONFIGURATION public.config_2_gram_cjk
     ADD MAPPING FOR numhword
     WITH simple;

ALTER TEXT SEARCH CONFIGURATION public.config_2_gram_cjk
     ADD MAPPING FOR numword
     WITH simple;

ALTER TEXT SEARCH CONFIGURATION public.config_2_gram_cjk
     ADD MAPPING FOR protocol
     WITH simple;

ALTER TEXT SEARCH CONFIGURATION public.config_2_gram_cjk
     ADD MAPPING FOR sfloat
     WITH simple;

ALTER TEXT SEARCH CONFIGURATION public.config_2_gram_cjk
     ADD MAPPING FOR tag
     WITH simple;

ALTER TEXT SEARCH CONFIGURATION public.config_2_gram_cjk
     ADD MAPPING FOR uint
     WITH simple;

ALTER TEXT SEARCH CONFIGURATION public.config_2_gram_cjk
    ADD MAPPING FOR url
    WITH simple;

ALTER TEXT SEARCH CONFIGURATION public.config_2_gram_cjk
    ADD MAPPING FOR url_path
    WITH simple;

ALTER TEXT SEARCH CONFIGURATION public.config_2_gram_cjk
    ADD MAPPING FOR version
    WITH simple;

ALTER TEXT SEARCH CONFIGURATION public.config_2_gram_cjk
    ADD MAPPING FOR word
    WITH simple;
  1. Log in to Miniflux through the normal installation process and import RSS feeds.

zhparser

docker-compose.yml

version: '3'
services:
  postgres:
    build:
      context: /home/www/miniflux/zhparser
      dockerfile: Dockerfile
    image: zhparser
    container_name: zhparser
    environment:
      - POSTGRES_PASSWORD=vg9QUVgeyjsQq6jwi5PsHZSDKWFReFZzKzNaPZ8X
    volumes:
      - /home/www/miniflux/zhparser/database:/var/lib/postgresql/data
    networks:
      - zhparser

  pgAdmin:
    restart: always
    image: dpage/pgadmin4
    ports:
      - "8095:80"
    environment:
      PGADMIN_DEFAULT_EMAIL: [email protected]
      PGADMIN_DEFAULT_PASSWORD: 123
    networks:
      - zhparser

  miniflux:
    restart: always
    image: miniflux/miniflux:latest 
    ports:
      - "8096:8080"
    environment:
      - DATABASE_URL=postgres://postgres:vg9QUVgeyjsQq6jwi5PsHZSDKWFReFZzKzNaPZ8X@zhparser:5432/postgres?sslmode=disable
      - RUN_MIGRATIONS=1
      - CREATE_ADMIN=1
      - ADMIN_USERNAME=admin
      - ADMIN_PASSWORD=test123
    networks:
      - zhparser

networks:
  zhparser:
    driver: bridge

Dockerfile

FROM postgres:16-bullseye

LABEL maintainer="loyayz - https://loyayz.com"

ENV SCWS_URL http://www.xunsearch.com/scws/down/scws-1.2.3.tar.bz2
ENV ZHPARSER_URL https://github.com/amutu/zhparser/archive/refs/heads/master.tar.gz
ENV SAFEUPDATE_URL https://github.com/eradman/pg-safeupdate/archive/master.tar.gz
ENV PG_CRON_URL https://github.com/citusdata/pg_cron/archive/main.tar.gz

RUN apt-get update \
      && apt-get install -y --no-install-recommends \
           ca-certificates \
           wget \
           bzip2 \
           make \
           gcc \
           libc6-dev \
           postgresql-16-cron \
           postgresql-server-dev-$PG_MAJOR \
           \
      ## install scws
      && cd / \
      && wget -q -O - $SCWS_URL | tar xjf - \
      && SCWS_DIR=${SCWS_URL##*/} \
      && SCWS_DIR=${SCWS_DIR%%.tar*} \
      && cd $SCWS_DIR && ./configure && make install \
      ## install zhparser
      && cd / \
      && wget -q -O - $ZHPARSER_URL | tar xzf - \
      && cd zhparser-master && make install \
      ## install pg-safeupdate
      && cd / \
      && wget -q -O - $SAFEUPDATE_URL | tar xzf - \
      && cd pg-safeupdate-master && gmake && make install \
      # clean
      && apt-get purge -y \
            bzip2 \
            make \
            gcc \
            libc6-dev \
            postgresql-server-dev-$PG_MAJOR \
      && apt-get autoremove --purge -y \
      && rm -rf /$SCWS_DIR \
            /zhparser-master \
            /pg-safeupdate-master \
            /var/lib/apt/lists/*

RUN mkdir -p /docker-entrypoint-initdb.d \
    && wget https://raw.githubusercontent.com/aaro-n/miniflux-postgres/main/config/init_extension.sh -O /docker-entrypoint-initdb.d/init_extension.sh
RUN chmod 755 /docker-entrypoint-initdb.d/init_extension.sh
  1. Build the image and run it, log in with pgAdmin, and configure the connection database
  2. Select CREATE script on the postgres database, and a new window will pop up.
  3. Clear the window content and enter
CREATE TEXT SEARCH CONFIGURATION chinese (PARSER = zhparser);

Run, then clear, enter

ALTER TEXT SEARCH CONFIGURATION chinese ADD MAPPING FOR n,v,a,i,e,l WITH simple;

Run again, clear, enter

CREATE EXTENSION pg_stat_statements;

run
4. Log in to Miniflux through the normal installation process and import RSS feeds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant