Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add batch func #2

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

add batch func #2

wants to merge 5 commits into from

Conversation

zengzzzzz
Copy link

@zengzzzzz zengzzzzz commented Mar 2, 2023

please review it, thanks.
1 not use the self.new_epub , so del it
2 add batch func with multiprocessing

@yihong0618
Copy link
Owner

@zengzzzzz thanks will take a look tonight

make.py Outdated Show resolved Hide resolved
make.py Outdated Show resolved Hide resolved
Copy link
Owner

@yihong0618 yihong0618 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good and clean for now, will test in my env tonight.

Thanks a lot.

@yihong0618
Copy link
Owner

有个地方有一点错误,我在 main 上改了哈,你可以先合并一下。

@zengzzzzz
Copy link
Author

有个地方有一点错误,我在 main 上改了哈,你可以先合并一下。

好的。麻烦了哈。谢谢。

@zengzzzzz
Copy link
Author

我应该没有权限合并分支。麻烦你帮忙合一下哈。

@yihong0618
Copy link
Owner

@zengzzzzz 我 fix 了,你拉下代码看看我 comment 那个 herf 问题。

@zengzzzzz
Copy link
Author

@zengzzzzz 我 fix 了,你拉下代码看看我 comment 那个 herf 问题。

ok

@zengzzzzz
Copy link
Author

@yihong0618 I fix the herf error, please check it again,thanks。

@yihong0618
Copy link
Owner

@zengzzzzz I feel that some restrictions should be added, now running in my computer will be directly stuck

@yihong0618
Copy link
Owner

using test_books/lemo.epub to test find some problem.

  1. the whole terminal stuck(I am using kitty)
  2. seems the same speed with not batch
  3. error found
    image

@yihong0618
Copy link
Owner

IMO we'd better use the <p1> + <p2> + <p3> -> chatgpt api -> <p1_t> <pt_2> <p3_t> batch way that every user can go well.

@zengzzzzz
Copy link
Author

IMO we'd better use the <p1> + <p2> + <p3> -> chatgpt api -> <p1_t> <pt_2> <p3_t> batch way that every user can go well.

I will try this way
1 will add restrictions (eg: max workers) to avoid stuck
2 tested on my machine, it is faster, but its speed depends on the response of the api
3 this seems to be an error generated by openai, I will verify it

@zengzzzzz
Copy link
Author

IMO we'd better use the <p1> + <p2> + <p3> -> chatgpt api -> <p1_t> <pt_2> <p3_t> batch way that every user can go well.

you are right . If we want everyone to use it well, we should try the method you mentioned.

@zengzzzzz
Copy link
Author

IMO we'd better use the <p1> + <p2> + <p3> -> chatgpt api -> <p1_t> <pt_2> <p3_t> batch way that every user can go well.

I think the method you mentioned has these advantages
1 fewer tokens, less cost
2 not easy to get stuck
3 everyone can go well

@zengzzzzz
Copy link
Author

zengzzzzz commented Mar 4, 2023

I implemented the batch translate method, but there are still some flaws in the format, so I decided to use it in my own project. If you are interested or needed, you can refer to the following implementations:

def translate_book(self):
        new_book = epub.EpubBook()
        new_book.metadata = self.origin_book.metadata
        new_book.spine = self.origin_book.spine
        new_book.toc = self.origin_book.toc
        batch_p = []
        batch_count = 0
        for i in self.origin_book.get_items():
            if i.get_type() == 9:
                soup = bs(i.content, "html.parser")
                p_list = soup.findAll("p")
                for p in p_list:
                    if p.text and not p.text.isdigit():
                        batch_p.append(p)
                        batch_count += 1
                        if batch_count == self.batch_size:
                            translated_batch = self.translate_model.translate([p.text for p in batch_p])
                            for j, c_p in enumerate(batch_p):
                                c_p.string = c_p.text + translated_batch[j]
                            batch_p = []
                            batch_count = 0
                    # Process any remaining paragraphs in the last batch
                if batch_p:
                    translated_batch = self.translate_model.translate([p.text for p in batch_p])
                    for j, c_p in enumerate(batch_p): 
                        c_p.string = c_p.text + translated_batch[j]
                    batch_p = []
                    batch_count = 0
                i.content = soup.prettify().encode()
            new_book.add_item(i)
        name = self.epub_name.split(".")[0]
        epub.write_epub(f"{name}_translated.epub", new_book, {})

@yihong0618
Copy link
Owner

thanks will keep this pr if I use your code I will commit on this.
Thanks again

@zengzzzzz
Copy link
Author

thanks will keep this pr if I use your code I will commit on this. Thanks again

thank you, bro

@DennySORA
Copy link

我覺得,這部分有些能改成 async 的模式,再加上你的這個。

畢竟目前大多數的瓶頸都是在 openAI API 跟 network 上。

@zengzzzzz
Copy link
Author

zengzzzzz commented Mar 4, 2023

我覺得,這部分有些能改成 async 的模式,再加上你的這個。

畢竟目前大多數的瓶頸都是在 openAI API 跟 network 上。

you are probably right . a reason for not using async or multprocessing etc is the qps limitation of the open ai api ( 20 per minute) , which makes its performance improvement not obvious.

@tianshanghong
Copy link

you are probably right . a reason for not using async or multprocessing etc is the qps limitation of the open ai api ( 20 per minute) , which makes its performance improvement not obvious.

Looks that the rate limit from OpenAI may not be a problem now. For Pay-as-you-go-user (after 48 hours), OpenAI allows 3500 requests per minute and 90,000 tokens per minute.

I created a new PR #62 to support batch processing with asyncio.

wayhome pushed a commit to wayhome/bilingual_book_maker that referenced this pull request Aug 29, 2024
1. Supported to customize translation text position and delete original content. resoved yihong0618#7
2. Supported to exclude original content by keyword and regular expression. resolved yihong0618#2
3. Added Baidu and Youdao translation engines. resolved yihong0618#3
4. Changed to save translated ebooks as a new book in Calibre library.
5. Supported to customize the color of translation text.
6. Supported to customize ChatGPT prompt word. resolved yihong0618#4
7. Ignored to translate phonetic symbols (e.g. Japanese). fixed yihong0618#3
8. Added Spanish as supported interface language. resolved yihong0618#5
9. Added disgnosis information to log.
10. Added "lang" attribute at translation element.
11. Fixed plugin icon disappearance when changing Calibre interface language.
12. Improved the functionality to extract original text.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants