Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Uukanshu (www & tw subdomains) #2264

Merged
merged 2 commits into from
Feb 12, 2024
Merged

add Uukanshu (www & tw subdomains) #2264

merged 2 commits into from
Feb 12, 2024

Conversation

camp00000
Copy link

Closes #2263

There was already an existing Parser but that was for a very different looking sj. subdomain so I renamed the file & class whilst making use of its existing custom text filter.

I split up the commits in hope that it wouldn't destroy the existing git history on the previous sj subdomain crawler but that seems to not have worked when looking at the changelist of this PR... -> reviewing this PR in the per commit view would make the most sense.

The www subdomain returns simplified CN in GBK encoding, the tw subdomain returns traditional CN in utf-8 encoding.
I tried to explain in the code comments wherever it might be confusing.

Tested with a pretty large novel https://www.uukanshu.net/b/86005/#gsc.tab=0 which has over 6k chapters and is split into 9 novels. Tested both tw and www subdomains, output matches website content.

ACA added 2 commits February 9, 2024 21:54
…simplified cn)

uukanshu_sj: rename class & make format_text a staticmethod
@dipu-bd dipu-bd merged commit 3fe8460 into dipu-bd:dev Feb 12, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants