Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reader bugs out if Japanese characters are in archive folder name #145

Closed
Twilicious opened this issue Jul 28, 2019 · 25 comments
Closed

Reader bugs out if Japanese characters are in archive folder name #145

Twilicious opened this issue Jul 28, 2019 · 25 comments

Comments

@Twilicious
Copy link

Twilicious commented Jul 28, 2019

I want to start by saying thanks for this amazing program. It's been invaluable for organizing and tagging my collection in the wake of sadpanda dying.
I saw a couple older issues dealing with similar unicode problems but this seems different.

LRR Version and OS
0.6.0. beta 2
Latest Win 10 Pro
Whatever the powershell script installed

Bug Details
The program works great for any archive in english but if there are any unicode characters in the title the reader bugs out. When you open an affected archive it says no thumbnail and if you try to advance it the loading gear comes up and freezes. I've gone and individually opened the affected archives and both the folder titles and images inside seem fine and uncorrupted. The affected archives also have correct thumbnails on the main page.

Matching Logs
The only actual error message I saw was when i pressed "regenerate archive thumbnail":

[2019-07-27 20:27:41] [Hash Computation] [error] Error building hash for /home/koyomi/lanraragi/script/../public/temp/5236544cd1d197486aa09129866935ff71883bef/(C91) [������ (�����)] �森峰�辱 (����&�����)/01.png -- Open failed: No such file or directory at /home/koyomi/lanraragi/script/../lib/LANraragi/Utils/Generic.pm line 93. [2019-07-27 20:27:41] [LANraragi] [debug] Thumbnail not found at /mnt/d/pron/Hentai/Doujins/thumb/5236544cd1d197486aa09129866935ff71883bef.jpg ! (force-thumb flag = 1) [2019-07-27 20:27:41] [LANraragi] [debug] Regenerating from /home/koyomi/lanraragi/script/../public/temp/5236544cd1d197486aa09129866935ff71883bef/(C91) [������ (�����)] �森峰�辱 (����&�����)/01.png

Error log after opening Japanese titled archive

Screenshots
Reader page of an archive with Japanese characters in it
Trying to advance the reader
Looking at all pages
Logs of opening an English archive vs a Japanese one

EDIT:
Found one archive with a kanji title that would open, logs look different.

@Twilicious Twilicious changed the title Reader bugs out if Japanese characters are in archive title Reader bugs out if Japanese characters are in archive folder name Jul 28, 2019
@Difegue
Copy link
Owner

Difegue commented Jul 28, 2019

And here I thought I'd be done with unicode problems.

Looks like the reader is extracting your archive correctly, but doesn't have the correct path to the extracted folder, leading to it not finding anything.

When you open a unicode archive, does it extract in some form to your file system?
Since you're on WSL, the temp folder can be accessed by typing \\wsl$\lanraragi\home\koyomi\lanraragi\public\temp in your Windows Explorer.

@Twilicious
Copy link
Author

Twilicious commented Jul 28, 2019

Says windows cannon't access that folder after pasting it in exactly into file explorer. What's the full path?

@Difegue
Copy link
Owner

Difegue commented Jul 28, 2019

The \\wsl$ path uses WSL's built-in file access to reach linux files safely.

The alternative is to lookup the files directly in the %APPDATA%\LANraragi\Distro folder. I recommend you don't modify files in that directory at all, but for simple looking it should be fine.

@Twilicious
Copy link
Author

Twilicious commented Jul 28, 2019

After navigating to %APPDATA%\LANraragi\Distro I find a temp folder but it just contains a bunch of 0kb files, the whole folder is only 213 bytes. There definitely is a temp folder somewhere because when i open an English archive the path shows up in the logs and everything opens fine. I still cant access \wsl$\ at all for some reason, says network error. I tried microsoft/WSL#4027 (comment) but it didn't do anything. Next step is reinstall everything i guess. Let me know if there any any specific logs or anything that would help.

@CirnoT
Copy link
Contributor

CirnoT commented Jul 28, 2019

Providing some useful logs. Can debug further if necessary.

/opt/lanraragi/public/temp/aa413b0930d2378f3c98a7c569a485a93bb400c5# ls
'(ゲームCG) [すたじおちゃれん] 放課後催眠倶樂部 ~餐~'
[2019-07-29 01:32:30] [LANraragi] [debug] Files found in archive: 
 $VAR1 = "/opt/lanraragi/script/../public/temp/aa413b0930d2378f3c98a7c569a485a93bb400c5/(\x{e3}\x{82}\x{b2}\x{e3}\x{83}\x{bc}\x{e3}\x{83}\x{a0}CG) [\x{e3}\x{81}\x{99}\x{e3}\x{81}\x{9f}\x{e3}\x{81}\x{98}\x{e3}\x{81}\x{8a}\x{e3}\x{81}\x{a1}\x{e3}\x{82}\x{83}\x{e3}\x{82}\x{8c}\x{e3}\x{82}\x{93}] \x{e6}\x{94}\x{be}\x{e8}\x{aa}\x{b2}\x{e5}\x{be}\x{8c}\x{e5}\x{82}\x{ac}\x{e7}\x{9c}\x{a0}\x{e5}\x{80}\x{b6}\x{e6}\x{a8}\x{82}\x{e9}\x{83}\x{a8} \x{ef}\x{bd}\x{9e}\x{e9}\x{a4}\x{90}\x{ef}\x{bd}\x{9e}/000.png";
[2019-07-29 01:33:40] [Hash Computation] [error] Error building hash for /opt/lanraragi/script/../public/temp/aa413b0930d2378f3c98a7c569a485a93bb400c5/(ã�²ã�¼ã� CG) [ã��ã��ã��ã��ã�¡ã��ã��ã��] æ�¾èª²å¾�å�¬ç� å�¶æ¨�é�¨ ï½�é¤�ï½�/000.png -- Open failed: No such file or directory at /opt/lanraragi/script/../lib/LANraragi/Utils/Generic.pm line 93.

[2019-07-29 01:33:40] [LANraragi] [debug] Thumbnail not found at /db/thumb/aa413b0930d2378f3c98a7c569a485a93bb400c5.jpg ! (force-thumb flag = 1)
[2019-07-29 01:33:40] [LANraragi] [debug] Regenerating from /opt/lanraragi/script/../public/temp/aa413b0930d2378f3c98a7c569a485a93bb400c5/(ã�²ã�¼ã� CG) [ã��ã��ã��ã��ã�¡ã��ã��ã��] æ�¾èª²å¾�å�¬ç� å�¶æ¨�é�¨ ï½�é¤�ï½�/000.png
[2019-07-29 01:33:42] [LANraragi] [debug] Files found in archive: 
<input class="stdinput" type="text" style="width:100%" readonly="" size="20" value="/db/(ã�²ã�¼ã�&nbsp;CG) [ã��ã��ã��ã��ã�¡ã��ã��ã�&#147;] æ�¾èª²å¾�å�¬ç�&nbsp;å�¶æ¨�é�¨ ï½�é¤�ï½�.zip" name="filename">

@Twilicious
Copy link
Author

Twilicious commented Jul 29, 2019

Turns out I'm retarded and wasn't using the most recent version of windows. I have now updated to 1903 and \wls$\ now works perfectly. To answer your original question, yes both the English and Japanese titled folders are extracted there and are perfectly readable from the explorer. I found another archive with English + kanji only and it opened just fine, logs look just like the edit screenshot in the OP.

@Twilicious
Copy link
Author

Twilicious commented Jul 29, 2019

So i think i figured it out. Before i used LRR i had all my works stored in plain folders. When i switched i used a batch script and 7zip to zip up every folder individually. I'm just now noticing that the photos were placed in a folder in the zip. When i upload a new archive with the photos directly in the zip it all works fine even with Japanese characters. I don't know what could be causing it but seems like a good clue.

Edit:
Ya, confirmed. I went in and manually fixed five or so broken archives and as soon as i pressed clean archive cache after fixing the directory it worked perfectly. Anyone know how to batch convert 1500 zips to that correct format?

@CirnoT
Copy link
Contributor

CirnoT commented Jul 29, 2019

That's a good clue and temporary fix, however it shouldn't really be needed. LANraragi seems to handle images nested in directories just fine, but it seems to bug out with Unicode characters in directory names. Definitely a bug to fix.

@CirnoT
Copy link
Contributor

CirnoT commented Jul 29, 2019

Okay, I've found where the issue lies. The Archive utility is always saving extracted archive in Unicode but Reader model then uses File::Find and forgets that it returns byte-encoded path, not Unicode. The solution would be:

diff --git a/lib/LANraragi/Model/Reader.pm b/lib/LANraragi/Model/Reader.pm
index 7ed5b42..3b1b142 100755
--- a/lib/LANraragi/Model/Reader.pm
+++ b/lib/LANraragi/Model/Reader.pm
@@ -106,7 +106,7 @@ sub build_reader_JSON {
         find(sub {
                 # Is it an image?
                 if ( LANraragi::Utils::Generic::is_image($_) ) {
-                    push @images, $File::Find::name;
+                    push @images, Encode::decode_utf8($File::Find::name);
                 }
             }, $path);
     };

@Difegue
Copy link
Owner

Difegue commented Jul 29, 2019

Hmm, I thought I had fixed unicode in folder names for good with commit 030229d, but I guess it's worth taking another look at it.

@CirnoT
Copy link
Contributor

CirnoT commented Jul 31, 2019

Yes indeed your solution was good, however removal of Find::utf8 was unnecessary. Find expects and returns byte-encoded strings but code further treats returned value from it as decoded string, which obviously fails. Either decode path as UTF-8 (as we are sure by now that it IS UTF-8 after renaming it) or use Find::utf8 variant that does it automatically.

@Difegue
Copy link
Owner

Difegue commented Jul 31, 2019

You're right on that point, but the thing is that the extracted files should not be UTF-8 at all, but pure ASCII.

Commit 030229d uses encode's coderef CHECK to automatically translate every non-ascii character to its U+XXXX codepoint equivalent, and then moves extracted files to their ascii name.
I use this instead of File::Find::utf8 to prevent issues on operating systems with filesystems that don't support utf-8. (mostly thinking about japanese windows and how shift-jis/cp932 ruins every day of my life)

However, you're still getting folder and files chock full of utf8 characters, which means the move operation is failing somehow and leaving the files as-is. I've added an explicit die throw to the moving in order to debug further.

@CirnoT
Copy link
Contributor

CirnoT commented Aug 1, 2019

I fail to see how simple die statement would help here. Indeed, the conversion fails and module dies, taking whole application with itself - subsequent requests after failure result in mojolicious errors that can easily make you look in a rabbit hole such as 'public/themes' not found.

I flubbed it while trying to open the archive /db/(C90) [りとる☆はむれっと (きぃら~☆)] おもらし大好きさとりさん (東方Project).zip
It's likely the archive is corrupt.
Some more info below if available:

Can't open file "./log/lanraragi.log": No such file or directory at /usr/local/share/perl/5.26.1/Mojo/Log.pm line 17.

@Difegue
Copy link
Owner

Difegue commented Aug 1, 2019

It was not really meant to help, just to see why the moving operation fails.
The log indeed doesn't seem to be telling much, though. 😐

It's weird that the program can't open a file in the log directory as well, however. Maybe some file permissions are wrong?

@CirnoT
Copy link
Contributor

CirnoT commented Aug 1, 2019

No, the logs work fine otherwise. It seems to be an issue caused by dying, a rabbit hole that's better left unexplored lest we start digging into Mojo source code.

@CirnoT
Copy link
Contributor

CirnoT commented Aug 1, 2019

^ The issue is because it dies inside finddepth, which internally does chdir.

@CirnoT
Copy link
Contributor

CirnoT commented Aug 1, 2019

Found it. filedepth returns non-Perl-encoded string, so encode fails. Either File::Find::utf8 must be used or decode_utf8() must be used before supplying $_ to encode().

You can verify it by doing print Dumper($_) inside finddepth (do not use $logger, finddepth does chdir so './' is invalid in that scope. To fix this, all paths supplied to Mojo should be absolute, not relative.

diff --git a/lib/LANraragi/Utils/Archive.pm b/lib/LANraragi/Utils/Archive.pm
index 1302973..c50f7bf 100644
--- a/lib/LANraragi/Utils/Archive.pm
+++ b/lib/LANraragi/Utils/Archive.pm
@@ -49,7 +49,7 @@ sub extract_archive {

     #Rename files and folders to an encoded version
     finddepth(sub {
-                move($_, encode("ascii", $_, sub{ sprintf "U+%04X", shift }));
+                move($_, encode("ascii", decode_utf8($_), sub{ sprintf "U+%04X", shift }));
             }, $ae->extract_path);

     # dir that was extracted to

@Difegue
Copy link
Owner

Difegue commented Aug 1, 2019

Re-read my docs on perl encoding, and indeed I was mistaken: My understanding was that encode could encode raw bytes to an encoding directly, which is obviously incorrect.

I still don't want to assume what we're receiving is utf-8 (I've gotten issues in the past from systems that weren't utf-8 at all), so I've added a layer of Encode::Guess with the major Japanese encodings.

@CirnoT
Copy link
Contributor

CirnoT commented Aug 1, 2019

Guess can be used but you must consider that it will fail in many cases where encoding is ambiguous - https://perldoc.perl.org/Encode/Guess.html

[2019-08-02 01:30:21.68979] [4761] [error] shiftjis or utf8 at /opt/lanraragi/script/../lib/LANraragi/Controller/Edit.pm line 119.

@Difegue
Copy link
Owner

Difegue commented Aug 1, 2019

will croak if Two or more suspects remain

Didn't see that in the perldoc indeed; I probably have to add some extra error checking here. (And in the variant I used in Edit.pm)

Didn't think shift-jis and utf8 used the same symbols though - I'll probably default to utf8/ascii when they appear in the guesses.

@CirnoT
Copy link
Contributor

CirnoT commented Aug 2, 2019

Technically something like this should work

local $@;
eval {
    $file = decode ("Guess", $file);
};
$file = decode_utf8($file) if $@;

decode_utf8 is not only an alias for decode("UTF-8, $_), it's also different in that it's not strict, so it won't do any harm to ascii encoded stuff and won't croak.

There is still open issue of finddepth not coming back to original directory if it dies mid-way - this really has to be handled as it crashes whole LRR permanently until restarted as Mojo is trying to access and write to files in a directory where archive was extracted to (assuming it even exists).

There is also an issue with your attempt to convert UTF-8 to U+ - it may quickly go out of control and exceed 255 filename length limit on ext4 and others which would cause move() to fail. Possibly just cut it at 254 characters?

Here is example of archive (with only cover) with encoding issue as well as one that would trigger filename too long issue: https://u.gensokyo.re/d/C03V4o2v

@Twilicious
Copy link
Author

I've been following this closely and I'm happy to report whatever you did in the latest nightly fixed my problem! All of my previously broken archives now work perfectly! Amazing work!

@Difegue
Copy link
Owner

Difegue commented Aug 2, 2019

I was writing something more complex, but in hindsight a basic guess + fallback to utf8 is going to be good for 98% of setups. I'll go with that.

Saving the current working directory and chdiring back to it after the finddepth should do the trick for the second point.

I had the character limit for the U+ conversion in the back of my mind too: I thought about cutting the max string, but that could cause issues for archives that have multiple folders with slight differences in name at the end like japanese characters ch.1/2/3/etc. It's super minor and wouldn't crash the app however, so it's no big deal.
I'll drop the U+ part in order to reduce the length a bit as well.

@Difegue
Copy link
Owner

Difegue commented Aug 2, 2019

hm, left an ascii up there from experimentations with encode::guess without the default encoding list.
Whatever, I'll remove it in some future commit.

@CirnoT
Copy link
Contributor

CirnoT commented Aug 2, 2019

It works! I'll report if I find any that cause issue but so far looks like we managed to finally solve it.

@Difegue Difegue closed this as completed Aug 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants