Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

German Umlaut in "genre" makes scan fail #846

Closed
pchesneau opened this issue Apr 17, 2021 · 17 comments
Closed

German Umlaut in "genre" makes scan fail #846

pchesneau opened this issue Apr 17, 2021 · 17 comments

Comments

@pchesneau
Copy link

pchesneau commented Apr 17, 2021

I recently added new tracks in my music library. Since then music scan fails.
When scan is performed from command line, the following error is reported:

$> /opt/RZphp74/bin/php-cli -d memory_limit=1G <redacted>/occ music:scan  <redacted>
Start scan for <redacted>
Found 201 music files to scan
Scanning /<redacted>/files/Musique/Antonio Vivaldi; I Musici, Severino Gazzelloni/6 Concerti, op. 10 for Flute, Strings and Continuo/02 Concerto No. 1 in F, RV 433 _La Tempesta Di Mare__ II. Largo.flac

In Mapper.php line 281:

  Did expect one result but found none when executing: query "SELECT
                                         `*PREFIX*music_genres`.`id`,
                                         `*PREFIX*music_genres`.`name`,
                                         `*PREFIX*music_genres`.`lower_name`,
                                         `*PREFIX*music_genres`.`created`,
                                         `*PREFIX*music_genres`.`updated`,
                                         COUNT(`track`.`id`) AS `trackCount`,
                                         COUNT(DISTINCT(`track`.`album_id`)) AS `albumCount`,
                                         COUNT(DISTINCT(`track`.`artist_id`)) AS `artistCount`
                                 FROM `*PREFIX*music_genres`
                                 LEFT JOIN `*PREFIX*music_tracks` `track`
                                 ON `track`.`genre_id` = `*PREFIX*music_genres`.`id`
                                 WHERE `*PREFIX*music_genres`.`user_id` = ? AND `lower_name` = ?
                                 GROUP BY
                                         `*PREFIX*music_genres`.`id`,
                                         `*PREFIX*music_genres`.`name`,
                                         `*PREFIX*music_genres`.`lower_name`,
                                         `*PREFIX*music_genres`.`created`,
                                         `*PREFIX*music_genres`.`updated`
                                 "; parameters Array
  (
      [0] => <redacted>
      [1] => fl�tenkonzert
  )
  ; limit ""; offset ""


music:scan [--all] [--group GROUP] [--debug] [--clean-obsolete] [--rescan] [--folder [FOLDER]] [--] [<user_id>...]

The actual value of genre for this track is "Flötenkonzert".

It looks like an issue around character encoding .

@pchesneau
Copy link
Author

While I'm at it, the issue is related to the lowercase feature.
In database, here is what's available:

Name lower_name
Flötenkonzert fl

@paulijar
Copy link
Collaborator

Thanks for the report. However, I couldn't reproduce the issue when I set genre of couple of tracks to "Flötenkonzert". Please specify the version of ownCloud or Nextcloud used, as well as the version of the Music app.

@pchesneau
Copy link
Author

pchesneau commented Apr 18, 2021

Sure : Nextcloud 20.0.8 with music app 1.1.0.
Just to be sure I tried to reupload the files in my nextcloud, Immediatly the issue appears.

May I send you by mail a share link to 2 affected files ? (flac files)

@paulijar
Copy link
Collaborator

May I send you by mail a share link to 2 affected files ? (flac files)

Please, do. Maybe the significant factor is actually something else than the "ö" character. You can find my email address from my github profile.

@pchesneau
Copy link
Author

Done, Have a look in your mailbox 🙂

@paulijar
Copy link
Collaborator

Got it, thanks. But still, I couldn't reproduce the problem: Both files were scanned without issues and show up under the genre "Flötenkonzert" on the UI.

@pchesneau
Copy link
Author

Could it be related to the default encoding of the database (utf8mb4) and php default charset (Iso-8859-1) ? Which charset are you using?

@paulijar
Copy link
Collaborator

The encoding in my database tables is "utf8" and the PHP default character set is "UTF-8". I now tested changing the PHP default encoding to "iso-8859-1" but that made no difference. That's expected, because I believe that the Nextcloud core anyway overrides the character set setting to UTF-8.

@pchesneau
Copy link
Author

pchesneau commented Apr 18, 2021

I confirm this is an issue with character encoding (I don't know why you do not reproduce on your setup though).
However, I found that if I add a call to mb_detect_encoding in GenreBusinessLayer (line 54):

$genre->setLowerName(\mb_strtolower($name ?? '',\mb_detect_encoding($name)));

(and also in Artist.php line 80 for a similar issue in subsonic API) it works perfectly. (maybe some other places, but 1AM here: I'll check once I slept)

Would you accept a PR with these modifications?

@paulijar
Copy link
Collaborator

Before making the code changes, I would like to understand the underlying concepts here if at all possible. I have been under the assumption that the internal character encoding used by PHP on all Nextcloud code should be UTF8, regardless of the PHP configuration. But if this does not hold, then there may potentially be other problems, too.

One more question: does your php.ini contain any values for the keys internal_encoding, iconv.internal_encoding, or mbstring.internal_encoding?

@pchesneau
Copy link
Author

pchesneau commented Apr 20, 2021

Before making the code changes, I would like to understand the underlying concepts here if at all possible. I have been under the assumption that the internal character encoding used by PHP on all Nextcloud code should be UTF8, regardless of the PHP configuration. But if this does not hold, then there may potentially be other problems, too.

I understand 🙂

One more question: does your php.ini contain any values for the keys internal_encoding, iconv.internal_encoding, or mbstring.internal_encoding?

I tried without and with these key with the same result. (However, there may be some hidden tricks with the php configuration, I'm on a mutualized hosting).

@paulijar
Copy link
Collaborator

paulijar commented May 8, 2021

Now I had some time to investigate this further, and I finally managed to reproduce the issue! I had two different test setups which acted a bit differently:

  • The setup no. 1 was NC20 + PHP7.4 + sqlite3.
  • The setup no. 2 was NC17 + PHP7.1 + MySQL

On setup 1, setting internal_encoding = "ISO-8859-1" in php.ini had the effect of changing the encoding used internally by PHP, as returned by \mb_internal_encoding(), to be "ISO-8859-1". With this, the genre lower_name stored to the DB got corrupted:

sqlite> select name, lower_name from oc_music_genres;
Suomi-pop|suomi-pop
Suomi-rock|suomi-rock
ÖöÄäÅå|▒▒▒▒▒▒
Flötenkonzert|fl▒tenkonzert

However, the scanning was working fine regardless, and the scanned tracks were shown correctly on the web UI.

On setup 2, settings internal_encoding = "ISO-8859-1" in php.ini did not change the internal encoding actually used when executing the Music app. The function \mb_internal_encoding() was still returning "UTF-8". However, setting mbstring.internal_encoding = "ISO-8859-1" in php.ini did the trick and the actually used encoding changed. Here, I could see lower_name to be truncated on the first accented character, and the scanning broke the way described on the OP.

I also found that the problem could be fixed by overriding the encoding within the code by calling mb_internal_encoding('UTF-8') . I added it to the constructor of the main application class in lib/App/Music.php:
image

@pchesneau Could you test if this fixes the problem for you, too?

@pchesneau
Copy link
Author

Sure ! I'll try as soon as I'm at my computer. I'll keep you updated.

paulijar added a commit that referenced this issue May 9, 2021
It is possible that the internal encoding used by PHP is something else than
UTF-8, depending on settings given in php.ini. Still, the OC/NC database
should always be in UTF-8 and the getID3 library returns UTF-8-encoded data
by default. Hence, using internal encoding other than UTF-8 may lead to
unexpected results in many places, especially when any of the PHP built-in
functions `mb_*` are used. To avoid this, we now override the internal
encoding to be UTF-8 in the constructor of the main application class.

refs #846
@paulijar
Copy link
Collaborator

@pchesneau So have you been able to verify this?

@pchesneau
Copy link
Author

Sorry for the late reply @paulijar, I've juste tested this fix on my NC instance. It works perfectly !

@paulijar
Copy link
Collaborator

Great, thanks for the confirmation! I'll try to make a new release including this fix soon; maybe tomorrow or latest on Sunday.

@paulijar
Copy link
Collaborator

The fix for this problem has now been released in Music v1.2.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants