Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special characters in filenames #7762

Open
loeffelpan opened this issue Jan 9, 2018 · 56 comments
Open

Special characters in filenames #7762

loeffelpan opened this issue Jan 9, 2018 · 56 comments
Labels
1. to develop Accepted and waiting to be taken care of feature: files feature: filesystem hotspot: filename handling Filenames - invalid, portable, blacklisting, etc. technical debt

Comments

@loeffelpan
Copy link

loeffelpan commented Jan 9, 2018

Steps to reproduce

  1. Put some files (not all with those characters) in any folder
  2. Include this folder by external storage (local)

Expected behaviour

Every file in this folder shoud be scanned and shown in the files-app.

Actual behaviour

These files came through download on the harddisk of my homeserver. The folder containing the downloaded files are configured as “local” external storage in my nextcloud.
Files and folders with german “umlaute” created by nextcloud in the files-app appear in the file listings. Other files and folders (from download) are ignored by the occ-file-scan.

While file-scan in debug mode the following messages appear in nextcloud.log.
There have to be Lügen instead of L\u00fcgen and Hölle instead of H\u00f6lle for example.

{“reqId”:“X7LIb2Ci8jdOOkqp3leZ”,“level”:0,“time”:“2017-12-26T17:25:29+01:00”,“remoteAddr”:"",“user”:"–",“app”:“OC\Files\Cache\Scanner”,“method”:"–",“url”:"–",“message”:"!!! Path ‘Serien/Zoo/S02E06.Sex, L\u00fcgen und Quallen.mp4’ is not accessible or present !!!",“userAgent”:"–",“version”:“12.0.4.3”}
{“reqId”:“X7LIb2Ci8jdOOkqp3leZ”,“level”:0,“time”:“2017-12-26T17:25:29+01:00”,“remoteAddr”:"",“user”:"–",“app”:“OC\Files\Cache\Scanner”,“method”:"–",“url”:"–",“message”:"!!! Path ‘Serien/Zoo/S02E10.H\u00f6lle in Helsinki.mp4’ is not accessible or present !!!",“userAgent”:"–",“version”:“12.0.4.3”}

Server configuration

Operating system: Ubuntu Server 17.10

Web server: Apache 2.4.27

Database: MySQL

PHP version: PHP 7.1.11-0ubuntu0.17.10.1

Nextcloud version: 12.0.4

Updated from an older Nextcloud/ownCloud or fresh install: fresh install

Where did you install Nextcloud from: nextcloud.com

List of activated apps:

App list
Enabled:
  - dav: 1.3.0
  - federatedfilesharing: 1.2.0
  - files: 1.7.2
  - files_external: 1.3.0
  - files_sharing: 1.4.0
  - files_videoplayer: 1.1.0
  - lookup_server_connector: 1.0.0
  - notifications: 2.0.0
  - oauth2: 1.0.5
  - provisioning_api: 1.2.0
  - theming: 1.3.0
  - twofactor_backupcodes: 1.1.1
  - updatenotification: 1.2.0
  - workflowengine: 1.2.0
Disabled:
  - activity
  - admin_audit
  - comments
  - encryption
  - federation
  - files_pdfviewer
  - files_texteditor
  - files_trashbin
  - files_versions
  - firstrunwizard
  - gallery
  - logreader
  - nextcloud_announcements
  - password_policy
  - serverinfo
  - sharebymail
  - survey_client
  - systemtags
  - user_external
  - user_ldap

Nextcloud configuration:

Config report
{
    "system": {
        "instanceid": "oc65jgv8zf6o",
        "passwordsalt": "***REMOVED SENSITIVE VALUE***",
        "secret": "***REMOVED SENSITIVE VALUE***",
        "trusted_domains": [
            "toothless.goip.de",
            "toothless.fritz.box"
        ],
        "datadirectory": "\/var\/www\/nextcloud\/data",
        "overwrite.cli.url": "https:\/\/toothless.goip.de",
        "dbtype": "mysql",
        "version": "12.0.4.3",
        "dbname": "nextcloud",
        "dbhost": "localhost",
        "dbport": "",
        "dbtableprefix": "oc_",
        "mysql.utf8mb4": true,
        "dbuser": "***REMOVED SENSITIVE VALUE***",
        "dbpassword": "***REMOVED SENSITIVE VALUE***",
        "installed": true,
        "skeletondirectory": "",
        "logtimezone": "Europe\/Berlin",
        "memcache.local": "\\OC\\Memcache\\APCu",
        "memcache.locking": "\\OC\\Memcache\\Redis",
        "redis": {
            "host": "localhost",
            "port": "6379"
        },
        "htaccess.RewriteBase": "\/",
        "mail_smtpmode": "smtp",
        "mail_smtpauthtype": "LOGIN",
        "mail_smtpauth": 1,
        "mail_from_address": "jan.noormann",
        "mail_domain": "gmail.com",
        "mail_smtphost": "smtp.gmail.com",
        "mail_smtpport": "587",
        "mail_smtpname": "***REMOVED SENSITIVE VALUE***",
        "mail_smtppassword": "***REMOVED SENSITIVE VALUE***",
        "mail_smtpsecure": "tls"
    }
}

Are you using external storage, if yes which one: local

Are you using encryption: no

Are you using an external user-backend, if yes which one: no

Client configuration

Browser: Opera, Chrome, Firefox

Operating system: Windows 10

@rullzer
Copy link
Member

rullzer commented Jan 9, 2018

@icewind1991 fs fun :)

@MorrisJobke MorrisJobke added bug feature: filesystem 0. Needs triage Pending check for reproducibility or if it fits our roadmap labels Jan 9, 2018
@icewind1991
Copy link
Member

Are you perhaps using a non stand filesystems such as fat or ntfs?

Can you try creating a php file ls.php:

<?php
echo "Listing {$argv[1]}\n";
var_dump(scandir($argv[1]));

And run it using php ls.php /path/to/folder and see if you get the correct result

@loeffelpan
Copy link
Author

Filesystem ist ext4 on that hdd.
I just figured out, that there are other files with special characters in the same filesystem, which are listed by nextcloud's file-app. Seems to have something to do with exactly the mentioned files.

The result of your PHP looks fine:
Listing /mnt/Test array(6) { [0]=> string(1) "." [1]=> string(2) ".." [2]=> string(35) "S02E06.Sex, Lügen und Quallen.mp4" [3]=> string(30) "S02E09.Das Knochenrätsel.mp4" [4]=> string(30) "S02E10.Hölle in Helsinki.mp4" [5]=> string(31) "S02E12.Die Säbelzahnkatze.mp4" }

@teadur
Copy link

teadur commented Nov 7, 2018

I'm having similar issue on ext4 filesystem.
For most of the files everything is okey but there is some amount of files with umlauts in their name that cannot be accessed by the File Scanner.

All affected files have error: "OC\Files\Cache\Scanner","method":"--","url":"--","message": !!! Path 'ROOT/K\u00c4SKI/DIR/T\u00f6\u00f6teeb.pdf' is not accessible or present !!!","userAgent":"--","version":"13.0.2.1"}
It seems these files have non utf-8 filenames, for example iso-8859-*

It seems that the scanner expects all filenames to be in ascii or utf-8.

If i take one of the non working files from filesystem and upload it from web ui it's accessible (it seems something converts the filename enconding in that case).

@nextcloud-bot nextcloud-bot removed the stale Ticket or PR with no recent activity label Nov 7, 2018
@teadur
Copy link

teadur commented Nov 14, 2018

if someone hits this problem and needs solution faster then the code gets fixed, then one solution is to use rclone / rsync to modify the filename charset.

@fhoner
Copy link

fhoner commented Jan 10, 2019

Facing exactly the same problem. Any updates on this?

OS: Ubuntu Server 18.04
Webserver: Apache 2.4.37
Database: PostgreSQL
PHP version: 7.2.13-1+ubuntu18.04.1+deb.sury.org+1
Nextcloud version: 15.0.0
Filesystem of local storage added to NC: ext4

@Mr-Bart-Simpson
Copy link

Just stumpled accross a very similar issue: Filenames containing a Plus-sign (+) cannot be uploaded - neither via Webfrontend nor via (Windows-) Client-Application.

@fhoner
Copy link

fhoner commented Jan 16, 2019

Still present in v15.0.2

@kesselb
Copy link
Contributor

kesselb commented Jan 16, 2019

I don't know how to reproduce 😞

peek 2019-01-16 15-06

@Mr-Bart-Simpson
Copy link

Is it possible that the problem depends on the underlying OSes? I had the problem with the Plus-Sign when uploading a file from a Windows 10 client to a Nextcloud server hosted on Linux Mint

@fhoner
Copy link

fhoner commented Jan 16, 2019

For me it has something to do with filename encodings I guess.
Following scenario:

I have a separated hard drive installed on the server where Nextcloud runs on. This drive is mounted as external storage with type local (ext4). Some people do have access to this drive via ssh/sftp. Folders copied over sftp on this drive containing symbols like ä, ö, ü are not shown on Nextcloud webclient. Renaming these folders manually using ssh terminal makes them visible though.
As there are terabytes of data manually renaming is not an option. I will do some further investigation and let you know any news.

@timor
Copy link

timor commented Mar 3, 2019

cc @herrwiese

@loeffelpan
Copy link
Author

I faced this again and again.
I will try renaming to solve this. For now uploading the files via web and deleting the invisible ones is my workaround.

@carowsolutions
Copy link

I put a cronjob in place to rename files containing Umlaute:
/30 * * * * find /etc/data/ -name "[äöüÄÖÜß]*" -exec rename 's/ä/ae/g;s/ü/ue/g;s/ß/ss/g;s/Ä/Ae/g;s/Ü/Ue/g;s/Ö/Oe/g;s/ö/oe/g' {} ;

@daftmab
Copy link

daftmab commented Apr 19, 2019

Solution:

I take no responsibility! create a database backup!!

Open PHPmyAdmin set Charset to ASCII and convert all tables.
set charset back to utf-8 and convert all tables again.
empty all file tables: oc_activity, oc_filecache, oc_files_trash.
DELETE FROM oc_filecache
rescan all files with
php -d memory_limit=1024M /var/www/cloud.nextloud.de/occ files:scan --all
I worked only on the database. Not the filesystem. Worked for me.
Umlaute in oc_accounts and other tables like groups must be changed manually.

/edit
just deleting the file tables and running the occ command doesn't work.
The Umlaute are still raw utf-8 ä ö ü or \u00c4 \u00d6 \u00dc

@schwma
Copy link

schwma commented Aug 7, 2019

I am experiencing a similar issue where some file paths containing special characters (specifically German umlauts) are not showing up. The folders in question are mounted as external storage via SFTP. I am running Nextcloud 16.0.3 as a docker container on Ubuntu Server 18.04.

What confused me was that some file paths containing umlauts were showing up while others were not. After poking around a bit I discovered that the paths that were not showing up contained "A", "O", or "U" followed by the unicode character "COMBINING DIAERESIS" (0x0308) whereas file paths that showed up normally seemed to contain "Ä", "Ö", or "Ü" directly. When renaming the combining diaeresis to the respective umlaut, the file path shows up as expected.

@OpenCoreCH
Copy link

OpenCoreCH commented Nov 22, 2019

@schwma (and potentially others): I had the same issue (files with "COMBINING DIAERESIS" not showing up) and could resolve it by enabling the "NFD compatibility" option on the share. The problem is that Nextcloud normalizes unicode by default (see

if (!$keepUnicode) {
$path = \OC_Util::normalizeUnicode($path);
}
) and turns names like "Lo\xcc\x88sungen.pdf" into "L\xc3\xb6sungen.pdf" which then are not found on the external share (because they don't exist). Enabling the option checks both encodings for such files. See owncloud/core#21365 and owncloud/core#24349 for an extensive discussion of the issue.

@endrift
Copy link

endrift commented Jul 28, 2020

I have this problem and arrived at the conclusion that the issue involved Unicode normalization too; however, I'm running on ZFS and none of the Unicode normalization options on my filesystem seemed to resolve the issue, so I've resorted to...not storing files with non-ASCII filenames in Nextcloud :(

@n3storm
Copy link

n3storm commented Aug 4, 2020

All of my MacosX users from different unrelated organizations fail to see files and folders containing "combining tildes" symbols.

Looks like PHP is able to handle this since PHP 7: https://wiki.php.net/rfc/unicode_escape

As per this page https://www.php.net/normalizer normalizing to NFC (being MacosX file and directory filenames NFD normalized) should fix this.

What worked to us to solve this issue is running frecuently cron tasks using following commands:

  • sudo -u www-data /usr/bin/convmv --notest --nfc -f utf8 -t utf8 -r data/ (better use absolute paths)
  • sudo -u www-data /usr/bin/php occ files:scan --all
  • sudo -u www-data /usr/bin/php occ groupfolders:scan 1 Optional (you may have more than one group folder which is a hassle)

The star here is convmv command and following SO question gave us the final touch:

https://stackoverflow.com/questions/26516700/file-name-look-the-same-but-is-different-after-copying

Looking now to use something like triggers to make de conversion, but we think this is issue shoud be addressed by Nextcloud.

@n3storm
Copy link

n3storm commented Aug 6, 2020

We are testing now using Nextcloud module Workflow making all Created and Copied files with mime type not application/fuu (to make all files and folders pass through) to this script:
/usr/bin/convmv --notest --nfc -f utf8 -t utf8 -r %f

Here we are using spanish characters from MacosX keyboards.
If somebody else can make test that would be awesome.

@johndoe7000
Copy link

@szaimen
It was not me finding the solution... it was from another guy @benjelloun69 . Look here...

https://help.nextcloud.com/t/invalid-encoding-on-file-names-in-nc19/83835

He posted a solution on 1. Nov. 2020... but somehow it was not accepted... read my comment from 26. Apr. 2022 until here.
I would be very happy if his patch would be accepted or a different one which satisfy Nextcloud devs.

Adding this small line on every Nextcloud release since more than 2 years is really annoying:(

@szaimen
Copy link
Contributor

szaimen commented Jan 10, 2023

I guess we have not seen this form post. Can you try to create the PR? I'll then help you moving this forward :)

@PVince81
Copy link
Member

see also troubleshooting NFD encoding issues with external storage: https://docs.nextcloud.com/server/latest/admin_manual/issues/general_troubleshooting.html#troubleshooting-file-encoding-on-external-storages

I'm not sure if the proposed patch will make everything work correctly. Maybe the scanner will find the file but when you'll try to overwrite it through the web UI or Webdav, it will create another instance of the file with the NFC normalized name. So you'll see two files on disk with seemingly the same name, but one is with NFC normalized and one with NFD (the original one).

For external storages, a special compatibility mode has been developed (see link above) which will always try both encodings to avoid such issues. However this approach makes everything slower as more FS accesses are required.

@PVince81
Copy link
Member

for those already using compatibility mode and can confirm that they have NFD encoded file names and it still doesn't work, then it can be handled as bug. Back then this mode was mostly tested with SMB storages and maybe some other storages like S3 need further workarounds to work correctly.

@johndoe7000
Copy link

@PVince81
Now this gets interesting....
I work for an employer where we use Windows, Linux and MacOSX.

I have an account on our Nextcloud 24.0.8 where I use a Samba4 (2:4.9.5+dfsg-5+deb10u3, Debian Buster) DFS enabled share.
There I have an excel sheet with german umlauts and Space in its name and it's password protected.
When I don't use the patch from @benjelloun69 I cannot successfully scan this file with occ or open it with Colabora Online Office (COOL). When I try to open it with COOL I see only a spinning wheel from Nextcloud.
When I enable NFD for this share COOL opens but fails to give me the dialog to enter the password.

Next, I created a new excel sheet with MS Office 2013 on my Windows System with a german umlaut and a space in its name and protected it with a password.
Then logged into Nextcloud and this file can be scanned and opened with COOL without problems... the password dialog appears.

So my guess, that it is a problem with password protected files which have an umlaut in its name IS WRONG, sorry for inconvenience.

Summary... fact is...

  1. I can open the "mysterious" excel sheet without problems from Windows and Linux (I have no access to a MacOSX machine).
  2. I can open and scan this file in Nextcloud with the patch from @benjelloun69 without having NFD enabled on the share.
  3. Enabling NFD on the share and not patching OC_Util.php works half way for this special excel sheet.
  4. I cannot tell you, if this file was originally generated on a Mac or not.

And before you ask, this excel sheet has "very" sensitive data in it, so I cannot share.

When I have more time... I will try to remove the password from that excel sheet and test again. If not possible maybe changing the password from my Windows or Linux system helps.
This is of course not a solution to the problem, but may give more insight.

@PVince81
Copy link
Member

in case it's useful, you can copy-paste a file name and pass it to this script and it will tell you what normalization it has and also show you both conversions:

<?php
$s = $argv[1];

if (\Normalizer::isNormalized($s, \Normalizer::FORM_D)) {
    print("Original string is using NFD normalization\n");
    $nfc = \Normalizer::normalize($s, \Normalizer::FORM_C);
    print("NFC: $nfc\n");
    print("NFD: $s\n");
} elseif (\Normalizer::isNormalized($s, \Normalizer::FORM_C)) {
    print("Original string is using NFC normalization\n");
    $nfd = \Normalizer::normalize($s, \Normalizer::FORM_D);
    print("NFC: $s\n");
    print("NFD: $nfd\n");
} else {
    print("Unknown normalization\n");
}

@johndoe7000
Copy link

@PVince81
I did 3 tests...

First I made a copy of the special excel sheet (Zugänge ITS.xlsx) with MS Explorer in the same folder, opened it with MS Excel 2013 and removed the password, then saved it. Opened it with MS Excel 2013 again > worked without password.

1. Test - patch disabled, NFD disabled.
sudo -u www-data php /var/www/nextcloud/occ files:scan DariusS
Starting scan for user 1 out of 1 (DariusS)
Entry "files_versions/DV/09_DV/Zugänge ITS.xlsx.v1651741342" will not be accessible due to incompatible encoding
Entry "09_DV/Zugänge ITS.xlsx" will not be accessible due to incompatible encoding
Entry "09_DV/Zugänge ITS - Kopie.xlsx" will not be accessible due to incompatible encoding
Entry "20_Public/Lothar/Orga/Einführung_Plunet_Mitarbeiterinfo.docx" will not be accessible due to incompatible encoding
.... many other files... will not be accessible due to incompatible encoding
+---------+--------+--------------+
| Folders | Files | Elapsed time |
+---------+--------+--------------+
| 113548 | 423462 | 00:27:23 |
+---------+--------+--------------+
Logged into Nextcloud 24.0.8 and NONE of the files from above could be opened with Colabora Online Office.

2. Test - patch disabled, NFD enabled.
sudo -u www-data php /var/www/nextcloud/occ files:scan DariusS
Starting scan for user 1 out of 1 (DariusS)
Entry "files_versions/DV/09_DV/Zugänge ITS.xlsx.v1651741342" will not be accessible due to incompatible encoding
+---------+--------+--------------+
| Folders | Files | Elapsed time |
+---------+--------+--------------+
| 113548 | 423630 | 00:28:02 |
+---------+--------+--------------+
Logged into Nextcloud 24.0.8. Again NONE of the files from Test 1 could be opened with Colabora Online Office.

3. Test - patch enabled, NFD disabled.
sudo -u www-data php /var/www/nextcloud/occ files:scan DariusS
Starting scan for user 1 out of 1 (DariusS)
+---------+--------+--------------+
| Folders | Files | Elapsed time |
+---------+--------+--------------+
| 113548 | 423631 | 00:26:59 |
+---------+--------+--------------+
Logged into Nextcloud 24.0.8. All files from Test 1 can be opened without problems with Colabora Online Office.

Enabling NFD on the share seems to help "occ files:scan" but does not help with Colabora Online Office and takes the most time as expected.
The best results I get with the patch... fastest and every file opens in Colabora Online Office.

FYI... I have checked the files "Zugänge ITS.xlsx", "Zugänge ITS - Kopie.xlsx" and "Einführung_Plunet_Mitarbeiterinfo.docx" with "file" on our Samba 4 server.
09_DV/Zugänge ITS.xlsx: CDFV2 Encrypted
09_DV/Zugänge ITS - Kopie.xlsx: Microsoft Excel 2007+
20_Public/Lothar/Orga/Einführung_Plunet_Mitarbeiterinfo.docx: Microsoft Word 2007+

@benjelloun69
Copy link

Hello guys,

I tried also many things before finding this solution that works for all. I think it is the only one which works for all scenarios!

@mudi0
Copy link

mudi0 commented Mar 14, 2023

i am also affected, local external storage, files with german umlaute ÄÜÖ and some other special characters can not be scanned and do not appear in the UI. But when i manually upload the files throug the web UI they are getting shown

@jancborchardt
Copy link
Member

@benjelloun69 do you mind opening a pull request with your patch? Thank you! ❤️

@jancborchardt jancborchardt added 1. to develop Accepted and waiting to be taken care of and removed 0. Needs triage Pending check for reproducibility or if it fits our roadmap labels Mar 27, 2023
@jancborchardt jancborchardt moved this to 🧭 Planning evaluation (don't pick) in 🖍 Design team Mar 27, 2023
@pgassmann
Copy link

just debugged a major issue with files on external storage on nextcloud 27.0.1. The External folder was webdav from hetzner storage box.
I worked around the issue by manually renaming all files that contained a plus character (+)
These files were uploaded through nextcloud on the external storage!

@brainrom
Copy link

brainrom commented Oct 6, 2024

Also faced this bug, behavior is similar: file with special characters on external storage (local).
According to my experiments, the bug present, when a letter is followed by cc 88 (COMBINING DIAERESIS) UTF-8 sequence. For letters, which haven't such variation with diaeresis (b̈ or f̈) everything works fine, but ö, ä, etc. breaks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1. to develop Accepted and waiting to be taken care of feature: files feature: filesystem hotspot: filename handling Filenames - invalid, portable, blacklisting, etc. technical debt
Projects
None yet