Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tar archive encodes file-names wrongly #2255

Closed
individual-it opened this issue Nov 10, 2021 · 7 comments
Closed

tar archive encodes file-names wrongly #2255

individual-it opened this issue Nov 10, 2021 · 7 comments
Labels
bug Something isn't working

Comments

@individual-it
Copy link
Contributor

originally in owncloud/ocis#2743

Describe the bug

file-names with non-latin characters are not displayed correctly with some tar tools

Steps to reproduce

Steps to reproduce the behavior:
using ocis & owncloud web:

  1. create a folder
  2. upload a file into the folder that contains non-latin in the file-name
  3. download the complete folder using the archiver endpoint. Make sure Windows NT is not send in the User-Agent header. So on Windows the user-agent need to be faked or use curl
  4. this should give a .tar file as download, if you are given a .zip file the user-agent header contains the string Windows NT
  5. inspect the tar archive on Windows or unpack the archive using php:
    <?php
    $archive = new PharData('file.tar');
    $archive->extractTo('/tmp/test');
    

Expected behavior

file-names should contain all non-latin characters

Actual behavior

This works fine with Linux tar command, but not with any Windows tool I tried or PHP
Instead I get all non-latin characters removed from the file-names and an additional folder called PaxHeaders.0, this folder contains text files, where the content is the correct name of the actual file

utf-archive

PaxHeaders.0: 27 path=my_data/öäü.txt

Additional context

This is ether a problem in how the tar files are constructed in ocis or some widely used library for clients has an issue

@individual-it
Copy link
Contributor Author

CC @gmgigi96

@gmgigi96
Copy link
Member

Thanks @individual-it. I tried in my test environment (and also in our production) but I wasn't able to reproduce it (I tried with WinRAR 6.02). Could you give more details on your current reva setup (like the storage provider you used, ...)? Thanks

@individual-it
Copy link
Contributor Author

@gmgigi96 I'm using ocis from the master branch https://github.com/owncloud/ocis/
Here the file I got: https://jankaritech.ocloud.de/index.php/s/wTnLbHqs1Y07KPQ

it has a folder called my_data and a file in it called öäüfile.txt

when I unpack it with PHP

<?php
$archive = new PharData('tar-bug-reva.tar');         
$archive->extractTo('/tmp/test');

the file name is only file.txt

can you try to open that file with WinRAR

@individual-it
Copy link
Contributor Author

AFAIK ocis uses the ocis storage per default

@gmgigi96
Copy link
Member

@individual-it I can see with no problems the content of the tar you pointed me out, as you can see from the figure
Schermata 2021-11-12 alle 15 10 37

I also tried with different different storage providers also (eos, localhome) and works as expected, so the correct filename (with no-latin characters) and without the PaxHeaders.0 folder. Are you sure this is not a problem of your WInRAR version or the library you are using in PHP?

@individual-it
Copy link
Contributor Author

@gmgigi96 yes that might be, I will have to dig deeper

@individual-it
Copy link
Contributor Author

After reading more about tar, I think this is a general issue of tar, it does not support officially anything else than ASCII
Only the pax format of tar supports UTF-8 and not all libraries seem to support that
for more details see e.g. haskell/tar#6 & https://docs.python.org/2/library/tarfile.html#unicode-issues

I will close this issue and proposing to switch to ZIP also on Linux systems

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants