Skip to content
This repository has been archived by the owner on Mar 18, 2022. It is now read-only.

Test for unicode path #45

Merged
merged 4 commits into from
May 22, 2018
Merged

Conversation

stsewd
Copy link
Member

@stsewd stsewd commented Mar 6, 2018

Test to expose fix for #27

This test was a little hard to figure out how to do it, since I wasn't able to use the name on the py file, so I had to add an actual file (I take the file from readthedocs/readthedocs.org#3732 (comment)) and I had to made it py2 and py3 compatible. Please let me know if there is a better way.

@stsewd
Copy link
Member Author

stsewd commented Mar 6, 2018

I forgot to make it py3 compatible

# When python2
assert isinstance(e, UnicodeDecodeError)

unicode_base_path = base_path.decode('utf-8')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the final question, should we force always to be str on the find_one method or we should document that anyone calling this function must pass a str object and not an unicode?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also probably would be better to explicitly check the python version on the test, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be able to support UTF8 file names, so I think we just need to patch the walk method. We did this somewhere -- either readthedocs.org or sphinx-autoapi (i think)

Copy link
Contributor

@agjohnson agjohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also look to some of the tests that probably already mock out the find_one calls. You shouldn't need to depend on a local repo file for the test, though this isn't a huge problem.

assert path == ''
except Exception as e:
# When python2
assert isinstance(e, UnicodeDecodeError)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a try/except here, I'd use six or basic python version_info checking to test for these two cases:

sys.version_info < (3,0,0)

@@ -75,3 +75,18 @@ def test_find_multiple_files(tmpdir):
str(tmpdir.join('first', 'readthedocs.yml')),
str(tmpdir.join('third', 'readthedocs.yml')),
]


def test_find_unicode_path(tmpdir):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test can also be written to fail, using the pytest.xfail decorator. Instead of writing a test that will fail once we have the bug fixed, we can write a passing case. Our tests will technically pass, but there will be an expected failure on the test run.

# When python2
assert isinstance(e, UnicodeDecodeError)

unicode_base_path = base_path.decode('utf-8')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be able to support UTF8 file names, so I think we just need to patch the walk method. We did this somewhere -- either readthedocs.org or sphinx-autoapi (i think)

- Run only on python2
- Mark as fail
@stsewd stsewd force-pushed the test-unicode-error branch from e9252e9 to 2d0b0e5 Compare March 6, 2018 20:49
@stsewd
Copy link
Member Author

stsewd commented Mar 6, 2018

You can also look to some of the tests that probably already mock out the find_one calls. You shouldn't need to depend on a local repo file for the test, though this isn't a huge problem.

I tried to use apply_fs to reproduce the bug, but I wasn't able to do it (even copying the same name of the file from this test). Even manually creating a file with the same name (copying it from the one that fails).

I think probably this is due a bad encoding or a corrupted file from the user? Even on my OS the file isn't show correctly.

bad_encode

@agjohnson
Copy link
Contributor

So you probably aren't able to reproduce this easily as your encoding is not ascii. Here is what the servers give us back:

>>> import sys
>>> sys.getdefaultencoding()
'ascii'

That is bad, and in fact, I'm not sure why that's happening, as our locale is set:

$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

What you're testing might not be the right thing to test. Also, you are having trouble creating a file or path with this codepoint because your encoding is likely 'utf-8' already. With the encoding ascii, I have no problem doing this with py.path and apply_fs. If you try the following on a box with default utf8 encoding, you'll get a file with a codepoint that isn't 0xf0:

In [1]: import os

In [2]: os.mkdir('/tmp/foo/')

In [3]: os.mkdir('/tmp/foo/f\xf0\xf0')

In [4]: ls /tmp/foo
f??/

In [5]: os.listdir('/tmp/foo')
Out[5]: ['f\xf0\xf0']

To make this more confusing, however, even with this I'm not able to reproduce the problem with find_all:

In [18]: path
Out[18]: local('/tmp/foo')

In [19]: path.listdir()
Out[19]: [local('/tmp/foo/f\xf0\xf0'), local('/tmp/foo/bo\xf0')]

In [20]: list(find_all('/tmp/foo', ['readthedocs.yml']))
Out[20]: []

So, I'm pretty confused at this point.

@stsewd
Copy link
Member Author

stsewd commented Mar 8, 2018

So you probably aren't able to reproduce this easily as your encoding is not ascii. Here is what the servers give us back:

I got the same on my local instance

>>> import sys
>>> sys.getdefaultencoding()
'ascii'

I asked to the user for more information readthedocs/readthedocs.org#3732 (comment), and I think probably that file was generated with some weird encoding that only Windows understand (I have seen some Windows files showing as invalid encode on my machine a couple of times). Also now that I remember, a couple of weeks ago I was helping to a German friend to setup his rtd instance and he was using Windows, and faced some similar problems with encoding (but on other part of the build).

@ericholscher
Copy link
Member

The fix for python encoding being crazy is "use python3". I've tried to fix the server encodings a million times, and it doesn't work. We just need to move to an all UTF-8 world.

@ericholscher
Copy link
Member

Looks like a good test to have, so going to merge this.

@ericholscher ericholscher merged commit b992332 into readthedocs:master May 22, 2018
@stsewd stsewd deleted the test-unicode-error branch May 22, 2018 18:34
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants