Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added chardet to detect the encoding of the content #6

Merged
merged 7 commits into from
Dec 4, 2023

Conversation

Nishantbhagat57
Copy link
Contributor

When running urless on a txt file I encountered this error:
ERROR processInput 2 'utf-8' codec can't decode byte 0xac in position 5030: invalid start byte

The error message is indicating that Python is trying to decode a file as utf-8 but it is encountering a byte (in this case, 0xac) which is not valid for utf-8 encoding.

To fix this issue I have modified the processInput() function to use chardet to handle encoding/decoding issues without modifying any content. chardet can help to detect the encoding of the content. We can use that encoding to open the file, rather than blindly assuming utf-8.

Before modifications:

Screenshot urless 2

After modifications:

Screenshot urless 1

@Nishantbhagat57
Copy link
Contributor Author

ping @xnl-h4ck3r

@xnl-h4ck3r
Copy link
Owner

Hey @Nishantbhagat57 , thanks for the pull request! This looks great! I just need to try it out locally first to make sure.
There seem to be 2 commits for setup.py though... the second having url="https://github.com/Nishantbhagat57/urless", so I guess that commit should be removed
Thanks. I'll get back to you after I'ce tested

@Nishantbhagat57
Copy link
Contributor Author

Hey there, thank you for reviewing this! My bad, the last commit regarding the URL change was intended for my personal fork and ended up in this PR inadvertently. Didn't think it would sneak into this PR :|

@xnl-h4ck3r
Copy link
Owner

Hey there, thank you for reviewing this! My bad, the last commit regarding the URL change was intended for my personal fork and ended up in this PR inadvertently. Didn't think it would sneak into this PR :|

Can you remove that commit from the pull request?
Also, can you change the following as part of the request:

  • Change README.md to say v1.1 instead of v1.0
  • Change image/init.py to say v1.1 intead of v1.0
  • Add the v1.1 and description of the change to CHANGELOG.md

Thanks!

@xnl-h4ck3r
Copy link
Owner

xnl-h4ck3r commented Oct 30, 2023

Also, I have a question... you mentioned that character 0xac was in the file. That appears to be the symbol ¬. Is that a valid character that was in a URL?
Can you give me the line from the file that causes that error so I can reproduce the issue?
Thanks

@Nishantbhagat57
Copy link
Contributor Author

Hi @xnl-h4ck3r I have implemented the changes.

Now, about the symbol ¬ or character 0xac you asked about, it was indeed found in the URL as follows: https://app.example.com/icicleViews/icicleCustomerPortrait/​addwarterMark.js

Even though the endpoint isn't correct, which could be categorized as a false positive, I believe urless should still effectively manage such instances. It simply threw an error in this case without attempting to deduplicate or de-clutter the URLs. In my case, I'm using urless as a part of an automation workflow, and it's essential that the script operates efficiently in all cases, regardless of some data portions being invalid.

That's why I suggest continuing with the chardet modification despite the input error - it maintains our ability to handle encoding/decoding issues effectively :)

@xnl-h4ck3r
Copy link
Owner

Hi @Nishantbhagat57 , I agree 100% it shouldn't crash with an error, even if it's not a valid url.
I'm having some probems testing this at the moment though. Even tough chardet has been installed, I keep getting the error ModuleNotFoundError: No module named 'chardet' when I run it.
Also, if I add the line https://app.example.com/icicleViews/icicleCustomerPortrait/​addwarterMark.js to a test file and run it, I don't get any errors from the original version for some reason. Not sure why that is. That's not too important though as lomg as the new version still works for me that's fine. I just need to work out why it's erroring for chardet. Any ideas?

@Nishantbhagat57
Copy link
Contributor Author

@xnl-h4ck3r I don't know why it's saying ModuleNotFoundError: No module named 'chardet'

But you can try these things:

  1. If you are testing the urless.py python script like sudo python3 urless.py -i input.txt -o output.txt then maybe chardet isn't installed for python that's on sudo, try the command without sudo.
  2. Reinstall Python3. I always use Homebrew to do this:
    /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
    (echo; echo 'eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"') >> /home/$USER/.bashrc
    eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"
    brew install python3
    pip3 install chardet argparse pyyaml termcolor urlparse3

I again tested the modified script and it works perfectly without issue in my environment

@xnl-h4ck3r
Copy link
Owner

Can you run these and let me know what versions you have?

python --version
chardet --version

@Nishantbhagat57
Copy link
Contributor Author

Screenshot 2023-10-31 204552

Can you run these and let me know what versions you have?

python --version
chardet --version

@xnl-h4ck3r
Copy link
Owner

Thank you. Sorry this is taking so long to sort out.
When I run pip3 show chardet I get the same output as you, and see version 5.2.0. However, if I run chardet --version I get chardet 3.0.4. What do you get for chardet --version?

@Nishantbhagat57
Copy link
Contributor Author

Thank you. Sorry this is taking so long to sort out. When I run pip3 show chardet I get the same output as you, and see version 5.2.0. However, if I run chardet --version I get chardet 3.0.4. What do you get for chardet --version?

Screenshot 2023-11-01 191454

@xnl-h4ck3r
Copy link
Owner

Thank you. Sorry this is taking so long to sort out. When I run pip3 show chardet I get the same output as you, and see version 5.2.0. However, if I run chardet --version I get chardet 3.0.4. What do you get for chardet --version?

Screenshot 2023-11-01 191454

I get exactly the same, which although very strange, doesn't help mr figure out why I can't get mine to run with chardet :(

@xnl-h4ck3r
Copy link
Owner

Hey @Nishantbhagat57. Sorry to go quiet on this... I am still trying to figure out what the problem is. It's def something on my setup rather than your changes obviously. I still need to fix it properly on mine to test properly first. Thanks for you patience

@xnl-h4ck3r xnl-h4ck3r merged commit a5e94ab into xnl-h4ck3r:main Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants