Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: input file encoding #596

Merged
merged 8 commits into from
Oct 19, 2023
Merged

fix: input file encoding #596

merged 8 commits into from
Oct 19, 2023

Conversation

JCHacking
Copy link
Contributor

@JCHacking JCHacking commented Oct 12, 2023

input files in lock-format are expected in a certain encoding,
other input file encodings are detected.

fixes #448

@JCHacking JCHacking requested a review from a team as a code owner October 12, 2023 19:07
Now the characters are read in bytes, what encoding they have is evaluated and converted to a string with it.

Signed-off-by: JCHacking <[email protected]>

Refs: #448
@jkowalleck
Copy link
Member

jkowalleck commented Oct 19, 2023

why did you use a niche library https://pypi.org/project/faust-cchardet
why not use a library that is widely used? like https://pypi.org/project/chardet ?

if it was for the py3.6 compatibility, chardet provides older compatible versions.

@JCHacking
Copy link
Contributor Author

why did you use a niche library https://pypi.org/project/faust-cchardet
why not use a library that is widely used? like https://pypi.org/project/chardet ?

if it was for the py3.6 compatibility, chardet provides older compatible versions.

That was one of the reasons, the other one is because some encodings like cp1252 are returned as Windows-1252 so they are not exactly the same to pass it to the decode method.

Also faust-cchardet is a maintained fork of cchardet which is written in C so it has better performance.
Traducido con DeepL https://www.deepl.com/app/?utm_source=android&utm_medium=app&utm_campaign=share-translation

@jkowalleck
Copy link
Member

re #596 (comment)

[...] it has better performance

Performance should not be a concern at this point.
Maintainability and support of used libraries is more important here. So better use a well-maintained library like chardet that is widely used.
If this choice is insufficient at any point, we can still report bugs to the maintainers team of that library.


see #448 (comment)

could you test the following on your system?

  • for poetry.lock and Pipfile.lock use: open(..., encoding="utf8") -- no mode, must path and encoding
  • for requirements.txt the char-detection is okay in general.

@JCHacking
Copy link
Contributor Author

re #596 (comentario)

[...] tiene mejor rendimiento

El rendimiento no debería ser una preocupación en este momento. La mantenibilidad y el soporte de las bibliotecas usadas son más importantes aquí. Así que es mejor utilizar una biblioteca bien mantenida como chardetla que se utiliza ampliamente. Si esta elección es insuficiente en algún momento, aún podemos informar errores al equipo de mantenimiento de esa biblioteca.

ver #448 (comentario)

¿Podrías probar lo siguiente en tu sistema?

  • para poetry.locky Pipfile.lockuso: open(..., encoding="utf8")-- sin modo, debe ruta y codificación
  • para requirements.txtla detección de caracteres está bien en general.

Perfect, then I change the library to chardet with a version that supports python 3.6 and then I try the other option that you mention of only validating the coding in requirements.txt

Changed to use the chardet library and now only the encoding in requirements.txt is inspected.

Signed-off-by: JCHacking <[email protected]>

Refs: #448
@JCHacking
Copy link
Contributor Author

The change is already done

Basically, if the open is done in byte mode the encoding will be inspected, otherwise it will be assumed that everything is OK.

And regarding chardet I have made a replace to make it work with windows, since it returns Windows-1252 but python only understands cp1252.

pyproject.toml Outdated Show resolved Hide resolved
JCHacking and others added 2 commits October 19, 2023 13:51
Changed to use the chardet library and now only the encoding in requirements.txt is inspected.

Signed-off-by: JCHacking <[email protected]>

Refs: #448
Signed-off-by: Jan Kowalleck <[email protected]>
@jkowalleck jkowalleck changed the title fix: Fix windows poetry charset error decode fix: input file encoding Oct 19, 2023
Signed-off-by: Jan Kowalleck <[email protected]>
@jkowalleck
Copy link
Member

I had to do minor version range adjustments and other chores.

I will add a regression test, and then, this fix is ready to go.
Thank you for your effort, @JCHacking

Signed-off-by: Jan Kowalleck <[email protected]>
@JCHacking
Copy link
Contributor Author

I had to do minor version range adjustments and other chores.

I will add a regression test, and then, this fix is ready to go.
Thank you for your effort, @JCHacking

Thanks to you for letting me collaborate in its solution.

@jkowalleck jkowalleck merged commit a9dda4b into CycloneDX:main Oct 19, 2023
22 checks passed
@jkowalleck jkowalleck mentioned this pull request Oct 19, 2023
11 tasks
@jkowalleck
Copy link
Member

fix is available as of v3.11.3

@JCHacking JCHacking deleted the char_encode branch November 18, 2023 11:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error getting BOM file from poetry
2 participants