Unicode everywhere #442

pombredanne · 2017-01-10T11:59:59Z

As a prep for #295 and for general sanity we should use unicode everywhere by default.

Today several pieces of code are dealing with a hodgepodge of bytes or unicode strings in a semi-organized way (this is an euphemism).

We should instead have a simpler approach:

use unicode as default everywhere (e.g. for now using from __future__ import unicode_literals on every Python file and using six text types or similar to handle proper Python 2/3 compat where needed)
always convert and decode to unicode at the boundaries when ingesting content or files or paths. By boundaries I mean when files or paths are ingested or reported. And from then on only process unicode.
and convert back to encoded bytes at the boundaries when needed (e.g UTF-8 or filesystem encoding for external paths consumption) at the boundaries when reporting out.
deal with bytes rather than unicode explicitly and by exception whenever low level byte handling is required. There are only a few places that should be needing this

The text was updated successfully, but these errors were encountered:

mjherzog · 2017-05-17T01:54:46Z

ScanCode TK v 2.0.0rc2 incorrectly reports "@" sign from a scan as "%40". This occurs with Angular sub-components from https://github.com/angular/angular/tree/2.0.x/modules/%40angular and may be related to the string of "%40angular " in the URL for "https://github.com/angular/angular/tree/2.0.x/modules/@angular"

pombredanne · 2017-06-09T12:46:54Z

@mjherzog this "reports "@" sign from a scan as "%40"" is fixed in develop this was tracked in #542

pombredanne · 2021-02-10T08:50:01Z

We now dropped support for Python 2 and are now using unicode inside everywhere. At last!

pombredanne added core and api enhancement labels Jan 10, 2017

pombredanne mentioned this issue Jun 17, 2017

Dockerfile (alpine based) + some quick ideas/suggestions #636

Open

This was referenced Jan 7, 2020

Could not scan deps with @ on Windows oss-review-toolkit/ort#2090

Closed

Scanning a path with "%" fails on Windows #1876

Closed

pombredanne closed this as completed Feb 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode everywhere #442

Unicode everywhere #442

pombredanne commented Jan 10, 2017

mjherzog commented May 17, 2017

pombredanne commented Jun 9, 2017

pombredanne commented Feb 10, 2021

Unicode everywhere #442

Unicode everywhere #442

Comments

pombredanne commented Jan 10, 2017

mjherzog commented May 17, 2017

pombredanne commented Jun 9, 2017

pombredanne commented Feb 10, 2021