Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add bom update #16

Merged
merged 1 commit into from
Apr 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions articles/20220306_python_datetimes/20220306_python_datetimes.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,16 @@
# 20220306 Python Datetimes

- [Standard Library vs. Packages](#standard-library-vs-packages)
- [Current Datetime](#current-datetime)
- [Current Datetime as UTC](#current-datetime-as-utc)
- [But what if my timezone isn't UTC?](#but-what-if-my-timezone-isnt-utc)
- [Using the computer's local timezone](#using-the-computers-local-timezone)
- [Bonus](#bonus)
- [Removing microseconds from the datetime](#removing-microseconds-from-the-datetime)


## Standard Library vs. Packages

Python `datetime` frequently trip me up when I use them, usually due to a combination of

1. dense documentation that is hard to read
Expand Down
Original file line number Diff line number Diff line change
@@ -1,34 +1,164 @@
# 20230919 Parsing BOMs in Python

- [Introduction](#introduction)
- [Show the BOM](#show-the-bom)
- [Create a UTF8 file](#create-a-utf8-file)
- [Reading the file in Python](#reading-the-file-in-python)
- [UTF-16](#utf-16)
- [The codecs package](#the-codecs-package)
- [BOM detection](#bom-detection)
- [Demo](#demo)

## Introduction

The "byte-order mark" or [BOM](https://en.wikipedia.org/wiki/Byte_order_mark) is a special char that appears at the very beginning of UTF8 and UTF16 files.
This marker is there to be our friend, however many languages and libraries don't generally deal with this marker by default, and python is no exception

You'll usually encounter these files if you work with data that came from Windows programs, otherwise it's usually rare to see.

## Show the BOM

The UTF-8 BOM character is: `U+FEFF`

In linux, it's possible to easily create a test file so that you can play around.

I like to use `cat -A` to check for non-printing characters, you can pipe anything to cat by using the `-` character, e.g.

```shell
☯ ~ echo -e '\xEF\xBB\xBF' | cat -A -
M-oM-;M-?$

☯ ~ printf '\ufeff\n' | cat -A -
M-oM-;M-?$
```

## Create a UTF8 file

To create a UTF8 file, use the BOM character from above and add some extra text, and save it to a file.

```shell
☯ ~ printf '\ufeffhello world\n' > test.csv

# check the file using the `file` command
☯ ~ file test.csv
test.csv: Unicode text, UTF-8 (with BOM) text, with no line terminators

# check the file using cat -A
☯ ~ cat -A test.csv
M-oM-;M-?hello world
```

### Reading the file in Python

When opening files, python will not remove the BOM character.

```python
with open('test.csv') as istream:
s = istream.read()

s
# '\ufeffhello world'
```

However, this can be easily fixed by using the `utf-8-sig` encoding!
The following info is buried within the [python codec documentation](https://docs.python.org/3/library/codecs.html):

> On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.


```python
with open('test.csv', encoding='utf-8-sig') as istream:
s = istream.read()

s
# 'hello world'
```

Now, you can see that the BOM character has been removed automatically! The same thing can be done with writing - automatically adding the BOM character by using the `utf-8-sig` encoding.

```python
with open('test.csv', 'w', encoding='utf-8-sig') as ostream:
print('hello world', file=ostream)
```

## UTF-16

For UTF-16 files, the BOM character comes in 2 flavors, big-endian and little-endian. Python doesn't offer a handy encoding for these, so you'll have to do it manually.

- UTF-16 BE: `U+FEFF`
- UTF-16 LE: `U+FFFE`

To help out - let's write a file with a BOM16 character and some text.

```python
with open('test16.csv', 'wb') as ostream:
ostream.write(codecs.BOM_UTF16)
ostream.write(b'hello world\n')
```

```shell
☯ ~ file test16.csv
test16.csv: Unicode text, UTF-16, little-endian text, with no line terminators

☯ ~ cat -A test16.csv
M-^?M-~hello world$
```

### The codecs package

The standard library has a `codecs` package that contains a few handy constants for the BOM characters.

```python
import codecs

codecs.BOM_UTF16_LE
# b'\xff\xfe'
codecs.BOM_UTF16_BE
# b'\xfe\xff'
```

### BOM detection

Using these constants, we can make a function that will detect a BOM character at the start of a file, and return the correct encoding.

```python
import csv, codecs
import codecs

CODECS = {
"utf-8-sig": [codecs.BOM_UTF8],
"utf-16": [
CODECS = {
"utf-8-sig": [codecs.BOM_UTF8],
"utf-16": [
codecs.BOM_UTF16,
codecs.BOM_UTF16_BE,
codecs.BOM_UTF16_LE,
]
}
}

def detect_encoding(fpath):
with open(fpath, 'rb') as istream:
data = istream.read(3)
for encoding, boms in CODECS.items():
if any(data.startswith(bom) for bom in boms):
return encoding
return 'utf-8'
def detect_encoding(fpath: str) -> str:
# open the file in bytes mode
with open(fpath, 'rb') as istream:
# read the first 3 bytes (the UTF-8 BOM is 3 chars, the UTF-16 BOM is 2)
data = istream.read(3)
# iterate over the codecs and return the encoding if the BOM is found
for encoding, boms in CODECS.items():
if any(data.startswith(bom) for bom in boms):
return encoding
return 'utf-8'

def read(fpath):
with open(fpath, 'r', encoding=detect_encoding(fpath)) as istream:
yield from csv.DictReader(istream)
detect_encoding('test.csv')
# 'utf-8-sig'
detect_encoding('test16.csv')
# 'utf-16'
```

### Demo

Finally, you could use this encoding detection inline when reading a file! For this test, I used a UTF16 file that I found in this repo: https://github.com/stain/encoding-test-files

```python
# run here
for i, row in enumerate(read('test.csv')):
print(i, row)
if i > 10:
break

with open(fpath, 'r', encoding=detect_encoding(fpath)) as istream:
s = istream.read()

s
# 'première is first\npremière is slightly different\nКириллица is Cyrillic\n𐐀 am Deseret\n'
```