Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV recognized as ASCII text in Debian #208

Open
kyprifog opened this issue Apr 21, 2020 · 10 comments
Open

CSV recognized as ASCII text in Debian #208

kyprifog opened this issue Apr 21, 2020 · 10 comments

Comments

@kyprifog
Copy link
Contributor

kyprifog commented Apr 21, 2020

cat etc/*-releases >>

PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

File that is recognized as "CSV" in mac (that is clearly a csv file with csv extension) is recognized as ASCII text in Debian. Tried reinstalling libmagic-dev didn't help.

@ahupp
Copy link
Owner

ahupp commented Apr 21, 2020

What does the file command say about this file? If it says csv, can you give an exact code snippet you're using?

@kyprifog
Copy link
Contributor Author

kyprifog commented Apr 29, 2020

root:# file csv_sample.csv
csv_sample.csv: ASCII text

Does this mean I need to install a different version of libmagic?

@ahupp
Copy link
Owner

ahupp commented Apr 29, 2020

I'm not sure if the mac uses the same file command as debian. If so, then I'd try comparing versions and see if something has changed between. This could be due to actual code changes, or a magic definition file (which usually comes along with the code)

@kyprifog
Copy link
Contributor Author

kyprifog commented May 1, 2020

So I looked in the debian image i was using (I'm using debian docker image) and the magic database was empty. I went ahead and copied the database from my mac to the docker image /usr/share/misc/magic/ (not sure this will work anyway), but still got the same result. apt-get upgrade file didn't work either. I'll keep digging.

@ahupp
Copy link
Owner

ahupp commented May 4, 2020

fwiw, in debian bullyseye (not docker image) I'm running file 5.38-4, and it does recognize a CSV file.

@kyprifog
Copy link
Contributor Author

kyprifog commented Sep 1, 2020

related: #75

@harrystaley
Copy link

at their core .csv files are just ASCII text files and as such contains the same file signature.

@indiVar0508
Copy link

Hi

I think have similar issue, i am creating a pandas dataframe and doing a to_csv(), but i get different results
i tried to create a MRC below

Ubuntu:
    Python 3.8.10 (default, Nov 22 2023, 10:22:35) 
    [GCC 9.4.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import magic
    >>> import pandas as pd
    >>> df = pd.DataFrame({"a": [1,2,3], "b": [2,3,4]})
    >>> df.to_csv()
    ',a,b\n0,1,2\n1,2,3\n2,3,4\n'
    >>> magic.detect_from_content(df.to_csv().encode('utf-8'))
    FileMagic(mime_type='application/csv', encoding='us-ascii', name='CSV text')

Centos:
    Python 3.8.19 (default, May 27 2024, 05:59:07) 
    [GCC 10.2.1 20210130 (Red Hat 10.2.1-11)] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import magic
    >>> import pandas as pd
    >>> df = pd.DataFrame({"a": [1,2,3], "b": [2,3,4]})
    >>> df.to_csv()
    ',a,b\n0,1,2\n1,2,3\n2,3,4\n'
    >>> magic.detect_from_content(df.to_csv().encode('utf-8'))
    FileMagic(mime_type='text/plain', encoding='us-ascii', name='ASCII text')

could this be some locale issue?
for ubuntu i have Ubuntu 20.04 LTS
centos i used a docker image quay.io/pypa/manylinux2014_x86_64 with cp38-cp38
python-magic used in both OS 0.4.27

@ahupp
Copy link
Owner

ahupp commented Jul 18, 2024

@indiVar0508 Almost certainly this is because the centos image ships an old version of libmagic

@indiVar0508
Copy link

indiVar0508 commented Jul 30, 2024

I see thanks for help, yeah this was the reason,
i prepared result behaviour across different os/libmagic in case someone else need it or stumble upon it

python-magic : 0.4.27
Python       : 3.8
                    libmagic Detected content
Ubuntu 20.04 LTS :    538    CSV  -> FileMagic(mime_type='application/csv', encoding='us-ascii', name='CSV text')
                             JSON -> FileMagic(mime_type='application/json', encoding='us-ascii', name='JSON data')
                             XlSX -> FileMagic(mime_type='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', encoding='binary', name='Microsoft Excel 2007+')
                             ZIP  -> FileMagic(mime_type='application/zip', encoding='binary', name='Zip archive data, at least v2.0 to extract')
                             PDF  -> FileMagic(mime_type='application/pdf', encoding='binary', name='PDF document, version 1.4')

Ubuntu 22.03 LTS :    541    CSV  -> FileMagic(mime_type='text/csv', encoding='us-ascii', name='CSV text')
                             JSON -> FileMagic(mime_type='application/json', encoding='us-ascii', name='JSON data')
                             XLSX -> FileMagic(mime_type='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', encoding='binary', name='Microsoft Excel 2007+')
                             ZIP  -> FileMagic(mime_type='application/zip', encoding='binary', name='Zip archive data, at least v2.0 to extract, compression method=deflate')
                             PDF  -> FileMagic(mime_type='application/pdf', encoding='binary', name='PDF document, version 1.4, 1 pages')

Centos7          :    511    CSV  -> FileMagic(mime_type='text/plain', encoding='us-ascii', name='ASCII text')
                             JSON*-> FileMagic(mime_type='text/plain', encoding='us-ascii', name='ASCII text, with no line terminators')
                             XLSX*-> FileMagic(mime_type='application/zip', encoding='binary', name='Zip archive data, at least v2.0 to extract')
                             ZIP  -> FileMagic(mime_type='application/zip', encoding='binary', name='Zip archive data, at least v2.0 to extract')
                             PDF  -> FileMagic(mime_type='application/pdf', encoding='binary', name='PDF document, version 1.4')

UBI8             :    533    CSV  -> FileMagic(mime_type='text/plain', encoding='us-ascii', name='ASCII text')
                             JSON*-> FileMagic(mime_type='text/plain', encoding='us-ascii', name='ASCII text, with no line terminators')
                             XLSX*-> FileMagic(mime_type='application/zip', encoding='binary', name='Zip archive data, at least v2.0 to extract')
                             ZIP  -> FileMagic(mime_type='application/zip', encoding='binary', name='Zip archive data, at least v2.0 to extract')
                             PDF  -> FileMagic(mime_type='application/pdf', encoding='binary', name='PDF document, version 1.4')

Code used to generate

import magic
if magic._has_version is True:
     print(magic.magic_version())
import json
import pandas as pd
import io
import zipfile
df = pd.DataFrame({"a": [1,2,3], "b": [2,3,4]})
# CSV detection
magic.detect_from_content(df.to_csv().encode('utf-8'))
# JSON detection
magic.detect_from_content(json.dumps({"a": 1, "b":[2,3]}))

# Excel detection
writerIO = io.BytesIO()
df.to_excel(writerIO)
writerIO.seek(0)
magic.detect_from_content(writerIO.read())

# Zip detection
df.to_csv("file.csv")
with zipfile.ZipFile("file_compressed.zip", "w") as zpo:
    zpo.write("file.csv", compress_type=zipfile.ZIP_DEFLATED)

magic.detect_from_content(open("file_compressed.zip", "rb").read())
# PDF Detection
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages

df = pd.DataFrame(np.random.random((10,3)), columns = ("col 1", "col 2", "col 3"))

#https://stackoverflow.com/questions/32137396/how-do-i-plot-only-a-table-in-matplotlib
fig, ax =plt.subplots(figsize=(12,4))
ax.axis('tight')
ax.axis('off')
the_table = ax.table(cellText=df.values,colLabels=df.columns,loc='center')

#https://stackoverflow.com/questions/4042192/reduce-left-and-right-margins-in-matplotlib-plot
pp = PdfPages("foo.pdf")
pp.savefig(fig, bbox_inches='tight')
pp.close()
magic.detect_from_content(open("foo.pdf", "rb").read())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants