Skip to content

Latest commit

 

History

History
295 lines (242 loc) · 18.3 KB

files.md

File metadata and controls

295 lines (242 loc) · 18.3 KB

File formats and encodings

Before we start: What is a file? What is a folder/directory?

Identifying a file format

  • How do you identify a file format?
  • How does your operating system do it?

    By filename extensions

  • Why is this naive?
    1. Download any image, e.g. this JPEG (right click on the link and choose "Save target element as..." to download instead of opening it in the browser)
    2. Open it by double clicking
    3. Change its name to arpanet.html
    4. Try to open it by double clicking
  • but also...
    1. Download any Ms Word file, e.g. this one
    2. Rename it to sample.zip
    3. Try to open it by double clicking
  • ...and... What's the format of the file with the .bin extension?

    Ask Google

  • Is there a better way to recognize the file type?

    Yes, we can try to actually analyze its content, e.g. with the file command

     1. In the cli go to the folder containing the arpanet.html
     2. Execute: `file arpanet.html`
    
  • Even when looking into the file content the result might be surprising
    echo 'rot,blau,gelb' > farben.csv
    file -i --mime farben.csv
    

File formats

There are two broad file format kinds: binary and text.

Binary formats

"Files you can't open in a plain text editor", e.g.

  • Images: jpg, png, etc.
  • Movies & sound: avi, mkv, mp4, mpeg, mp3, etc.
  • Old Ms Office files: .doc, .xls, .ppt
  • PDF (although not entirely...)

There are various reasons for using binary formats:

  • The data stored in a file have no good text representation (e.g. images, movies, sound)
  • Compression.
  • Performance.
  • Protecting intellectual rights - only you know how to read (decode) the data stored in a file.

Text formats

"Files you can open in a plain text editor", e.g.

  • txt: plain text file
  • csv, tsv: delimited values "database" files
  • xml, json: general structured data file formats
  • html, md: files storing formatted texts
  • js, php, py, r, cpp, etc.: source code in various programming languages
  • and many, many more (e.g. IANA-registered list of text formats)

There are various reasons for using text formats:

  • All of them can be read and edited using just a plain text editor.
    This doesn't mean it's always the most convenient way.
  • They are easy to compare and version.
    (you should have seen it during the git lecture)

Images

Images, despite having distinct content, share similarities with other files. In fact, they can be either binary, such as JPEG, PNG, GIF, BMP, or text files, such as Scalable Vector Graphics (SVG), Portable Graymap (PGM),Portable Bitmap (PBM), etc.

Depending on users' needs, images can be, stored as:

  • JPEG: lossy compression, small size (good for transmission)- it's our gold standard when it comes to archiving
  • PNG: lossy compression, small size (good for transmission)
  • Uncompressed(RAW) or LZW-compressed TIFF: lossless format; Represent the faithful digital version of the holding; they stores correct image dimensions and colour profiles.
    Note: TIFF files are typically binary, and their content is not easily human-readable. Therefore, providing an example in a text representation is not practical.
  • SVG: Scalable Vector Graphics, useful for simple geometric shapes, logos, diagrams etc. SVG is a vector graphics format that uses XML to describe 2D vector graphics. It is primarily used for scalable two-dimensional graphics that can be rendered with XML and CSS.
    Note: SVG is a text-based format, but it's not an image format in the traditional sense. It's more of a markup language for describing vector graphics

When it comes to maintain consistency among the scanned images, there are three detailed levels:

  • Thumbnail
    • General Access (jpeg, png)
      • Master (raw, TIFF)

File formats hierarchy

  • Can a file be in more than one format at the same time?

    Sure it can, e.g.

    1. Download a sample HTML file (right click on the link and choose "Save target element as..." to download instead of opening it in the browser).
    2. Open it in a browser (just double click on it)
    3. Make a copy of it and rename the copy to sample.xml
    4. Open sample.xml in a browser (just double click on it)
    5. Make a copy of it and rename the copy to sample.txt
    6. Open sample.txt in a plain text editor (e.g. Visual Studio Code)

    As we can see an HTML file is also an XML file as well as a plain text file at the same time

  • File formats form a hierarchy with more specialized ones being build on top of more generic ones, e.g.
    text file -+                      any file which can be read in a plain text editor
               +-> XML --+            a generic text format for storing structured data in text files
               |         +-> TEI      uses XML to store text data
               |         +-> HTML     uses XML to store WWW webpages
               |         +-> RDF/XML  uses XML to store RDF data
               |         +-> MARC/XML uses XML to store MARC data (library catalog data)
               |
               +-> JSON -+            another generic format for storing structured data in text files
                         +-> geoJSON  uses JSON to store spatial data
                         +-> JSON-LD  uses JSON to store RDF data
    ZIP -+        just a compressed set of arbitrary files
         +-> DOCX a compressed set of files representing an Ms Word document
    

Expected file contents: languages and associated files

  • Don’t confuse the type of a file with its contents, even though these are closely linked:

    • A simple txt-file could contain Shakespeare's A Midsommer nights dreame, a Python script or a even a mixture of that ;)
  • When we handle files – event if they are just special text files – we can't ignore their typical contents and the typical associated languages

    • E.g., white spaces in an xml-file are often considered to be irrelevant for the contents of that file
      <element>content<child>child-element content</child></element>
      is broadly expected to be equivalent to:
      <element>content<child>child-element content</child>
      </element>
    • In contrary white spaces matters a lot in python script files.
      if (2+2)==4:
        print("everything still okay")
      While the above is valid python (or at least looks like it) this could never run, and an error message will appear:
      if (2+2)==4:
      print("everything still okay")
      (If you want to, try to run the "correct" example in a python shell by simply pasting it there. It will fail. Any idea why?)
  • Knowing the typical content associated with a file can help you determine how to handle a file. For example, it can prevent you from inadvertently modifying relevant white spaces. It also helps you to know what you could do with a file; files could be:

    • executed: .sh, .exe, .js, .py
    • used to store data from a project: .xml, .docx, .jpg, .sql, .md
    • used to store graphical data, photos, shapes etc.: .tif, .jpg, .svg
    • used to configure a system or program: .xml, .json

      ! Note that some of these are binary files, while some are text files. These two categories, however, don’t align with the above mentioned purposes.

  • The purpose of a file often provides clues about the language it's written in and its associated data structures. You would be surprised, for example, if you discover that the Word document, you just received from your Professor, contained python script. This is because you typically expect such a document to contain texts written in natural language, e.g., German. Technical languages are not uniform but are organized into families, each tailored to fulfill specific technical tasks or objectives.

    Programming languages

    Generally speaking, we use them to define sets of instructions that can be executed by a computer.

    • Python
    • Java
    • JavaScript
    • C, C++, C#
    • Fortran

    Markup languages

    We use them to encode and structure text-like information.

    • XML
    • HTML
    • TeX
    • Markdown (in fact, this document was written in Markdown.)

    Query languages

    we use them if we want to get specific parts/sets of (structured) data. Eg. to search something.

    • SPARQL (for RDF graphs)
    • SQL (for relational databases)
    • XQuery (for XML data sources)
    • XPath (for XML documents)

File format conversion

  • Many file formats can be converted between each other, e.g.
    • pandoc allows conversion between many different formatted text formats
    • csv2json can convert CSV to JSON and vice versa
    • Ms Office/Libre Office/Google Docs being able to save both in .docx and .odt, .xlsx and .ods, etc.
    • and there are plenty others - google it!

Example:

Let's imagine, you want to create a list of your enemies. Let us further assume, you used a csv file. You would create a table, containing 3 columns:

  1. name
  2. age
  3. actions
  • When you are done with the table, you suddenly realize: some of you enemies have children. Surely, their children are you enemies too (your are not easily forgiving), but they don't qualify for their own rows in your table, since they are only secondary enemies. But still, you want to store the same data for parents and children. Maybe storing the data in an .csv file was a bad idea from the beginning … ?
  • File-conversions can be tricky.
    When?

Problems specific to plain text files

Character sets

  • The issue: there's no common agreement on how computers should internally store characters and different parties do it in a different way which leads to trouble.

  • A bit of history:

    • Historically, we tried to store characters as compact as possible which limited the number of characters possible to be represented.
    • Initially, (ASCII, 1963) it was only around 100 useful characters. This left no space for characters specific to non-English alphabets. (Just think about it: even 10 digits and 26 letters in small and big caps is already 62 characters and we also need space, coma, period, brackets, etc.).
    • This has been quickly (still in '60s) extended by additional 128 characters which was enough to handle (almost) any single language.
      This lead to creation of hundreds (!) of encoding standards which we call code pages.
    • It took until 1991 to come up with a standard allowing to represent (hopefully) any character in a uniform way - the Unicode - but it still has a few implementations (UTF-8, UTF-16, UTF-32 and a few others).
  • Why are code pages troublesome?

    1. You have to know file code page to read it properly but this information is not contained in the file
      • Download and open in VSC Code a sample Windows-1252-encoded file (right click on the link and choose "Save target element as..." to download instead of opening it in the browser).
        Change the encoding so the file is displayed properly: In the bottom bar of VS Code, you'll see the label UTF-8. Click it to open the action bar and select Reopen with encoding.
      • Download and open in VSC Code a sample ISO-8859-1-encoded file.
        Change the encoding so the file is displayed properly.
      • Download and open in VSC Code a sample file in unknown encoding.
        It contains the same text as in the iso_8859-1.txt but can you guess the encoding so that it displays correctly?
    2. You can't store characters from different encodings in one file, e.g. you can't have a file containing Jürgen Żółtak (a mix of German and Polish characters)
  • Unfortunately, code pages are still wildly used, e.g. in:

    • PDFs (!)
    • Filenames in ZIP files created by Windows (!)
    • Many apps working with plain text files under Windows (!)
    • Legacy data created before Unicode gained momentum
  • What can go wrong with Unicode?

    • There are many ways of encoding Unicode data: UTF-8, UTF-16, UTF-32.
    • To avoid problems with unknown file encoding the BOM has been invented.
      Unfortunately or not BOM has never been widely adopted. Anyway, if you have a BOM-aware app and a file containing the BOM mark, the automated encoding recognition works, e.g.:
  • UTF-8 without BOM is the most portable Unicode encoding.
    Use it in every new file you create.

    • If you're using Mac or Unix, it's just a default.
    • If you're using Windows, make sure your app is set up to save files using UTF-8.

Character sets conversion

There are various tools allowing to convert files between characters sets, e.g.

  • With VS Code Click on the label with the current encoding in the bottom action bar (eg. UTF-8) and select Reop with encoding. Choose the target encoding from the list.
  • Alternatively use the iconv app in the cli, e.g.:
    iconv -f CP1252 -t UTF-8 fileInWindows1252Encoding.txt > fileConvertedToUTF8.txt
    (f = "from code", t = "to code")

Line endings

  • For historical reasons there are two characters used to denote the end of a line in plain text files: a Carriage Return (\r) and a Line Feed (\n). (If you're wondering why think of how typewriters used to work).
  • Different operating systems use them in a different way:
    • Windows default is \r\n
    • Unix/current Mac default is \n
    • Legacy MacOS default is \r
  • Most apps just handle all conventions listed above but it does make a difference for file comparison (e.g. in git)
  • There are ways to convert line ending style:
    • In the bottom bar of VS Code, you'll see a label displaying the line endings of the current file. You can toggle between LF/CRLF.
    • On Windows: with Notepad++ or dos2unix and unix2dos
      • When you install git on Windows you can choose if the conversion should be performed automatically when you pull/push data from remote repositories.
    • On Mac: install dos2unix via Homebrew (brew install dos2unix)
    • On Unix: use dos2unix and unix2dos

Problems related to file names and paths

  • Path separators:

    • \ (Windows),
    • / (Unix and Mac but generally works in Windows as well)
  • Characters allowed in file and folder names

    • Differ between operating systems or even file systems, e.g. windows:will:not:store:it.txt is a valid file name under Linux but not under Windows
    • Some characters are allowed but may require special handling in the cli, e.g. a space.
      1. Rename any file in a way its name contains a space
      2. Try to copy it in the cli now.
        How to do it properly?
    • To be on the safe side, avoid characters other then letters, digits, a dot, an underscore and a dash.

Keep your data open - Open Data (FAIR)

This topic is far to broad to discuss in detail during an introductory course but it's still worth to mention that:

  • The same information can be stored in a different ways, e.g.
  • The format you choose impacts how easy it will be to reuse the data (see e.g. 5 Star Opened Data).
    • Use formats which can be processed with free tools.
      • Something to think about - "free" as in "a free beer" or like in "freedom of speech"? See e.g. here.
    • Think about licensing.
      • Take a look at CreativeCommons.
      • Honor licenses of data you are using --in academia you are most likely to violate attribution and "share derived work under same license" obligations.
      • The "free beer" vs "freedom of speech" question applies also here.
    • Separate data from presentation and keep your data structured (make it easy to process your data in an automated way).
    • Follow your scientific community standards.
    • Don't forget about the metadata.
    • Deposit outcomes of your work in public repositories (e.g. Zenodo) so others can find and access them.
  • Read about:
    • Open Data and Linked Data
    • Remember that any data you create during your studies or work can be useful for others...
      ...but sharing it in a reusable way admittedly involves quite some work.