Before we start: What is a file? What is a folder/directory?
- How do you identify a file format?
- How does your operating system do it?
By filename extensions
- Why is this naive?
- Download any image, e.g. this JPEG (right click on the link and choose "Save target element as..." to download instead of opening it in the browser)
- Open it by double clicking
- Change its name to
arpanet.html
- Try to open it by double clicking
- but also...
- Download any Ms Word file, e.g. this one
- Rename it to sample.zip
- Try to open it by double clicking
- ...and...
What's the format of the file with the
.bin
extension? - Is there a better way to recognize the file type?
Yes, we can try to actually analyze its content, e.g. with the
file
command1. In the cli go to the folder containing the arpanet.html 2. Execute: `file arpanet.html`
- Even when looking into the file content the result might be surprising
echo 'rot,blau,gelb' > farben.csv file -i --mime farben.csv
There are two broad file format kinds: binary and text.
"Files you can't open in a plain text editor", e.g.
- Images: jpg, png, etc.
- Movies & sound: avi, mkv, mp4, mpeg, mp3, etc.
- Old Ms Office files: .doc, .xls, .ppt
- PDF (although not entirely...)
There are various reasons for using binary formats:
- The data stored in a file have no good text representation (e.g. images, movies, sound)
- Compression.
- Performance.
- Protecting intellectual rights - only you know how to read (decode) the data stored in a file.
"Files you can open in a plain text editor", e.g.
- txt: plain text file
- csv, tsv: delimited values "database" files
- xml, json: general structured data file formats
- html, md: files storing formatted texts
- js, php, py, r, cpp, etc.: source code in various programming languages
- and many, many more (e.g. IANA-registered list of text formats)
There are various reasons for using text formats:
- All of them can be read and edited using just a plain text editor.
This doesn't mean it's always the most convenient way. - They are easy to compare and version.
(you should have seen it during the git lecture)
Images, despite having distinct content, share similarities with other files. In fact, they can be either binary, such as JPEG, PNG, GIF, BMP, or text files, such as Scalable Vector Graphics (SVG), Portable Graymap (PGM),Portable Bitmap (PBM), etc.
Depending on users' needs, images can be, stored as:
- JPEG: lossy compression, small size (good for transmission)- it's our gold standard when it comes to archiving
- PNG: lossy compression, small size (good for transmission)
- Uncompressed(RAW) or LZW-compressed TIFF: lossless format; Represent the faithful digital version of the holding; they stores correct image dimensions and colour profiles.
Note: TIFF files are typically binary, and their content is not easily human-readable. Therefore, providing an example in a text representation is not practical. - SVG: Scalable Vector Graphics, useful for simple geometric shapes, logos, diagrams etc. SVG is a vector graphics format that uses XML to describe 2D vector graphics. It is primarily used for scalable two-dimensional graphics that can be rendered with XML and CSS.
Note: SVG is a text-based format, but it's not an image format in the traditional sense. It's more of a markup language for describing vector graphics
When it comes to maintain consistency among the scanned images, there are three detailed levels:
- Thumbnail
- General Access (jpeg, png)
- Master (raw, TIFF)
- General Access (jpeg, png)
- Can a file be in more than one format at the same time?
Sure it can, e.g.
- Download a sample HTML file (right click on the link and choose "Save target element as..." to download instead of opening it in the browser).
- Open it in a browser (just double click on it)
- Make a copy of it and rename the copy to sample.xml
- Open sample.xml in a browser (just double click on it)
- Make a copy of it and rename the copy to sample.txt
- Open sample.txt in a plain text editor (e.g. Visual Studio Code)
As we can see an HTML file is also an XML file as well as a plain text file at the same time
- File formats form a hierarchy with more specialized ones being build on top of more generic ones, e.g.
text file -+ any file which can be read in a plain text editor +-> XML --+ a generic text format for storing structured data in text files | +-> TEI uses XML to store text data | +-> HTML uses XML to store WWW webpages | +-> RDF/XML uses XML to store RDF data | +-> MARC/XML uses XML to store MARC data (library catalog data) | +-> JSON -+ another generic format for storing structured data in text files +-> geoJSON uses JSON to store spatial data +-> JSON-LD uses JSON to store RDF data ZIP -+ just a compressed set of arbitrary files +-> DOCX a compressed set of files representing an Ms Word document
-
Don’t confuse the type of a file with its contents, even though these are closely linked:
- A simple txt-file could contain Shakespeare's A Midsommer nights dreame, a Python script or a even a mixture of that ;)
-
When we handle files – event if they are just special text files – we can't ignore their typical contents and the typical associated languages
- E.g., white spaces in an xml-file are often considered to be irrelevant for the contents of that file
<element>content<child>child-element content</child></element>
is broadly expected to be equivalent to:<element>content<child>child-element content</child>
</element> - In contrary white spaces matters a lot in python script files.
if (2+2)==4: print("everything still okay")
While the above is valid python (or at least looks like it) this could never run, and an error message will appear:if (2+2)==4: print("everything still okay")
(If you want to, try to run the "correct" example in a python shell by simply pasting it there. It will fail. Any idea why?)
- E.g., white spaces in an xml-file are often considered to be irrelevant for the contents of that file
-
Knowing the typical content associated with a file can help you determine how to handle a file. For example, it can prevent you from inadvertently modifying relevant white spaces. It also helps you to know what you could do with a file; files could be:
- executed:
.sh
,.exe
,.js
,.py
- used to store data from a project:
.xml
,.docx
,.jpg
,.sql
,.md
- used to store graphical data, photos, shapes etc.:
.tif
,.jpg
,.svg
- used to configure a system or program:
.xml
,.json
! Note that some of these are binary files, while some are text files. These two categories, however, don’t align with the above mentioned purposes.
- executed:
-
The purpose of a file often provides clues about the language it's written in and its associated data structures. You would be surprised, for example, if you discover that the Word document, you just received from your Professor, contained python script. This is because you typically expect such a document to contain texts written in natural language, e.g., German. Technical languages are not uniform but are organized into families, each tailored to fulfill specific technical tasks or objectives.
Generally speaking, we use them to define sets of instructions that can be executed by a computer.
- Python
- Java
- JavaScript
- C, C++, C#
- Fortran
We use them to encode and structure text-like information.
- XML
- HTML
- TeX
- Markdown (in fact, this document was written in Markdown.)
we use them if we want to get specific parts/sets of (structured) data. Eg. to search something.
- SPARQL (for RDF graphs)
- SQL (for relational databases)
- XQuery (for XML data sources)
- XPath (for XML documents)
- Many file formats can be converted between each other, e.g.
Let's imagine, you want to create a list of your enemies. Let us further assume, you used a csv file. You would create a table, containing 3 columns:
name
age
actions
- When you are done with the table, you suddenly realize: some of you enemies have children. Surely, their children are you enemies too (your are not easily forgiving), but they don't qualify for their own rows in your table, since they are only secondary enemies. But still, you want to store the same data for parents and children. Maybe storing the data in an .csv file was a bad idea from the beginning … ?
- File-conversions can be tricky.
When?
-
The issue: there's no common agreement on how computers should internally store characters and different parties do it in a different way which leads to trouble.
-
A bit of history:
- Historically, we tried to store characters as compact as possible which limited the number of characters possible to be represented.
- Initially, (ASCII, 1963) it was only around 100 useful characters. This left no space for characters specific to non-English alphabets. (Just think about it: even 10 digits and 26 letters in small and big caps is already 62 characters and we also need space, coma, period, brackets, etc.).
- This has been quickly (still in '60s) extended by additional 128 characters which was enough to handle (almost) any single language.
This lead to creation of hundreds (!) of encoding standards which we call code pages. - It took until 1991 to come up with a standard allowing to represent (hopefully) any character in a uniform way - the Unicode - but it still has a few implementations (UTF-8, UTF-16, UTF-32 and a few others).
-
Why are code pages troublesome?
- You have to know file code page to read it properly but this information is not contained in the file
- Download and open in VSC Code a sample Windows-1252-encoded file (right click on the link and choose "Save target element as..." to download instead of opening it in the browser).
Change the encoding so the file is displayed properly: In the bottom bar of VS Code, you'll see the labelUTF-8
. Click it to open the action bar and selectReopen with encoding
. - Download and open in VSC Code a sample ISO-8859-1-encoded file.
Change the encoding so the file is displayed properly. - Download and open in VSC Code a sample file in unknown encoding.
It contains the same text as in theiso_8859-1.txt
but can you guess the encoding so that it displays correctly?
- Download and open in VSC Code a sample Windows-1252-encoded file (right click on the link and choose "Save target element as..." to download instead of opening it in the browser).
- You can't store characters from different encodings in one file, e.g. you can't have a file containing
Jürgen Żółtak
(a mix of German and Polish characters)
- You have to know file code page to read it properly but this information is not contained in the file
-
Unfortunately, code pages are still wildly used, e.g. in:
- PDFs (!)
- Filenames in ZIP files created by Windows (!)
- Many apps working with plain text files under Windows (!)
- Legacy data created before Unicode gained momentum
-
What can go wrong with Unicode?
- There are many ways of encoding Unicode data: UTF-8, UTF-16, UTF-32.
- To avoid problems with unknown file encoding the BOM has been invented.
Unfortunately or not BOM has never been widely adopted. Anyway, if you have a BOM-aware app and a file containing the BOM mark, the automated encoding recognition works, e.g.:- Download and open a UTF-16-encoded file with BOM
- Download and open a UTF-16le-encoded file without BOM
-
UTF-8 without BOM is the most portable Unicode encoding.
Use it in every new file you create.- If you're using Mac or Unix, it's just a default.
- If you're using Windows, make sure your app is set up to save files using UTF-8.
There are various tools allowing to convert files between characters sets, e.g.
- With VS Code Click on the label with the current encoding in the bottom action bar (eg.
UTF-8
) and selectReop with encoding
. Choose the target encoding from the list.- Alternatively use the
iconv
app in the cli, e.g.:
iconv -f CP1252 -t UTF-8 fileInWindows1252Encoding.txt > fileConvertedToUTF8.txt
(f
= "from code",t
= "to code")
- For historical reasons there are two characters used to denote the end of a line in plain text files: a
Carriage Return
(\r
) and aLine Feed
(\n
). (If you're wondering why think of how typewriters used to work). - Different operating systems use them in a different way:
- Windows default is
\r\n
- Unix/current Mac default is
\n
- Legacy MacOS default is
\r
- Windows default is
- Most apps just handle all conventions listed above but it does make a difference for file comparison (e.g. in git)
- There are ways to convert line ending style:
- In the bottom bar of VS Code, you'll see a label displaying the line endings of the current file. You can toggle between
LF
/CRLF
. - On Windows: with Notepad++ or
dos2unix
andunix2dos
- When you install git on Windows you can choose if the conversion should be performed automatically when you pull/push data from remote repositories.
- On Mac: install
dos2unix
via Homebrew (brew install dos2unix
) - On Unix: use
dos2unix
andunix2dos
- In the bottom bar of VS Code, you'll see a label displaying the line endings of the current file. You can toggle between
-
Path separators:
\
(Windows),/
(Unix and Mac but generally works in Windows as well)
-
Characters allowed in file and folder names
- Differ between operating systems or even file systems, e.g.
windows:will:not:store:it.txt
is a valid file name under Linux but not under Windows - Some characters are allowed but may require special handling in the cli, e.g. a space.
- Rename any file in a way its name contains a space
- Try to copy it in the cli now.
How to do it properly?
- To be on the safe side, avoid characters other then letters, digits, a dot, an underscore and a dash.
- Differ between operating systems or even file systems, e.g.
This topic is far to broad to discuss in detail during an introductory course but it's still worth to mention that:
- The same information can be stored in a different ways, e.g.
- you can prepare a text as a .docx, LaTeX, markdown, TEI-XML or publish it as an HTML webpage,
- you can store your database as .xlsx, .csv, JSON, XML, RDF or using a dedicated database software (e.g. a relational database or a triplestore),etc.
- The format you choose impacts how easy it will be to reuse the data (see e.g. 5 Star Opened Data).
- Use formats which can be processed with free tools.
- Something to think about - "free" as in "a free beer" or like in "freedom of speech"? See e.g. here.
- Think about licensing.
- Take a look at CreativeCommons.
- Honor licenses of data you are using --in academia you are most likely to violate attribution and "share derived work under same license" obligations.
- The "free beer" vs "freedom of speech" question applies also here.
- Separate data from presentation and keep your data structured (make it easy to process your data in an automated way).
- Follow your scientific community standards.
- Don't forget about the metadata.
- Deposit outcomes of your work in public repositories (e.g. Zenodo) so others can find and access them.
- Use formats which can be processed with free tools.
- Read about:
- Open Data and Linked Data
- Remember that any data you create during your studies or work can be useful for others...
...but sharing it in a reusable way admittedly involves quite some work.