Skip to content

Latest commit

 

History

History
479 lines (327 loc) · 15 KB

strings_and_binaries.livemd

File metadata and controls

479 lines (327 loc) · 15 KB

Strings And Binaries

Mix.install([
  {:jason, "~> 1.4"},
  {:kino, "~> 0.9", override: true},
  {:youtube, github: "brooklinjazz/youtube"},
  {:hidden_cell, github: "brooklinjazz/hidden_cell"}
])

Navigation

Review Questions

Upon completing this lesson, a student should be able to answer the following.

  • Explain codepoints, strings, binaries, and charlists.
  • Why do lists in Elixir sometimes print unnexpected values, and how do you get around it?

Strings And Binaries

Strings are a common data type in most programming languages.

To check if a value is a string, we use the is_binary/1 function. That's because strings in Elixir are represented internally as binaries.

is_binary("hello")

A binary is a sequence of bytes. bytes are a grouping of eight bits.

A bit is a single value represented as 0 or 1. 0 and 1 correspond to on and off for an electrical signal.

Here we see a visual representation of a byte in memory.

Bytes can represent integers with a base2 counting system where each bit represents the next placeholder.

This means the maximum integer a single byte can represent is 255.

128 + 64 + 32 + 16 + 8 + 4 + 2 + 1

By counting the placeholders in each byte we can determine the bytes value. These are binary numbers.

Integers are represented using a base 10 counting system compared to the base 2 counting system for binary numbers. As the value in the rightmost placeholder grows, it shifts to the left when it reaches 10 in base 10 or 2 in base 2.

Here's a table of binary numbers and their corresponding integer values.

Enum.map(1..300, fn integer ->
  binary = Integer.digits(integer, 2) |> Enum.join() |> String.to_integer()
  %{base10: integer, base2: binary}
end)
|> Kino.DataTable.new()

Each byte stores an integer between 1 and 255 in binary. These integers are used to represent characters. For example, the number 65 represents the "A" character. This number is called the codepoint. We can find the codepoint of any character using the ? symbol.

?A

We can also see that a string is actually a series of binary bytes representing a codepoint using IO.inspect/2 with the binaries: :as_binaries option.

IO.inspect("ABCDEFG", binaries: :as_binaries)

Unicode Code Points

To represent strings, each character is given a unique numerical index called a code point. The Unicode Standard contains the complete registry of all possible characters and their numerical index.

We've already seen we can find the code point of a character using ? in Elixir. For example the code point of h is 104.

?h

Here we can see a table of alphabetical characters and their code points. Notice that capital letters and lowercase letters have different code points.

"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
|> String.codepoints()
|> Enum.map(fn char ->
  <<code_point::utf8>> = char
  %{character: char, code_point: code_point}
end)
|> Kino.DataTable.new()

To represent strings as binaries, each byte in the binary represents a code point.

For example, the code point for the the character h is 104 or 01101000 in a byte.

64 + 32 + 8

Using byte_size/1 and bit_size/1 We can see that (generally) each character in a string is a single byte or eight bits.

byte_size("h")
bit_size("h")

So a string with five characters would have five bytes or forty bits.

bit_size("hello")
byte_size("hello")

We can use IO.inspect/2 with the :as_binaries option to display a value as its binary representation.

Notice these are the code points for each character.

IO.inspect("hello", binaries: :as_binaries)

?h = 104
?e = 101
?l = 108
?l = 108
?o = 111

Using the codepoints for these characters, we can create a charlist to see the relationship between charlists and binaries.

[104, 101, 108, 108, 111]

We can also create a binary using <<>>. Don't worry about using this syntax. It's uncommon for most real-world applications.

<<104, 101, 108, 108, 111>>

UTF-8

Unicode version 14.0 has a total of 144,697 characters with code points. Since a single byte can only store an integer up to 255 some characters must use multiple bytes.

Elixir uses UTF-8 to encode it's strings, meaning that each code point is encoded as a series of 8-bit bytes.

UTF-8 uses between one and four bytes to store each code point. For example the é character uses two bytes.

byte_size("é")

UTF-8 also uses graphemes. Graphemes may consist of multiple characters perceived as one.

We can see the underlying characters of a grapheme with String.codepoints/1.

For example, the character can be represented as the "e with acute" code point or as the letter "e" followed by a "combining acute accent".

You can see below that while both have the same appearance, they have different code points.

String.codepoints("é")
String.codepoints("é")

When splitting a string into characters for the sake of enumeration, it's important to know if you want to enumerate through the graphemes or the code points.

If you want to enumerate through each grapheme, rather than the underlying code points, then you can use String.graphemes/1.

String.graphemes("é")

Your Turn

Use String.graphemes/1 to convert the woman fire fighter emoji 👩‍🚒 to a list of graphemes.

Use String.codepoints/1 to convert the emoji 👩‍🚒 into a list of codepoints. You'll notice it's actually a combination of the 👩 and 🚒 emojis.

Hexadecimal

Since binary is cumbersome for representing large numbers, we use hexadecimal for a more human friendly representation of binary values.

Hexadecimal is a base 16 counting system.

If base 10 uses ten symbols (0-9) and base 2 uses two symbols (0, 1), then Hexadecimal uses sixteen symbols (0-9 a-f).

Hexadecimal uses decimal symbols from 1 to 9. Then switches to using the letters a to f as you can see in the following table.

"0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f"
|> String.split(",")
|> Enum.with_index()
|> Enum.map(fn {symbol, index} -> %{integer: "#{index}", hexadecimal: symbol} end)
|> Enum.take(-10)
|> Kino.DataTable.new()

As the rightmost placeholder grows, it shifts to the left when it reaches 16.

You may already be familiar with hexadecimal for color codes. For example, the color white is ffffff. This is actually the hexadecimal code for 16777215. you can add the placeholder values to find the hexadecimal value like so:

1_048_576 * 15 + 65536 * 15 + 4096 * 15 + 256 * 15 + 16 * 15 + 15

In Elixir, we can represent Hexadecimal integers using 0x syntax. They are equivalent to their integer counterpart.

15 = 0xF
255 = 0xFF
4090 = 0xFFA

We can also represent the code point for a string using hexadecimal with \u syntax.

For example, "\u0065" is the character "e"

"\u0065"

The codepoint for e is 101

?e

And the hexadecimal value 0x65 represents 101. So, \u0065 corresponds to the code point for "e".

0x65 = ?e

Which is why "\u0065" is the character "e".

"\u0065" = "e"

Your Turn

Use hexadecimal 0x syntax to represent the number 15.

Example Solution
0xf

Bitstrings

The codepoints for characters are represented with a UTF-8 encoded series of bytes. These bytes are stored in a bitstring.

A bitstring is a contiguous sequence of bits in memory. A binary is a bitstring where the number of bits is divisible by 8.

The <<>> syntax defines a new bitstring.

Strings in Elixir are actually bitstrings. For example, when we create a bitstring of codepoints matching "hello" we create the "hello" string.

# H E L L O
<<104, 101, 108, 108, 111>>

It's fairly rare that you'll need to work directly with bitstrings for most applications. If you'd like to learn more, you can reference the Elixir Documentation.

Charlists

You may have noticed that occasionally lists of integers in Elixir will display unexpected characters.

[97, 98, 99]

That's because they are interpreted as a charlist. A charlist is a list of integers that are all valid code points. If any integer in a list is not a valid codepoint, the list will not be interpreted as a charlist.

# A B C Invalid
[97, 98, 99, 1]

You can define a charlist using single quotes ''.

'hello'

This charlist is equivalent to a list of the code points for a string.

[?h, ?e, ?l, ?l, ?o]

Charlists are a list of integers, so they can be enumerated through and will return a new charlist. Here's how we could shift each character up one code point.

Enum.map('abcde', fn each -> each + 1 end)

You can convert a charlist back into a string with List.to_string/1.

List.to_string('hello')

Or convert a string into a charlist with String.to_charlist/1.

String.to_charlist("hello")

Like bitstrings, you will likely not need to use charlists frequently, however they can be useful for certain string manipulation or when working with certain libraries.

If we want to inspect a list that contains codepoints, we can use IO.inspect/2 with the charlists: :as_lists option.

IO.inspect([65, 66, 67])
IO.inspect([65, 66, 67], charlists: :as_lists)

Your Turn

Convert the alphabet "abcdefghijklmnopqrstuvwxyz" to a charlist, then inspect it to see the codepoints for each letter.

Example Solution
"abcdefghijklmnopqrstuvwxyz"
|> String.to_charlist()
|> IO.inspect(charlists: :as_lists)

Further Reading

Consider the following resource(s) to deepen your understanding of the topic.

Commit Your Progress

DockYard Academy now recommends you use the latest Release rather than forking or cloning our repository.

Run git status to ensure there are no undesirable changes. Then run the following in your command line from the curriculum folder to commit your progress.

$ git add .
$ git commit -m "finish Strings And Binaries reading"
$ git push

We're proud to offer our open-source curriculum free of charge for anyone to learn from at their own pace.

We also offer a paid course where you can learn from an instructor alongside a cohort of your peers. We will accept applications for the June-August 2023 cohort soon.

Navigation