Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
jgordini committed Aug 11, 2024
1 parent adb3f12 commit bf3bf3b
Show file tree
Hide file tree
Showing 2 changed files with 92 additions and 13 deletions.
Binary file modified .DS_Store
Binary file not shown.
105 changes: 92 additions & 13 deletions intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,28 +194,107 @@ We use a complex parsing strategy:

This extracts the board numbers and parses the coordinate data separately.

## Transcript

Hi, I'd like to show you how to use Dyalog APL to parse the contents of text files that contain data in unusual formats. First off, we're going to have a regular comma-separated file. Luckily, Dyalog APL has quad CSV, which can easily do this job. The first element of the argument is the file name or the content itself. Next is a specification of what type of content it is, but that can usually be inferred, so we can give it an empty vector. Finally, there's a code for conversion. This is because comma-separated files do not distinguish between numbers and text. However, code number four uses a heuristic: if something looks like a number, then we'll convert it to a number; otherwise, we'll just leave it as text. And here we go, here's our numeric matrix.

Next up is something that's a list of numbers, and you might not think of this as a comma-separated file, but you can actually parse it as such, and you will just have a single column. In order to get a vector that we can work with, we simply ravel it.
# Justifying Text in APL

Next, we have a bit map or a matrix of bits represented as the characters zero and one, and this doesn't look like a comma-separated file at all either. There are no separators. However, quad CSV has a variant option where you can, instead of using a separator, use widths. And we have eight columns, each one has width one, and here's our boolean matrix.
APL provides powerful array-oriented capabilities that make text justification concise and efficient. Let's explore how to implement text justification using APL.

Of course, it's not so much fun to have to count how many columns there are, so what we can do is we can use quad-n-get to read in the file first and do a little bit of analysis on it so we know how many columns there are. We ask for a vector of character vectors with this one flag. This gives us a nested result. Getting just the first one allows us to count how many there are. But since we want the width as a vector of ones, we can actually just do a one constant for each one. And now we can take this and use it as our widths, and then we just take the data as before, wrap the entire thing in the dyad variant on the widths, and now it's all automatic.
## Basic Approach

Next, we got something that decidedly doesn't look like a separator-separated file. However, we can notice that the labels on the left have a consistent width, and the single character on the right is also well, just a single character. So we could actually split this with width as well. First, we have three characters for this label, and then four characters for the space equal equal space, and then a single character. It doesn't actually matter to specify the conversion to numbers because there are no numbers here. Okay, this gives us an extra column in the middle that we don't want because we just want the labels and the characters, so we can start processing them. We can then use a horizontal compress to remove the middle column, and then we can take it from there.
The core idea is to create an integer matrix representing how many copies of each character we want in the final justified text. For example:

We can do something very similar to this file even though it looks like there are multiple levels of nesting. We can start off with the width: one column for the single digit, one column for the semicolon separator, one column for the second digit that's there, then four columns for space two space, and then we have the same kind of pattern again. This is seven columns, and we only want every other one, so we do a cyclic reshape of one zero for our mask and do the horizontal compress, and there we go. If you want to make pairs 2 and 2, then we can use partition for this: begin a new partition, continue the same partition, begin a new partition, continue the same partition. Now we've got two matrices: the beginning points and the end points. We could also apply this rank one in order to get a matrix of beginning pairs and end pairs and take it from there.
```apl
chars ← 'the question: whether ''tis '
ints ← 1 1 1 3 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 0 0 0 0
```

Here, `3` and `2` indicate expanded spaces, while `0` removes trailing spaces.

## Key Steps

1. Identify spaces and trailing spaces:
```apl
spaces ← ' '=text
trailing ← ⌽∧\⌽spaces
```

2. Determine characters to keep:
```apl
keep ← ~trailing
```

3. Find inner spaces to expand:
```apl
inner ← keep∧spaces
```

4. Calculate space distribution:
```apl
trail ← +/~keep ⍝ Trailing spaces per line
spaces ← +/inner ⍝ Inner spaces per line
add ← inner(×⍤1 0)⌊trail÷spaces
```

This might not look like a data file at all, but if you count, you'll see that every row has exactly four words. And so, instead of being comma-separated, we can call this space-separated. So we set the separator to a space, and we got ourselves a matrix of words. Here are some keywords and values, again separated by a space. Because we have the four for the conversion code so that things that look like numbers are treated like numbers, this just works.
5. Handle extra spaces:
```apl
extra ← inner × (+\inner)(≤⍤1 0)spaces|trail
```

6. Combine and apply:
```apl
result ← (⍴text)⍴(,keep+add+extra)/,text
```

## Complete Function

Here's a concise APL function that justifies text:

```apl
Just ← {
keep ← ~⌽∧\⌽' '=⍵
inner ← keep∧' '=⍵
trail ← +/~keep
spaces ← +/inner
add ← inner(×⍤1 0)⌊trail÷spaces
extra ← inner×(+\inner)(≤⍤1 0)spaces|trail
(⍴⍵)⍴(,keep+add+extra)/,⍵
}
```

## Handling Edge Cases

To handle blank lines and short lines, we can create a more robust function:

Okay, this one is a bit tougher because we want to separate out each table that are empty line separated. Let's start off by figuring out the width of the columns. So we could look at this as the first column has of numbers takes two characters and the following ones use three characters each: two followed by four that are three. Okay, now we can split it on rows that begin with empty character vectors. So how do we find the empty character vectors? Let's take the left column, and the tally of each: the numbers have tally one and the empty character vectors have tally zero. And then we can use this to partition the split because partition likes to take a vector, so we split the matrix into individual rows and then we partition like that. And here we have a vector of vectors of vectors. Mix that twice, and we've got a rank three array of numbers.
```apl
BetterJust ← {
s ← ' '=⍵
t ← ⌽∧\⌽s
fewWs ← 0=+/s-t
shortL ← (+/t)>0.25×⊃⌽⍴⍵
use ← ~fewWs∨shortL
result ← ⍵
(use⌿result) ← Just use⌿⍵
result
}
```

Another possibility is to compute the exact width. Here we were lucky that it was easy to spot where the column splits were, but they might have had different widths. So how can we detect where the columns' separations are? Let's begin by getting in the data, and we give a 1 to quad-n-get to say we want a vector of vectors. If we mix this, we get a character matrix. Now we can compare with space, and this gives us a boolean matrix marking where the spaces are. And reduction will tell us columns that are entirely made of spaces. These are the beginning points for every column then when we're going to cut it up, except the first column doesn't begin with a space, so we'll change the into a 1 at position 1. These are our beginning marks. The only thing we want to know is how wide are these columns, so we can use this to self-partition and then get the length of each. There we go, and then we proceed like we did before to split it in by the lines that begin with an empty vector. Oops, that should have been a split. And finally, the mix.
This function ignores blank lines, lines with only one word, and lines that are too short (less than 75% filled).

Okay, this is a tough one because not only do we have the separations with empty spaces, we also have these headings, and we want to extract the number from the headings. What we're going to do is we're going to start by getting in the content of the file as vector of vectors. Now we're going to split it on empties. So we can get the length of each, and we can get whether or not the length is non-zero by taking the sign of that. We can use this to partition what we've got. So now we have each board with its heading in its separate element. This means we can write a little utility function and apply to each to process it. For each board and its heading, we want to separate it into two elements: one is the heading itself, and one is the data from the board. So the heading is the first element, and the board are the remaining elements.
## Example Usage

Now we just need to parse each part. Well, the board content itself, that is exactly the content of a comma-separated file, so we can use quad CSV on that. And then we need to gather out the number from the board heading, and for that we have verify and fix input. It gives us a two-element result: the first one is a check for each token in that character vector, did it form a correctly formed number or not, and the second element are the values with zeros put in at places where we could not evaluate to a number. All this is safe; we're not executing anything. Now we know there's only one number, so we don't need to actually check. We can just take the second element, and since we know that there will only be one number and the rest are zeros, we can just take the largest one from that, and we're done. If you want, we can stack these on top of each other.
```apl
text ← ↑'HAMLET' '' 'To APL, or not to APL, that is' 'the question: whether ''tis' 'nobler in the mind to suffer' 'the slings and arrows of the' 'outrageous fortune...'
BetterJust text
⍝ Result:
⍝ HAMLET
⍝ To APL, or not to APL, that is
⍝ the question: whether 'tis
⍝ nobler in the mind to suffer
⍝ the slings and arrows of the
⍝ outrageous fortune...
```

I'll take all the codes and put them into the description of this video. Thank you for watching.
This APL implementation demonstrates the language's power in handling complex text operations with concise, array-oriented code. The `BetterJust` function efficiently justifies text while intelligently handling various edge cases.

0 comments on commit bf3bf3b

Please sign in to comment.