User data support #911

iliana · 2022-04-13T23:13:50Z

~~This is a draft because I haven't run this outside of integration tests yet, but in theory it works!~~ It works! Please go ahead and review.

cloud-init is the usual instance initialization system for cloud images of common Linux distributions. It is a huge ball of Python code that runs at each boot and helps set things up in an expected way. There are two main pieces of data cloud-init looks for on boot: "meta-data", provided by the cloud provider, and "user-data", provided by whoever launched the actual instance.

This PR:

Adds user_data to the InstanceCreate struct, a base64-encoded string
Creates a CIDATA-labeled vfat volume, as specified in the cloud-init docs with the meta-data and user-data
Wires this data through to sled-agent, which learned of propolis's support this data in supply real volume data to propolis #771

I considered whether to add a user_data field to the Instance struct in omicron_common, but chose against it. I think we want to come up with an alternate means of making user-data accessible from the API rather than add a potentially-long string to the response object for all instance-related actions.

Existing tests that launch instances with the simulated sled-agent seem to exercise these changes, but recommendations for additional tests welcome; I suspect that the full end-to-end "the user-data matches on the actual instance" test will require having a CI system that runs actual VMs.

jmpesp

Haven't tested this yet, but will shortly! Some questions:

jmpesp · 2022-04-14T14:37:38Z

nexus/src/cidata.rs

+
+impl Instance {
+    pub fn generate_cidata(&self) -> Result<Vec<u8>, Error> {
+        // cloud-init meta-data is YAML, but YAML is a strict superset of JSON.


https://noyaml.com/

jmpesp · 2022-04-14T14:50:07Z

nexus/src/cidata.rs

+    // this was reverse engineered by making the numbers go lower until the
+    // code failed (usually because of low disk space). this only works for
+    // FAT12 filesystems
+    let sectors = 42.max(file_sectors + 35 + ((file_sectors + 1) / 341 * 2));


what do the numbers in this calculation represent? my FAT* filesystem ignorant read is "the lowest number of sectors we support is 42, and otherwise we have to compute enough space for FAT* formatting / metadata plus our file data", but that implies that FAT* needs 35 sectors for something?

I actually have no idea because the only reference I could easily find is [[Design of the FAT file system]], which I repeatedly tried to read and my eyes kept falling off the page. I wrote some code using fatfs to find out what the smallest numbers are that work with it.

This more or less spells out to:

No device less than 42 sectors can be formatted as FAT12

Given the total number of sectors taken up by your files, 35 are overhead

... plus an additional 2 sectors of overhead based on some linear relationship that has numbers I don't understand in it (why is it 341???).

I recommend https://www.win.tue.nl/~aeb/linux/fs/fat/fatgen103.pdf as a reference - it is the FAT spec.

Page 14 has a section on FAT type determination:

The FAT type— one of FAT12, FAT16, or FAT32—is determined by the count of clusters on the volume and nothing
else.

Since it seems we're seeing bytes_per_cluster to 512, and the default bytes_per_sector is also 512, the choice of FAT filesystem type seems dependent on "how many clusters we need", which seems dependent on the total input data size.

It also mentions the calculation:

// Calculate the Root Directory Sectors RootDirSectors = ((BPB_RootEntCnt * 32) + (BPB_BytsPerSec – 1)) / BPB_BytsPerSec; // Calculate the sectors in the data region If(BPB_FATSz16 != 0) FATSz = BPB_FATSz16; Else FATSz = BPB_FATSz32; If(BPB_TotSec16 != 0) TotSec = BPB_TotSec16; Else TotSec = BPB_TotSec32; DataSec = TotSec – (BPB_ResvdSecCnt + (BPB_NumFATs * FATSz) + RootDirSectors); // Calculate the cluster count CountofClusters = DataSec / BPB_SecPerClus; // Figure out FAT type If(CountofClusters < 4085) { /* Volume is FAT12 */ } else if(CountofClusters < 65525) { /* Volume is FAT16 */ } else { /* Volume is FAT32 */ }

I kinda despise this calculation, because to solve the equation for "am I using a FAT12 filesystem", you must first provide input that knows if you are using FAT12/FAT16/FAT32. For example, BPB_RootEntCnt is zero on FAT32, but is not on FAT12 / FAT16.

(This is literally what the fatfs crate does - guess, then check)

They even acknowledge this is kinda fucked up on page 19, under "FAT Volume Initialization":

At this point, the careful reader should have one very interesting question. Given that the FAT type
(FAT12, FAT16, or FAT32) is dependant on the number of clusters—and that the sectors available in
the data area of a FAT volume is dependant on the size of the FAT—when handed an unformatted
volume that does not yet have a BPB, how do you determine all this and compute the proper values to
put in BPB_SecPerClus and either BPB_FATSz16 or BPB_FATSz32? The way Microsoft operating
systems do this is with a fixed value, several tables, and a clever piece of arithmetic.

Microsoft operating systems only do FAT12 on floppy disks. Because there is a limited number of
floppy formats that all have a fixed size, this is done with a simple table:
“If it is a floppy of this type, then the BPB looks like this.”

There is no dynamic computation for FAT12. For the FAT12 formats, all the computation for
BPB_SecPerClus and BPB_FATSz16 was worked out by hand on a piece of paper and recorded in the
table (being careful of course that the resultant cluster count was always less than 4085). If your media
is larger than 4 MB, do not bother with FAT12. Use smaller BPB_SecPerClus values so that the
volume will be FAT16.

So, anyway, "worked out by hand on a piece of paper" aside, we can make some assumptions and try calculating this for a limited size:

If we assume we're using FAT12 (which we can validate later),

and we assume sector and cluster size are both 512

and we assume "max_root_dir_entries" is 512,

and we assume "BPB_ResvdSecCnt = 1"

and we assume "# of FATs = 2" (we could change this, but that's documented as the default of the rust crate you're using)

RootDirSectors = ((512 * 32) + (512 – 1)) / 512 = 32; DataSec = TotSec - (1 + 2 * FATSz + 32); DataSec = TotSec - (33 + 2 * FATSz) // The filesystem stays FAT12 if DataSec < 4085

So, anyway, how do we calculate the size of the actual "FAT"s on the FS? With a "simple" calculation (see: determine_sectors_per_fat in fatfs). According to page 16 of the FAT spec, this is "more complicated" for FAT12 because "there are 1.5 bytes (12-bits) per FAT entry".

So, recalling our calculation earlier:

DataSec = TotSec - (33 + 2 * FATSz)

If FATSz = 1 => 512 bytes => 4096 bits -> can represent 341 clusters (4096 / 12), or ~170 KiB worth of clusters

If FATSz = 2 => 1024 bytes => 8192 bits -> can represent 682 clusters, or ~341 KiB worth of clusters.

So:

TotalSec = 35 + DataSec, if we're storing < 170 KiB of clusters
TotalSec = 37 + DataSec, if we're storing < 341 KiB of clusters

TL:DR:

If we think it's okay to put a bound on user_data + meta_data, my inclination would be:

Calculate: file_sectors = user_data.len().div_ceil(512) + meta_data.len().div_ceil(512)

Return an error if file_sectors > 512 (implying a combined sum of 256KiB)

Expect that 37 sectors will be reserved for overhead

Use a disk size of 37 + file_sectors

Though reading your comment earlier, it seems like you experimentally noticed the minimum FS size was 42, no? Any idea where that's coming from?

No idea where the 42 is coming from, no. But if you try to give fatfs fewer than 42 sectors, it gives you an "unfortunate disk size" error.

nexus/src/external_api/params.rs

nexus/src/cidata.rs

jmpesp · 2022-04-14T18:02:40Z

cloud-init meta-data is YAML, but YAML is a strict superset of JSON

beware, YAML carries a terrible curse, but we can use JSON in a FAT filesystem

I actually have no idea

FAT is also cursed but at least we can use it for cloud-init data

it's both at the same time!

normalize_pubkey_data is also cursed

iliana · 2022-04-14T18:09:21Z

yooooo

Welcome to Alpine Linux 3.13
Kernel 5.10.16-0-virt on an x86_64 (/dev/ttyS0)

localhost login: root
Welcome to Alpine!

The Alpine Wiki contains a large amount of how-to guides and general
information about administrating Alpine systems.
See <http://wiki.alpinelinux.org/>.

You can setup the system with the command: setup-alpine

You may change this message by editing /etc/motd.

localhost:~# blkid /dev/vdb
/dev/vdb: LABEL="CIDATA" UUID="1234-5678" TYPE="vfat"
localhost:~# mount /dev/vdb /mnt
localhost:~# ls /mnt
meta-data
localhost:~# cat /mnt/meta-data
{"instance-id":"6de5655c-bd1d-4030-a19d-51118f800282","local-hostname":"suyre","public-keys":""}localhost:~#

nexus/src/lib.rs

iliana · 2022-04-14T18:53:42Z

cloud-init meta-data is YAML, but YAML is a strict superset of JSON

beware, YAML carries a terrible curse, but we can use JSON in a FAT filesystem

the new curse is that YAML 1.2 is the version of the spec that made it a strict superset, and PyYAML (what cloud-init ultimately uses) does not yet have YAML 1.2 support.

this could get interesting

edit: nvm we're probably fine:

>>> import yaml
>>> yaml.safe_load("""{"instance-id":"6de5655c-bd1d-4030-a19d-51118f800282","local-hostname":"suyre","public-keys":""}""")
{'instance-id': '6de5655c-bd1d-4030-a19d-51118f800282', 'local-hostname': 'suyre', 'public-keys': ''}

iliana · 2022-04-14T22:11:54Z

[ 10.766464] cloud-init[578]: Cloud-init v. 20.4.1 finished at Thu, 14 Apr 2022 22:09:45 +0000. Datasource DataSourceNoCloud [seed=/dev/vdb][dsmode=net]. Up 10.76 seconds

🎉

iliana · 2022-04-14T22:41:07Z

I've verified this works on a Debian 11 image with the following user data:

#cloud-config
system_info:
  default_user:
    name: iliana
    plain_text_passwd: "this is my password"
    lock_passwd: false

which correctly set the username and password of the default account, allowing me to log in over the serial console.

[   10.952755] cloud-init[572]: Cloud-init v. 20.4.1 finished at Thu, 14 Apr 2022 22:39:00 +0000. Datasource DataSourceNoCloud [seed=/dev/vdb][dsmode=net].  Up 10.95 seconds
[  OK  ] Finished Execute cloud user/final scripts.
[  OK  ] Reached target Cloud-init target.

Debian GNU/Linux 11 hostname ttyS0

hostname login: iliana
Password: 
Linux hostname 5.10.0-13-amd64 #1 SMP Debian 5.10.106-1 (2022-03-17) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
iliana@hostname:~$

david-crespo · 2022-04-14T23:13:50Z

openapi/nexus.json

+          },
+          "user_data": {
+            "description": "User data for instance initialization systems (such as cloud-init). Must be a Base64-encoded string, as specified in RFC 4648 § 4 (+ and / characters with padding). Maximum 32 KiB unencoded data.",
+            "type": "string"


Can we leave this field out of the request and everything still works the same? I see that it's not in the required list. I also don't see the logic that makes that work, but it might be serde magic or dropshot magic or who knows what.

Yep. There's a #[serde(default)] here, so it defaults to an empty string if no data is provided.

(With this change in place there will always be a cloud-init data device attached to an instance with at least the metadata, which is hostname/instance ID/SSH keys.)

smklein · 2022-04-15T14:25:42Z

nexus/src/cidata.rs

+    let mut disk = Cursor::new(vec![0; sectors * 512]);
+    fatfs::format_volume(
+        &mut disk,
+        FormatVolumeOptions::new()


We can explicitly specify that this needs to be FAT12 using .fat_type(FatType::Fat12), if that's necessary to assert our comment above. However, I'm not sure it's safe to assume we'll be using FAT12 - I think that's dependent on the size of userdata + metadata.

Edit; Unless we think it's okay to impose an upper size bound, then setting the fat type would be totally reasonable here.

smklein · 2022-04-15T16:10:01Z

nexus/src/cidata.rs

+    // this was reverse engineered by making the numbers go lower until the
+    // code failed (usually because of low disk space). this only works for
+    // FAT12 filesystems
+    let sectors = 42.max(file_sectors + 35 + ((file_sectors + 1) / 341 * 2));


I recommend https://www.win.tue.nl/~aeb/linux/fs/fat/fatgen103.pdf as a reference - it is the FAT spec.

Page 14 has a section on FAT type determination:

The FAT type— one of FAT12, FAT16, or FAT32—is determined by the count of clusters on the volume and nothing
else.

Since it seems we're seeing bytes_per_cluster to 512, and the default bytes_per_sector is also 512, the choice of FAT filesystem type seems dependent on "how many clusters we need", which seems dependent on the total input data size.

It also mentions the calculation:

// Calculate the Root Directory Sectors RootDirSectors = ((BPB_RootEntCnt * 32) + (BPB_BytsPerSec – 1)) / BPB_BytsPerSec; // Calculate the sectors in the data region If(BPB_FATSz16 != 0) FATSz = BPB_FATSz16; Else FATSz = BPB_FATSz32; If(BPB_TotSec16 != 0) TotSec = BPB_TotSec16; Else TotSec = BPB_TotSec32; DataSec = TotSec – (BPB_ResvdSecCnt + (BPB_NumFATs * FATSz) + RootDirSectors); // Calculate the cluster count CountofClusters = DataSec / BPB_SecPerClus; // Figure out FAT type If(CountofClusters < 4085) { /* Volume is FAT12 */ } else if(CountofClusters < 65525) { /* Volume is FAT16 */ } else { /* Volume is FAT32 */ }

I kinda despise this calculation, because to solve the equation for "am I using a FAT12 filesystem", you must first provide input that knows if you are using FAT12/FAT16/FAT32. For example, BPB_RootEntCnt is zero on FAT32, but is not on FAT12 / FAT16.

(This is literally what the fatfs crate does - guess, then check)

They even acknowledge this is kinda fucked up on page 19, under "FAT Volume Initialization":

At this point, the careful reader should have one very interesting question. Given that the FAT type
(FAT12, FAT16, or FAT32) is dependant on the number of clusters—and that the sectors available in
the data area of a FAT volume is dependant on the size of the FAT—when handed an unformatted
volume that does not yet have a BPB, how do you determine all this and compute the proper values to
put in BPB_SecPerClus and either BPB_FATSz16 or BPB_FATSz32? The way Microsoft operating
systems do this is with a fixed value, several tables, and a clever piece of arithmetic.

Microsoft operating systems only do FAT12 on floppy disks. Because there is a limited number of
floppy formats that all have a fixed size, this is done with a simple table:
“If it is a floppy of this type, then the BPB looks like this.”

There is no dynamic computation for FAT12. For the FAT12 formats, all the computation for
BPB_SecPerClus and BPB_FATSz16 was worked out by hand on a piece of paper and recorded in the
table (being careful of course that the resultant cluster count was always less than 4085). If your media
is larger than 4 MB, do not bother with FAT12. Use smaller BPB_SecPerClus values so that the
volume will be FAT16.

So, anyway, "worked out by hand on a piece of paper" aside, we can make some assumptions and try calculating this for a limited size:

If we assume we're using FAT12 (which we can validate later),

and we assume sector and cluster size are both 512

and we assume "max_root_dir_entries" is 512,

and we assume "BPB_ResvdSecCnt = 1"

and we assume "# of FATs = 2" (we could change this, but that's documented as the default of the rust crate you're using)

RootDirSectors = ((512 * 32) + (512 – 1)) / 512 = 32; DataSec = TotSec - (1 + 2 * FATSz + 32); DataSec = TotSec - (33 + 2 * FATSz) // The filesystem stays FAT12 if DataSec < 4085

So, anyway, how do we calculate the size of the actual "FAT"s on the FS? With a "simple" calculation (see: determine_sectors_per_fat in fatfs). According to page 16 of the FAT spec, this is "more complicated" for FAT12 because "there are 1.5 bytes (12-bits) per FAT entry".

So, recalling our calculation earlier:

DataSec = TotSec - (33 + 2 * FATSz)

If FATSz = 1 => 512 bytes => 4096 bits -> can represent 341 clusters (4096 / 12), or ~170 KiB worth of clusters

If FATSz = 2 => 1024 bytes => 8192 bits -> can represent 682 clusters, or ~341 KiB worth of clusters.

So:

TotalSec = 35 + DataSec, if we're storing < 170 KiB of clusters
TotalSec = 37 + DataSec, if we're storing < 341 KiB of clusters

TL:DR:

If we think it's okay to put a bound on user_data + meta_data, my inclination would be:

Calculate: file_sectors = user_data.len().div_ceil(512) + meta_data.len().div_ceil(512)

Return an error if file_sectors > 512 (implying a combined sum of 256KiB)

Expect that 37 sectors will be reserved for overhead

Use a disk size of 37 + file_sectors

Crucible updates all Crucible connections should set TCP_NODELAY (#983) Use a fixed size for tag and nonce (#957) Log crucible opts on start, order crutest options (#974) Lock the Downstairs less (#966) Cache dirty flag locally, reducing SQLite operations (#970) Make stats mutex synchronous (#961) Optimize requeue during flow control conditions (#962) Update Rust crate base64 to 0.21.4 (#950) Do less in control (#949) Fix --flush-per-blocks (#959) Fast dependency checking (#916) Update actions/checkout action to v4 (#960) Use `cargo hakari` for better workspace deps (#956) Update actions/checkout digest to 8ade135 (#939) Cache block size in Guest (#947) Update Rust crate ringbuffer to 0.15.0 (#954) Update Rust crate toml to 0.8 (#955) Update Rust crate reedline to 0.24.0 (#953) Update Rust crate libc to 0.2.148 (#952) Update Rust crate indicatif to 0.17.7 (#951) Remove unused async (#943) Use a synchronous mutex for bw/iop_tokens (#946) Make flush ID non-locking (#945) Use `oneshot` channels instead of `mpsc` for notification (#918) Use a strong type for upstairs negotiation (#941) Add a "dynamometer" option to crucible-downstairs (#931) Get new work and active count in one lock (#938) A bunch of misc test cleanup stuff (#937) Wait for a snapshot to finish on all downstairs (#920) dsc and clippy cleanup. (#935) No need to sort ackable_work (#934) Use a strong type for repair ID (#928) Keep new jobs sorted (#929) Remove state_count function on Downstairs (#927) Small cleanup to IOStateCount (#932) let cmon and IOStateCount use ClientId (#930) Fast return for zero length IOs (#926) Use a strong type for client ID (#925) A few Crucible Agent fixes (#922) Use a newtype for `JobId` (#919) Don't pass MutexGuard into functions (#917) Crutest updates, rename tests, new options (#911) Propolis updates Update tungstenite crates to 0.20 Use `strum` crate for enum-related utilities Wire up bits for CPUID customization PHD: improve artifact store (#529) Revert abort-on-panic in 'dev' cargo profile

Crucible updates all Crucible connections should set TCP_NODELAY (#983) Use a fixed size for tag and nonce (#957) Log crucible opts on start, order crutest options (#974) Lock the Downstairs less (#966) Cache dirty flag locally, reducing SQLite operations (#970) Make stats mutex synchronous (#961) Optimize requeue during flow control conditions (#962) Update Rust crate base64 to 0.21.4 (#950) Do less in control (#949) Fix --flush-per-blocks (#959) Fast dependency checking (#916) Update actions/checkout action to v4 (#960) Use `cargo hakari` for better workspace deps (#956) Update actions/checkout digest to 8ade135 (#939) Cache block size in Guest (#947) Update Rust crate ringbuffer to 0.15.0 (#954) Update Rust crate toml to 0.8 (#955) Update Rust crate reedline to 0.24.0 (#953) Update Rust crate libc to 0.2.148 (#952) Update Rust crate indicatif to 0.17.7 (#951) Remove unused async (#943) Use a synchronous mutex for bw/iop_tokens (#946) Make flush ID non-locking (#945) Use `oneshot` channels instead of `mpsc` for notification (#918) Use a strong type for upstairs negotiation (#941) Add a "dynamometer" option to crucible-downstairs (#931) Get new work and active count in one lock (#938) A bunch of misc test cleanup stuff (#937) Wait for a snapshot to finish on all downstairs (#920) dsc and clippy cleanup. (#935) No need to sort ackable_work (#934) Use a strong type for repair ID (#928) Keep new jobs sorted (#929) Remove state_count function on Downstairs (#927) Small cleanup to IOStateCount (#932) let cmon and IOStateCount use ClientId (#930) Fast return for zero length IOs (#926) Use a strong type for client ID (#925) A few Crucible Agent fixes (#922) Use a newtype for `JobId` (#919) Don't pass MutexGuard into functions (#917) Crutest updates, rename tests, new options (#911) Propolis updates Update tungstenite crates to 0.20 Use `strum` crate for enum-related utilities Wire up bits for CPUID customization PHD: improve artifact store (#529) Revert abort-on-panic in 'dev' cargo profile --------- Co-authored-by: Alan Hanson <[email protected]>

iliana added 4 commits April 13, 2022 08:34

Add user_data to InstanceCreate

cbe32b9

wire cloud_init_bytes into nexus

3427ae7

draw the rest of the damn owl

ef889e4

actually set the volume label

fa47eb9

iliana requested a review from jmpesp April 13, 2022 23:14

jmpesp reviewed Apr 14, 2022

View reviewed changes

wait does schemars really not do that for me

bd8f3d5

davepacheco reviewed Apr 14, 2022

View reviewed changes

nexus/src/lib.rs Outdated Show resolved Hide resolved

iliana added 2 commits April 14, 2022 18:37

use a list of public keys

159dde2

avoid using #![feature(int_roundings)]

1bfbdef

it's apparently lowercase, documentation be damned

a7561cd

iliana marked this pull request as ready for review April 14, 2022 22:18

move user-data size limit to a const

f609797

david-crespo reviewed Apr 14, 2022

View reviewed changes

clean up no-longer-needed comment

38d6999

jmpesp approved these changes Apr 15, 2022

View reviewed changes

smklein reviewed Apr 15, 2022

View reviewed changes

less magical sector calculation

3a4334b

david-crespo mentioned this pull request Apr 15, 2022

[Instance create] user_data textarea/file input oxidecomputer/console#800

Closed

iliana merged commit f91a12b into main Apr 15, 2022

iliana deleted the cloud-init branch April 15, 2022 22:47

plotnick mentioned this pull request Apr 25, 2022

Pass SSH public keys through to cloud-init #974

Merged

leftwo mentioned this pull request Oct 4, 2023

Update Propolis and Crucible to latest #4195

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User data support #911

User data support #911

iliana commented Apr 13, 2022 •

edited

Loading

jmpesp left a comment

jmpesp Apr 14, 2022

iliana Apr 14, 2022

jmpesp Apr 14, 2022

iliana Apr 14, 2022

smklein Apr 15, 2022 •

edited

Loading

smklein Apr 15, 2022

iliana Apr 15, 2022

jmpesp commented Apr 14, 2022

iliana commented Apr 14, 2022

iliana commented Apr 14, 2022 •

edited

Loading

iliana commented Apr 14, 2022

iliana commented Apr 14, 2022

david-crespo Apr 14, 2022 •

edited

Loading

iliana Apr 14, 2022

smklein Apr 15, 2022

smklein Apr 15, 2022

smklein Apr 15, 2022 •

edited

Loading

User data support #911

User data support #911

Conversation

iliana commented Apr 13, 2022 • edited Loading

jmpesp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smklein Apr 15, 2022 • edited Loading

Choose a reason for hiding this comment

TL:DR:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmpesp commented Apr 14, 2022

iliana commented Apr 14, 2022

iliana commented Apr 14, 2022 • edited Loading

iliana commented Apr 14, 2022

iliana commented Apr 14, 2022

david-crespo Apr 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smklein Apr 15, 2022 • edited Loading

Choose a reason for hiding this comment

TL:DR:

iliana commented Apr 13, 2022 •

edited

Loading

smklein Apr 15, 2022 •

edited

Loading

iliana commented Apr 14, 2022 •

edited

Loading

david-crespo Apr 14, 2022 •

edited

Loading

smklein Apr 15, 2022 •

edited

Loading