Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split units and values in type definitions #121

Open
romainmenke opened this issue Mar 25, 2022 · 33 comments
Open

Split units and values in type definitions #121

romainmenke opened this issue Mar 25, 2022 · 33 comments

Comments

@romainmenke
Copy link
Contributor

romainmenke commented Mar 25, 2022

https://design-tokens.github.io/community-group/format/#duration

This assumes that all consumers agree on the unit for a given value type or forces implementers to do error prone parsing before passing on transformed values.

A better sub type might be a general Dimension type

{
  "Duration-100": {
    "$value": {
        "number": "100",
        "unit": "ms"
    },
    "$type": "duration"
  }
}

This reduces the amount of micro syntaxes in the spec and makes it easier to implement.

$value: "100ms"

becomes

$value: { "number": "100", "unit": "ms" }

@drwpow
Copy link
Contributor

drwpow commented Mar 31, 2022

Parsing numbers and units from a string is trivial and performant in every programming language. I personally quite like the readability of having one string represent amount + unit, and it doesn’t impact tooling. So I think this hurts readability without adding any benefit (I don’t think this is easier to implement).

Further, for duration, there’s only one unit (considering ms and s are just divisions of the same unit), so expanding it just seems to be verbose.

A better sub type might be a general Dimension type

In a design system, time and space are distinct concepts. Numbers can quantify both, but since they have no direct relationship I don’t think there’s a point in tokenizing the number (e.g. 50ms and 50px have no relationship).

@TravisSpomer
Copy link

Agreed. I think that what the spec defines is our best option right now, unless maybe perhaps we could agree on something like "all durations shall be in milliseconds forever so let's specify durations as just a number." But since the spec already allows both px and rem for dimensions, I think that durations should probably be strings that specify units too.

@jonathantneal
Copy link

I want to drop a little “it depends” to this conversation, but more for consideration in future issues.

Given the current character restrictions, I agree we should be able to reliably parse units from numbers in the JSON format.

Parsing numbers and units from a string is trivial and performant in every programming language.

I only want to highlight one exception, specific to this Design Tokens Format Module. These character restrictions do not pair well outside of quoted strings in non JSON file formats like CSS.

Specifically, the reservation of U+002E — the FULL STOP or ‘period’ (.) character — may potentially introduce non-trivial parsing issues. This is because the period character also begins fractional numbers.

As niche or hypothetical as this sounds, I believe it has already surfaced in the wild. We can see the parsing conflict in this video for a new and very impressive library that implements the Design Tokens Format Module in CSS (at 00:43:31). In that clip, there is a red.6 token. In CSS, this is to be parsed as red and then .6. Extrapolating, a token like red.6a50 could represent a naming pattern for alpha transparency, or it could represent red, then 0.6a, and finally 90. This means DTFM style tokens, if made declarable in CSS, will make parsing numbers and units non-trivial, unless quoted strings are also required. That would be my suggestion, anyway.

I think this is worth pointing out as folks experiment with DTFM outside of JSON. These are complex problems. And I have big time admiration for everyone I’ve just replied to or invoked. In fact, in the same clip (at 00:50:44), the same folks educate viewers on these same kinds of parsing issues. Y’all are good people who know your stuff. 🙂

@TravisSpomer
Copy link

TravisSpomer commented Apr 1, 2022

Sorry, I don't follow how that really applies here. This spec isn't about defining tokens in CSS. If someone, like the people in that video, wants to invent their own method of specifying tokens in some CSS-like syntax, they're welcome to. But that would be a different format, not the one in this spec (albeit perhaps one meant to interoperate with this one 1:1)! And the author of that format could solve that problem however they wish*.

This issue is saying "hey, parsing strings to get numbers is dumb; we should just separate the data in the format itself instead of parsing a string." "50" is a number and "px" is a string for a unit. But there's no situation where there's any confusion or ambiguity between parsing a token name and a value.

If in the future there were some syntax where red.6a50 was a valid value, there still wouldn't be any ambiguity, because references to token names are in braces as in {red.6a50}.

(*If I were inventing such a format, I would probably follow the pattern that var() established: so prop: red.6a50 would just be the value "red.6a50", and prop: token(red.6a50) would be a reference to the value in the token named red.6a50.)

@romainmenke
Copy link
Contributor Author

This issue is saying "hey, parsing strings to get numbers is dumb; we should just separate the data in the format itself instead of parsing a string."

No one said anything like this or used a tone like this.
Please do better!


The main issue I see with this specification proposal is that it is focussed on creating an interface between multiple programs but too many provisions are made for reading and writing by humans.

This does have a short term advantage which I fully understand.
But as more and better tools are created to interact with token files the need for this will largely go away.

Parsing numbers and units from a string is trivial and performant in every programming language.

This might be true in practice but this specification will still need to define exactly how values must be parsed. Any specification which has micro syntax does this to avoid interop issues.

This places extra burden on editors of the spec and implementers.
This can not be debated away. It is work that has to be done.

Avoiding micro syntaxes altogether and choosing to create a data format without any possible ambiguity is possible.

I am only advocating for making the choice in favour of something that works best as an interface between multiple programs.

@romainmenke romainmenke changed the title Remove units from type definitions Split units and values in type definitions Apr 1, 2022
@TravisSpomer
Copy link

Please do better!

Apologies; I didn't mean to imply that you were being rude. I hope no one else took it that way.

I agree with your central points! Parsing strings to get numbers is d—to use nicer-sounding words, best avoided—for the reasons you list and more. But I think it makes the most sense for this spec only since, in my view, this format is already clearly leaning strongly toward human-editable rather than optimizing for machines. We agree on the "best" way, just not the best way for this particular spec. The format's design philosophy is perhaps something that needs to be more explicitly stated. If I've misinterpreted it and this is indeed a format primarily to be edited and read by code, then I will wholeheartedly agree with this issue and give it all my upvotes.

@romainmenke
Copy link
Contributor Author

I've spend more time considering this and still think it is best to avoid micro syntaxes if possible.

My main concern is that regexp will be used or that we will see everyone writing custom parsers for these.

@WanaByte
Copy link

@romainmenke said:

My main concern is that regexp will be used or that we will see everyone writing custom parsers for these.

I agree with this concern. As written, I think the spec will need to define the grammar for each string field.

For example, dimensions are currently limited to px and rem. I think this implies parsing with CSS logic to consume numeric tokens and for converting strings to numbers. The same applies to durations, where the spec choses ms, but leaves open the ability to define other units later.

If we separate the number and units, we can fall back on JSON's number parsing. For example: $"value": {"number": -1.345e-2, "unit": "rem"} versus "$value": "-1.345e-2rem".

@rdlopes
Copy link

rdlopes commented Jul 21, 2022

Hi @romainmenke, there is actually an ISO format for expressing times, intervals, and durations: ISO 8601
Since we're talking standardization, it would be great to enforce ISO standards.

For durations, here is the Wikipedia description that we also use in tools like Camunda.

Is it set upfront that durations will always be expressed in milliseconds?

@romainmenke
Copy link
Contributor Author

romainmenke commented Jul 21, 2022

Hi romainmenke, there is actually an ISO format for expressing times, intervals, and durations: ISO 8601
Since we're talking standardization, it would be great to enforce ISO standards.

I fully agree :)
But maybe better to open a separate issues for specific value types which currently do not have an exact definition? (don't reference an existing standard and might be ambiguous)

This issue specifically is focussed on splitting numbers and units into multiple JSON fields. Not about the actual notation of the number or unit part.

@WanaByte
Copy link

@rdlopes said:

Is it set upfront that durations will always be expressed in milliseconds?

From https://tr.designtokens.org/format/#duration:

Represents the length of time in milliseconds an animation or animation cycle takes to complete, such as 200 milliseconds. The $type property MUST be set to the string duration. The value MUST be a string containing a number (either integer or floating-point) followed by an "ms" unit. A millisecond is a unit of time equal to one thousandth of a second.

Probably forking the conversation, but if all durations are milliseconds, could the spec omit the ms, and just leave the number?

@rdlopes
Copy link

rdlopes commented Jul 21, 2022

@romainmenke @KyleWpppd agreed, will fork, out of scope here.

@ilikescience
Copy link

I haven't written a translator, so admittedly I'm not the best judge of computational complexity. That said ...

So far as I can tell in the current spec, for single-value (ie not composite) tokens, the $value of the token will always be:

  1. A hex code, which is a string that always starts with a # followed by 3,4,6, or 8 alphanumeric characters 0-9,a-f, A-F.
  2. A unitless value, which can be:
    a. A string that starts with a non-numeric character and contains any numeric- or non-numeric character afterwards (ie, a string like "bold")
    b. A string that starts with numeric characters and contains only numeric characters and/or a single decimal (ie a string that can be cast into a float or int, like "500")
  3. A value + unit, which is a string that starts with numeric characters and/or a single decimal, followed by any number of alphabetic characters. (ie "200ms")

In the case of 3, it seems pretty straightforward to split the string on the first non-numeric character, resulting in the value and unit.

The parser will need to use a regular expression to handle these cases, but it seems computationally cheap. To me, it's a good tradeoff for the better ergonomics of writing/editing.

@romainmenke
Copy link
Contributor Author

romainmenke commented Dec 4, 2022

In the case of 3, it seems pretty straightforward to split the string on the first non-numeric character, resulting in the value and unit.

What about :

123e3px

Is that :

  • value : 123
  • unit : e3px

or

  • value : 123000
  • unit : px

The parser will need to use a regular expression to handle these cases, but it seems computationally cheap. To me, it's a good tradeoff for the better ergonomics of writing/editing.

A parser is not a regular expression.
This is also not about computation overhead.

This is about simplicity.
The stated goal is to be an interchange format between entire eco systems of tools.
Tools which today do not work well together.

Inventing new micro syntaxes requires much more work to implement this.
Even borrowing existing micro syntaxes does not reduce the amount of work by much.


Having implemented multiple tokenizers and parsers I can tell you that the hand wavy list of rules above are completely insufficient for the purpose of extracting values from strings.

A good reference to get a feel for the complexity involved is the CSS syntax specification.
Specifically the sections on tokenization and consuming numeric and string tokens :

https://www.w3.org/TR/css-syntax-3/#tokenizer-algorithms


All this complexity can be avoided by splitting units and numeric values in two fields in the JSON structure.

@ilikescience
Copy link

What about :
123e3px

I think this could be avoided by disallowing scientific notation — I can't think of any cases where values would be so large or small to make the tradeoff worth the complexity.

@romainmenke
Copy link
Contributor Author

romainmenke commented Dec 9, 2022

I think this could be avoided by disallowing scientific notation

That would mean that this specification needs to define its own number type completely.
Restriction values in this way makes it harder for token editors to copy/paste values.

restrict to non-scientific no restrictions split unit and value
requires number specification yes yes no
number specification is unique* yes no no
requires custom tokenizer/parser no yes no
restricts values yes no no
requires more typing when typing by hand no no yes

* specifying what a number is can not be done be referring to an existing specification because the format is different.

@ilikescience
Copy link

That would mean that this specification needs to define its own number type completely.

I don't see why that's the case, since we're discussing this in the context of spec-defined types like dimension, duration, etc.

I should have been more clear when I said "disallowing scientific notation." Let me rephrase:

According to the proposed definition of possible values for $value:

  1. A hex code, which is a string that always starts with a # followed by 3,4,6, or 8 alphanumeric characters 0-9,a-f, A-F.
  2. A unitless value, which can be:
    a. A string that starts with a non-numeric character and contains any numeric- or non-numeric character afterwards (ie, a string like "bold")
    b. A string that starts with numeric characters and contains only numeric characters and/or a single decimal (ie a string that can be cast into a float or int, like "500")
  3. A value + unit, which is a string that starts with numeric characters and/or a single decimal, followed by any number of alphabetic characters. (ie "200ms")

123e3px is not a valid value. 123000px is the valid equivalent.

Do you foresee this causing any problems?

@romainmenke
Copy link
Contributor Author

romainmenke commented Dec 9, 2022

I don't see why that's the case, since we're discussing this in the context of spec-defined types like dimension, duration, etc.

A hex code, which is a string that always starts with a # followed by 3,4,6, or 8 alphanumeric characters 0-9,a-f, A-F.

What if you want to express a string value that starts with # ?

I think this rule already is an oversimplification.
The hex color value parsing should only come into play once "$type": "color" has been determined.
In any other type it would just be a string literal and invalid (in the current specification).

A string that starts with a non-numeric character and contains any numeric- or non-numeric character afterwards (ie, a string like "bold")

Ok

A string that starts with numeric characters and contains only numeric characters and/or a single decimal (ie a string that can be cast into a float or int, like "500")

Negative numbers?
-10 ?

What if a string literal starts with a number?
"10 horses"
Is this a parse error or a string value?

How do I express string literals that contain only numbers?
"10" but must be processed in translation tools as a string value, not a number.

These examples mainly show that a dimension value can not be separated from its type field.
Parsing becomes much easier if you already know that something is a dimension and must not be something else.


Do you foresee this causing any problems?

I think the main issue is vagueness.
Limiting the number type to make parsing easier doesn't really make it easier in practice.

I can just pass 123e3 to something like parseFloat in JavaScript and get the intended result. But if the specification does not allow scientific notation then I must intercept and error on any values with scientific notation.

123e3px doesn't go away by not allowing it.
Implementations now need to intercept it and mark those tokens as invalid.

There are only two options :

  • go for a structured format like I propose here and use the transform format fully
  • write a detailed specification for all possible values

A detailed specification includes error handling.

@romainmenke
Copy link
Contributor Author

romainmenke commented Dec 9, 2022

Limiting the number type to make it easier to parse in implementations would also conflict with this resolution : #149 (comment)

Easy editing by humans is now an important aspect of this specification.

That number values from a different context can not be copy/pasted to a design token file doesn't help people edit these manually.

If any program produces numbers with scientific notation it forces people to convert them first before using them in design token files.

I would prefer it if this specification did not enforce arbitrary limits like that.

@ilikescience
Copy link

Before going deeper down the rabbit hole, I want to enthusiastically agree that splitting numbers and units would make the spec simpler and make it easier for parsers to rely on languages' built-in types (like numbers and strings).

On the other hand, I think we can agree that "$value:" "100ms" is easier to type than $value: { "number": "100", "unit": "ms" }.

So, given that, along with the stated goal of making token files easy for people to edit, the 2 questions I'm trying to answer are:

  1. Can we come up with a clear specification (I think that's what we're calling a microformat) for the possible value of $value in a way that allows someone to write a parser that correctly separates units and numbers?
  2. Is the complexity of implementing that spec/microformat worth the benefit gained by being easier to write?

Obviously my opinion on both of these is "yes," but as we go through @romainmenke 's excellent counterpoints I'm having doubts. But I do think it's worth continuing to iterate to see if we can get to clarity around the spec's currently-implied microformat before moving to splitting units and numbers.

Ok, now down the rabbit hole.


What if you want to express a string value that starts with #?

Yes, good point. It might be productive to limit our discussion to types like dimension and duration, where separating units from numbers is important. The need for clarity around string values is a topic that has come up a lot, and warrants a separate topic.

Negative numbers?

Another good point, my microformat proposal didn't account for negative numbers. I'll iterate at the end of this comment.

What if a string literal starts with a number?
"10 horses"
Is this a parse error or a string value?

I think this hypothetical is a little too hypothetical. But I think it would be reasonable to understand "10 horses" as a number of 10 and a unit of horses (say, if you needed to convert this to another unit). The microformat doesn't account for spaces between the unit and value, but it should.

If you want "10 horses" to be a string (not a number + unit), then you would need something like a string type, which again we haven't fully addressed and warrants its own topic.

How do I express string literals that contain only numbers?

That would be something like a (currently undefined) string type, not a dimension or a duration.

Parsing becomes much easier if you already know that something is a dimension and must not be something else.

Agree. I think we're discussing a microformat for any tokens that have numbers+units, not for every single type/token.

If any program produces numbers with scientific notation it forces people to convert them first before using them in design token files.

I agree that changing numbers from scientific notation is an extra step. I think we have to consider tradeoffs here: how often is someone copying-and pasting numbers into a token file from a tool that produces scientific notation? My estimation is that it is pretty rare. If we are making files harder to write and/or harder to parse to acommodate this workflow, is it worth it?


Here's an updated proposal for the format based on these ideas:

The $value of a token that has units will always take the following format:

  1. MIGHT include a - or +, then
  2. MUST include at least one numeric character
  3. MIGHT include a decimal
    a. if a decimal is present, MUST include at least one numeric character after the decimal
  4. MIGHT include a space
  5. MUST include at least one letter character

Here's a reference implementation:

function parse(str) {
  const regex = /^([-+]?\d*\.?\d+)+\s?([A-Za-z]+)$/;
  const matches = str.match(regex);

  if (!matches) {
    return new Error('Input string does not follow the format');
  }

  const number = matches[1];
  const unit = matches[2];

  return { number, unit };
}

parse("100ms") results in { "number": "100", "unit": "ms" }.
parse("10 horses") results in { "number": "10", "unit": "horses" }
parse("-10.23px") results in { "number": "-10.23", "unit": "px" }.

parse("123e3px") results in an error.


@romainmenke I appreciate you continuing to push on this and having an exacting attention to detail and edge cases.

@romainmenke
Copy link
Contributor Author

romainmenke commented Dec 11, 2022

I would strongly advice not to use regexp for this, not even for a reference implementation.
I can not stress this enough, a parser is not a regexp and a regexp is not a parser.

If I had a dime for each bug that existed because someone used regexp where they should have used a true tokenizer and parser :)

Doing this work both in the specification and in implementations from day one will spare end users so much issues and bugs.


The example with 10 horses was meant to illustrate a true string that kind of looks like a value and a unit but most definitely is not.

White space must not be allowed between a value and a unit.
Allowing this is highly unusual.


I agree that changing numbers from scientific notation is an extra step. I think we have to consider tradeoffs here: how often is someone copying-and pasting numbers into a token file from a tool that produces scientific notation? My estimation is that it is pretty rare. If we are making files harder to write and/or harder to parse to acommodate this workflow, is it worth it?

Here's an updated proposal for the format based on these ideas:

The $value of a token that has units will always take the following format:

I think it is confusing that there are arbitrary differences between a JSON number and a number that is part of a dimension.

Why can't we adopt the definition of a number from JSON?

https://www.json.org/json-en.html


A parsing algorithm can use look ahead because values are contained within JSON fields.

This algorithm assumes that the token type is dimension, duration or any of the other value + unit tuples.
It will not produce correct results if the token type can be anything else.

It requires that the allowed units are known.

  1. compare the end of the string value with the known units
  2. if the string value does not end in any of the known units
    2.a. this is parsing error
  3. trim the unit from the string value
  4. parse the remaining string value as JSON
  5. if parsing as JSON fails
    5.a. this is parsing error
  6. if the parsed value is not a number
    6.a this is parsing error
  7. return the parsed value as the value and the found unit as unit
function parseUnitAndValue(input, allowedUnits) {
	allowedUnits.sort((a, b) => b.length - a.length); // can be optimized by sorting outside this function

	// 1. compare the end of the string value with the known units
	const unit = allowedUnits.find((candidate) => {
		return input.slice(input.length-candidate.length) === candidate;
	});

	// 2. if the string value does not end in any of the known units
	if (!unit) {
		// 2.a. this is parsing error
		throw new Error('Parse error');
	}

	// 3. trim the unit from the string value
	let inputWithoutUnit = input.slice(0, input.length - unit.length);

	let value;
	try {
		// 4. parse the remaining string value as JSON
		value = JSON.parse(inputWithoutUnit);
	} catch (err) { // 5. if parsing as JSON fails
		// for debugging maybe?
		console.log(err);
		// 5.a. this is parsing error
		throw new Error('Parse error');
	}

	// 6. if the parsed value is not a number
	if (typeof value !== 'number') {
		// 6.a this is parsing error
		throw new Error('Parse error');
	}

	// 7. return the parsed value as the `value` and the found unit as `unit`
	return {
		value: value,
		unit: unit,
	};
}

console.log(parseUnitAndValue('-10rem', ['em', 'rem', 'px']));

Implementers might prefer to use a regexp to find and trim the unit from the end.


An algorithm of this kind has these benefits :

  • simple to implement
  • parity with regular numbers in JSON
  • implementations do not need to write complete tokenizers and parsers

The downside is that it can only be applied if you already know that something is a unit and value tuple and what the allowed units are.


If a field in a composite token allows multiple types then it requires a second algorithm.

For this theoretical composite type :

  • type composite-example
  • one field foo
  • foo allows string, dimension, number values

valid:

{
  "alpha": {
    "$type": "composite-example",
    "$value": {
      "foo": 10
    }
  },
  "beta": {
    "$type": "composite-example",
    "$value": {
      "foo": "10"
    }
  },
  "delta": {
    "$type": "composite-example",
    "$value": {
      "foo": "10px"
    }
  },
}
  1. if value is a number
    1.a. return this number
  2. if value is a string
    2.a. try to parse as a dimension
    2.b. if parsing succeeds
    2.b.i. return the value/unit tuple
    2.c. return the string value

This works well up until a new unit is added to dimension.
Any tools that haven't been updated will parse new tokens as string.


This parsing algorithm doesn't have any issues for dimension or duration but it complicates error handling.

It fails to detect syntactically valid microsyntax with unknown units.

This affects the parsing of composite tokens.


A different approach would to require implementers to write a custom version of the JSON number parsing algorithm.

can be found here : https://www.json.org/json-en.html

  1. parse a number exactly like JSON
  2. if parsing a number failed
    2.a this is parsing error
  3. trim the parsed number from the input
  4. if the remainder is not one of the allowed units
    4.a. this is parsing error
  5. return the parsed value as the value and the found unit as unit

This places a much higher burden on implementers but it is able to distinguish these :

  • valid <number><unit> microsyntax and known unit
  • valid <number><unit> microsyntax but unknown unit
  • invalid <number><unit> microsyntax

Being able to make that distinction makes it much easier to extend composite types in the future. (in the specification, not in implementations)


I think it is important to constantly look ahead and balance these aspects :

  • future extension of the specification without it being breaking
  • burden on users
  • burden on implementers

Simply splitting the unit and value has non of these issues, challenges or drawbacks.

@romainmenke
Copy link
Contributor Author

romainmenke commented Dec 11, 2022

A good way to get a sense of the complexity involved would be to explore how dimension tokens are parsed in CSS.

In our tokenizer this is done here : https://github.com/csstools/postcss-plugins/blob/postcss-preset-env--v8/packages/css-tokenizer/src/consume/numeric-token.ts#L19

number specifically : https://github.com/csstools/postcss-plugins/blob/postcss-preset-env--v8/packages/css-tokenizer/src/consume/number.ts

CSS is not JSON so the exact implementation is different.
But similar things apply.

oversimplified :

  1. consume a number byte by byte
  2. check if the remainder starts an ident
    2.a consume an ident byte by byte
    2.b return the tuple
  3. return a number

@ilikescience
Copy link

Great points on the parsing algorithm. I think any recommendation by the spec here would be non-normative, but it's good to see how different strategies to parsing apply.

Our microformat definition could be as simple as "A string containing a number (defined by JSON's number type) followed by a unit (defined by a list of valid units). The number and unit can not be separated by a space."

Then, implementers can use whatever parsing algorithm they want.

To state the issues:

  1. You have to know whether or not a value is a number and unit tuple in order to correctly parse it
    The spec has to define which token types can and can't be a number+unit tuple, so a parser can use this information to avoid incorrectly parsing. In fact, it might be interesting to explore the possibility of defining dimension and duration as ALWAYS being a number+unit tuple, which removes a lot of ambiguity.

  2. You have to check the unit against known/defined units
    The spec currently takes the approach of defining valid units, so that list is normative. Even if we split units out, the parser still needs to check the unit against the list to validate the token.

  3. Composite tokens require additional algorithms
    This is something that I think necessitates a separate thread, as I'm seeing a lot of potential problems with our current approach to implicitly typing the values of a composite token. If the values of a composite token were explicitly typed, this issue is no a concern.


So, if the spec is written so that the first type of algorithm can work for all cases, and as you said it's "simple to implement," then it seems like it might be a viable way to enable authors to write number+unit strings instead of breaking them out.

@romainmenke
Copy link
Contributor Author

I think any recommendation by the spec here would be non-normative, but it's good to see how different strategies to parsing apply.

I am now unsure if these are generally normative or not.
https://tc39.es/ecma262/#sec-string.prototype.codepointat

Can they be normative without requiring implementations to follow them exactly?
I think the specification must leave room for implementors to :

  • optimize algorithms for performance
  • some contexts might need fewer or more steps

To state the issues:

These are all correct :)
But they are all a side-effect of ambiguity.

There are multiple ways to remove that ambiguity.

  • splitting unit and value
  • explicit typing
  • defined parsing algorithms
  • ...

You have to check the unit against known/defined units
The spec currently takes the approach of defining valid units, so that list is normative. Even if we split units out, the parser still needs to check the unit against the list to validate the token.

This has the extra nuance that this list is not a constant.

  • today : px, rem, ms
  • tomorrow: px, rem, ms, new-unit

All tools do not update simultaneously and users of tools do not update all tools at the same time.

This will easily create cases where new-unit is introduced by tool X but needs to be processed by tool Y which doesn't yet understand new-unit.

It must be defined what tool Y must do when encountering 10new-unit.
We can not assume this is a user error.

I will open a separate issue to define a general error handling strategy.

@o-t-w
Copy link

o-t-w commented Jun 6, 2023

It may be of interest that the Typed OM JavaScript API separates units and values.

https://drafts.css-houdini.org/css-typed-om/

@romainmenke
Copy link
Contributor Author

Typed OM JavaScript API separates units and values.

Yes, any sane object model separates these :)

@o-t-w
Copy link

o-t-w commented Jun 6, 2023

On the topic of readability, the spec says “Groups are arbitrary and tools SHOULD NOT use them to infer the type or purpose of design tokens.”

Could the spec be changed to allow people to define the type at the group level? Would that address the issues raised here but keep it readable?

@drwpow
Copy link
Contributor

drwpow commented Aug 16, 2024

Sorry for pinging an old thread, but I’ve just put up a proposal ✅ accepting this original proposal: #244.

This is NOT an official decision yet! This is only a way to source feedback, and push this forward. I tried to distill the thoughts & opinions expressed in this thread into a formal spec change (also, for the eagle-eyed, you may spot an earlier comment from me initially against this proposal. The arguments were compelling. I came around. People grow 🙂).

Any/all feedback would be welcome. Thanks all for pushing for this change 🙏

  • Were your concerns addressed?
  • Anything you’d like to see change (see the diff)?

@jorenbroekema
Copy link

jorenbroekema commented Sep 9, 2024

I have some concerns regarding this proposal.

It's simpler?

For the tool to parse it, yes perhaps, but it's making the value more complex by making it of type object.
I disagree that it's simpler: making the type more complex to address "complexity" of the string interpolation feels like a bad trade-off, considering how easy it is to do the string interpolation in this case.

Objects in objects

If we make very simple token types object types (e.g. dimension, duration etc.), it means we get more instances where already composite type tokens will be allowed to contain nested objects:

{
  shadow_token: {
    $value: {
      "xOffset": {
        "number": 0,
        // is unit required? can it be left out if number is 0?
      },
      "yOffset": {
        "number": 5,
        "unit": "px"
      }
    }
  }
}

I'm not saying it's not technically feasible to deal with composite value nesting, there's already a case where this is in the spec: cubicBezier can be a composite token and can be a property (timingFunction) of transition, I think strokeStyle inside border is another example. But it does add to the level of complexity in my opinion, more than the rather simple string interpolation needed right now. I also wonder how deep we want to go with regards to nesting objects in objects inside values. E.g. I'm imagining a world where we have shadows that have colors that are gradients with many channels/color spaces, or as transitions/animations token types become more complex, I'd expect a lot of potential nesting there too. I guess what I'm saying is, I'd like to try to prevent that we add to this complexity unnecessarily.

Multiple values

I have a small concern related to #239 which is actually quite a similar issue to this issue, the summary is whether we want to allow array values e.g. for shadows, animations, gradients/colors, to be layered within a single design token, as this is often conceptually a single "design decision". That issue doesn't have much feedback at all yet but if it were to be received positively, it would maybe clash a bit with this proposal because both issues propose to complicate the token $value type for dimension tokens.


Summarizing, I feel like this proposal smells of over-engineering. I am not convinced that the benefits of splitting dimensions in syntax-tree like objects is worth it. The downsides are that tokens become more verbose/heavy, more complex (e.g. nested objects), less easy to author/read (if those are goals of the spec, perhaps debatable).

My suggestion is to keep this open but to not push this forward until there is more of a consensus on the actual need of this to land in the spec. I just don't see the need right now.

Edit: I don't mean to be harsh btw, I'm not absolutely/strongly against the proposal, just wanted to make sure I voice the thoughts I have against it (I think the arguments for it are clear already)

@jorenbroekema
Copy link

Alternative idea (feel free to upvote/downvote, I'm impartial to this idea and won't be offended, just throwing it out there)

Have the spec publish a regex pattern recommendation for tools to use for interpreting dimension tokens, e.g.:
(?<value>^.+?)(?<unit>[a-zA-Z%]*$)

@romainmenke
Copy link
Contributor Author

romainmenke commented Sep 10, 2024

this proposal smells of over-engineering

no, it is the opposite :)

This proposal aims to use what is already there in JSON, what is freely and readily available in the underlying tech.
It doesn't require any further spec work or work on the side of implementations.

Not splitting units and values into separate fields however might very well lead to over-engineering:

  • need to define or reference a spec for numbers
  • need to define or reference a spec for units (is px2 a valid unit?, it is in CSS at the tokenizing, parsing level)
  • need to define general parsing guidelines

This proposal is to "under-engineer" the dimension type.


I can understand and sympathize that an extra object that is visibly present in a tokens file is perceived as extra complexity.

At the end of the day what I truly find important is that this specification is precise and exact and doesn't leave room for implementors to make the wrong assumptions.

If this is done with a strict number, unit and parsing definitions, fine by me, but this is work that needs to be done.

It can't be assumed that it is obvious how to parse 10px into 10 and px because people will not consider all the edge cases and will create subtly different implementations. Each meaningfully different implementation will cause friction in the ecosystem and places future constraints on this spec.

Separating units and numbers adds perceived complexity but reduces very real specification and implementation complexity.

@jorenbroekema
Copy link

jorenbroekema commented Sep 10, 2024

Not splitting units and values into separate fields however might very well lead to over-engineering:

  • need to define or reference a spec for numbers
  • need to define or reference a spec for units (is px2 a valid unit?, it is in CSS at the tokenizing, parsing level)
  • need to define general parsing guidelines

We need to do all of these things regardless of whether the value is a string or an object so I'm not sure why you're bringing this up. Especially the first two, what does it matter if the unit/number are already split from one another, you have to decide whether you want to restrict the units/numbers regardless?

this specification is precise and exact and doesn't leave room for implementors to make the wrong assumptions

And that's exactly the mindset/philosophy that imo leads to over-engineering the spec. You can't realistically do this - pragmatism is just as important as correctness, if not more important.

reduces very real specification and implementation complexity

As a specification implementor (style-dictionary, sd-transforms and various other tools that consume DTCG tokens), my experience is that I just don't agree with this statement. The current string value isn't complex 😅. I also disagree with @jonathantneal 's analysis, I don't see how their examples are relevant to the design token JSON format, which @TravisSpomer also pointed out.

Edit: Let me be helpful and mention what would change my mind on this matter: real world use cases, that are actually applicable to design tokens in JSON, where string parsing dimensions to split value from unit is actually error-prone when done by a simple regex.

@romainmenke
Copy link
Contributor Author

romainmenke commented Sep 12, 2024

We need to do all of these things regardless of whether the value is a string or an object so I'm not sure why you're bringing this up. Especially the first two, what does it matter if the unit/number are already split from one another, you have to decide whether you want to restrict the units/numbers regardless?

This is untrue and a trivial example of that is the max value that can be expressed.
In string form you can write sequences of digits that, as numbers, would far exceed 32, 64, ... bits. By relying on the underlying format, JSON, you already inherit a bunch of properties of the JSON number type. You get this for free and JSON parsing/serializing is widely supported.

If however a microsyntax is chosen you need to at least define how to parse a string into a number. The current specification doesn't do that.

Keep in mind that not everything is JavaScript.
Different languages have different API's to parse numbers and those have subtly different behaviors.

Even without defining it you might get good results in your context, but the goal is that anyone can get the correct outcome.


this specification is precise and exact and doesn't leave room for implementors to make the wrong assumptions

And that's exactly the mindset/philosophy that imo leads to over-engineering the spec. You can't realistically do this - pragmatism is just as important as correctness, if not more important.

I am not sure how to respond to that.

A specification, by its very nature, is a document that helps and guides multiple implementors realize different implementations that still behave in the same way. A primary purpose of specifications is to have interop between various tools.

This is not a mindset or philosophy... 😕

I also don't understand what isn't realistic about this.


To clarify, such a description wouldn't tell you how to do it.
It would be agnostic of regex, tokenizers/parsers, parseFloat, ... any API you prefer, ...

A specification typically only describes the high level steps in pseudo code.
Although often possible to naively write code that directly matches those steps it is typical for implementors to search for and find more optimized algorithms.

It is up to you to actually write code that matches those high level steps.
If that is a regex and it works exactly as needed, then that is obviously fine.

No specification tells you to over-engineer your code.


I am not arguing against you or the use of a micro syntax, I am advocating for clarity, completeness, ease of implementation and interop.

I am absolutely sure that an object notation is the most direct path towards those attributes, but I will follow whatever is decided by the spec editors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants