RFC: Explicitly specify what characters are/aren't allowed in DataObject db field names #10605

GuySartorelli · 2022-12-05T21:58:14Z

We've come across situations a few times where we want to verify if column names are "valid" (#10600 is the most recent) - and have tried to do so using regex. Unfortunately there's currently no validation inside Silverstripe code itself to determine what characters are or aren't valid for database columns.

We already use code like explode() etc in places, which make assumptions about what characters aren't valid in db field names - but the only validation for this is in the database itself. This can potentially lead to scenarios where a db column is technically valid and gets created in the database with no errors or warnings, but then our code makes incorrect assumptions and causes errors later down the line - after data has already been created.

I propose we set either a whitelist of allowed characters, or a blacklist of disallowed characters, and validate DB column names. Invalid column names would result in errors at least during dev/build, so that developers are never in a scenario where they have otherwise column names that are unusable with some functionality, and so that we can more confidently validate column name validity anywhere in the code without ambiguity.

@silverstripe/core-team thoughts?

The text was updated successfully, but these errors were encountered:

emteknetnz · 2022-12-05T23:14:06Z

I like this, though I'm concerned about scenarios where developers cannot change existing column names such as column-with-dashes and spaces due to a 2nd system that consumes the contents of the database and thus cannot easily be changed. It's the old "maintain backwards compatibility to facilitate easy upgrades" argument.

If we do this, then I'd much prefer a list of allowable characters as this method is far stricter, presumably it would be something like ^[a-zA-Z0-9_]+$ so that it can work in raw mysql queries without the use of backticks to escape anything (only a nice to have though, you could presumably still call your column name a mysql reserved word). We'd need to be careful to allow utf8 characters such as macrons though (probably?).

GuySartorelli · 2022-12-05T23:25:03Z

I think allowing utf8 characters is a must, especially for developers in places where English isn't the primary language, which may want to use characters from their language's script for the column names - though I'm interested to hear if anyone disagrees with that.

I'd suspect hyphens and underscores would be allowed, along with alphanumerical characters including utf8 "word" characters. So /^[\w_-]+$/u would perhaps be appropriate (where \w is functionally equivalent to a-z0-9 and the u modifier matches UTF16 characters), as it correctly matches gāй0-9_ but would not match if special characters such as !@#$%^&*()+=;:'" etc were included.

The upgrade path concern is valid though. I'm not sure how to reconcile that - or how much of an impact this change would have (if any impact at all).

michalkleiner · 2022-12-06T00:09:00Z

What are we doing for table names? Could/should that be somehow reflected here as well?

michalkleiner · 2022-12-06T00:10:30Z

Would we provide a migration script for data that suddenly live in a column with an unsupported name and a new column with a supported name is created during dev/build? Or use a fallback trying to access the column without the sanitisation applied?

GuySartorelli · 2022-12-06T00:18:28Z

What are we doing for table names? Could/should that be somehow reflected here as well?

Other than passing it through Convert::raw2sql() (which probably also happens to db column names, but I haven't checked) there is no validation of table names currently - it might make sense to treat those in a similar way as is being proposed here for column names.

Would we provide a migration script for data that suddenly live in a column with an unsupported name and a new column with a supported name is created during dev/build?

I think that's a good idea, if this RFC is approved. It won't help anyway with the situation that Steve described (i.e. they can't change the name without affecting other systems), but I suspect that is a very edge case scenario.

Or use a fallback trying to access the column without the sanitisation applied?

I'd like to avoid this, as it means we then have to support these "invalid" column names with all our functionality, which is what I'm trying to get away from with this RFC.

GuySartorelli added the rfc/draft label Dec 5, 2022

GuySartorelli mentioned this issue Dec 5, 2022

API Create orderBy() method to handle raw SQL #10600

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Explicitly specify what characters are/aren't allowed in DataObject db field names #10605

RFC: Explicitly specify what characters are/aren't allowed in DataObject db field names #10605

GuySartorelli commented Dec 5, 2022 •

edited

Loading

emteknetnz commented Dec 5, 2022 •

edited

Loading

GuySartorelli commented Dec 5, 2022 •

edited

Loading

michalkleiner commented Dec 6, 2022

michalkleiner commented Dec 6, 2022

GuySartorelli commented Dec 6, 2022

RFC: Explicitly specify what characters are/aren't allowed in DataObject db field names #10605

RFC: Explicitly specify what characters are/aren't allowed in DataObject db field names #10605

Comments

GuySartorelli commented Dec 5, 2022 • edited Loading

emteknetnz commented Dec 5, 2022 • edited Loading

GuySartorelli commented Dec 5, 2022 • edited Loading

michalkleiner commented Dec 6, 2022

michalkleiner commented Dec 6, 2022

GuySartorelli commented Dec 6, 2022

GuySartorelli commented Dec 5, 2022 •

edited

Loading

emteknetnz commented Dec 5, 2022 •

edited

Loading

GuySartorelli commented Dec 5, 2022 •

edited

Loading