Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Explicitly specify what characters are/aren't allowed in DataObject db field names #10605

Open
GuySartorelli opened this issue Dec 5, 2022 · 5 comments

Comments

@GuySartorelli
Copy link
Member

GuySartorelli commented Dec 5, 2022

We've come across situations a few times where we want to verify if column names are "valid" (#10600 is the most recent) - and have tried to do so using regex. Unfortunately there's currently no validation inside Silverstripe code itself to determine what characters are or aren't valid for database columns.

We already use code like explode() etc in places, which make assumptions about what characters aren't valid in db field names - but the only validation for this is in the database itself. This can potentially lead to scenarios where a db column is technically valid and gets created in the database with no errors or warnings, but then our code makes incorrect assumptions and causes errors later down the line - after data has already been created.

I propose we set either a whitelist of allowed characters, or a blacklist of disallowed characters, and validate DB column names. Invalid column names would result in errors at least during dev/build, so that developers are never in a scenario where they have otherwise column names that are unusable with some functionality, and so that we can more confidently validate column name validity anywhere in the code without ambiguity.

@silverstripe/core-team thoughts?

@emteknetnz
Copy link
Member

emteknetnz commented Dec 5, 2022

I like this, though I'm concerned about scenarios where developers cannot change existing column names such as column-with-dashes and spaces due to a 2nd system that consumes the contents of the database and thus cannot easily be changed. It's the old "maintain backwards compatibility to facilitate easy upgrades" argument.

If we do this, then I'd much prefer a list of allowable characters as this method is far stricter, presumably it would be something like ^[a-zA-Z0-9_]+$ so that it can work in raw mysql queries without the use of backticks to escape anything (only a nice to have though, you could presumably still call your column name a mysql reserved word). We'd need to be careful to allow utf8 characters such as macrons though (probably?).

@GuySartorelli
Copy link
Member Author

GuySartorelli commented Dec 5, 2022

I think allowing utf8 characters is a must, especially for developers in places where English isn't the primary language, which may want to use characters from their language's script for the column names - though I'm interested to hear if anyone disagrees with that.

I'd suspect hyphens and underscores would be allowed, along with alphanumerical characters including utf8 "word" characters. So /^[\w_-]+$/u would perhaps be appropriate (where \w is functionally equivalent to a-z0-9 and the u modifier matches UTF16 characters), as it correctly matches gāй0-9_ but would not match if special characters such as !@#$%^&*()+=;:'" etc were included.

The upgrade path concern is valid though. I'm not sure how to reconcile that - or how much of an impact this change would have (if any impact at all).

@michalkleiner
Copy link
Contributor

What are we doing for table names? Could/should that be somehow reflected here as well?

@michalkleiner
Copy link
Contributor

Would we provide a migration script for data that suddenly live in a column with an unsupported name and a new column with a supported name is created during dev/build? Or use a fallback trying to access the column without the sanitisation applied?

@GuySartorelli
Copy link
Member Author

What are we doing for table names? Could/should that be somehow reflected here as well?

Other than passing it through Convert::raw2sql() (which probably also happens to db column names, but I haven't checked) there is no validation of table names currently - it might make sense to treat those in a similar way as is being proposed here for column names.

Would we provide a migration script for data that suddenly live in a column with an unsupported name and a new column with a supported name is created during dev/build?

I think that's a good idea, if this RFC is approved. It won't help anyway with the situation that Steve described (i.e. they can't change the name without affecting other systems), but I suspect that is a very edge case scenario.

Or use a fallback trying to access the column without the sanitisation applied?

I'd like to avoid this, as it means we then have to support these "invalid" column names with all our functionality, which is what I'm trying to get away from with this RFC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants