Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to kaliop_migration table to utf8mb4? #176

Closed
omh opened this issue Nov 19, 2018 · 15 comments
Closed

Move to kaliop_migration table to utf8mb4? #176

omh opened this issue Nov 19, 2018 · 15 comments
Milestone

Comments

@omh
Copy link
Contributor

omh commented Nov 19, 2018

While not technically necessary it would be nice if the kaliop_migration table also used utf8mb4 to be consistent with the rest of eZ.

@gggeek
Copy link
Member

gggeek commented Nov 19, 2018

Indeed. How does ez deal with backwards compatibility about that? Is there any issue at all?

@omh
Copy link
Contributor Author

omh commented Nov 19, 2018

Don't think it deals with it at all. At least for platform v2.3 all database versions in the requirements (https://doc.ezplatform.com/en/2.3/getting_started/requirements/) support utfmb4. If you do have run it on an ancient version you'd have to convert it manually. utf8mb4 was added in MySQL 5.5 back in 2010: https://dev.mysql.com/doc/relnotes/mysql/5.5/en/news-5-5-3.html

@gggeek gggeek added this to the 5.8 milestone Dec 15, 2018
@gggeek gggeek modified the milestones: 5.8, 5.9 Jan 18, 2019
@gggeek gggeek modified the milestones: 5.9, 5.8 Apr 2, 2019
@gggeek
Copy link
Member

gggeek commented Apr 2, 2019

Uhm... preliminary analysis tells me that the component I use to create kaliop migrations own tables (which is the Schema Manager from Doctrine) does not have explicit support for managing the character set :-(

I could maybe add a sql migration file to the mig. bundle, which would only affect mysql-based databases... at the same time I am a bit loath to do that, as it might break stuff for anyone who's doing joins on those tables.

Any suggestion on how to tackle this is welcome ;-)

@omh
Copy link
Contributor Author

omh commented Apr 2, 2019

Have to admin I'm not well versed in Doctrine and its migration system, but I did some googling...

From a bit of research it seems possible to pass a SchemaConfig via $table->setSchemaConfig() where $table is what is returned by $schema->createTable(). This SchemaConfig object has a setDefaultTableOptions method that takes an array, and that array, I believe, can take a charset key. Completely untested though. Not sure how it would work on other DBs like postgresql. Hopefully it's just ignored...

@gggeek
Copy link
Member

gggeek commented Apr 2, 2019

Btw: there is an interesting comparison of how performances (for selects) are affected by choice of character set, related to the version of MySql in use at: https://www.percona.com/blog/2019/02/27/charset-and-collation-settings-impact-on-mysql-performance/
In short: depending on the version in use, one 'collation' charset is to be preferred to others - there seems to be no 'universal good default'

@gggeek
Copy link
Member

gggeek commented Apr 2, 2019

@omh my main concerns would still be:
a) making sure that the code works with Postgres (and eventually other dbs)
b) making sure that the charset picked for the migration tables is exactly the same as the ones for the rest of the eZ schema

I would have hoped that point B was naturally the case - at least it happened on the databases which I use, which are set for a default of utf8_general_ci. Do you confirm that it did not happen for you? And if it did not, what are the default db settings that you have for collations?

@gggeek
Copy link
Member

gggeek commented Apr 2, 2019

About Doctrine not specifying a default charset/collation (or picking up the default one from the current schema): it seems that this is not the case: https://github.com/doctrine/dbal/blob/master/lib/Doctrine/DBAL/Platforms/MySqlPlatform.php#L486.

Otoh we should be able to retrieve it with a couple of queries:
select database();
show variables like "character_set_database";

@omh
Copy link
Contributor Author

omh commented Apr 3, 2019

Lots of MySQL installations are still utf8 by default even though utf8mb4 is recommended. E.g default MySQL on Ubuntu 18.04 will have utf8 as default. eZ Platform set all their tables as utf8mb4 in their schema directly: https://github.com/ezsystems/ezpublish-kernel/blob/master/data/mysql/schema.sql

This isn't hugely important as you're unlikely to have paths or error messages with emojis, but it's nice to be consistent. Not sure how to actually make it happen though...

@gggeek gggeek modified the milestones: 5.9, 5.10 Apr 11, 2019
@gggeek gggeek modified the milestones: 5.10, 5.11 Jul 15, 2019
@gggeek
Copy link
Member

gggeek commented Nov 1, 2020

A 'hackish' but 'safe' solution could be the following:

  1. add to the bundle a custom migration with a past-date (as done with v1-to-v2 mig)
  2. this migration would be naturally run on 1st execution of the 'migrate' command (after the mig tables have been created for new installs)
  3. in the mig, look up the charset of ezcontentobject table. If utf8mb4, make the mig tables the same

We could make step 3 go further and change the mig tables encoding to the same used by eZ even for non-utf8mb4.

Ideally we could scan for more/other tables than ezcontentobject of course, but in my experience I have seen many ez dbs have erratic/inconistent charsets in their table definitions...

@gggeek
Copy link
Member

gggeek commented Nov 2, 2020

Doing some experiments...

good news: it seems that we can simply call $table->addOption('charset', 'utf8mb4); for mysql to pick up the desired charset for the table creation sql

bad news: doing so makes mysql crap itself with error SQLSTATE[42000]: Syntax error or access violation: 1071 Specified key was too long; max key length is 767 bytes

Sadly, working around this limitation is not trivial:

  • it seems that the 767 bytes limitation comes into play depending on at least two inndob server parameters: innodb_large_prefix and innodb_page_size. It might also be that unsetting STRICT_TRANS_TABLES downgrades the error to a warning... (see https://dev.mysql.com/doc/refman/5.6/en/innodb-limits.html and the detailed description at https://serversforhackers.com/c/mysql-utf8-and-indexing)
  • we could just enforce the unicity the pk to the first 191 bytes, but that would theoretically allow some clashes in migration names
  • a more correct thing to do would be to set the key length based on current db configuration, but that would mean lots of brittle code
  • I still have to figure out if the doctrine dbal allows us to do any of the above...

@gggeek
Copy link
Member

gggeek commented Nov 2, 2020

further bad news: Doctrine does support specifying Index Length for Mysql but:

  • only since version 2.9 (released 4/12/2018)
  • not for primary keys

gggeek pushed a commit that referenced this issue Nov 2, 2020
@gggeek
Copy link
Member

gggeek commented Nov 2, 2020

Update: managed to overcome the index-length limitation with some too-smart-for-its-own-good code - hopefully that will work for all mysql versions... (see 15cf8e8)
Now on to the custom migration to change charset for mig tables of existing installs

@omh
Copy link
Contributor Author

omh commented Nov 2, 2020

Just chiming in from the side line here: would switching the PK to the md5 field make more sense? That way you have a short and unique field that would hit the index length issue. I'm assuming md5 == md5 of the migration file.

@gggeek
Copy link
Member

gggeek commented Nov 2, 2020

'md5 == md5(file contents)', so unfortunately that won't do - we might have many migrations with the same definition.
We could of course introduce a new PK with values being 'md5(migration name)' or 'md5(migration full path)', however I'd still want to keep around a unique index on the 'migration name' column, and we would be facing the (almost) same issue as today...

gggeek pushed a commit that referenced this issue Nov 2, 2020
@gggeek
Copy link
Member

gggeek commented Nov 2, 2020

Travis test are good now.
The table definitions are unchanged; the code will attempt first to create an index using the full 255 chars and in case of error fall back to 191.
The default charset is now utf8_mb4 and the default collation is utf8mba_general_ci. There are two Sf parameters available to override those values.
I have decided not to force charset migration for existing installations as: 1) the code doing that would be super finicky, and 2) those installations surely do work as expected at the moment, possibly by having had a dba go and manually alter table charset/collation, so no need to force changes on them.
Postgresql support has not been tested - this gives me a good opportunity to introduce it in the test matrix :-)
An issue has been opened in the Doctrine DBAL tracker in the making of this: doctrine/dbal#4404

@gggeek gggeek closed this as completed Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants