Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to ensure character set is set for database connection? #133

Closed
seb303 opened this issue Jun 19, 2021 · 4 comments · Fixed by #135
Closed

How to ensure character set is set for database connection? #133

seb303 opened this issue Jun 19, 2021 · 4 comments · Fixed by #135

Comments

@seb303
Copy link

seb303 commented Jun 19, 2021

As I understand it, this library doesn't support setting the character set automatically, so it's necessary to send a "SET NAMES ..." statement to do this. But I think it won't work to do this just once because one can't be sure when the database connection will be re-made.
e.g.

$mysql_factory = new \React\MySQL\Factory($loop);
$db = $mysql_factory->createLazyConnection(....);
$db->query("SET NAMES ?", [$charset]);
....
....
// By this point we can't be sure that it's using the same connection.
$db->query(...);

So it seems necessary to send SET NAMES every time a query is executed...
e.g.

$mysql_factory = new \React\MySQL\Factory($loop);
$db = $mysql_factory->createLazyConnection(....);
....
....
$db->query("SET NAMES ?", [$charset]);
$db->query(...);
....
....
$db->query("SET NAMES ?", [$charset]);
$db->query(...);

This seems a bit inefficient, and there's still no guarantee that the database connection won't get dropped and re-made between the SET NAMES query and the next query.

Am I missing something here? Is there a better way (other than built-in support for character set, which would be nice of course).

@clue
Copy link
Contributor

clue commented Jun 24, 2021

@seb303 Excellent question, thanks for posting!

The charset can indeed be set using the SET NAMES utf8mb4 query on an existing connection.

However, as you've rightfully pointed out, the connection may be be recreated at a later time when using a lazy connection. In this case, the charset would only apply to the original connection and would not be applied to any subsequent connections that are created automatically.

This does not occur when using the createConnection() method, so this would be how this can be worked around at the moment. This one leaves control up to the user, though this might be a bit harder to use.

I agree that we should provide better support for this for lazy connections. Afaict there are multiple options to make this work:

  • Add a new charset attribute to the connection URL to automatically apply this query every time a connection is established (à la PDO or https://www.doctrine-project.org/projects/doctrine-dbal/en/latest/reference/configuration.html#connecting-using-a-url)
  • Add some generic hook for queries to be executed after creating a connection (not sure what this would look like and how many other options one might want to set?)
  • Add a new method to change the charset and then remember this setting for subsequent connections (à la mysqli_set_charset())
  • Do not add a new method but try to parse SQL queries to automatically remember this setting for subsequent connections (not a fan of this because this seems fragile)

What do you think?

@seb303
Copy link
Author

seb303 commented Jun 24, 2021

Hello @clue

I think the problem with createConnection() is that many use situations will end up constantly setting up new connections to the database, perhaps one for every SQL query. It also creates less elegant code with the nested thens. Or it can be wrapped in a function but then this is definitely making one connection per SQL query. e.g.

function doQuery($query, $params): PromiseInterface {
    return $this->factory->createConnection($this->db_uri)->then(
        function (ConnectionInterface $connection) use ($query, $params) {
            $promise = $connection->query($query, $params);
            $connection->quit();
            return $promise;
        }
    );
}

createLazyConnection() on the other hand makes for more elegant code and saves the overhead of constantly setting up new connections to the database. But as originally stated there is the uncertainty over whether the client character set is correct for the connection being used.

As for how the library should implement support for character sets:

I think it would make sense if createLazyConnection (and indeed createConnection) had a parameter for charset, either as part of the url, or as a second argument. It's pretty much an essential part of setting up any database connection, unless one is writing code only for a very specific server environment. I'm not familiar with the actual server protocol, but I don't believe it's necessary to use a "SET NAMES" statement - there is another way to indicate the client character set / collation on connection. At least, this is what is suggested here:
https://dev.mysql.com/doc/refman/8.0/en/charset-connection.html#charset-connection-client-configuration

It would also work to have some kind of generic hook to execute a query on connection, or fire a connect event where a "SET NAMES" query could be executed. But this is more cumbersome and I can't think of any uses beyond setting the charset.

I think having a method to change the charset like mysqli_set_charset() doesn't make sense in an async environment. In the unusual situation where the client application requires multiple charsets it would be safer to set up multiple connections.

I also do not like the idea of trying to parse "SET NAMES" statements from SQL queries to remember / automatically repeat.

@clue
Copy link
Contributor

clue commented Jul 14, 2021

I agree and it looks like the best way forward would be to accept an optional ?charset=utf8 query parameter like this:

$factory->createConnection('user:secret@localhost/database?charset=utf8mb4');

Once this is added to the createConnection() method, the createLazyConnection() method will automatically take advantage of this for every connection that is created automatically.

At the moment, this project is hard-wired to default to utf8 (not utf8mb4, so it only supports of subset of actual UTF-8!) as per

private $charsetNumber = 0x21;
. We may use the given charset query parameter to automatically build a SET NAMES utf8mb4 query as per https://dev.mysql.com/doc/refman/5.6/en/charset-connection.html. In the future, we may want to map the given charset query parameter to a charset/collation ID as per https://dev.mysql.com/doc/internals/en/character-set.html#packet-Protocol::CharacterSet. On top of this, we should add some tests with some help of https://dev.mysql.com/doc/refman/5.6/en/charset-introducer.html, https://dev.mysql.com/doc/refman/8.0/en/charset-binary-set.html and https://dev.mysql.com/doc/refman/5.6/en/charset-literal.html.

PRs are very much welcome 👍

(As much as I'd love to work on this myself, there are currently no immediate plans to build this from my end (no demand at the moment and more important outstanding issues currently). If you need this for a commercial project and you want to help sponsor this feature, feel free to reach out and I'm happy to take a look.)

@seb303
Copy link
Author

seb303 commented Jul 15, 2021

Unfortunately my programming skills are not up to the job, otherwise I would have a go at this myself. (I'm very much a newcomer to OO and Async PHP.)

But I did a quick experiment of editing the value of private $charsetNumber in AuthenticateCommand.php and found that the charset ID is not consistent from database to database. In my database for example, utf8mb4 has ID 45 (not 255 like the MySQL docs):
utf8mb4 | utf8mb4_general_ci | 45

So it would be necessary to query INFORMATION_SCHEMA.COLLATIONS in order to map a given charset string to its ID. More straightforward therefore (and probably less overhead) to just send a SET NAMES query on connection.

Furthermore, it seems you cannot really say for sure that this project is hard-coded to default to utf8 since it's not guaranteed that charset id 0x21 will always be utf8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants