doltgres 0.12.0 #192458

BrewTestBot · 2024-10-01T03:34:20Z

Created by brew bump

Created with brew bump-formula-pr.

release notes

# Merged PRs
doltgresql

780: Don't count failure to send standby messages as terminal failures dur…

…ing replication
770: Tests for dolt_schemas

Dolt PR: Fix dolt_schemas for doltgres dolthub/dolt#8401
768: Add workbench tests for docs

Dolt PR: Add pointer for dolt_docs schema so it can be replaced by doltgres dolthub/dolt#8398
766: More workbench tests for schema, tags, views
763: Wrap column default expressions in ParenExprto match MySQL's requirements

PostgreSQL's syntax does not require column default expressions to be wrapped in parens, but MySQL's does, so when we translate the column default value expressions to the vitess AST, we need to wrap them in parens so that they execute in GMS without triggering an error.

Fixes Column default expressions should not require parentheses dolthub/doltgresql#751
762: Add support for ALTER TABLE ADD FOREIGN KEY

Helps with data imports, since it's common to add FKs at the end of an import script.

The regression below seems to come from the FK being added, but we don't yet support removing an FK, so the drop table call now fails.

Related to ALTER TABLE support dolthub/doltgresql#724
761: Implement information_schema views and tables, add as schema

GMS: Support information_schema views/tables hooks for doltgres dolthub/go-mysql-server#2678

Dolt: Return information_schema schema for doltgres dolthub/dolt#8391
752: More unskipped tests

Better test harness allowing to unskip about half of the dolt merge tests (after various bug fixes in dolt, already in main)
747: Smoke tests for dolt diff functions/tables

Dolt PR: Fix diff related table functions for doltgres dolthub/dolt#8381
742: Fixed panic for some index joins

Fixes:

Simple index join broken dolthub/doltgresql#730

For some index joins, the analyzer will create a specific type of plan that creates MySQL ranges rather than Doltgres ranges. It appears as though there may be divergent branches for the join logic, so I attempted to look for the source of the divergence, however I came up short.

For now, rather than chasing this down and delaying a PR (since Tim needs this fixed asap), we can pass the lookup to the internal Dolt table. This will return incorrect results in some situations, but it won't panic for now, so I'll follow up with a better search through GMS at a later date to merge the index join paths.


741: Support automatic transaction management with COPY FROM STDIN
740: Add workbench tests for COPY FROM
739: Tests for unique indexes
736: Expanding tests for the steps in the Doltgres Getting Started Guide

Added more test coverage over the Doltgres Getting Started Guide and pulled them out into their own file.
733: Initial Authentication Protocol

This implements the initial portion of the authentication protocol.

Postgres Reference Documentation:

https://www.postgresql.org/docs/15/protocol-flow.html
https://www.postgresql.org/docs/15/sasl-authentication.html

Primarily, this implements SASL SCRAM-SHA-256, which appears to be the primary form of authentication used in modern Postgres. It has been built by following the RFC specification:
https://datatracker.ietf.org/doc/html/rfc5802

There are no tests since the implementation is incomplete. It cannot truly be tested until we have passwords and such that it can verify against (as the results must be sent back to the client for verification, so it can't be faked), however I have tested it up through what has been written, and what exists works as it should.

Surprisingly, there aren't any libraries that we could really leverage for this. Most SASL libraries don't implement SCRAM. The closest was the following:
https://codeberg.org/mellium/sasl

However, I couldn't really find a way to integrate it using raw messages and the eventual Doltgres user backend, so this is all custom-written using the RFC as a guideline (along with capturing packets using the regression capture tool to ensure that Postgres follows the RFC's implementation). For now, the logic is hidden behind a bogus parameter check so that the static analyzer is happy, and the next step is to make a mock in-memory database of users and passwords so I can fully test the entire workflow.


732: Initial support for ALTER TABLE, starting with adding a primary key

Adding initial support for converting ALTER TABLE statements. This first iteration only supports ALTER TABLE t ADD PRIMARY KEY (...);.

Related to ALTER TABLE support dolthub/doltgresql#724
731: fix return limited row issue
727: Enabled use mydb/main without quoting and implemented the IF function

Most of this PR is changes to the doltgres engine testing harness to make it pass more tests.

Correctly validates most DML statements
Matches commit hashes more reliably
Handles more types

Also include parser support for unquoted db identifiers with a / in a USE statement, and implements the IF function (apparently a cockroach extension).


722: Feature: COPY support for HEADER option

Adds support for using the HEADER option in COPY statements.

In this first iteration, we only support specifying HEADER or HEADER true. This form causes the tabular data loader and CSV data loader to skip over the initial, header line in import data. In addition to this form, COPY also supports a HEADER MATCH option, where the server asserts that the columns in the import data exactly match the name and the order of the columns in the destination table.

(Note: this PR is based off of Feature: COPY FROM STDIN support for CSV files dolthub/doltgresql#700 to help split up the changes to make them easier to review)
717: Moved enginetests to their own testing CI workflow, don't use -race

This works around the problem described here:

race analysis fails for several tests on Amazon Linux dolthub/doltgresql#718
711: Fix Regression Tests Workflow
707: use pgproto3 for handling server connection messages
705: Workbench tests for log, merge, table details
700: Feature: COPY FROM STDIN support for CSV files

Support for loading data via COPY FROM STDIN using CSV data.
699: Unskip some engine tests

dolt_reset now works correctly
694: Tests for version control operations on new schemas / tables in new schemas

Code changes are all in Dolt:

support for schemas in various version control operations dolthub/dolt#8343
691: Convert ANALYZE statements

Adds support for converting Postgres' ANALYZE statement for a single table and running it through the GMS SQL engine. There are stills lots of unsupported options in Postgres' ANALYZE statement, but this change allows Doltgres to process the simplest form – where a single table is being analyzed.

Since it's common to run ANALYZE at the end of data load scripts (example), this change is intended to make it easier to load dumps into Doltgres.
689: COPY FROM STDIN support

Adds support for COPY ... FROM STDIN. When copying from STDIN, the COPY FROM statement starts a multi-message flow between the server and client – the client will send COPY DATA messages until all the data has been sent, and then send a COPY DONE (or COPY FAIL) message to finalize the transfer and let the server process more queries.

This PR adds a new TabularDataLoader type, with the idea that we can create a DataLoader interface for that when we extend this to add support for loading data from CSV data files, too.

This PR also depends on a GMS change to allow us to create a new sql.Context instance: Add sql.ContextProvider interface dolthub/go-mysql-server#2652
686: Added COPY FROM

This was pulled from:

Added regression tests dolthub/doltgresql#672

Separating the COPY FROM portion from the regression tests.


681: use ParameterObjectIDs in Parse message for binding variable types

Note: the prepared tests don't utilize this field in the Parse message, so it needs to extract the binding value types from analyzed plan of the query.
680: Prevent premature decay for IN tuple

This removes IN prematurely decaying, since it's only necessary for index filters. To complement this, I've implemented SplitConjunction and SplitDisjunction, so that they're aware of Doltgres expression types. The GMS versions will see *pgexprs.GMSCast and do nothing, since we actually care about the child, but GMS is unaware of that.
677: use gms engine instead of mysql handler
675: Replace AllDatabasesWithNames so that revision dbs show up in information_schema tables

GMS PR: Fix revision databases not showing up for schema databases dolthub/go-mysql-server#2645
672: Added regression tests

This adds a ton of tests, taken from the Postgres Regression Test suite, and made into Go tests using the new tool in testing/go/regression/reader. There are errors in some of the tests, but it at least gives a very rough idea of what works and what doesn't. The plan is to put this in a nightly job that displays results alongside the correctness tests, and it will also be able to tell us which tests fail that previously passed.

In addition, this also implements COPY FROM, since it's used by the regression tests for loading in the test data.
667: Several bug fixes and test fixes

Improvements to testing harnesses to allow more tests to pass.

Relies on Fixed table resolution for reset dolthub/dolt#8313 and [no-release-notes] Testing harness changes dolthub/go-mysql-server#2647
663: Fix dolt system variables

Dolt PR: Export DoltSystemVariables var so that it can be used by doltgres dolthub/dolt#8306
657: Start on workbench stability tests
650: fix string value binding for date type and add more functions for DBeaver
647: Fix node tests for dolt procedures
644: Add tests for branch interactions with schema
641: Release v0.11.1

Created by the Release workflow to update DoltgreSQL's version
634: support schema qualified create table with serial type
561: Add support for Poltgres-compatible indexes

This implements a proof-of-concept for true Postgres indexes.
Current Implementation
The current index implementation relies on the methods used in GMS and Dolt, which are inherently MySQL-based. There are a lot of layers to make indexes as efficient as possible (while also allowing for different integrators to use their own implementations), but I'm going to focus on the inner "core" logic. There are two parts to the core: storage and iteration.
Storage
At a high level, we've hardcoded how values should be stored in an index by their observed behavior. NULL is treated as being smaller than the smallest possible value, and is always stored first. Integers are ordered from negative to positive. Strings are in the order defined by their collation, which is usually in alphabetical order with some casing differences. In Dolt the ordering is concrete, and in GMS this order is assumed for all indexable types.
Iteration
In GMS, we take an expression (filter or join) and create a range (or multiple ranges) that expresses the values that the index should return. That range contains the lower and upper bounds, and those are based on the value given. For example, the filter column >= 6 (where column is an integer) uses the value of 6 to construct a range of [6, ∞). This range is then passed to Dolt, which uses the inclusive 6 as its starting point, and knows to iterate over the remaining index data until the end. If given some value that uses a different type than the indexed column's type, then that value is automatically cast to the index's type.
Postgres vs. MySQL
With the storage and iteration in place for how MySQL (GMS and Dolt) work, let's now look at some key differences with indexes in Postgres.

Storage Control: The layout of values are defined by a function. For the default types, Postgres ships with the function pre-defined. It is assumed that this function lays out values such that distinct (not equal using the = operator) values are not equivalent.
Iteration via Operators: Specifically for b-trees, iterating over the storage is handled by the comparison operators. As long as the 5 main comparison operators exist (=, >, >=, <, <=), then any value may be used to iterate over storage. It is assumed that these operators map to some logical form of continuity, but that is not strictly required (the Postgres analyzer can actually catch some forms of discontinuity and apply additional filters, pretty cool actually). For example, it is possible that < and > could return true for the same input, but again it is assumed that this is not the case.
Null Ordering: Nulls can be viewed as either the smallest possible value or the largest possible value, changing where they're positioned in the index.
Null Distinction: For UNIQUE indexes, this controls whether we permit multiple NULL values. If NULLs are distinct, then multiple rows may use NULL. In MySQL, NULLs are always considered distinct.

Indexed Joins
This is originally what kickstarted this small project. Now that I've covered how the current implementation works, and a few ways how Postgres differs, it should be much easier to show how proper indexed joins would not work in the current implementation. At their simplest, an index join has a form like SELECT * FROM t1 JOIN t2 ON t1.col = t2.col;. It may not be obvious, but this touches at least 2 of the 4 differences that I mentioned in the previous section.

Iteration via Operators: Postgres doesn't need to be able to convert between two types, it just needs the comparison operators. For some types, this would actually lead to different results. For example, SELECT 0.99999994::float4 = 0.99999997::float8; returns false, as there is a defined = operator for the two types. SELECT 0.99999994::float4 = 0.99999997::float8::float4; returns true, as casting from float8 to float4 loses some precision. In this exact case, as long as we keep the filter expression then it's okay, but can lead to data corruption otherwise. There are many more examples than this, so don't take this as the only case, but it's an easier one to understand (compared to a more "realistic" example using reg... types). If our index framework is built on casting (like the current GMS implementation), then we will always have cases where we are tracking down bugs due to invalid behavior.
Null Ordering: It's obvious how index differences aren't handled at all. If NULL values are sorted differently between indexes, then that must be taken into account by some analyzer step. The current implementation does not do this, as it does not need to.
Null Distinction: I'm actually not sure how this is supposed to be handled in Postgres in this case (haven't done too much research on it), so this could be something else that the analyzer needs to take into account, but it hasn't been verified.

The Simplest Solution
Right now on main, we are implementing indexes by casting everything to the column type for filters. This "works" in that we are able to get some tests and performance metrics working, but that's only for items that have a cast. As mentioned earlier, this casting logic is not correct, but our limited testing at least works with it. Once we leave the small bubble, we start to see all of the changes that would have to be made in order to get Postgres indexes "working", such as special casing types and expressions to work under the assumptions made in GMS and Dolt, and still it would be incorrect. Some of the special casing would even be hard to figure out in the first place, like how the reg... types that were mentioned earlier should interact with other types.

I propose, with this PR (and attached Dolt PR) that the simplest solution is to just do what Postgres is doing. Postgres defines functions that control the layout of values, so we can implement those functions and simply pass them down to Dolt's storage layer, which uses those for ordering rather than the hardcoded versions. This PR doesn't yet implement this part, but it is what we are already doing with the introduction of sql.ExtendedType, which uses the comparisons defined on the type to control the layout. We just have to change which function is being used, which is relatively simple. This PR, instead, focuses on the filter and retrieval part (since it's the more involved portion).

Postgres simply passes the relevant operator functions down to its storage layer, and runs a tree search (using those operators on its internal b-tree) to find where the storage iterator should start. It then iterates until those operators are no longer fulfilled. This completely sidesteps the casting part, and focuses strictly on the comparison operators, which is exactly what we want. And that's all this PR (and the Dolt one) does in essence. It's a bit more complicated as I'm still trying to take advantage of as much infrastructure as possible, but at it's core it's passing the operators down to Dolt's storage layer to find a start (and stop) point. By passing down functions, this not only gives us full support for everything, but it even allows us to handle things like custom types without any additional code.

There are still additional things that need to be done, such as covering indexes vs. non covering indexes, composite indexes, etc. but those are mostly Doltgres-side changes to match how Postgres behaves. Also, code layout is not final (everything is in one file), comments are missing, names are bad, stuffing things into special structs rather than creating new fields, no GMS changes yet, etc. Look not at the code, but at the intention of the code, as none of this is final or production-quality.
Dolt PR

Add support for Doltgres indexes dolthub/dolt#8186

References

https://www.postgresql.org/docs/15/indexes-types.html
https://www.postgresql.org/docs/15/indexes-opclass.html
https://www.postgresql.org/docs/15/sql-createindex.html
https://www.postgresql.org/docs/15/xindex.html
https://www.postgresql.org/docs/15/btree-support-funcs.html



Closed Issues

751: Column default expressions should not require parentheses
730: Simple index join broken
725: Panic creating unique index
574: incorrect format for numeric type result
554: Column alias from subquery cannot be referenced in join
648: Cannot create table after checking out a new branch
88: Doltgres version prints Dolt version output
578: Cannot Connect To Doltgres with Slonik
47: Connecting to doltgres from R Postgres client fails because of no "SET datestyle"

github-actions · 2024-10-01T03:50:27Z

🤖 An automated task has requested bottles to be published to this PR.

doltgres 0.12.0

a064b13

github-actions bot added go Go use is a significant feature of the PR or issue bump-formula-pr PR was created using `brew bump-formula-pr` labels Oct 1, 2024

chenrui333 approved these changes Oct 1, 2024

View reviewed changes

doltgres: update 0.12.0 bottle.

5427bbd

github-actions bot added the CI-published-bottle-commits The commits for the built bottles have been pushed to the PR branch. label Oct 1, 2024

github-actions bot approved these changes Oct 1, 2024

View reviewed changes

BrewTestBot enabled auto-merge October 1, 2024 03:52

BrewTestBot added this pull request to the merge queue Oct 1, 2024

Merged via the queue into master with commit 86f5e70 Oct 1, 2024
15 checks passed

BrewTestBot deleted the bump-doltgres-0.12.0 branch October 1, 2024 04:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doltgres 0.12.0 #192458

doltgres 0.12.0 #192458

BrewTestBot commented Oct 1, 2024

Current Implementation

Storage

Iteration

Postgres vs. MySQL

Indexed Joins

The Simplest Solution

Dolt PR

References

github-actions bot commented Oct 1, 2024

doltgres 0.12.0 #192458

doltgres 0.12.0 #192458

Conversation

BrewTestBot commented Oct 1, 2024

doltgresql

Current Implementation

Storage

Iteration

Postgres vs. MySQL

Indexed Joins

The Simplest Solution

Dolt PR

References

Closed Issues

github-actions bot commented Oct 1, 2024