Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doltgres 0.12.0 #192458

Merged
merged 2 commits into from
Oct 1, 2024
Merged

doltgres 0.12.0 #192458

merged 2 commits into from
Oct 1, 2024

Conversation

BrewTestBot
Copy link
Member

Created by brew bump


Created with brew bump-formula-pr.

release notes
# Merged PRs

doltgresql

  • 780: Don't count failure to send standby messages as terminal failures dur…
    …ing replication
  • 770: Tests for dolt_schemas
    Dolt PR: Fix dolt_schemas for doltgres dolthub/dolt#8401
  • 768: Add workbench tests for docs
    Dolt PR: Add pointer for dolt_docs schema so it can be replaced by doltgres dolthub/dolt#8398
  • 766: More workbench tests for schema, tags, views
  • 763: Wrap column default expressions in ParenExprto match MySQL's requirements
    PostgreSQL's syntax does not require column default expressions to be wrapped in parens, but MySQL's does, so when we translate the column default value expressions to the vitess AST, we need to wrap them in parens so that they execute in GMS without triggering an error.
    Fixes Column default expressions should not require parentheses dolthub/doltgresql#751
  • 762: Add support for ALTER TABLE ADD FOREIGN KEY
    Helps with data imports, since it's common to add FKs at the end of an import script.
    The regression below seems to come from the FK being added, but we don't yet support removing an FK, so the drop table call now fails.
    Related to ALTER TABLE support dolthub/doltgresql#724
  • 761: Implement information_schema views and tables, add as schema
    GMS: Support information_schema views/tables hooks for doltgres dolthub/go-mysql-server#2678
    Dolt: Return information_schema schema for doltgres dolthub/dolt#8391
  • 752: More unskipped tests
    Better test harness allowing to unskip about half of the dolt merge tests (after various bug fixes in dolt, already in main)
  • 747: Smoke tests for dolt diff functions/tables
    Dolt PR: Fix diff related table functions for doltgres dolthub/dolt#8381
  • 742: Fixed panic for some index joins
    Fixes:
    • Simple index join broken dolthub/doltgresql#730
      For some index joins, the analyzer will create a specific type of plan that creates MySQL ranges rather than Doltgres ranges. It appears as though there may be divergent branches for the join logic, so I attempted to look for the source of the divergence, however I came up short.
      For now, rather than chasing this down and delaying a PR (since Tim needs this fixed asap), we can pass the lookup to the internal Dolt table. This will return incorrect results in some situations, but it won't panic for now, so I'll follow up with a better search through GMS at a later date to merge the index join paths.
  • 741: Support automatic transaction management with COPY FROM STDIN
  • 740: Add workbench tests for COPY FROM
  • 739: Tests for unique indexes
  • 736: Expanding tests for the steps in the Doltgres Getting Started Guide
    Added more test coverage over the Doltgres Getting Started Guide and pulled them out into their own file.
  • 733: Initial Authentication Protocol
    This implements the initial portion of the authentication protocol.
    Postgres Reference Documentation:
    • https://www.postgresql.org/docs/15/protocol-flow.html
    • https://www.postgresql.org/docs/15/sasl-authentication.html
      Primarily, this implements SASL SCRAM-SHA-256, which appears to be the primary form of authentication used in modern Postgres. It has been built by following the RFC specification:
    • https://datatracker.ietf.org/doc/html/rfc5802
      There are no tests since the implementation is incomplete. It cannot truly be tested until we have passwords and such that it can verify against (as the results must be sent back to the client for verification, so it can't be faked), however I have tested it up through what has been written, and what exists works as it should.
      Surprisingly, there aren't any libraries that we could really leverage for this. Most SASL libraries don't implement SCRAM. The closest was the following:
    • https://codeberg.org/mellium/sasl
      However, I couldn't really find a way to integrate it using raw messages and the eventual Doltgres user backend, so this is all custom-written using the RFC as a guideline (along with capturing packets using the regression capture tool to ensure that Postgres follows the RFC's implementation). For now, the logic is hidden behind a bogus parameter check so that the static analyzer is happy, and the next step is to make a mock in-memory database of users and passwords so I can fully test the entire workflow.
  • 732: Initial support for ALTER TABLE, starting with adding a primary key
    Adding initial support for converting ALTER TABLE statements. This first iteration only supports ALTER TABLE t ADD PRIMARY KEY (...);.
    Related to ALTER TABLE support dolthub/doltgresql#724
  • 731: fix return limited row issue
  • 727: Enabled use mydb/main without quoting and implemented the IF function
    Most of this PR is changes to the doltgres engine testing harness to make it pass more tests.
    • Correctly validates most DML statements
    • Matches commit hashes more reliably
    • Handles more types
      Also include parser support for unquoted db identifiers with a / in a USE statement, and implements the IF function (apparently a cockroach extension).
  • 722: Feature: COPY support for HEADER option
    Adds support for using the HEADER option in COPY statements.
    In this first iteration, we only support specifying HEADER or HEADER true. This form causes the tabular data loader and CSV data loader to skip over the initial, header line in import data. In addition to this form, COPY also supports a HEADER MATCH option, where the server asserts that the columns in the import data exactly match the name and the order of the columns in the destination table.
    (Note: this PR is based off of Feature: COPY FROM STDIN support for CSV files dolthub/doltgresql#700 to help split up the changes to make them easier to review)
  • 717: Moved enginetests to their own testing CI workflow, don't use -race
    This works around the problem described here:
    race analysis fails for several tests on Amazon Linux dolthub/doltgresql#718
  • 711: Fix Regression Tests Workflow
  • 707: use pgproto3 for handling server connection messages
  • 705: Workbench tests for log, merge, table details
  • 700: Feature: COPY FROM STDIN support for CSV files
    Support for loading data via COPY FROM STDIN using CSV data.
  • 699: Unskip some engine tests
    dolt_reset now works correctly
  • 694: Tests for version control operations on new schemas / tables in new schemas
    Code changes are all in Dolt:
    support for schemas in various version control operations dolthub/dolt#8343
  • 691: Convert ANALYZE statements
    Adds support for converting Postgres' ANALYZE statement for a single table and running it through the GMS SQL engine. There are stills lots of unsupported options in Postgres' ANALYZE statement, but this change allows Doltgres to process the simplest form – where a single table is being analyzed.
    Since it's common to run ANALYZE at the end of data load scripts (example), this change is intended to make it easier to load dumps into Doltgres.
  • 689: COPY FROM STDIN support
    Adds support for COPY ... FROM STDIN. When copying from STDIN, the COPY FROM statement starts a multi-message flow between the server and client – the client will send COPY DATA messages until all the data has been sent, and then send a COPY DONE (or COPY FAIL) message to finalize the transfer and let the server process more queries.
    This PR adds a new TabularDataLoader type, with the idea that we can create a DataLoader interface for that when we extend this to add support for loading data from CSV data files, too.
    This PR also depends on a GMS change to allow us to create a new sql.Context instance: Add sql.ContextProvider interface dolthub/go-mysql-server#2652
  • 686: Added COPY FROM
    This was pulled from:
  • 681: use ParameterObjectIDs in Parse message for binding variable types
    Note: the prepared tests don't utilize this field in the Parse message, so it needs to extract the binding value types from analyzed plan of the query.
  • 680: Prevent premature decay for IN tuple
    This removes IN prematurely decaying, since it's only necessary for index filters. To complement this, I've implemented SplitConjunction and SplitDisjunction, so that they're aware of Doltgres expression types. The GMS versions will see *pgexprs.GMSCast and do nothing, since we actually care about the child, but GMS is unaware of that.
  • 677: use gms engine instead of mysql handler
  • 675: Replace AllDatabasesWithNames so that revision dbs show up in information_schema tables
    GMS PR: Fix revision databases not showing up for schema databases dolthub/go-mysql-server#2645
  • 672: Added regression tests
    This adds a ton of tests, taken from the Postgres Regression Test suite, and made into Go tests using the new tool in testing/go/regression/reader. There are errors in some of the tests, but it at least gives a very rough idea of what works and what doesn't. The plan is to put this in a nightly job that displays results alongside the correctness tests, and it will also be able to tell us which tests fail that previously passed.
    In addition, this also implements COPY FROM, since it's used by the regression tests for loading in the test data.
  • 667: Several bug fixes and test fixes
    Improvements to testing harnesses to allow more tests to pass.
    Relies on Fixed table resolution for reset dolthub/dolt#8313 and [no-release-notes] Testing harness changes dolthub/go-mysql-server#2647
  • 663: Fix dolt system variables
    Dolt PR: Export DoltSystemVariables var so that it can be used by doltgres dolthub/dolt#8306
  • 657: Start on workbench stability tests
  • 650: fix string value binding for date type and add more functions for DBeaver
  • 647: Fix node tests for dolt procedures
  • 644: Add tests for branch interactions with schema
  • 641: Release v0.11.1
    Created by the Release workflow to update DoltgreSQL's version
  • 634: support schema qualified create table with serial type
  • 561: Add support for Poltgres-compatible indexes
    This implements a proof-of-concept for true Postgres indexes.

    Current Implementation

    The current index implementation relies on the methods used in GMS and Dolt, which are inherently MySQL-based. There are a lot of layers to make indexes as efficient as possible (while also allowing for different integrators to use their own implementations), but I'm going to focus on the inner "core" logic. There are two parts to the core: storage and iteration.

    Storage

    At a high level, we've hardcoded how values should be stored in an index by their observed behavior. NULL is treated as being smaller than the smallest possible value, and is always stored first. Integers are ordered from negative to positive. Strings are in the order defined by their collation, which is usually in alphabetical order with some casing differences. In Dolt the ordering is concrete, and in GMS this order is assumed for all indexable types.

    Iteration

    In GMS, we take an expression (filter or join) and create a range (or multiple ranges) that expresses the values that the index should return. That range contains the lower and upper bounds, and those are based on the value given. For example, the filter column >= 6 (where column is an integer) uses the value of 6 to construct a range of [6, ∞). This range is then passed to Dolt, which uses the inclusive 6 as its starting point, and knows to iterate over the remaining index data until the end. If given some value that uses a different type than the indexed column's type, then that value is automatically cast to the index's type.

    Postgres vs. MySQL

    With the storage and iteration in place for how MySQL (GMS and Dolt) work, let's now look at some key differences with indexes in Postgres.
    • Storage Control: The layout of values are defined by a function. For the default types, Postgres ships with the function pre-defined. It is assumed that this function lays out values such that distinct (not equal using the = operator) values are not equivalent.
    • Iteration via Operators: Specifically for b-trees, iterating over the storage is handled by the comparison operators. As long as the 5 main comparison operators exist (=, >, >=, <, <=), then any value may be used to iterate over storage. It is assumed that these operators map to some logical form of continuity, but that is not strictly required (the Postgres analyzer can actually catch some forms of discontinuity and apply additional filters, pretty cool actually). For example, it is possible that < and > could return true for the same input, but again it is assumed that this is not the case.
    • Null Ordering: Nulls can be viewed as either the smallest possible value or the largest possible value, changing where they're positioned in the index.
    • Null Distinction: For UNIQUE indexes, this controls whether we permit multiple NULL values. If NULLs are distinct, then multiple rows may use NULL. In MySQL, NULLs are always considered distinct.

    Indexed Joins

    This is originally what kickstarted this small project. Now that I've covered how the current implementation works, and a few ways how Postgres differs, it should be much easier to show how proper indexed joins would not work in the current implementation. At their simplest, an index join has a form like SELECT * FROM t1 JOIN t2 ON t1.col = t2.col;. It may not be obvious, but this touches at least 2 of the 4 differences that I mentioned in the previous section.
    • Iteration via Operators: Postgres doesn't need to be able to convert between two types, it just needs the comparison operators. For some types, this would actually lead to different results. For example, SELECT 0.99999994::float4 = 0.99999997::float8; returns false, as there is a defined = operator for the two types. SELECT 0.99999994::float4 = 0.99999997::float8::float4; returns true, as casting from float8 to float4 loses some precision. In this exact case, as long as we keep the filter expression then it's okay, but can lead to data corruption otherwise. There are many more examples than this, so don't take this as the only case, but it's an easier one to understand (compared to a more "realistic" example using reg... types). If our index framework is built on casting (like the current GMS implementation), then we will always have cases where we are tracking down bugs due to invalid behavior.
    • Null Ordering: It's obvious how index differences aren't handled at all. If NULL values are sorted differently between indexes, then that must be taken into account by some analyzer step. The current implementation does not do this, as it does not need to.
    • Null Distinction: I'm actually not sure how this is supposed to be handled in Postgres in this case (haven't done too much research on it), so this could be something else that the analyzer needs to take into account, but it hasn't been verified.

    The Simplest Solution

    Right now on main, we are implementing indexes by casting everything to the column type for filters. This "works" in that we are able to get some tests and performance metrics working, but that's only for items that have a cast. As mentioned earlier, this casting logic is not correct, but our limited testing at least works with it. Once we leave the small bubble, we start to see all of the changes that would have to be made in order to get Postgres indexes "working", such as special casing types and expressions to work under the assumptions made in GMS and Dolt, and still it would be incorrect. Some of the special casing would even be hard to figure out in the first place, like how the reg... types that were mentioned earlier should interact with other types.
    I propose, with this PR (and attached Dolt PR) that the simplest solution is to just do what Postgres is doing. Postgres defines functions that control the layout of values, so we can implement those functions and simply pass them down to Dolt's storage layer, which uses those for ordering rather than the hardcoded versions. This PR doesn't yet implement this part, but it is what we are already doing with the introduction of sql.ExtendedType, which uses the comparisons defined on the type to control the layout. We just have to change which function is being used, which is relatively simple. This PR, instead, focuses on the filter and retrieval part (since it's the more involved portion).
    Postgres simply passes the relevant operator functions down to its storage layer, and runs a tree search (using those operators on its internal b-tree) to find where the storage iterator should start. It then iterates until those operators are no longer fulfilled. This completely sidesteps the casting part, and focuses strictly on the comparison operators, which is exactly what we want. And that's all this PR (and the Dolt one) does in essence. It's a bit more complicated as I'm still trying to take advantage of as much infrastructure as possible, but at it's core it's passing the operators down to Dolt's storage layer to find a start (and stop) point. By passing down functions, this not only gives us full support for everything, but it even allows us to handle things like custom types without any additional code.
    There are still additional things that need to be done, such as covering indexes vs. non covering indexes, composite indexes, etc. but those are mostly Doltgres-side changes to match how Postgres behaves. Also, code layout is not final (everything is in one file), comments are missing, names are bad, stuffing things into special structs rather than creating new fields, no GMS changes yet, etc. Look not at the code, but at the intention of the code, as none of this is final or production-quality.

    Dolt PR

    References

Closed Issues

  • 751: Column default expressions should not require parentheses
  • 730: Simple index join broken
  • 725: Panic creating unique index
  • 574: incorrect format for numeric type result
  • 554: Column alias from subquery cannot be referenced in join
  • 648: Cannot create table after checking out a new branch
  • 88: Doltgres version prints Dolt version output
  • 578: Cannot Connect To Doltgres with Slonik
  • 47: Connecting to doltgres from R Postgres client fails because of no "SET datestyle"

@github-actions github-actions bot added go Go use is a significant feature of the PR or issue bump-formula-pr PR was created using `brew bump-formula-pr` labels Oct 1, 2024
Copy link
Contributor

github-actions bot commented Oct 1, 2024

🤖 An automated task has requested bottles to be published to this PR.

@github-actions github-actions bot added the CI-published-bottle-commits The commits for the built bottles have been pushed to the PR branch. label Oct 1, 2024
@BrewTestBot BrewTestBot added this pull request to the merge queue Oct 1, 2024
Merged via the queue into master with commit 86f5e70 Oct 1, 2024
15 checks passed
@BrewTestBot BrewTestBot deleted the bump-doltgres-0.12.0 branch October 1, 2024 04:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bump-formula-pr PR was created using `brew bump-formula-pr` CI-published-bottle-commits The commits for the built bottles have been pushed to the PR branch. go Go use is a significant feature of the PR or issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants