[#23490] YSQL: Tighten notion of equality for update optimizations

Summary: ### Background Data undergoes multiple transformations during the lifetime of a query. The data that is input by a query may be type modified (`typmod`), padded and aligned (`typalign`), type casted, compressed (`typstorage`) and normalized before being stored in a Datum as part of a Postgres Tuple. This Datum is then massaged into a network format (`protobuf`) as it is sent over the network to a tserver. Finally, the Datum is unpacked and stored in persistent storage in a format that is supported by DocDB/RocksDB. The reverse process happens when the same piece of data needs to be output back to the user. Each of these formats have their own notion of equality: Postgres has the notion of semantic/logical equality (`{"a": 1, "b": 2}` and `{"b": 2, "a": 1}` are equivalent) and storage/binary equality (the binary representations of `{"a": 1, "b": 2}` and `{"b": 2, "a": 1}` maybe different if alternative representations of a value are not normalized). Similarly DocDB has its own notion of semantic and storage equality. In most cases, these notions of equality can be used interchangeably and with good reason: - There is a 1:1 correspondence between the data stored in different formats. That is, a datum’s representation in postgres has exactly one corresponding representation in DocDB that allows the datum to be transformed seamlessly between the formats. This also implies that two datums whose binary representations are identical in postgres will also have identical representations in DocDB. This property is ubiquitously exploited to pushdown postgres operations to DocDB, and to indeed use DocDB as a storage engine for postgres. - Postgres also normalizes the representation of a datum’s value when it is packed into a Datum prior to storage. That is, if a given value of a given data type has multiple representations (the json from the example above), postgres converts the value into a normalized representation, which allows semantic equality to be interchangeably used with storage equality (if multiple representations of a value are represented in-memory/on-disk identically, they will also be stored identically). For data types that are not normalized, postgres does not define an equality operator (`json` data type is not normalized and does not have an equality operator, while `jsonb` data type is normalized and has an equality operator) This leads to a couple of problems: - There are occasions where we may want to know if two datums are stored identically when the data type that the datums belong to, does not have an equality operator. On such occasions, there is a distinction between semantic (not defined) and storage equality (defined). - Postgres is a highly extensible database that allows users to define custom data types and equality operators. In user-defined scenarios, it is also possible to end up with a difference between semantic and storage equality. ### This revision We perform the following optimizations on UPDATE queries that rely on *some* notion of equality: 1. If a BEFORE UPDATE FOR EACH ROW trigger is defined, we skip redundant index updates by comparing the old (pre-update) and new (post-update) values of a column. 2. With D34040, we also have a framework to skip index updates and constraint checking in cases where the value of a column remains unchanged by the update process. Both of these optimizations rely on semantic equality today. However, they should rely on storage/binary equality to correctly handle the problems mentioned above: - A given data type may not define an equality operator. In the absence of storage equality, for correctness in such cases, we must assume that the columns of such data types always change in value. - A user-defined data type may have funky notions of semantic equality (and set membership). This can lead to correctness issues in cases such as partial indexes, when two representations of a given value are considered semantically equal, but are not stored identically (not normalized) and membership to the partial index relies on a membership function that is sensitive to the storage representation. (eg: `{"a": 1, "b": 2}` and `{"b": 2, "a": 1}` are not stored identically and a partial index is defined on `begins with '{”a”: 1’`) This revision switches to the use of storage equality for the above optimizations with the caveat that the function used for the comparison (`datumIsEqual`) does not support TOASTed storage. Jira: DB-12404 Test Plan: ``` ./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressUpdateOptimized#schedule' ``` Reviewers: amartsinchyk, mihnea, smishra Reviewed By: amartsinchyk Subscribers: yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D37384
yugabyte · Sep 3, 2024 · b57e3c6 · b57e3c6
1 parent 1050ec4
commit b57e3c6
Show file tree

Hide file tree

Showing 3 changed files with 1,261 additions and 26 deletions.
diff --git a/src/postgres/src/backend/executor/ybOptimizeModifyTable.c b/src/postgres/src/backend/executor/ybOptimizeModifyTable.c
@@ -126,41 +126,74 @@ YbIsColumnComparisonAllowed(const Bitmapset *modified_cols,
 }
 
 /* ----------------------------------------------------------------
- * YBEqualDatums
+ * YBAreDatumsStoredIdentically
  *
- * Function compares values of lhs and rhs datums with respect to value type
- * and collation.
+ * Function determines if the underlying storage representation of the two
+ * datums is identical.
  *
- * Returns true in case value of lhs and rhs datums match.
+ * Notes:
+ * - This function assumes that the value held in both datums are of the same
+ *   data type ('attdesc->atttypid'). This function does not account for type
+ *   casting.
+ * - This function returns true if both datums are NULL. Semantically, NULL
+ *   values are always treated as distinct, unless supplied with a NULLS NOT
+ *   DISTINCT modifier. However, the storage representation of NULL values is
+ *   identical.
+ * - Data types may have multiple representations for a value. Examples include
+ *    - NaNs in floating point data types
+ *    - {"a": 1, "b": 2}, {"b": 2, "a": 1} in jsonb.
+ *   Postgres normalizes the representation of these values for primitive data
+ *   types. This function does not normalize the representation of datums before
+ *   comparing them. Beware if you are using user-defined data types!
+ * - Similarly, this function does not account for type modifiers and alignment
+ *   rules. Input datums are expected to be modified and aligned.
+ * - Additionally, this function assumes that both datums are either compressed
+ *   or uncompressed. It does not handle the case where one datum is compressed
+ *   and the other is not.
+ * - Furthermore, this function assumes that both data types have the same
+ *   collation.
  * ----------------------------------------------------------------
  */
 static bool
-YBEqualDatums(Datum lhs, Datum rhs, Oid atttypid, Oid collation)
+YBAreDatumsStoredIdentically(Datum lhs,
+							 Datum rhs,
+							 const FormData_pg_attribute *attdesc)
 {
-	TypeCacheEntry *typentry =
-		lookup_type_cache(atttypid, TYPECACHE_CMP_PROC_FINFO);
-	if (!OidIsValid(typentry->cmp_proc_finfo.fn_oid))
-		ereport(ERROR, (errcode(ERRCODE_UNDEFINED_FUNCTION),
-						errmsg("could not identify a comparison function for "
-							   "type %s",
-							   format_type_be(typentry->type_id))));
-
-	/* To ensure that there is an upper bound on the size of data compared */
-	const int lhslength = datumGetSize(lhs, typentry->typbyval, typentry->typlen);
-	const int rhslength = datumGetSize(rhs, typentry->typbyval, typentry->typlen);
+	/*
+	 * Within a tuple, postgres may store a field's data in 3 different ways:
+	 * 1. Inline/byval within the tuple. This is used for fixed length types.
+	 * 2. Out-of-line/by reference but on the same page. Postgres uses fixed
+	 *    page sizes (usually 8 KB). If the data fits within the page, it is
+	 *    preferred to be stored on the same page, with a reference from the
+	 *    tuple to its location. This is used for variable length data types.
+	 * 3. Out-of-line and TOASTed. If the data is too large to fit within the
+	 *    page, it is stored in a TOAST table with a reference from the
+	 *    tuple to its location.
+	 *
+	 * For fields with inline data (1), it can be determined if the two datums
+	 * are identical by casting the datums to machine-word sized integers and
+	 * comparing their values.
+	 * For fields with out-of-line data (2 and 3), the datums are pointers to
+	 * variable length byte arrays which can be memcmp'd.
+	 * Yugabyte does not TOAST oversized values, nor does it use the concept of
+	 * storage pages, so we do not need to handle (3) in this function.
+	 */
+
+	Assert(attdesc->attbyval || (attdesc->attstorage != 'e'));
 
+	/*
+	 * To ensure that there is an upper bound on the size of data compared,
+	 * compute the size of the datums. The length computation is repeated in
+	 * datumIsEqual, but it does not accept an upper bound arg, and we would
+	 * like to keep that function unchanged.
+	 */
+	const int lhslength = datumGetSize(lhs, attdesc->attbyval, attdesc->attlen);
+	const int rhslength = datumGetSize(rhs, attdesc->attbyval, attdesc->attlen);
 	if (lhslength != rhslength ||
 		lhslength > yb_update_optimization_options.max_cols_size_to_compare)
 		return false;
 
-	FunctionCallInfoData locfcinfo;
-	InitFunctionCallInfoData(locfcinfo, &typentry->cmp_proc_finfo, 2, collation,
-							 NULL, NULL);
-	locfcinfo.arg[0] = lhs;
-	locfcinfo.arg[1] = rhs;
-	locfcinfo.argnull[0] = false;
-	locfcinfo.argnull[1] = false;
-	return DatumGetInt32(FunctionCallInvoke(&locfcinfo)) == 0;
+	return datumIsEqual(lhs, rhs, attdesc->attbyval, attdesc->attlen);
 }
 
 /* ----------------------------------------------------------------------------
@@ -192,8 +225,7 @@ YBIsColumnModified(Relation rel, HeapTuple oldtuple, HeapTuple newtuple,
 
 	return (
 		(old_is_null != new_is_null) ||
-		(!old_is_null && !YBEqualDatums(old_value, new_value, attdesc->atttypid,
-										attdesc->attcollation)));
+		(!old_is_null && !YBAreDatumsStoredIdentically(old_value, new_value, attdesc)));
 }
 
 /* ----------------------------------------------------------------