Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: loading DATE columns with extreme values results in incorrect data (pyarrow.lib.ArrowInvalid casting error in v3) #45

Closed
tswast opened this issue Dec 1, 2021 · 5 comments · Fixed by #46
Assignees
Labels
api: bigquery Issues related to the googleapis/python-db-dtypes-pandas API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@tswast
Copy link
Collaborator

tswast commented Dec 1, 2021

Environment details

  • OS type and version:
  • Python version: python --version
  • pip version: pip --version
  • google-cloud-bigquery version: pip show google-cloud-bigquery
Name: google-cloud-bigquery
Version: 3.0.0b1
Summary: Google BigQuery API client library
Home-page: https://github.com/googleapis/python-bigquery
Author: Google LLC
Author-email: [email protected]
License: Apache 2.0
Location: /Users/swast/src/github.com/googleapis/python-bigquery
Requires: grpcio, google-api-core, google-cloud-bigquery-storage, google-cloud-core, google-resumable-media, packaging, proto-plus, protobuf, pyarrow, requests
Required-by: pandas-gbq, sqlalchemy-bigquery, pybigquery, ibis-bigquery

Steps to reproduce

Try to load the following DataFrame:

input_df = pandas.DataFrame(
    {
        "row_num": [1, 2, 3],
        "date_col": [datetime.date(1, 1, 1), datetime.date(1970, 1, 1), datetime.date(9999, 12, 31)],
        "datetime_col": [datetime.datetime(1, 1, 1), datetime.datetime(1970, 1, 1), datetime.datetime(9999, 12, 31, 23, 59, 59, 999999)],
        "timestamp_col": [datetime.datetime(1, 1, 1, tzinfo=datetime.timezone.utc), datetime.datetime(1970, 1, 1, tzinfo=datetime.timezone.utc), datetime.datetime(9999, 12, 31, 23, 59, 59, 999999, tzinfo=datetime.timezone.utc)],
    },
    columns=["row_num", "date_col", "datetime_col", "timestamp_col"],
)
table_schema = [
    {"name": "row_num", "type": "INTEGER"},
    {"name": "date_col", "type": "DATE"},
    {"name": "datetime_col", "type": "DATETIME"},
    {"name": "timestamp_col", "type": "TIMESTAMP"},
]

Stack trace

(dev-3.9) ➜  python-bigquery-pandas git:(issue365-datetime-out-of-bounds) ✗ pytest tests/system/test_to_gbq.py::test_dataframe_round_trip_with_table_schema'[load_parquet-issue365-extreme-datetimes]'
============================================================================= test session starts =============================================================================
platform darwin -- Python 3.9.5, pytest-6.2.5, py-1.10.0, pluggy-0.13.1
rootdir: /Users/swast/src/github.com/googleapis/python-bigquery-pandas
plugins: cov-2.12.1, asyncio-0.15.1, anyio-3.3.0, requests-mock-1.9.3
collected 1 item                                                                                                                                                              

tests/system/test_to_gbq.py F                                                                                                                                           [100%]

================================================================================== FAILURES ===================================================================================
____________________________________________ test_dataframe_round_trip_with_table_schema[load_parquet-issue365-extreme-datetimes] _____________________________________________

client = <google.cloud.bigquery.client.Client object at 0x19b997fd0>
dataframe =    row_num    date_col                datetime_col                     timestamp_col
0        1  1754-08-30         00...         1970-01-01 00:00:00+00:00
2        3  1816-03-29  9999-12-31 23:59:59.999999  9999-12-31 23:59:59.999999+00:00
destination_table_ref = TableReference(DatasetReference('swast-scratch', 'python_bigquery_pandas_tests_system_20211201161931_9106f8'), 'round_trip_w_schema_687411')
location = None
schema = {'fields': [{'name': 'row_num', 'type': 'INTEGER'}, {'name': 'date_col', 'type': 'DATE'}, {'name': 'datetime_col', 'type': 'DATETIME'}, {'name': 'timestamp_col', 'type': 'TIMESTAMP'}]}

    def load_parquet(
        client: bigquery.Client,
        dataframe: pandas.DataFrame,
        destination_table_ref: bigquery.TableReference,
        location: Optional[str],
        schema: Optional[Dict[str, Any]],
    ):
        job_config = bigquery.LoadJobConfig()
        job_config.write_disposition = "WRITE_APPEND"
        job_config.source_format = "PARQUET"
    
        if schema is not None:
            schema = pandas_gbq.schema.remove_policy_tags(schema)
            job_config.schema = pandas_gbq.schema.to_google_cloud_bigquery(schema)
            dataframe = cast_dataframe_for_parquet(dataframe, schema)
    
        try:
>           client.load_table_from_dataframe(
                dataframe, destination_table_ref, job_config=job_config, location=location,
            ).result()

pandas_gbq/load.py:127: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <google.cloud.bigquery.client.Client object at 0x19b997fd0>
dataframe =    row_num    date_col                datetime_col                     timestamp_col
0        1  1754-08-30         00...         1970-01-01 00:00:00+00:00
2        3  1816-03-29  9999-12-31 23:59:59.999999  9999-12-31 23:59:59.999999+00:00
destination = TableReference(DatasetReference('swast-scratch', 'python_bigquery_pandas_tests_system_20211201161931_9106f8'), 'round_trip_w_schema_687411'), num_retries = 6
job_id = '2592298a-57af-4944-b7f4-a93f58ee2af1', job_id_prefix = None, location = None, project = None
job_config = <google.cloud.bigquery.job.load.LoadJobConfig object at 0x19b966d60>, parquet_compression = 'SNAPPY', timeout = None

    def load_table_from_dataframe(
        self,
        dataframe,
        destination: Union[Table, TableReference, str],
        num_retries: int = _DEFAULT_NUM_RETRIES,
        job_id: str = None,
        job_id_prefix: str = None,
        location: str = None,
        project: str = None,
        job_config: LoadJobConfig = None,
        parquet_compression: str = "snappy",
        timeout: TimeoutType = DEFAULT_TIMEOUT,
    ) -> job.LoadJob:
        """Upload the contents of a table from a pandas DataFrame.
    
        Similar to :meth:`load_table_from_uri`, this method creates, starts and
        returns a :class:`~google.cloud.bigquery.job.LoadJob`.
    
        .. note::
    
            REPEATED fields are NOT supported when using the CSV source format.
            They are supported when using the PARQUET source format, but
            due to the way they are encoded in the ``parquet`` file,
            a mismatch with the existing table schema can occur, so
            REPEATED fields are not properly supported when using ``pyarrow<4.0.0``
            using the parquet format.
    
            https://github.com/googleapis/python-bigquery/issues/19
    
        Args:
            dataframe (pandas.DataFrame):
                A :class:`~pandas.DataFrame` containing the data to load.
            destination (google.cloud.bigquery.table.TableReference):
                The destination table to use for loading the data. If it is an
                existing table, the schema of the :class:`~pandas.DataFrame`
                must match the schema of the destination table. If the table
                does not yet exist, the schema is inferred from the
                :class:`~pandas.DataFrame`.
    
                If a string is passed in, this method attempts to create a
                table reference from a string using
                :func:`google.cloud.bigquery.table.TableReference.from_string`.
    
        Keyword Arguments:
            num_retries (Optional[int]): Number of upload retries.
            job_id (Optional[str]): Name of the job.
            job_id_prefix (Optional[str]):
                The user-provided prefix for a randomly generated
                job ID. This parameter will be ignored if a ``job_id`` is
                also given.
            location (Optional[str]):
                Location where to run the job. Must match the location of the
                destination table.
            project (Optional[str]):
                Project ID of the project of where to run the job. Defaults
                to the client's project.
            job_config (Optional[google.cloud.bigquery.job.LoadJobConfig]):
                Extra configuration options for the job.
    
                To override the default pandas data type conversions, supply
                a value for
                :attr:`~google.cloud.bigquery.job.LoadJobConfig.schema` with
                column names matching those of the dataframe. The BigQuery
                schema is used to determine the correct data type conversion.
                Indexes are not loaded.
    
                By default, this method uses the parquet source format. To
                override this, supply a value for
                :attr:`~google.cloud.bigquery.job.LoadJobConfig.source_format`
                with the format name. Currently only
                :attr:`~google.cloud.bigquery.job.SourceFormat.CSV` and
                :attr:`~google.cloud.bigquery.job.SourceFormat.PARQUET` are
                supported.
            parquet_compression (Optional[str]):
                [Beta] The compression method to use if intermittently
                serializing ``dataframe`` to a parquet file.
    
                The argument is directly passed as the ``compression``
                argument to the underlying ``pyarrow.parquet.write_table()``
                method (the default value "snappy" gets converted to uppercase).
                https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow-parquet-write-table
    
                If the job config schema is missing, the argument is directly
                passed as the ``compression`` argument to the underlying
                ``DataFrame.to_parquet()`` method.
                https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html#pandas.DataFrame.to_parquet
            timeout (Optional[float]):
                The number of seconds to wait for the underlying HTTP transport
                before using ``retry``.
    
        Returns:
            google.cloud.bigquery.job.LoadJob: A new load job.
    
        Raises:
            TypeError:
                If ``job_config`` is not an instance of :class:`~google.cloud.bigquery.job.LoadJobConfig`
                class.
        """
        job_id = _make_job_id(job_id, job_id_prefix)
    
        if job_config:
            _verify_job_config_type(job_config, google.cloud.bigquery.job.LoadJobConfig)
            # Make a copy so that the job config isn't modified in-place.
            job_config_properties = copy.deepcopy(job_config._properties)
            job_config = job.LoadJobConfig()
            job_config._properties = job_config_properties
    
        else:
            job_config = job.LoadJobConfig()
    
        supported_formats = {job.SourceFormat.CSV, job.SourceFormat.PARQUET}
        if job_config.source_format is None:
            # default value
            job_config.source_format = job.SourceFormat.PARQUET
    
        if (
            job_config.source_format == job.SourceFormat.PARQUET
            and job_config.parquet_options is None
        ):
            parquet_options = ParquetOptions()
            # default value
            parquet_options.enable_list_inference = True
            job_config.parquet_options = parquet_options
    
        if job_config.source_format not in supported_formats:
            raise ValueError(
                "Got unexpected source_format: '{}'. Currently, only PARQUET and CSV are supported".format(
                    job_config.source_format
                )
            )
    
        if location is None:
            location = self.location
    
        # If table schema is not provided, we try to fetch the existing table
        # schema, and check if dataframe schema is compatible with it - except
        # for WRITE_TRUNCATE jobs, the existing schema does not matter then.
        if (
            not job_config.schema
            and job_config.write_disposition != job.WriteDisposition.WRITE_TRUNCATE
        ):
            try:
                table = self.get_table(destination)
            except core_exceptions.NotFound:
                pass
            else:
                columns_and_indexes = frozenset(
                    name
                    for name, _ in _pandas_helpers.list_columns_and_indexes(dataframe)
                )
                job_config.schema = [
                    # Field description and policy tags are not needed to
                    # serialize a data frame.
                    SchemaField(
                        field.name,
                        field.field_type,
                        mode=field.mode,
                        fields=field.fields,
                    )
                    # schema fields not present in the dataframe are not needed
                    for field in table.schema
                    if field.name in columns_and_indexes
                ]
    
        job_config.schema = _pandas_helpers.dataframe_to_bq_schema(
            dataframe, job_config.schema
        )
    
        if not job_config.schema:
            # the schema could not be fully detected
            warnings.warn(
                "Schema could not be detected for all columns. Loading from a "
                "dataframe without a schema will be deprecated in the future, "
                "please provide a schema.",
                PendingDeprecationWarning,
                stacklevel=2,
            )
    
        tmpfd, tmppath = tempfile.mkstemp(
            suffix="_job_{}.{}".format(job_id[:8], job_config.source_format.lower())
        )
        os.close(tmpfd)
    
        try:
    
            if job_config.source_format == job.SourceFormat.PARQUET:
                if job_config.schema:
                    if parquet_compression == "snappy":  # adjust the default value
                        parquet_compression = parquet_compression.upper()
    
>                   _pandas_helpers.dataframe_to_parquet(
                        dataframe,
                        job_config.schema,
                        tmppath,
                        parquet_compression=parquet_compression,
                        parquet_use_compliant_nested_type=True,
                    )

../python-bigquery/google/cloud/bigquery/client.py:2616: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

dataframe =    row_num    date_col                datetime_col                     timestamp_col
0        1  1754-08-30         00...         1970-01-01 00:00:00+00:00
2        3  1816-03-29  9999-12-31 23:59:59.999999  9999-12-31 23:59:59.999999+00:00
bq_schema = [SchemaField('row_num', 'INTEGER', 'NULLABLE', None, (), None), SchemaField('date_col', 'DATE', 'NULLABLE', None, (), ...e_col', 'DATETIME', 'NULLABLE', None, (), None), SchemaField('timestamp_col', 'TIMESTAMP', 'NULLABLE', None, (), None)]
filepath = '/var/folders/gg/ry6k1t_157xbkwtn2rnnfddc006gf0/T/tmpip3lgjt8_job_2592298a.parquet', parquet_compression = 'SNAPPY', parquet_use_compliant_nested_type = True

    def dataframe_to_parquet(
        dataframe,
        bq_schema,
        filepath,
        parquet_compression="SNAPPY",
        parquet_use_compliant_nested_type=True,
    ):
        """Write dataframe as a Parquet file, according to the desired BQ schema.
    
        This function requires the :mod:`pyarrow` package. Arrow is used as an
        intermediate format.
    
        Args:
            dataframe (pandas.DataFrame):
                DataFrame to convert to Parquet file.
            bq_schema (Sequence[Union[ \
                :class:`~google.cloud.bigquery.schema.SchemaField`, \
                Mapping[str, Any] \
            ]]):
                Desired BigQuery schema. Number of columns must match number of
                columns in the DataFrame.
            filepath (str):
                Path to write Parquet file to.
            parquet_compression (Optional[str]):
                The compression codec to use by the the ``pyarrow.parquet.write_table``
                serializing method. Defaults to "SNAPPY".
                https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow-parquet-write-table
            parquet_use_compliant_nested_type (bool):
                Whether the ``pyarrow.parquet.write_table`` serializing method should write
                compliant Parquet nested type (lists). Defaults to ``True``.
                https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types
                https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow-parquet-write-table
    
                This argument is ignored for ``pyarrow`` versions earlier than ``4.0.0``.
        """
        import pyarrow.parquet  # type: ignore
    
        kwargs = (
            {"use_compliant_nested_type": parquet_use_compliant_nested_type}
            if _helpers.PYARROW_VERSIONS.use_compliant_nested_type
            else {}
        )
    
        bq_schema = schema._to_schema_fields(bq_schema)
>       arrow_table = dataframe_to_arrow(dataframe, bq_schema)

../python-bigquery/google/cloud/bigquery/_pandas_helpers.py:660: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

dataframe =    row_num    date_col                datetime_col                     timestamp_col
0        1  1754-08-30         00...         1970-01-01 00:00:00+00:00
2        3  1816-03-29  9999-12-31 23:59:59.999999  9999-12-31 23:59:59.999999+00:00
bq_schema = [SchemaField('row_num', 'INTEGER', 'NULLABLE', None, (), None), SchemaField('date_col', 'DATE', 'NULLABLE', None, (), ...e_col', 'DATETIME', 'NULLABLE', None, (), None), SchemaField('timestamp_col', 'TIMESTAMP', 'NULLABLE', None, (), None)]

    def dataframe_to_arrow(dataframe, bq_schema):
        """Convert pandas dataframe to Arrow table, using BigQuery schema.
    
        Args:
            dataframe (pandas.DataFrame):
                DataFrame to convert to Arrow table.
            bq_schema (Sequence[Union[ \
                :class:`~google.cloud.bigquery.schema.SchemaField`, \
                Mapping[str, Any] \
            ]]):
                Desired BigQuery schema. The number of columns must match the
                number of columns in the DataFrame.
    
        Returns:
            pyarrow.Table:
                Table containing dataframe data, with schema derived from
                BigQuery schema.
        """
        column_names = set(dataframe.columns)
        column_and_index_names = set(
            name for name, _ in list_columns_and_indexes(dataframe)
        )
    
        bq_schema = schema._to_schema_fields(bq_schema)
        bq_field_names = set(field.name for field in bq_schema)
    
        extra_fields = bq_field_names - column_and_index_names
        if extra_fields:
            raise ValueError(
                "bq_schema contains fields not present in dataframe: {}".format(
                    extra_fields
                )
            )
    
        # It's okay for indexes to be missing from bq_schema, but it's not okay to
        # be missing columns.
        missing_fields = column_names - bq_field_names
        if missing_fields:
            raise ValueError(
                "bq_schema is missing fields from dataframe: {}".format(missing_fields)
            )
    
        arrow_arrays = []
        arrow_names = []
        arrow_fields = []
        for bq_field in bq_schema:
            arrow_names.append(bq_field.name)
            arrow_arrays.append(
>               bq_to_arrow_array(get_column_or_index(dataframe, bq_field.name), bq_field)
            )

../python-bigquery/google/cloud/bigquery/_pandas_helpers.py:605: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

series = 0    1754-08-30
1    1970-01-01
2    1816-03-29
Name: date_col, dtype: dbdate, bq_field = SchemaField('date_col', 'DATE', 'NULLABLE', None, (), None)

    def bq_to_arrow_array(series, bq_field):
        if bq_field.field_type.upper() == "GEOGRAPHY":
            arrow_type = None
            first = _first_valid(series)
            if first is not None:
                if series.dtype.name == "geometry" or isinstance(first, _BaseGeometry):
                    arrow_type = pyarrow.binary()
                    # Convert shapey geometry to WKB binary format:
                    series = series.apply(_to_wkb)
                elif isinstance(first, bytes):
                    arrow_type = pyarrow.binary()
            elif series.dtype.name == "geometry":
                # We have a GeoSeries containing all nulls, convert it to a pandas series
                series = pandas.Series(numpy.array(series))
    
            if arrow_type is None:
                arrow_type = bq_to_arrow_data_type(bq_field)
        else:
            arrow_type = bq_to_arrow_data_type(bq_field)
    
        field_type_upper = bq_field.field_type.upper() if bq_field.field_type else ""
    
        if bq_field.mode.upper() == "REPEATED":
            return pyarrow.ListArray.from_pandas(series, type=arrow_type)
        if field_type_upper in schema._STRUCT_TYPES:
            return pyarrow.StructArray.from_pandas(series, type=arrow_type)
>       return pyarrow.Array.from_pandas(series, type=arrow_type)

../python-bigquery/google/cloud/bigquery/_pandas_helpers.py:330: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???

pyarrow/array.pxi:904: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???

pyarrow/array.pxi:252: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???

pyarrow/array.pxi:110: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <DateArray>
[datetime.date(1754, 8, 30),  datetime.date(1970, 1, 1),
 datetime.date(1816, 3, 29)]
Length: 3, dtype: dbdate, type = DataType(date32[day])

    def __arrow_array__(self, type=None):
        """Convert to an Arrow array from dbdate data.
    
        See:
        https://pandas.pydata.org/pandas-docs/stable/development/extending.html#compatibility-with-apache-arrow
        """
        array = pyarrow.array(self._ndarray, type=pyarrow.timestamp("ns"))
>       return pyarrow.compute.cast(
            array, type if type is not None else pyarrow.date32(),
        )

../python-db-dtypes-pandas/db_dtypes/__init__.py:277: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

arr = <pyarrow.lib.TimestampArray object at 0x19bbd7c40>
[
  1754-08-30 22:43:41.128654848,
  1970-01-01 00:00:00.000000000,
  1816-03-29 05:56:08.066277376
]
target_type = DataType(date32[day]), safe = True

    def cast(arr, target_type, safe=True):
        """
        Cast array values to another data type. Can also be invoked as an array
        instance method.
    
        Parameters
        ----------
        arr : Array or ChunkedArray
        target_type : DataType or type string alias
            Type to cast to
        safe : bool, default True
            Check for overflows or other unsafe conversions
    
        Examples
        --------
        >>> from datetime import datetime
        >>> import pyarrow as pa
        >>> arr = pa.array([datetime(2010, 1, 1), datetime(2015, 1, 1)])
        >>> arr.type
        TimestampType(timestamp[us])
    
        You can use ``pyarrow.DataType`` objects to specify the target type:
    
        >>> cast(arr, pa.timestamp('ms'))
        <pyarrow.lib.TimestampArray object at 0x7fe93c0f6910>
        [
          2010-01-01 00:00:00.000,
          2015-01-01 00:00:00.000
        ]
    
        >>> cast(arr, pa.timestamp('ms')).type
        TimestampType(timestamp[ms])
    
        Alternatively, it is also supported to use the string aliases for these
        types:
    
        >>> arr.cast('timestamp[ms]')
        <pyarrow.lib.TimestampArray object at 0x10420eb88>
        [
          1262304000000,
          1420070400000
        ]
        >>> arr.cast('timestamp[ms]').type
        TimestampType(timestamp[ms])
    
        Returns
        -------
        casted : Array
        """
        if target_type is None:
            raise ValueError("Cast target type must not be None")
        if safe:
            options = CastOptions.safe(target_type)
        else:
            options = CastOptions.unsafe(target_type)
>       return call_function("cast", [arr], options)

/usr/local/Caskroom/miniconda/base/envs/dev-3.9/lib/python3.9/site-packages/pyarrow/compute.py:297: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???

pyarrow/_compute.pyx:527: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???

pyarrow/_compute.pyx:337: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???

pyarrow/error.pxi:143: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to date32[day] would lose data: -6795364578871345152

pyarrow/error.pxi:99: ArrowInvalid

The above exception was the direct cause of the following exception:

method_under_test = functools.partial(<function to_gbq at 0x19ba4aee0>, project_id='swast-scratch', credentials=<google.oauth2.credentials.Credentials object at 0x19b997160>)
random_dataset_id = 'python_bigquery_pandas_tests_system_20211201161931_9106f8', bigquery_client = <google.cloud.bigquery.client.Client object at 0x19b97efd0>
input_df =    row_num    date_col                datetime_col                     timestamp_col
0        1  0001-01-01         00...         1970-01-01 00:00:00+00:00
2        3  9999-12-31  9999-12-31 23:59:59.999999  9999-12-31 23:59:59.999999+00:00
expected_df =    row_num    date_col                datetime_col                     timestamp_col
0        1  0001-01-01         00...         1970-01-01 00:00:00+00:00
2        3  9999-12-31  9999-12-31 23:59:59.999999  9999-12-31 23:59:59.999999+00:00
table_schema = [{'name': 'row_num', 'type': 'INTEGER'}, {'name': 'date_col', 'type': 'DATE'}, {'name': 'datetime_col', 'type': 'DATETIME'}, {'name': 'timestamp_col', 'type': 'TIMESTAMP'}]
api_method = 'load_parquet', api_methods = {'load_csv', 'load_parquet'}

    @pytest.mark.parametrize(
        ["input_df", "expected_df", "table_schema", "api_methods"], DATAFRAME_ROUND_TRIPS
    )
    def test_dataframe_round_trip_with_table_schema(
        method_under_test,
        random_dataset_id,
        bigquery_client,
        input_df,
        expected_df,
        table_schema,
        api_method,
        api_methods,
    ):
        if api_method not in api_methods:
            pytest.skip(f"{api_method} not supported.")
        if expected_df is None:
            expected_df = input_df
        table_id = f"{random_dataset_id}.round_trip_w_schema_{random.randrange(1_000_000)}"
>       method_under_test(
            input_df, table_id, table_schema=table_schema, api_method=api_method
        )

tests/system/test_to_gbq.py:232: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas_gbq/gbq.py:1095: in to_gbq
    connector.load_data(
pandas_gbq/gbq.py:546: in load_data
    chunks = load.load_chunks(
pandas_gbq/load.py:218: in load_chunks
    load_parquet(client, dataframe, destination_table_ref, location, schema)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

client = <google.cloud.bigquery.client.Client object at 0x19b997fd0>
dataframe =    row_num    date_col                datetime_col                     timestamp_col
0        1  1754-08-30         00...         1970-01-01 00:00:00+00:00
2        3  1816-03-29  9999-12-31 23:59:59.999999  9999-12-31 23:59:59.999999+00:00
destination_table_ref = TableReference(DatasetReference('swast-scratch', 'python_bigquery_pandas_tests_system_20211201161931_9106f8'), 'round_trip_w_schema_687411')
location = None
schema = {'fields': [{'name': 'row_num', 'type': 'INTEGER'}, {'name': 'date_col', 'type': 'DATE'}, {'name': 'datetime_col', 'type': 'DATETIME'}, {'name': 'timestamp_col', 'type': 'TIMESTAMP'}]}

    def load_parquet(
        client: bigquery.Client,
        dataframe: pandas.DataFrame,
        destination_table_ref: bigquery.TableReference,
        location: Optional[str],
        schema: Optional[Dict[str, Any]],
    ):
        job_config = bigquery.LoadJobConfig()
        job_config.write_disposition = "WRITE_APPEND"
        job_config.source_format = "PARQUET"
    
        if schema is not None:
            schema = pandas_gbq.schema.remove_policy_tags(schema)
            job_config.schema = pandas_gbq.schema.to_google_cloud_bigquery(schema)
            dataframe = cast_dataframe_for_parquet(dataframe, schema)
    
        try:
            client.load_table_from_dataframe(
                dataframe, destination_table_ref, job_config=job_config, location=location,
            ).result()
        except pyarrow.lib.ArrowInvalid as exc:
>           raise exceptions.ConversionError(
                "Could not convert DataFrame to Parquet."
            ) from exc
E           pandas_gbq.exceptions.ConversionError: Could not convert DataFrame to Parquet.

pandas_gbq/load.py:131: ConversionError
=========================================================================== short test summary info ===========================================================================
FAILED tests/system/test_to_gbq.py::test_dataframe_round_trip_with_table_schema[load_parquet-issue365-extreme-datetimes] - pandas_gbq.exceptions.ConversionError: Could not ...
============================================================================== 1 failed in 4.85s ==============================================================================
@tswast tswast self-assigned this Dec 1, 2021
@tswast
Copy link
Collaborator Author

tswast commented Dec 1, 2021

Possibly an issue with https://github.com/googleapis/python-db-dtypes-pandas too.

It shouldn't cast data it can't handle to nanosecond-precision.

@tswast
Copy link
Collaborator Author

tswast commented Dec 1, 2021

Turns out this is also a bug with google-cloud-bigquery 2.x

(pandas-gbq-dev) ➜  python-bigquery-pandas git:(issue365-datetime-out-of-bounds) ✗ pip show google-cloud-bigquery
Name: google-cloud-bigquery
Version: 2.30.1
Summary: Google BigQuery API client library
Home-page: https://github.com/googleapis/python-bigquery
Author: Google LLC
Author-email: [email protected]
License: Apache 2.0
Location: /usr/local/Caskroom/miniconda/base/envs/pandas-gbq-dev/lib/python3.9/site-packages
Requires: google-api-core, google-cloud-core, google-resumable-media, grpcio, packaging, proto-plus, protobuf, python-dateutil, requests
Required-by: pandas-gbq
tests/system/test_to_gbq.py F                                                                                                                                   [100%]

============================================================================== FAILURES ===============================================================================
________________________________________ test_dataframe_round_trip_with_table_schema[load_parquet-issue365-extreme-datetimes] _________________________________________

method_under_test = functools.partial(<function to_gbq at 0x19f13d8b0>, project_id='swast-scratch', credentials=<google.oauth2.credentials.Credentials object at 0x19f064c40>)
random_dataset_id = 'python_bigquery_pandas_tests_system_20211201211335_8ed57e', bigquery_client = <google.cloud.bigquery.client.Client object at 0x19f064ac0>
input_df =    row_num    date_col                datetime_col                     timestamp_col
0        1  0001-01-01         00...         1970-01-01 00:00:00+00:00
2        3  9999-12-31  9999-12-31 23:59:59.999999  9999-12-31 23:59:59.999999+00:00
expected_df =    row_num    date_col                datetime_col                     timestamp_col
0        1  0001-01-01         00...         1970-01-01 00:00:00+00:00
2        3  9999-12-31  9999-12-31 23:59:59.999999  9999-12-31 23:59:59.999999+00:00
table_schema = [{'name': 'row_num', 'type': 'INTEGER'}, {'name': 'date_col', 'type': 'DATE'}, {'name': 'datetime_col', 'type': 'DATETIME'}, {'name': 'timestamp_col', 'type': 'TIMESTAMP'}]
api_method = 'load_parquet', api_methods = {'load_csv', 'load_parquet'}

    @pytest.mark.parametrize(
        ["input_df", "expected_df", "table_schema", "api_methods"], DATAFRAME_ROUND_TRIPS
    )
    def test_dataframe_round_trip_with_table_schema(
        method_under_test,
        random_dataset_id,
        bigquery_client,
        input_df,
        expected_df,
        table_schema,
        api_method,
        api_methods,
    ):
        if api_method not in api_methods:
            pytest.skip(f"{api_method} not supported.")
        if expected_df is None:
            expected_df = input_df
        table_id = f"{random_dataset_id}.round_trip_w_schema_{random.randrange(1_000_000)}"
        method_under_test(
            input_df, table_id, table_schema=table_schema, api_method=api_method
        )
        round_trip = bigquery_client.list_rows(table_id).to_dataframe(
            dtypes=dict(zip(expected_df.columns, expected_df.dtypes))
        )
        round_trip.sort_values("row_num", inplace=True)
>       pandas.testing.assert_frame_equal(expected_df, round_trip)

tests/system/test_to_gbq.py:239: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas/_libs/testing.pyx:53: in pandas._libs.testing.assert_almost_equal
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   AssertionError: DataFrame.iloc[:, 1] (column name="date_col") are different
E   
E   DataFrame.iloc[:, 1] (column name="date_col") values are different (66.66667 %)
E   [index]: [0, 1, 2]
E   [left]:  [0001-01-01, 1970-01-01, 9999-12-31]
E   [right]: [1754-08-30, 1970-01-01, 1816-03-29]

pandas/_libs/testing.pyx:168: AssertionError

Not good!

@tswast tswast changed the title v3 bug: loading DATE columns with extreme values results in incorrect data & pyarrow.lib.ArrowInvalid casting error bug: loading DATE columns with extreme values results in incorrect data (pyarrow.lib.ArrowInvalid casting error in v3) Dec 1, 2021
@tswast
Copy link
Collaborator Author

tswast commented Dec 1, 2021

On second-look, I suspect this is more a problem with the db-dtypes package than it is the google-cloud-bigquery library, but I'll be interested to see how the tests work on the v3 branch.

@tswast
Copy link
Collaborator Author

tswast commented Dec 1, 2021

Just checked with pytest tests/system/test_pandas.py::test_load_table_from_dataframe_w_explicit_schema from googleapis/python-bigquery#1078 on the v3 branch and it passes too. I think this is a problem with the db-dtypes package and how pandas-gbq is using it.

@tswast tswast transferred this issue from googleapis/python-bigquery Dec 1, 2021
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-db-dtypes-pandas API. label Dec 1, 2021
@tswast
Copy link
Collaborator Author

tswast commented Dec 1, 2021

Thought: Perhaps we add pyarrow implementations similar to https://pandas.pydata.org/docs/reference/api/pandas.arrays.StringArray.html versus https://pandas.pydata.org/docs/reference/api/pandas.arrays.ArrowStringArray.html ?

The pyarrow implementation could support a wider range of values? Or perhaps parametrize by units in the numpy implementation?

@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Dec 2, 2021
gcf-merge-on-green bot pushed a commit to googleapis/python-bigquery that referenced this issue Dec 2, 2021
…taFrame (#1078)

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
- [ ] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/python-bigquery/issues/new/choose) before writing your code!  That way we can discuss the change, evaluate designs, and agree on the general idea
- [ ] Ensure the tests and linter pass
- [ ] Code coverage does not decrease (if any source code was changed)
- [ ] Appropriate docs were updated (if necessary)

Towards #1076 🦕
(edit: moved to googleapis/python-db-dtypes-pandas#45 )
@tswast tswast added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed triage me I really want to be triaged. labels Dec 3, 2021
@tswast tswast closed this as completed in #46 Dec 4, 2021
@tswast tswast self-assigned this Jan 30, 2023
abdelmegahedgoogle pushed a commit to abdelmegahedgoogle/python-bigquery that referenced this issue Apr 17, 2023
…taFrame (googleapis#1078)

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
- [ ] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/python-bigquery/issues/new/choose) before writing your code!  That way we can discuss the change, evaluate designs, and agree on the general idea
- [ ] Ensure the tests and linter pass
- [ ] Code coverage does not decrease (if any source code was changed)
- [ ] Appropriate docs were updated (if necessary)

Towards #1076 🦕
(edit: moved to googleapis/python-db-dtypes-pandas#45 )
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-db-dtypes-pandas API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants