DOC: Avoid requesting data from s3 buckets from our docs #56762

JackCollins91 · 2024-01-07T14:23:50Z

closes BUG: Docs won't build (S3 bucket does not exist) #56592
All code checks passed.
Added an entry in the latest doc/source/whatsnew/v2.3.0.rst file if fixing a bug or adding a new feature.

ipython code chunk would make call to S3 bucket URL. Most often harmless and therefore not easy to replicate, but some users reported error when building html documentation (see issue #56592) when there was some access issue to the S3 bucket URL. Decision was to change to code block to avoid calls.

Make consistent with other s3 bucket URL examples and avoid doc build error when problem with s3 url.

Make example consistent with other code block examples

phofl · 2024-01-07T17:01:03Z

/preview

phofl · 2024-01-07T17:03:01Z

doc/source/whatsnew/v2.3.0.rst

@@ -206,6 +206,7 @@ Styler

 Other
 ^^^^^
+- Bug when building html documentation from ``doc\source\user_guide\io.rst`` no longer calls S3 bucket URL (:issue:`56592`)


No need for a whatsnew

Thanks. Will resubmit shortly.

phofl · 2024-01-07T17:03:15Z

doc/source/user_guide/io.rst


-   df = pd.read_xml(
+   pd.read_xml(


Can you please write down the output?

datapythonista · 2024-01-12T06:26:18Z

Thanks @JackCollins91 for the help with this.

The bucket does actually exist, the error inside a docker seems to be related to connectivity.

The idea is that regardless of it working, it'd still be nice to not have our docs getting those files from s3 at every build. What you did to achieve that is correct, replacing the ipython block to a code-block block. It'd be better if you don't edit the code at all (at least in this PR, but I think we're happy how it is in general). The only change besides replacing the block type is to add the output. The ipython block executes the code and generates the output for us, and since this won't happen anymore, we need to add the output ourselves (which I think requires adding >>> before the code, so the block knows what's the code and what the output.

I think we want to do this for every block that access s3 buckets, not only one single block. You can find the output you need to add to the examples in the rendered docs: https://pandas.pydata.org/docs/dev/user_guide/io.html

Let me know if you have any question.

JackCollins91 · 2024-01-12T06:33:13Z

Thanks @datapythonista . Thanks for the guidance. I'll update the PR shortly.

…st)-Issue-pandas-dev#56592

For each S3 bucket code block, ideally we show what the output would be, but without making an actual call. Unfortunately, for several of the S3 buckets, there are issues with the code, which we must fix in another commit or PR. For now, the two S3 examples that do work, we edit to make the code block show what the output would have been if it had run successfully. Find details on issues in conversation on PR pandas-dev#56592

Code still doesn't run, but at least unmatched } is no longer the issue.

JackCollins91 · 2024-01-14T13:54:08Z

Hi @phofl and @datapythonista,
I made a commit which, in addition to ensuring no requests are made to S3 buckets, it displays something about the output.

The original S3 example which was causing the issue for #56592 is now fixed and displays the intended output as requested, so I believe this PR now fixes the original issue. I have attached a pdf file (not sure of the etiquette here - hope that solves any dependency issues) of a preview of the output now.

However, in trying to extend the solution to all S3 example codes in the doc, I ran into some issues I'd appreciate advice on how the pandas community would prefer these be managed.

One of the sample returns a large dataframe. I chose to print out the column names to exemplify the contents. I felt even displaying the first row was too large to look ok. Let me know if there's a more desired way, I couldn't see an an obvious example elsewhere.
For the other three S3 bucket example codes, there are various error preventing the code from running. I have attached a pdf file of the page when I converted all S3 examples to ipython code blocks, ran the code, and printed the error messages.

They are as follows. I can suggest options 1) Leave this PR as is, at least it fixes the simple S3 issues. or 2) try and fix all errors, but I might require some assistance to check all the S3 buckets are still ok and the authorizations needed therein.

Error messages as follows.

1/3

In [230]: df = pd.read_json("s3://pandas-test/adatafile.json")
---------------------------------------------------------------------------
NoSuchBucket                              Traceback (most recent call last)
File /usr/local/lib/python3.10/site-packages/s3fs/core.py:113, in _error_wrapper(func, args, kwargs, retries)
    112 try:
--> 113     return await func(*args, **kwargs)
    114 except S3_RETRYABLE_ERRORS as e:

File /usr/local/lib/python3.10/site-packages/aiobotocore/client.py:408, in AioBaseClient._make_api_call(self, operation_name, api_params)
    407     error_class = self.exceptions.from_code(error_code)
--> 408     raise error_class(parsed_response, operation_name)
    409 else:

NoSuchBucket: An error occurred (NoSuchBucket) when calling the ListObjectsV2 operation: The specified bucket does not exist

The above exception was the direct cause of the following exception:

FileNotFoundError                         Traceback (most recent call last)
Cell In[230], line 1
----> 1 df = pd.read_json("s3://pandas-test/adatafile.json")

File /home/pandas/pandas/io/json/_json.py:791, in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, precise_float, date_unit, encoding, encoding_errors, lines, chunksize, compression, nrows, storage_options, dtype_backend, engine)
    788 if convert_axes is None and orient != "table":
    789     convert_axes = True
--> 791 json_reader = JsonReader(
    792     path_or_buf,
    793     orient=orient,
    794     typ=typ,
    795     dtype=dtype,
    796     convert_axes=convert_axes,
    797     convert_dates=convert_dates,
    798     keep_default_dates=keep_default_dates,
    799     precise_float=precise_float,
    800     date_unit=date_unit,
    801     encoding=encoding,
    802     lines=lines,
    803     chunksize=chunksize,
    804     compression=compression,
    805     nrows=nrows,
    806     storage_options=storage_options,
    807     encoding_errors=encoding_errors,
    808     dtype_backend=dtype_backend,
    809     engine=engine,
    810 )
    812 if chunksize:
    813     return json_reader

File /home/pandas/pandas/io/json/_json.py:904, in JsonReader.__init__(self, filepath_or_buffer, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, precise_float, date_unit, encoding, lines, chunksize, compression, nrows, storage_options, encoding_errors, dtype_backend, engine)
    902     self.data = filepath_or_buffer
    903 elif self.engine == "ujson":
--> 904     data = self._get_data_from_filepath(filepath_or_buffer)
    905     self.data = self._preprocess_data(data)

File /home/pandas/pandas/io/json/_json.py:944, in JsonReader._get_data_from_filepath(self, filepath_or_buffer)
    937 filepath_or_buffer = stringify_path(filepath_or_buffer)
    938 if (
    939     not isinstance(filepath_or_buffer, str)
    940     or is_url(filepath_or_buffer)
    941     or is_fsspec_url(filepath_or_buffer)
    942     or file_exists(filepath_or_buffer)
    943 ):
--> 944     self.handles = get_handle(
    945         filepath_or_buffer,
    946         "r",
    947         encoding=self.encoding,
    948         compression=self.compression,
    949         storage_options=self.storage_options,
    950         errors=self.encoding_errors,
    951     )
    952     filepath_or_buffer = self.handles.handle
    953 elif (
    954     isinstance(filepath_or_buffer, str)
    955     and filepath_or_buffer.lower().endswith(
   (...)
    958     and not file_exists(filepath_or_buffer)
    959 ):

File /home/pandas/pandas/io/common.py:728, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    725     codecs.lookup_error(errors)
    727 # open URLs
--> 728 ioargs = _get_filepath_or_buffer(
    729     path_or_buf,
    730     encoding=encoding,
    731     compression=compression,
    732     mode=mode,
    733     storage_options=storage_options,
    734 )
    736 handle = ioargs.filepath_or_buffer
    737 handles: list[BaseBuffer]

File /home/pandas/pandas/io/common.py:443, in _get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode, storage_options)
    439             storage_options = dict(storage_options)
    440             storage_options["anon"] = True
    441         file_obj = fsspec.open(
    442             filepath_or_buffer, mode=fsspec_mode, **(storage_options or {})
--> 443         ).open()
    445     return IOArgs(
    446         filepath_or_buffer=file_obj,
    447         encoding=encoding,
   (...)
    450         mode=fsspec_mode,
    451     )
    452 elif storage_options:

File /usr/local/lib/python3.10/site-packages/fsspec/core.py:135, in OpenFile.open(self)
    128 def open(self):
    129     """Materialise this as a real open file without context
    130 
    131     The OpenFile object should be explicitly closed to avoid enclosed file
    132     instances persisting. You must, therefore, keep a reference to the OpenFile
    133     during the life of the file-like it generates.
    134     """
--> 135     return self.__enter__()

File /usr/local/lib/python3.10/site-packages/fsspec/core.py:103, in OpenFile.__enter__(self)
    100 def __enter__(self):
    101     mode = self.mode.replace("t", "").replace("b", "") + "b"
--> 103     f = self.fs.open(self.path, mode=mode)
    105     self.fobjects = [f]
    107     if self.compression is not None:

File /usr/local/lib/python3.10/site-packages/fsspec/spec.py:1295, in AbstractFileSystem.open(self, path, mode, block_size, cache_options, compression, **kwargs)
   1293 else:
   1294     ac = kwargs.pop("autocommit", not self._intrans)
-> 1295     f = self._open(
   1296         path,
   1297         mode=mode,
   1298         block_size=block_size,
   1299         autocommit=ac,
   1300         cache_options=cache_options,
   1301         **kwargs,
   1302     )
   1303     if compression is not None:
   1304         from fsspec.compression import compr

File /usr/local/lib/python3.10/site-packages/s3fs/core.py:671, in S3FileSystem._open(self, path, mode, block_size, acl, version_id, fill_cache, cache_type, autocommit, size, requester_pays, cache_options, **kwargs)
    668 if cache_type is None:
    669     cache_type = self.default_cache_type
--> 671 return S3File(
    672     self,
    673     path,
    674     mode,
    675     block_size=block_size,
    676     acl=acl,
    677     version_id=version_id,
    678     fill_cache=fill_cache,
    679     s3_additional_kwargs=kw,
    680     cache_type=cache_type,
    681     autocommit=autocommit,
    682     requester_pays=requester_pays,
    683     cache_options=cache_options,
    684     size=size,
    685 )

File /usr/local/lib/python3.10/site-packages/s3fs/core.py:2110, in S3File.__init__(self, s3, path, mode, block_size, acl, version_id, fill_cache, s3_additional_kwargs, autocommit, cache_type, requester_pays, cache_options, size)
   2108         self.details = s3.info(path)
   2109         self.version_id = self.details.get("VersionId")
-> 2110 super().__init__(
   2111     s3,
   2112     path,
   2113     mode,
   2114     block_size,
   2115     autocommit=autocommit,
   2116     cache_type=cache_type,
   2117     cache_options=cache_options,
   2118     size=size,
   2119 )
   2120 self.s3 = self.fs  # compatibility
   2122 # when not using autocommit we want to have transactional state to manage

File /usr/local/lib/python3.10/site-packages/fsspec/spec.py:1651, in AbstractBufferedFile.__init__(self, fs, path, mode, block_size, autocommit, cache_type, cache_options, size, **kwargs)
   1649         self.size = size
   1650     else:
-> 1651         self.size = self.details["size"]
   1652     self.cache = caches[cache_type](
   1653         self.blocksize, self._fetch_range, self.size, **cache_options
   1654     )
   1655 else:

File /usr/local/lib/python3.10/site-packages/fsspec/spec.py:1664, in AbstractBufferedFile.details(self)
   1661 @property
   1662 def details(self):
   1663     if self._details is None:
-> 1664         self._details = self.fs.info(self.path)
   1665     return self._details

File /usr/local/lib/python3.10/site-packages/fsspec/asyn.py:118, in sync_wrapper.<locals>.wrapper(*args, **kwargs)
    115 @functools.wraps(func)
    116 def wrapper(*args, **kwargs):
    117     self = obj or args[0]
--> 118     return sync(self.loop, func, *args, **kwargs)

File /usr/local/lib/python3.10/site-packages/fsspec/asyn.py:103, in sync(loop, func, timeout, *args, **kwargs)
    101     raise FSTimeoutError from return_result
    102 elif isinstance(return_result, BaseException):
--> 103     raise return_result
    104 else:
    105     return return_result

File /usr/local/lib/python3.10/site-packages/fsspec/asyn.py:56, in _runner(event, coro, result, timeout)
     54     coro = asyncio.wait_for(coro, timeout=timeout)
     55 try:
---> 56     result[0] = await coro
     57 except Exception as ex:
     58     result[0] = ex

File /usr/local/lib/python3.10/site-packages/s3fs/core.py:1328, in S3FileSystem._info(self, path, bucket, key, refresh, version_id)
   1323         raise translate_boto_error(e, set_cause=False)
   1325 try:
   1326     # We check to see if the path is a directory by attempting to list its
   1327     # contexts. If anything is found, it is indeed a directory
-> 1328     out = await self._call_s3(
   1329         "list_objects_v2",
   1330         self.kwargs,
   1331         Bucket=bucket,
   1332         Prefix=key.rstrip("/") + "/" if key else "",
   1333         Delimiter="/",
   1334         MaxKeys=1,
   1335         **self.req_kw,
   1336     )
   1337     if (
   1338         out.get("KeyCount", 0) > 0
   1339         or out.get("Contents", [])
   1340         or out.get("CommonPrefixes", [])
   1341     ):
   1342         return {
   1343             "name": "/".join([bucket, key]),
   1344             "type": "directory",
   1345             "size": 0,
   1346             "StorageClass": "DIRECTORY",
   1347         }

File /usr/local/lib/python3.10/site-packages/s3fs/core.py:348, in S3FileSystem._call_s3(self, method, *akwarglist, **kwargs)
    346 logger.debug("CALL: %s - %s - %s", method.__name__, akwarglist, kw2)
    347 additional_kwargs = self._get_s3_method_kwargs(method, *akwarglist, **kwargs)
--> 348 return await _error_wrapper(
    349     method, kwargs=additional_kwargs, retries=self.retries
    350 )

File /usr/local/lib/python3.10/site-packages/s3fs/core.py:140, in _error_wrapper(func, args, kwargs, retries)
    138         err = e
    139 err = translate_boto_error(err)
--> 140 raise err

FileNotFoundError: The specified bucket does not exist

In [231]: df.head(1)
Out[231]: 
          0         1         2         3
0 -1.294524  0.413738  0.276662 -0.472035

2/3

In [232]: storage_options = {"client_kwargs": {"endpoint_url": "http://127.0.0.1:5555"}}

In [233]: df = pd.read_json("s3://pandas-test/test-1", storage_options=storage_options)
---------------------------------------------------------------------------
NoCredentialsError                        Traceback (most recent call last)
File /home/pandas/pandas/io/common.py:432, in _get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode, storage_options)
    429 try:
    430     file_obj = fsspec.open(
    431         filepath_or_buffer, mode=fsspec_mode, **(storage_options or {})
--> 432     ).open()
    433 # GH 34626 Reads from Public Buckets without Credentials needs anon=True

File /usr/local/lib/python3.10/site-packages/fsspec/core.py:135, in OpenFile.open(self)
    129 """Materialise this as a real open file without context
    130 
    131 The OpenFile object should be explicitly closed to avoid enclosed file
    132 instances persisting. You must, therefore, keep a reference to the OpenFile
    133 during the life of the file-like it generates.
    134 """
--> 135 return self.__enter__()

File /usr/local/lib/python3.10/site-packages/fsspec/core.py:103, in OpenFile.__enter__(self)
    101 mode = self.mode.replace("t", "").replace("b", "") + "b"
--> 103 f = self.fs.open(self.path, mode=mode)
    105 self.fobjects = [f]

File /usr/local/lib/python3.10/site-packages/fsspec/spec.py:1295, in AbstractFileSystem.open(self, path, mode, block_size, cache_options, compression, **kwargs)
   1294 ac = kwargs.pop("autocommit", not self._intrans)
-> 1295 f = self._open(
   1296     path,
   1297     mode=mode,
   1298     block_size=block_size,
   1299     autocommit=ac,
   1300     cache_options=cache_options,
   1301     **kwargs,
   1302 )
   1303 if compression is not None:

File /usr/local/lib/python3.10/site-packages/s3fs/core.py:671, in S3FileSystem._open(self, path, mode, block_size, acl, version_id, fill_cache, cache_type, autocommit, size, requester_pays, cache_options, **kwargs)
    669     cache_type = self.default_cache_type
--> 671 return S3File(
    672     self,
    673     path,
    674     mode,
    675     block_size=block_size,
    676     acl=acl,
    677     version_id=version_id,
    678     fill_cache=fill_cache,
    679     s3_additional_kwargs=kw,
    680     cache_type=cache_type,
    681     autocommit=autocommit,
    682     requester_pays=requester_pays,
    683     cache_options=cache_options,
    684     size=size,
    685 )

File /usr/local/lib/python3.10/site-packages/s3fs/core.py:2110, in S3File.__init__(self, s3, path, mode, block_size, acl, version_id, fill_cache, s3_additional_kwargs, autocommit, cache_type, requester_pays, cache_options, size)
   2109         self.version_id = self.details.get("VersionId")
-> 2110 super().__init__(
   2111     s3,
   2112     path,
   2113     mode,
   2114     block_size,
   2115     autocommit=autocommit,
   2116     cache_type=cache_type,
   2117     cache_options=cache_options,
   2118     size=size,
   2119 )
   2120 self.s3 = self.fs  # compatibility

File /usr/local/lib/python3.10/site-packages/fsspec/spec.py:1651, in AbstractBufferedFile.__init__(self, fs, path, mode, block_size, autocommit, cache_type, cache_options, size, **kwargs)
   1650 else:
-> 1651     self.size = self.details["size"]
   1652 self.cache = caches[cache_type](
   1653     self.blocksize, self._fetch_range, self.size, **cache_options
   1654 )

File /usr/local/lib/python3.10/site-packages/fsspec/spec.py:1664, in AbstractBufferedFile.details(self)
   1663 if self._details is None:
-> 1664     self._details = self.fs.info(self.path)
   1665 return self._details

File /usr/local/lib/python3.10/site-packages/fsspec/asyn.py:118, in sync_wrapper.<locals>.wrapper(*args, **kwargs)
    117 self = obj or args[0]
--> 118 return sync(self.loop, func, *args, **kwargs)

File /usr/local/lib/python3.10/site-packages/fsspec/asyn.py:103, in sync(loop, func, timeout, *args, **kwargs)
    102 elif isinstance(return_result, BaseException):
--> 103     raise return_result
    104 else:

File /usr/local/lib/python3.10/site-packages/fsspec/asyn.py:56, in _runner(event, coro, result, timeout)
     55 try:
---> 56     result[0] = await coro
     57 except Exception as ex:

File /usr/local/lib/python3.10/site-packages/s3fs/core.py:1302, in S3FileSystem._info(self, path, bucket, key, refresh, version_id)
   1301 try:
-> 1302     out = await self._call_s3(
   1303         "head_object",
   1304         self.kwargs,
   1305         Bucket=bucket,
   1306         Key=key,
   1307         **version_id_kw(version_id),
   1308         **self.req_kw,
   1309     )
   1310     return {
   1311         "ETag": out.get("ETag", ""),
   1312         "LastModified": out.get("LastModified", ""),
   (...)
   1318         "ContentType": out.get("ContentType"),
   1319     }

File /usr/local/lib/python3.10/site-packages/s3fs/core.py:348, in S3FileSystem._call_s3(self, method, *akwarglist, **kwargs)
    347 additional_kwargs = self._get_s3_method_kwargs(method, *akwarglist, **kwargs)
--> 348 return await _error_wrapper(
    349     method, kwargs=additional_kwargs, retries=self.retries
    350 )

File /usr/local/lib/python3.10/site-packages/s3fs/core.py:140, in _error_wrapper(func, args, kwargs, retries)
    139 err = translate_boto_error(err)
--> 140 raise err

File /usr/local/lib/python3.10/site-packages/s3fs/core.py:113, in _error_wrapper(func, args, kwargs, retries)
    112 try:
--> 113     return await func(*args, **kwargs)
    114 except S3_RETRYABLE_ERRORS as e:

File /usr/local/lib/python3.10/site-packages/aiobotocore/client.py:388, in AioBaseClient._make_api_call(self, operation_name, api_params)
    387     apply_request_checksum(request_dict)
--> 388     http, parsed_response = await self._make_request(
    389         operation_model, request_dict, request_context
    390     )
    392 await self.meta.events.emit(
    393     'after-call.{service_id}.{operation_name}'.format(
    394         service_id=service_id, operation_name=operation_name
   (...)
    399     context=request_context,
    400 )

File /usr/local/lib/python3.10/site-packages/aiobotocore/client.py:416, in AioBaseClient._make_request(self, operation_model, request_dict, request_context)
    415 try:
--> 416     return await self._endpoint.make_request(
    417         operation_model, request_dict
    418     )
    419 except Exception as e:

File /usr/local/lib/python3.10/site-packages/aiobotocore/endpoint.py:96, in AioEndpoint._send_request(self, request_dict, operation_model)
     95 self._update_retries_context(context, attempts)
---> 96 request = await self.create_request(request_dict, operation_model)
     97 success_response, exception = await self._get_response(
     98     request, operation_model, context
     99 )

File /usr/local/lib/python3.10/site-packages/aiobotocore/endpoint.py:84, in AioEndpoint.create_request(self, params, operation_model)
     81     event_name = 'request-created.{service_id}.{op_name}'.format(
     82         service_id=service_id, op_name=operation_model.name
     83     )
---> 84     await self._event_emitter.emit(
     85         event_name,
     86         request=request,
     87         operation_name=operation_model.name,
     88     )
     89 prepared_request = self.prepare_request(request)

File /usr/local/lib/python3.10/site-packages/aiobotocore/hooks.py:66, in AioHierarchicalEmitter._emit(self, event_name, kwargs, stop_on_response)
     65 # Await the handler if its a coroutine.
---> 66 response = await resolve_awaitable(handler(**kwargs))
     67 responses.append((handler, response))

File /usr/local/lib/python3.10/site-packages/aiobotocore/_helpers.py:15, in resolve_awaitable(obj)
     14 if inspect.isawaitable(obj):
---> 15     return await obj
     17 return obj

File /usr/local/lib/python3.10/site-packages/aiobotocore/signers.py:24, in AioRequestSigner.handler(self, operation_name, request, **kwargs)
     19 async def handler(self, operation_name=None, request=None, **kwargs):
     20     # This is typically hooked up to the "request-created" event
     21     # from a client's event emitter.  When a new request is created
     22     # this method is invoked to sign the request.
     23     # Don't call this method directly.
---> 24     return await self.sign(operation_name, request)

File /usr/local/lib/python3.10/site-packages/aiobotocore/signers.py:88, in AioRequestSigner.sign(self, operation_name, request, region_name, signing_type, expires_in, signing_name)
     86         raise e
---> 88 auth.add_auth(request)

File /usr/local/lib/python3.10/site-packages/botocore/auth.py:418, in SigV4Auth.add_auth(self, request)
    417 if self.credentials is None:
--> 418     raise NoCredentialsError()
    419 datetime_now = datetime.datetime.utcnow()

NoCredentialsError: Unable to locate credentials

During handling of the above exception, another exception occurred:

EndpointConnectionError                   Traceback (most recent call last)
Cell In[233], line 1
----> 1 df = pd.read_json("s3://pandas-test/test-1", storage_options=storage_options)

File /home/pandas/pandas/io/json/_json.py:791, in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, precise_float, date_unit, encoding, encoding_errors, lines, chunksize, compression, nrows, storage_options, dtype_backend, engine)
    788 if convert_axes is None and orient != "table":
    789     convert_axes = True
--> 791 json_reader = JsonReader(
    792     path_or_buf,
    793     orient=orient,
    794     typ=typ,
    795     dtype=dtype,
    796     convert_axes=convert_axes,
    797     convert_dates=convert_dates,
    798     keep_default_dates=keep_default_dates,
    799     precise_float=precise_float,
    800     date_unit=date_unit,
    801     encoding=encoding,
    802     lines=lines,
    803     chunksize=chunksize,
    804     compression=compression,
    805     nrows=nrows,
    806     storage_options=storage_options,
    807     encoding_errors=encoding_errors,
    808     dtype_backend=dtype_backend,
    809     engine=engine,
    810 )
    812 if chunksize:
    813     return json_reader

File /home/pandas/pandas/io/json/_json.py:904, in JsonReader.__init__(self, filepath_or_buffer, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, precise_float, date_unit, encoding, lines, chunksize, compression, nrows, storage_options, encoding_errors, dtype_backend, engine)
    902     self.data = filepath_or_buffer
    903 elif self.engine == "ujson":
--> 904     data = self._get_data_from_filepath(filepath_or_buffer)
    905     self.data = self._preprocess_data(data)

File /home/pandas/pandas/io/json/_json.py:944, in JsonReader._get_data_from_filepath(self, filepath_or_buffer)
    937 filepath_or_buffer = stringify_path(filepath_or_buffer)
    938 if (
    939     not isinstance(filepath_or_buffer, str)
    940     or is_url(filepath_or_buffer)
    941     or is_fsspec_url(filepath_or_buffer)
    942     or file_exists(filepath_or_buffer)
    943 ):
--> 944     self.handles = get_handle(
    945         filepath_or_buffer,
    946         "r",
    947         encoding=self.encoding,
    948         compression=self.compression,
    949         storage_options=self.storage_options,
    950         errors=self.encoding_errors,
    951     )
    952     filepath_or_buffer = self.handles.handle
    953 elif (
    954     isinstance(filepath_or_buffer, str)
    955     and filepath_or_buffer.lower().endswith(
   (...)
    958     and not file_exists(filepath_or_buffer)
    959 ):

File /home/pandas/pandas/io/common.py:728, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    725     codecs.lookup_error(errors)
    727 # open URLs
--> 728 ioargs = _get_filepath_or_buffer(
    729     path_or_buf,
    730     encoding=encoding,
    731     compression=compression,
    732     mode=mode,
    733     storage_options=storage_options,
    734 )
    736 handle = ioargs.filepath_or_buffer
    737 handles: list[BaseBuffer]

File /home/pandas/pandas/io/common.py:443, in _get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode, storage_options)
    439             storage_options = dict(storage_options)
    440             storage_options["anon"] = True
    441         file_obj = fsspec.open(
    442             filepath_or_buffer, mode=fsspec_mode, **(storage_options or {})
--> 443         ).open()
    445     return IOArgs(
    446         filepath_or_buffer=file_obj,
    447         encoding=encoding,
   (...)
    450         mode=fsspec_mode,
    451     )
    452 elif storage_options:

File /usr/local/lib/python3.10/site-packages/fsspec/core.py:135, in OpenFile.open(self)
    128 def open(self):
    129     """Materialise this as a real open file without context
    130 
    131     The OpenFile object should be explicitly closed to avoid enclosed file
    132     instances persisting. You must, therefore, keep a reference to the OpenFile
    133     during the life of the file-like it generates.
    134     """
--> 135     return self.__enter__()

File /usr/local/lib/python3.10/site-packages/fsspec/core.py:103, in OpenFile.__enter__(self)
    100 def __enter__(self):
    101     mode = self.mode.replace("t", "").replace("b", "") + "b"
--> 103     f = self.fs.open(self.path, mode=mode)
    105     self.fobjects = [f]
    107     if self.compression is not None:

File /usr/local/lib/python3.10/site-packages/fsspec/spec.py:1295, in AbstractFileSystem.open(self, path, mode, block_size, cache_options, compression, **kwargs)
   1293 else:
   1294     ac = kwargs.pop("autocommit", not self._intrans)
-> 1295     f = self._open(
   1296         path,
   1297         mode=mode,
   1298         block_size=block_size,
   1299         autocommit=ac,
   1300         cache_options=cache_options,
   1301         **kwargs,
   1302     )
   1303     if compression is not None:
   1304         from fsspec.compression import compr

File /usr/local/lib/python3.10/site-packages/s3fs/core.py:671, in S3FileSystem._open(self, path, mode, block_size, acl, version_id, fill_cache, cache_type, autocommit, size, requester_pays, cache_options, **kwargs)
    668 if cache_type is None:
    669     cache_type = self.default_cache_type
--> 671 return S3File(
    672     self,
    673     path,
    674     mode,
    675     block_size=block_size,
    676     acl=acl,
    677     version_id=version_id,
    678     fill_cache=fill_cache,
    679     s3_additional_kwargs=kw,
    680     cache_type=cache_type,
    681     autocommit=autocommit,
    682     requester_pays=requester_pays,
    683     cache_options=cache_options,
    684     size=size,
    685 )

File /usr/local/lib/python3.10/site-packages/s3fs/core.py:2110, in S3File.__init__(self, s3, path, mode, block_size, acl, version_id, fill_cache, s3_additional_kwargs, autocommit, cache_type, requester_pays, cache_options, size)
   2108         self.details = s3.info(path)
   2109         self.version_id = self.details.get("VersionId")
-> 2110 super().__init__(
   2111     s3,
   2112     path,
   2113     mode,
   2114     block_size,
   2115     autocommit=autocommit,
   2116     cache_type=cache_type,
   2117     cache_options=cache_options,
   2118     size=size,
   2119 )
   2120 self.s3 = self.fs  # compatibility
   2122 # when not using autocommit we want to have transactional state to manage

File /usr/local/lib/python3.10/site-packages/fsspec/spec.py:1651, in AbstractBufferedFile.__init__(self, fs, path, mode, block_size, autocommit, cache_type, cache_options, size, **kwargs)
   1649         self.size = size
   1650     else:
-> 1651         self.size = self.details["size"]
   1652     self.cache = caches[cache_type](
   1653         self.blocksize, self._fetch_range, self.size, **cache_options
   1654     )
   1655 else:

File /usr/local/lib/python3.10/site-packages/fsspec/spec.py:1664, in AbstractBufferedFile.details(self)
   1661 @property
   1662 def details(self):
   1663     if self._details is None:
-> 1664         self._details = self.fs.info(self.path)
   1665     return self._details

File /usr/local/lib/python3.10/site-packages/fsspec/asyn.py:118, in sync_wrapper.<locals>.wrapper(*args, **kwargs)
    115 @functools.wraps(func)
    116 def wrapper(*args, **kwargs):
    117     self = obj or args[0]
--> 118     return sync(self.loop, func, *args, **kwargs)

File /usr/local/lib/python3.10/site-packages/fsspec/asyn.py:103, in sync(loop, func, timeout, *args, **kwargs)
    101     raise FSTimeoutError from return_result
    102 elif isinstance(return_result, BaseException):
--> 103     raise return_result
    104 else:
    105     return return_result

File /usr/local/lib/python3.10/site-packages/fsspec/asyn.py:56, in _runner(event, coro, result, timeout)
     54     coro = asyncio.wait_for(coro, timeout=timeout)
     55 try:
---> 56     result[0] = await coro
     57 except Exception as ex:
     58     result[0] = ex

File /usr/local/lib/python3.10/site-packages/s3fs/core.py:1302, in S3FileSystem._info(self, path, bucket, key, refresh, version_id)
   1300 if key:
   1301     try:
-> 1302         out = await self._call_s3(
   1303             "head_object",
   1304             self.kwargs,
   1305             Bucket=bucket,
   1306             Key=key,
   1307             **version_id_kw(version_id),
   1308             **self.req_kw,
   1309         )
   1310         return {
   1311             "ETag": out.get("ETag", ""),
   1312             "LastModified": out.get("LastModified", ""),
   (...)
   1318             "ContentType": out.get("ContentType"),
   1319         }
   1320     except FileNotFoundError:

File /usr/local/lib/python3.10/site-packages/s3fs/core.py:348, in S3FileSystem._call_s3(self, method, *akwarglist, **kwargs)
    346 logger.debug("CALL: %s - %s - %s", method.__name__, akwarglist, kw2)
    347 additional_kwargs = self._get_s3_method_kwargs(method, *akwarglist, **kwargs)
--> 348 return await _error_wrapper(
    349     method, kwargs=additional_kwargs, retries=self.retries
    350 )

File /usr/local/lib/python3.10/site-packages/s3fs/core.py:140, in _error_wrapper(func, args, kwargs, retries)
    138         err = e
    139 err = translate_boto_error(err)
--> 140 raise err

File /usr/local/lib/python3.10/site-packages/s3fs/core.py:113, in _error_wrapper(func, args, kwargs, retries)
    111 for i in range(retries):
    112     try:
--> 113         return await func(*args, **kwargs)
    114     except S3_RETRYABLE_ERRORS as e:
    115         err = e

File /usr/local/lib/python3.10/site-packages/aiobotocore/client.py:388, in AioBaseClient._make_api_call(self, operation_name, api_params)
    384     maybe_compress_request(
    385         self.meta.config, request_dict, operation_model
    386     )
    387     apply_request_checksum(request_dict)
--> 388     http, parsed_response = await self._make_request(
    389         operation_model, request_dict, request_context
    390     )
    392 await self.meta.events.emit(
    393     'after-call.{service_id}.{operation_name}'.format(
    394         service_id=service_id, operation_name=operation_name
   (...)
    399     context=request_context,
    400 )
    402 if http.status_code >= 300:

File /usr/local/lib/python3.10/site-packages/aiobotocore/client.py:416, in AioBaseClient._make_request(self, operation_model, request_dict, request_context)
    412 async def _make_request(
    413     self, operation_model, request_dict, request_context
    414 ):
    415     try:
--> 416         return await self._endpoint.make_request(
    417             operation_model, request_dict
    418         )
    419     except Exception as e:
    420         await self.meta.events.emit(
    421             'after-call-error.{service_id}.{operation_name}'.format(
    422                 service_id=self._service_model.service_id.hyphenize(),
   (...)
    426             context=request_context,
    427         )

File /usr/local/lib/python3.10/site-packages/aiobotocore/endpoint.py:100, in AioEndpoint._send_request(self, request_dict, operation_model)
     96 request = await self.create_request(request_dict, operation_model)
     97 success_response, exception = await self._get_response(
     98     request, operation_model, context
     99 )
--> 100 while await self._needs_retry(
    101     attempts,
    102     operation_model,
    103     request_dict,
    104     success_response,
    105     exception,
    106 ):
    107     attempts += 1
    108     self._update_retries_context(context, attempts, success_response)

File /usr/local/lib/python3.10/site-packages/aiobotocore/endpoint.py:262, in AioEndpoint._needs_retry(self, attempts, operation_model, request_dict, response, caught_exception)
    260 service_id = operation_model.service_model.service_id.hyphenize()
    261 event_name = f"needs-retry.{service_id}.{operation_model.name}"
--> 262 responses = await self._event_emitter.emit(
    263     event_name,
    264     response=response,
    265     endpoint=self,
    266     operation=operation_model,
    267     attempts=attempts,
    268     caught_exception=caught_exception,
    269     request_dict=request_dict,
    270 )
    271 handler_response = first_non_none_response(responses)
    272 if handler_response is None:

File /usr/local/lib/python3.10/site-packages/aiobotocore/hooks.py:66, in AioHierarchicalEmitter._emit(self, event_name, kwargs, stop_on_response)
     63 logger.debug('Event %s: calling handler %s', event_name, handler)
     65 # Await the handler if its a coroutine.
---> 66 response = await resolve_awaitable(handler(**kwargs))
     67 responses.append((handler, response))
     68 if stop_on_response and response is not None:

File /usr/local/lib/python3.10/site-packages/aiobotocore/_helpers.py:15, in resolve_awaitable(obj)
     13 async def resolve_awaitable(obj):
     14     if inspect.isawaitable(obj):
---> 15         return await obj
     17     return obj

File /usr/local/lib/python3.10/site-packages/aiobotocore/retryhandler.py:107, in AioRetryHandler._call(self, attempts, response, caught_exception, **kwargs)
    104     retries_context = kwargs['request_dict']['context'].get('retries')
    105     checker_kwargs.update({'retries_context': retries_context})
--> 107 if await resolve_awaitable(self._checker(**checker_kwargs)):
    108     result = self._action(attempts=attempts)
    109     logger.debug("Retry needed, action of: %s", result)

File /usr/local/lib/python3.10/site-packages/aiobotocore/_helpers.py:15, in resolve_awaitable(obj)
     13 async def resolve_awaitable(obj):
     14     if inspect.isawaitable(obj):
---> 15         return await obj
     17     return obj

File /usr/local/lib/python3.10/site-packages/aiobotocore/retryhandler.py:126, in AioMaxAttemptsDecorator._call(self, attempt_number, response, caught_exception, retries_context)
    121 if retries_context:
    122     retries_context['max'] = max(
    123         retries_context.get('max', 0), self._max_attempts
    124     )
--> 126 should_retry = await self._should_retry(
    127     attempt_number, response, caught_exception
    128 )
    129 if should_retry:
    130     if attempt_number >= self._max_attempts:
    131         # explicitly set MaxAttemptsReached

File /usr/local/lib/python3.10/site-packages/aiobotocore/retryhandler.py:165, in AioMaxAttemptsDecorator._should_retry(self, attempt_number, response, caught_exception)
    161         return True
    162 else:
    163     # If we've exceeded the max attempts we just let the exception
    164     # propagate if one has occurred.
--> 165     return await resolve_awaitable(
    166         self._checker(attempt_number, response, caught_exception)
    167     )

File /usr/local/lib/python3.10/site-packages/aiobotocore/_helpers.py:15, in resolve_awaitable(obj)
     13 async def resolve_awaitable(obj):
     14     if inspect.isawaitable(obj):
---> 15         return await obj
     17     return obj

File /usr/local/lib/python3.10/site-packages/aiobotocore/retryhandler.py:174, in AioMultiChecker._call(self, attempt_number, response, caught_exception)
    171 async def _call(self, attempt_number, response, caught_exception):
    172     for checker in self._checkers:
    173         checker_response = await resolve_awaitable(
--> 174             checker(attempt_number, response, caught_exception)
    175         )
    176         if checker_response:
    177             return checker_response

File /usr/local/lib/python3.10/site-packages/botocore/retryhandler.py:247, in BaseChecker.__call__(self, attempt_number, response, caught_exception)
    245     return self._check_response(attempt_number, response)
    246 elif caught_exception is not None:
--> 247     return self._check_caught_exception(
    248         attempt_number, caught_exception
    249     )
    250 else:
    251     raise ValueError("Both response and caught_exception are None.")

File /usr/local/lib/python3.10/site-packages/botocore/retryhandler.py:416, in ExceptionRaiser._check_caught_exception(self, attempt_number, caught_exception)
    408 def _check_caught_exception(self, attempt_number, caught_exception):
    409     # This is implementation specific, but this class is useful by
    410     # coordinating with the MaxAttemptsDecorator.
   (...)
    414     # the MaxAttemptsDecorator is not interested in retrying the exception
    415     # then this exception just propagates out past the retry code.
--> 416     raise caught_exception

File /usr/local/lib/python3.10/site-packages/aiobotocore/endpoint.py:181, in AioEndpoint._do_get_response(self, request, operation_model, context)
    179     http_response = first_non_none_response(responses)
    180     if http_response is None:
--> 181         http_response = await self._send(request)
    182 except HTTPClientError as e:
    183     return (None, e)

File /usr/local/lib/python3.10/site-packages/aiobotocore/endpoint.py:285, in AioEndpoint._send(self, request)
    284 async def _send(self, request):
--> 285     return await self.http_session.send(request)

File /usr/local/lib/python3.10/site-packages/aiobotocore/httpsession.py:253, in AIOHTTPSession.send(self, request)
    247         raise ReadTimeoutError(endpoint_url=request.url, error=e)
    248 except (
    249     ClientConnectorError,
    250     ClientConnectionError,
    251     socket.gaierror,
    252 ) as e:
--> 253     raise EndpointConnectionError(endpoint_url=request.url, error=e)
    254 except asyncio.TimeoutError as e:
    255     raise ReadTimeoutError(endpoint_url=request.url, error=e)

EndpointConnectionError: Could not connect to the endpoint URL: "http://127.0.0.1:5555/pandas-test/test-1"

In [234]: df.head(1)
Out[234]: 
          0         1         2         3
0 -1.294524  0.413738  0.276662 -0.472035

3/3

In [237]: df = pd.read_csv(
   .....:     "simplecache::s3://ncei-wcsd-archive/data/processed/SH1305/18kHz/"
   .....:     "SaKe2013-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv",
   .....:     storage_options={"s3": {"anon": True}},
   .....: )
   .....: 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[237], line 1
----> 1 df = pd.read_csv(
      2     "simplecache::s3://ncei-wcsd-archive/data/processed/SH1305/18kHz/"
      3     "SaKe2013-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv",
      4     storage_options={"s3": {"anon": True}},
      5 )

File /home/pandas/pandas/io/parsers/readers.py:1024, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
   1011 kwds_defaults = _refine_defaults_read(
   1012     dialect,
   1013     delimiter,
   (...)
   1020     dtype_backend=dtype_backend,
   1021 )
   1022 kwds.update(kwds_defaults)
-> 1024 return _read(filepath_or_buffer, kwds)

File /home/pandas/pandas/io/parsers/readers.py:618, in _read(filepath_or_buffer, kwds)
    615 _validate_names(kwds.get("names", None))
    617 # Create the parser.
--> 618 parser = TextFileReader(filepath_or_buffer, **kwds)
    620 if chunksize or iterator:
    621     return parser

File /home/pandas/pandas/io/parsers/readers.py:1618, in TextFileReader.__init__(self, f, engine, **kwds)
   1615     self.options["has_index_names"] = kwds["has_index_names"]
   1617 self.handles: IOHandles | None = None
-> 1618 self._engine = self._make_engine(f, self.engine)

File /home/pandas/pandas/io/parsers/readers.py:1878, in TextFileReader._make_engine(self, f, engine)
   1876     if "b" not in mode:
   1877         mode += "b"
-> 1878 self.handles = get_handle(
   1879     f,
   1880     mode,
   1881     encoding=self.options.get("encoding", None),
   1882     compression=self.options.get("compression", None),
   1883     memory_map=self.options.get("memory_map", False),
   1884     is_text=is_text,
   1885     errors=self.options.get("encoding_errors", "strict"),
   1886     storage_options=self.options.get("storage_options", None),
   1887 )
   1888 assert self.handles is not None
   1889 f = self.handles.handle

File /home/pandas/pandas/io/common.py:728, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    725     codecs.lookup_error(errors)
    727 # open URLs
--> 728 ioargs = _get_filepath_or_buffer(
    729     path_or_buf,
    730     encoding=encoding,
    731     compression=compression,
    732     mode=mode,
    733     storage_options=storage_options,
    734 )
    736 handle = ioargs.filepath_or_buffer
    737 handles: list[BaseBuffer]

File /home/pandas/pandas/io/common.py:453, in _get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode, storage_options)
    445     return IOArgs(
    446         filepath_or_buffer=file_obj,
    447         encoding=encoding,
   (...)
    450         mode=fsspec_mode,
    451     )
    452 elif storage_options:
--> 453     raise ValueError(
    454         "storage_options passed with file object or non-fsspec file path"
    455     )
    457 if isinstance(filepath_or_buffer, (str, bytes, mmap.mmap)):
    458     return IOArgs(
    459         filepath_or_buffer=_expand_user(filepath_or_buffer),
    460         encoding=encoding,
   (...)
    463         mode=mode,
    464     )

ValueError: storage_options passed with file object or non-fsspec file path

JackCollins91 · 2024-01-14T13:55:54Z

Preview of the fixed output and the errors when the S3 bucket code blocks run in ipython.
Preview_IO tools (text, CSV, HDF5, …) — pandas 3.0.0.dev0+101.ge705ee4507 documentation.pdf
S3_Errors_IO tools (text, CSV, HDF5, …) — pandas 3.0.0.dev0+100.g5ea2948d2a.dirty documentation.pdf

datapythonista · 2024-01-14T14:05:44Z

I think the output should be the same as in this page:
https://pandas.pydata.org/docs/dev/user_guide/io.html

Do you mind to send a link inside that page to the example with the very long dataframe, and one of the failing cases, to see how is it shown now please?

JackCollins91 · 2024-01-14T14:22:36Z

I think the output should be the same as in this page: https://pandas.pydata.org/docs/dev/user_guide/io.html

Do you mind to send a link inside that page to the example with the very long dataframe, and one of the failing cases, to see how is it shown now please?

Hi @datapythonista thanks for looking this over. I'm sorry but I'm not sure I understand your request.

Do you want the PR to make the new io.html page look exactly the same as the existing one? I can do that, but only one of the S3 bucket examples would then show what the output looks like. If that's ok, I can do so.

I am not sure what kind of 'link' you are requesting above. If you mean you want to see what the io page will look like after this PR is merged, I have that here...
Preview_IO tools (text, CSV, HDF5, …) — pandas 3.0.0.dev0+101.ge705ee4507 documentation.pdf

And if you want to see a page preview that shows what the S3 related errors are in the page, that can be seen here
S3_Errors_IO tools (text, CSV, HDF5, …) — pandas 3.0.0.dev0+100.g5ea2948d2a.dirty documentation.pdf

Let me know if this is what you wanted or if I have misunderstood. Sorry for inconvenience.

Cheers

datapythonista · 2024-01-14T14:32:25Z

Sorry, what I would like is to see how the example with the big dataframe looks now, before merging this PR (I don't know which example it is). Same for one of the failing ones.

My understanding is that what we want is to simply stop running code that access s3 buckets. If the dataframes aren't been displayed now, I would expect to keep that.

If I missed the point and we want to start showing output in examples that currently don't, I'd personally do this in two steps. Replacing in this PR ipython blocks by code-block blocks and keeping exactly the same output. And displaying the content we want in a separate PR.

Let me know if I'm misunderstanding things. @mroeschke I think you are the one who wanted to stop running code that access s3. Does it make sense what I say?

JackCollins91 · 2024-01-14T14:41:00Z

Sorry, what I would like is to see how the example with the big dataframe looks now, before merging this PR (I don't know which example it is). Same for one of the failing ones.

My understanding is that what we want is to simply stop running code that access s3 buckets. If the dataframes aren't been displayed now, I would expect to keep that.

If I missed the point and we want to start showing output in examples that currently don't, I'd personally do this in two steps. Replacing in this PR ipython blocks by code-block blocks and keeping exactly the same output. And displaying the content we want in a separate PR.

Let me know if I'm misunderstanding things. @mroeschke I think you are the one who wanted to stop running code that access s3. Does it make sense what I say?

Thanks for the clarification @datapythonista .

Regarding the large DataFrame (happy to revert if not desired)
Before:

After:

Regarding the failing code blocks, I have listed the failing output as code blocks in the comment above - is that sufficient?

Regarding the code block which caused the original issue #56592

Before (issue is it makes an S3 call):

After (no S3 call):

Hopefully I've provided the info you wanted in a convenient way, but let me know if any issues. Sorry I am new to pandas community open dev.

avoids unnecessary file change in PR

datapythonista · 2024-01-14T14:54:02Z

Why do you prefer your version in your versions and not what's in the original? I'm unsure what you are trying to achieve sorry

JackCollins91 · 2024-01-14T15:00:00Z

Why do you prefer your version in your versions and not what's in the original? I'm unsure what you are trying to achieve sorry

Hi @datapythonista ,
The two altered S3 Bucket examples no longer make any requests to the S3 buckets (thereby eliminating build issues from failure to connect to the bucket), but still display the output of the code that would display if the code had run. This fixes #56592 .

You requested this to be the case for every S3 bucket example, but right now it is only possible for the two listed above.

Is this a reasonable motivation for the PR?

datapythonista · 2024-01-14T16:32:56Z

Yes, absolutely, your changes are great.

I think there is only one thing when we are not understanding each other. What I'm suggesting is to not change the output. If there is output being displayed now, to continue to display it. If there is not, to continue to not display it. There was a comment earlier on the reviews about showing the output. What I understood is that the changes you made at that point stopped showing the output that previously existed. My understanding is that in your changes now you start showing output in places where we currently don't show it.

The output that you are showing as you say is a bit tricy, dataframes are huge and you then show the columns. My last question was whybdo you think introducing this output is better than continuing to not show output. Is it just because in the review comment you were asked to show the output. Or is it because younpersonally thing showing the columns is better than not showing anything?

My suggestion is that in this PR you just focus on making the changes to stop running the examples that query s3. This should be non-controversial and we can get it merged quickly. Changing the output is way more opinionated and will take more discussions. I would do that if you want to do it in a follow up PR.

Does this make sense?

Rollback changes to one of the examples (out of scope)

JackCollins91 · 2024-01-14T16:39:48Z

Yes, absolutely, your changes are great.

I think there is only one thing when we are not understanding each other. What I'm suggesting is to not change the output. If there is output being displayed now, to continue to display it. If there is not, to continue to not display it. There was a comment earlier on the reviews about showing the output. What I understood is that the changes you made at that point stopped showing the output that previously existed. My understanding is that in your changes now you start showing output in places where we currently don't show it.

The output that you are showing as you say is a bit tricy, dataframes are huge and you then show the columns. My last question was whybdo you think introducing this output is better than continuing to not show output. Is it just because in the review comment you were asked to show the output. Or is it because younpersonally thing showing the columns is better than not showing anything?

My suggestion is that in this PR you just focus on making the changes to stop running the examples that query s3. This should be non-controversial and we can get it merged quickly. Changing the output is way more opinionated and will take more discussions. I would do that if you want to do it in a follow up PR.

Does this make sense?

Hi @datapythonista , this makes sense to me now. Thanks so much for taking the time to clarify. You have correctly assessed the PR and yes, I only added in the large DF outputs because I thought there was a desire for consistency among S3 examples. However, I now understand the preference is to keep the document the same as before, with the sole change being that now S# calls are not made.

Code chunks that previously had no output should continue to do so. The same for code chunks which display output.

I will have made a commit now that does exactly this.

Let me know if any issues and thanks again :)

datapythonista · 2024-01-15T07:19:26Z

doc/source/user_guide/io.rst

+   ...    "s3://pmc-oa-opendata/oa_comm/xml/all/PMC1236943.xml",
+   ...    xpath=".//journal-meta",
+   ...)
+   >>> df.head(1)


Doesn't the file contain only one row already? I see that now we simply show df and the output is the same. Am I missing something?

Suggested change

>>> df.head(1)

>>> df

Agreed it's redundant. Will revise now.

datapythonista

Looks good to me. I'll let @mroeschke have a look before merging.

What is the plan regarding the other blocks that access s3 buckets?

mroeschke

LGTM. Yeah if interested in following up, another PR turning all examples in our docs that access s3 buckets to code-blocks woul be appreciated

mroeschke · 2024-01-15T18:16:05Z

Thanks @JackCollins91

…56762) * Update io.rst Make consistent with other s3 bucket URL examples and avoid doc build error when problem with s3 url. * Update io.rst Make example consistent with other code block examples * Update v2.3.0.rst * immitating interactive mode For each S3 bucket code block, ideally we show what the output would be, but without making an actual call. Unfortunately, for several of the S3 buckets, there are issues with the code, which we must fix in another commit or PR. For now, the two S3 examples that do work, we edit to make the code block show what the output would have been if it had run successfully. Find details on issues in conversation on PR pandas-dev#56592 * Update io.rst Code still doesn't run, but at least unmatched } is no longer the issue. * Update v2.3.0.rst avoids unnecessary file change in PR * Update io.rst Rollback changes to one of the examples (out of scope) * Update io.rst * Update io.rst --------- Co-authored-by: JackCollins1991 <[email protected]>

JackCollins1991 added 3 commits January 7, 2024 11:31

Update io.rst

62d3e33

Make consistent with other s3 bucket URL examples and avoid doc build error when problem with s3 url.

Update io.rst

98436eb

Make example consistent with other code block examples

Update v2.3.0.rst

34cea05

phofl requested changes Jan 7, 2024

View reviewed changes

mroeschke added the Docs label Jan 8, 2024

datapythonista changed the title ~~BUG DOC: Bug docs won't build (s3 bucket does not exist) issue #56592~~ DOC: Docs won't build (s3 bucket does not exist) Jan 12, 2024

datapythonista changed the title ~~DOC: Docs won't build (s3 bucket does not exist)~~ DOC: Avoid requesting data from s3 buckets from our docs Jan 12, 2024

JackCollins91 and others added 3 commits January 14, 2024 11:15

Merge branch 'main' into BUG-Docs-won't-build-(S3-bucket-does-not-exi…

5ea2948

…st)-Issue-pandas-dev#56592

Update io.rst

c57db57

Code still doesn't run, but at least unmatched } is no longer the issue.

Update v2.3.0.rst

781216c

avoids unnecessary file change in PR

Update io.rst

36f4538

Rollback changes to one of the examples (out of scope)

Update io.rst

a2d3d3c

datapythonista reviewed Jan 15, 2024

View reviewed changes

Update io.rst

4fecd57

datapythonista approved these changes Jan 15, 2024

View reviewed changes

mroeschke added this to the 3.0 milestone Jan 15, 2024

mroeschke approved these changes Jan 15, 2024

View reviewed changes

mroeschke merged commit 1af1030 into pandas-dev:main Jan 15, 2024
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: Avoid requesting data from s3 buckets from our docs #56762

DOC: Avoid requesting data from s3 buckets from our docs #56762

JackCollins91 commented Jan 7, 2024 •

edited

Loading

phofl commented Jan 7, 2024

phofl Jan 7, 2024

JackCollins91 Jan 8, 2024

phofl Jan 7, 2024

datapythonista commented Jan 12, 2024

JackCollins91 commented Jan 12, 2024

JackCollins91 commented Jan 14, 2024 •

edited

Loading

JackCollins91 commented Jan 14, 2024

datapythonista commented Jan 14, 2024

JackCollins91 commented Jan 14, 2024

datapythonista commented Jan 14, 2024

JackCollins91 commented Jan 14, 2024 •

edited

Loading

datapythonista commented Jan 14, 2024

JackCollins91 commented Jan 14, 2024

datapythonista commented Jan 14, 2024

JackCollins91 commented Jan 14, 2024

datapythonista Jan 15, 2024

JackCollins91 Jan 15, 2024

datapythonista left a comment

mroeschke left a comment

mroeschke commented Jan 15, 2024

DOC: Avoid requesting data from s3 buckets from our docs #56762

DOC: Avoid requesting data from s3 buckets from our docs #56762

Conversation

JackCollins91 commented Jan 7, 2024 • edited Loading

phofl commented Jan 7, 2024

phofl Jan 7, 2024

Choose a reason for hiding this comment

JackCollins91 Jan 8, 2024

Choose a reason for hiding this comment

phofl Jan 7, 2024

Choose a reason for hiding this comment

datapythonista commented Jan 12, 2024

JackCollins91 commented Jan 12, 2024

JackCollins91 commented Jan 14, 2024 • edited Loading

JackCollins91 commented Jan 14, 2024

datapythonista commented Jan 14, 2024

JackCollins91 commented Jan 14, 2024

datapythonista commented Jan 14, 2024

JackCollins91 commented Jan 14, 2024 • edited Loading

datapythonista commented Jan 14, 2024

JackCollins91 commented Jan 14, 2024

datapythonista commented Jan 14, 2024

JackCollins91 commented Jan 14, 2024

datapythonista Jan 15, 2024

Choose a reason for hiding this comment

JackCollins91 Jan 15, 2024

Choose a reason for hiding this comment

datapythonista left a comment

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

mroeschke commented Jan 15, 2024

JackCollins91 commented Jan 7, 2024 •

edited

Loading

JackCollins91 commented Jan 14, 2024 •

edited

Loading

JackCollins91 commented Jan 14, 2024 •

edited

Loading