Adds AVRO_PARSER_NUM_MINIBATCH to override num_minibatches and logs the parsing time #1283

i-ony · 2021-01-22T21:08:21Z

Added AVRO_PARSER_NUM_MINIBATCH to override num_minibatches. This is recommended to be set equal to the vcore request.

google-cla · 2021-01-22T21:08:24Z

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

google-cla · 2021-01-22T21:15:38Z

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

google-cla · 2021-01-22T21:20:32Z

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

burgerkingeater

thanks. The change also includes logging the parsing time. Can you update the PR title/description?

ashahab · 2021-01-25T17:31:32Z

@googlebot I consent.

yongtang · 2021-01-26T16:24:46Z

@StanfordMCP @burgerkingeater Is there any additional information about the AVRO_PARSER_NUM_MINIBATCH and its impact? From the PR, the AVRO_PARSER_NUM_MINIBATCH and the aforementioned vcore request are not very clear.

i-ony · 2021-01-26T16:38:57Z

@ashahab please comment

ashahab · 2021-01-26T16:57:38Z

@yongtang .
AVRO_PARSER_NUM_MINIBATCH changes the number of minibatches an avro parser groups the batches for parallelism.
Avro parser's parallelism is identical to tfrecord's: https://github.com/tensorflow/tensorflow/blob/v2.2.0/tensorflow/core/util/example_proto_fast_parsing.cc#L1195
Pasting the comment below. Feel free to give us feedback on how best to communicate the intent of the environment variable.
The intent is the users/running processes

// TODO(lew): A big performance low-hanging fruit here is to improve
  //   num_minibatches calculation to take into account actual amount of work
  //   needed, as the size in bytes is not perfect. Linear combination of
  //   size in bytes and average number of features per example is promising.
  //   Even better: measure time instead of estimating, but this is too costly
  //   in small batches.
  //   Maybe accept outside parameter #num_minibatches?

yongtang · 2021-01-26T18:42:30Z

Thanks @ashahab @StanfordMCP for the background. I think in that case an attribute would be preferable than an environmental variable, as we rarely use env for parameter fine tuning for kernel ops.

Would you mind updating the PR to use attribute instead? The attribute can be passed in similar way as other attributes:

io/tensorflow_io/core/kernels/avro/parse_avro_kernels.cc

Lines 406 to 411 in a200cfb

    
           OP_REQUIRES_OK(ctx, ctx->GetAttr("sparse_types", &sparse_types_)); 
        
           OP_REQUIRES_OK(ctx, ctx->GetAttr("dense_types", &dense_types_)); 
        
           OP_REQUIRES_OK(ctx, ctx->GetAttr("dense_shapes", &dense_shapes_)); 
        
           OP_REQUIRES_OK(ctx, ctx->GetAttr("sparse_keys", &sparse_keys_)); 
        
           OP_REQUIRES_OK(ctx, ctx->GetAttr("dense_keys", &dense_keys_));

You can set the default value of the attribute, e.g.:

io/tensorflow_io/core/ops/avro_ops.cc

Line 276 in a200cfb

.Attr("container: string = ''")

so that user does not need to pass the attribute explicitly.

kvignesh1420 · 2021-02-18T20:06:51Z

@ashahab @StanfordMCP any update on this PR?

-Originally AVRO_PARSER_NUM_MINIBATCH was set as an environmental variable. Because tensorflow-io rarely uses env vars to fine tune kernal ops this was changed to an attribute. See comment here: tensorflow#1283 (comment)

burgerkingeater · 2021-03-08T03:18:21Z

@StanfordMCP can you fix the build issue?

yongtang · 2021-03-08T14:56:19Z

I think the build issue is unrelated and has been fixed in #1320. Can you rebase to the latest master? I think that will fix the build issue.

-Originally AVRO_PARSER_NUM_MINIBATCH was set as an environmental variable. Because tensorflow-io rarely uses env vars to fine tune kernal ops this was changed to an attribute. See comment here: tensorflow#1283 (comment)

google-cla · 2021-03-08T20:52:29Z

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

ashahab · 2021-03-08T21:10:44Z

@googlebot I consent.

google-cla · 2021-03-08T21:11:40Z

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

burgerkingeater

thanks for making the change! left few comments.

tensorflow_io/core/python/experimental/parse_avro_ops.py

tensorflow_io/core/kernels/avro/parse_avro_kernels.cc

burgerkingeater · 2021-03-11T02:45:47Z

tensorflow_io/core/kernels/avro/parse_avro_kernels.cc

@@ -204,8 +207,13 @@ Status ParseAvro(const AvroParserConfig& config,
        minibatch_bytes = 0;
      }
    }
-    // 'special logic'
-    const size_t min_minibatches = std::min<size_t>(8, serialized.size());
+    if (avro_num_minibatches_) {


if avro_num_minibatches_ is not set, what's the default value? can you comment it in the code?

burgerkingeater · 2021-03-11T02:48:31Z

tensorflow_io/core/kernels/avro/parse_avro_kernels.cc

+    }
+    // This is to ensure users can control the num minibatches all the way down
+    // to size of 1(no parallelism).
+    const size_t min_minibatches = std::min<size_t>(1, serialized.size());


can you also comment that avro_num_minibatches_ is [1, 64] in the op interface?

burgerkingeater · 2021-03-15T06:39:18Z

tensorflow_io/core/kernels/avro/parse_avro_kernels.cc

@@ -401,6 +421,11 @@ class ParseAvroOp : public OpKernel {
          dense_shapes_[d].dims() > 1 && dense_shapes_[d].dim_size(0) == -1;
    }

+    // Check that avro_num_minibatches is positive


nit: Check that avro_num_minibatches is non-negative

-Exposes `num_parallel_reads` and `num_parallel_calls` in AvroRecordDataset and `make_avro_record_dataset` -Adds parameter constraints -Fixes lint issues -Adds test method for _require() function -This update adds a test to check if ValueErrors are raised when given an invalid input for num_parallel_calls

* Bump Apache Arrow to 2.0.0 Also bumps Apache Thrift to 0.13.0 Signed-off-by: Yong Tang <[email protected]> * Update code to match Arrow Signed-off-by: Yong Tang <[email protected]> * Bump pyarrow to 2.0.0 Signed-off-by: Yong Tang <[email protected]> * Stay with version=1 for write_feather to pass tests Signed-off-by: Yong Tang <[email protected]> * Bump flatbuffers to 1.12.0 Signed-off-by: Yong Tang <[email protected]> * Fix Windows issue Signed-off-by: Yong Tang <[email protected]> * Fix tests Signed-off-by: Yong Tang <[email protected]> * Fix Windows Signed-off-by: Yong Tang <[email protected]> * Remove -std=c++11 and leave default -std=c++14 for arrow build Signed-off-by: Yong Tang <[email protected]> * Update sha256 of libapr1 As the hash changed by the repo. Signed-off-by: Yong Tang <[email protected]>

* Bump com_github_googleapis_google_cloud_cpp to `1.21.0` * Add gcs testbench * Bump `libcurl` to `7.69.1`

Building shared libraries on CentOS 8 is pretty much the same as on Ubuntu 20.04 except `apt` should be changed to `yum`. For that our CentOS 8 CI test is not adding a lot of value. Furthermore with the upcoming CentOS 8 change: https://www.phoronix.com/scan.php?page=news_item&px=CentOS-8-Ending-For-Stream CentOS 8 is effectively EOLed at 2021. For that we may want to drop the CentOS 8 build (only leave a comment in README.md) Note we keep CentOS 7 build for now as there are still many users using CentOS 7 and CentOS 7 will only be EOLed at 2024. We might drop CentOS 7 build in the future as well if there is similiar changes to CentOS 7 like CentOS 8. Signed-off-by: Yong Tang <[email protected]>

…orflow#1247) Signed-off-by: Yong Tang <[email protected]>

* [s3] add support for testing on macOS * modify docker-compose cmd

* Refactor README.md content * bump to run ci jobs

…low#1299) When GitHub Actions runs it looks like there is an implicit concurrent jobs limit. As such the CentOS 7 test normally is scheduled later after other jobs completes. However, many times CentOS 7 test hangs (e.g., https://github.com/tensorflow/io/runs/1825943449). This is likely due to the CentOS 7 test is on the GitHub Actions queue for too long. This PR moves CentOS 7 to run after Ubuntu 20.04 test complete, to try to avoid hangs. Signed-off-by: Yong Tang <[email protected]>

We renamed the tests to remove "_eager" parts. This PR updates the api test for correct filenames Signed-off-by: Yong Tang <[email protected]>

Fixes wrong benchmark tests names caused by last commit Signed-off-by: Yong Tang <[email protected]>

…1304) This PR patchs arrow to temporarily resolve the ARROW-11518 issue. See 1281 for details Credit to diggerk. We will update arrow after the upstream PR is merged. Signed-off-by: Yong Tang <[email protected]>

tensorflow#1241) * Remove external headers from tensorflow, and use third_party headers instead This PR removes external headers from tensorflow, and use third_party headers instead. Signed-off-by: Yong Tang <[email protected]> * Address review comment Signed-off-by: Yong Tang <[email protected]>

Signed-off-by: Yong Tang <[email protected]>

Fix read/STDIN_FILENO Signed-off-by: Yong Tang <[email protected]>

* Switch to modular file system for hdfs This PR is part of the effort to switch to modular file system for hdfs. When TF_ENABLE_LEGACY_FILESYSTEM=1 is provided, old behavior will be preserved. Signed-off-by: Yong Tang <[email protected]> * Build against tf-nightly Signed-off-by: Yong Tang <[email protected]> * Update tests Signed-off-by: Yong Tang <[email protected]> * Adjust the if else logic, follow review comment Signed-off-by: Yong Tang <[email protected]>

With tensorflow upgrade to tf-nightly, the test_write_kafka test is failing and that is block the plan to modular file system migration. This PR disables the test temporarily so that CI can continue to push tensorflow-io-nightly image (needed for modular file system migration) Signed-off-by: Yong Tang <[email protected]>

This PR is part of the effort to switch to modular file system for s3. When TF_ENABLE_LEGACY_FILESYSTEM=1 is provided, old behavior will be preserved. Signed-off-by: Yong Tang <[email protected]>

-Originally AVRO_PARSER_NUM_MINIBATCH was set as an environmental variable. Because tensorflow-io rarely uses env vars to fine tune kernal ops this was changed to an attribute. See comment here: tensorflow#1283 (comment)

Added AVRO_PARSER_NUM_MINIBATCH to override num_minibatches. This is recommended to be set equal to the vcore request.

-Originally AVRO_PARSER_NUM_MINIBATCH was set as an environmental variable. Because tensorflow-io rarely uses env vars to fine tune kernal ops this was changed to an attribute. See comment here: tensorflow#1283 (comment)

google-cla · 2021-03-15T18:57:40Z

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

ashahab · 2021-03-16T23:01:42Z

@googlebot I consent

google-cla · 2021-03-16T23:02:21Z

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

burgerkingeater

Thanks, LGTM.

i-ony · 2021-03-17T15:23:13Z

@googlebot I consent.

google-cla · 2021-03-17T15:24:22Z

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

burgerkingeater · 2021-03-17T20:03:47Z

@yongtang the failure GitHub CI / Test 3.9 Linux (pull_request) Failing after 22m — Test 3.9 Linux is not related to this PR, but it cannot be merged until all check passes. Do you have any idea how we can proceed with this?

ashahab · 2021-03-17T20:57:41Z

@googlebot I consent

google-cla · 2021-03-17T20:58:20Z

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

yongtang · 2021-03-18T16:11:09Z

@burgerkingeater I can override the merge. We will see if after the merge the issue still happens.

…he parsing time (tensorflow#1283) * Exposes num_parallel_reads and num_parallel_calls -Exposes `num_parallel_reads` and `num_parallel_calls` in AvroRecordDataset and `make_avro_record_dataset` -Adds parameter constraints -Fixes lint issues -Adds test method for _require() function -This update adds a test to check if ValueErrors are raised when given an invalid input for num_parallel_calls * Bump Apache Arrow to 2.0.0 (tensorflow#1231) * Bump Apache Arrow to 2.0.0 Also bumps Apache Thrift to 0.13.0 Signed-off-by: Yong Tang <[email protected]> * Update code to match Arrow Signed-off-by: Yong Tang <[email protected]> * Bump pyarrow to 2.0.0 Signed-off-by: Yong Tang <[email protected]> * Stay with version=1 for write_feather to pass tests Signed-off-by: Yong Tang <[email protected]> * Bump flatbuffers to 1.12.0 Signed-off-by: Yong Tang <[email protected]> * Fix Windows issue Signed-off-by: Yong Tang <[email protected]> * Fix tests Signed-off-by: Yong Tang <[email protected]> * Fix Windows Signed-off-by: Yong Tang <[email protected]> * Remove -std=c++11 and leave default -std=c++14 for arrow build Signed-off-by: Yong Tang <[email protected]> * Update sha256 of libapr1 As the hash changed by the repo. Signed-off-by: Yong Tang <[email protected]> * Add emulator for gcs (tensorflow#1234) * Bump com_github_googleapis_google_cloud_cpp to `1.21.0` * Add gcs testbench * Bump `libcurl` to `7.69.1` * Remove the CI build for CentOS 8 (tensorflow#1237) Building shared libraries on CentOS 8 is pretty much the same as on Ubuntu 20.04 except `apt` should be changed to `yum`. For that our CentOS 8 CI test is not adding a lot of value. Furthermore with the upcoming CentOS 8 change: https://www.phoronix.com/scan.php?page=news_item&px=CentOS-8-Ending-For-Stream CentOS 8 is effectively EOLed at 2021. For that we may want to drop the CentOS 8 build (only leave a comment in README.md) Note we keep CentOS 7 build for now as there are still many users using CentOS 7 and CentOS 7 will only be EOLed at 2024. We might drop CentOS 7 build in the future as well if there is similiar changes to CentOS 7 like CentOS 8. Signed-off-by: Yong Tang <[email protected]> * add tf-c-header rule (tensorflow#1244) * Skip tf-nightly:tensorflow-io==0.17.0 on API compatibility test (tensorflow#1247) Signed-off-by: Yong Tang <[email protected]> * [s3] add support for testing on macOS (tensorflow#1253) * [s3] add support for testing on macOS * modify docker-compose cmd * add notebook formatting instruction in README (tensorflow#1256) * [docs] Restructure README.md content (tensorflow#1257) * Refactor README.md content * bump to run ci jobs * Update libtiff/libgeotiff dependency (tensorflow#1258) This PR updates libtiff/libgeotiff to the latest version. Signed-off-by: Yong Tang <[email protected]> * remove unstable elasticsearch test setup on macOS (tensorflow#1263) * Exposes num_parallel_reads and num_parallel_calls (tensorflow#1232) -Exposes `num_parallel_reads` and `num_parallel_calls` in AvroRecordDataset and `make_avro_record_dataset` -Adds parameter constraints -Fixes lint issues - Adds test method for _require() function -This update adds a test to check if ValueErrors are raised when given an invalid input for num_parallel_calls Co-authored-by: Abin Shahab <[email protected]> * Added AVRO_PARSER_NUM_MINIBATCH to override num_minibatches Added AVRO_PARSER_NUM_MINIBATCH to override num_minibatches. This is recommended to be set equal to the vcore request. * Exposes num_parallel_reads and num_parallel_calls (tensorflow#1232) * Exposes num_parallel_reads and num_parallel_calls -Exposes `num_parallel_reads` and `num_parallel_calls` in AvroRecordDataset and `make_avro_record_dataset` -Adds parameter constraints -Fixes lint issues * Exposes num_parallel_reads and num_parallel_calls -Exposes `num_parallel_reads` and `num_parallel_calls` in AvroRecordDataset and `make_avro_record_dataset` -Adds parameter constraints -Fixes lint issues * Exposes num_parallel_reads and num_parallel_calls -Exposes `num_parallel_reads` and `num_parallel_calls` in AvroRecordDataset and `make_avro_record_dataset` -Adds parameter constraints -Fixes lint issues * Fixes Lint Issues * Removes Optional typing for method parameter - * Adds test method for _require() function -This update adds a test to check if ValueErrors are raised when given an invalid input for num_parallel_calls * Uncomments skip for macOS pytests * Fixes Lint issues Co-authored-by: Abin Shahab <[email protected]> * add avro tutorial testing data (tensorflow#1267) Co-authored-by: Cheng Ren <[email protected]> * Update Kafka tutorial to work with Apache Kafka (tensorflow#1266) * Update Kafka tutorial to work with Apache Kafka Minor update to the Kafka tutorial to remove the dependency on Confluent's distribution of Kafka, and instead work with vanilla Apache Kafka. Signed-off-by: Dale Lane <[email protected]> * Address review comments Remove redundant pip install commands Signed-off-by: Dale Lane <[email protected]> * add github workflow for performance benchmarking (tensorflow#1269) * add github workflow for performance benchmarking * add github-action-benchmark step * handle missing dependencies while benchmarking (tensorflow#1271) * handle missing dependencies while benchmarking * setup test_sql * job name change * set auto-push to true * remove auto-push * add personal access token * use alternate method to push to gh-pages * add name to the action * use different id * modify creds * use github_token * change repo name * set auto-push * set origin and push results * set env * use PERSONAL_GITHUB_TOKEN * use push changes action * use github.head_ref to push the changes * try using fetch-depth * modify branch name * use alternative push approach * git switch - * test by merging with forked master * Disable s3 macOS for now as docker is not working on GitHub Actions for macOS (tensorflow#1277) * Revert "[s3] add support for testing on macOS (tensorflow#1253)" This reverts commit 81789bd. Signed-off-by: Yong Tang <[email protected]> * Update Signed-off-by: Yong Tang <[email protected]> * rename testing data files (tensorflow#1278) * Add tutorial for avro dataset API (tensorflow#1250) * remove docker based mongodb tests in macos (tensorflow#1279) * trigger benchmarks workflow only on commits (tensorflow#1282) * Bump Apache Arrow to 3.0.0 (tensorflow#1285) Signed-off-by: Yong Tang <[email protected]> * Add bazel cache (tensorflow#1287) Signed-off-by: Yong Tang <[email protected]> * Add initial bigtable stub test (tensorflow#1286) * Add initial bigtable stub test Signed-off-by: Yong Tang <[email protected]> * Fix kokoro test Signed-off-by: Yong Tang <[email protected]> * Add reference to github-pages benchmarks in README (tensorflow#1289) * add reference to github-pages benchmarks * minor grammar change * Update README.md Co-authored-by: Yuan Tang <[email protected]> Co-authored-by: Yuan Tang <[email protected]> * Clear outputs (tensorflow#1292) * fix kafka online-learning section in tutorial notebook (tensorflow#1274) * kafka notebook fix for colab env * change timeout from 30 to 20 seconds * reduce stream_timeout * Only enable bazel caching writes for tensorflow/io github actions (tensorflow#1293) This PR updates so that only GitHub actions run on tensorflow/io repo will be enabled with bazel cache writes. Without the updates, a focked repo actions will cause error. Note once bazel cache read-permissions are enabled from gcs forked repo will be able to access bazel cache (read-only). Signed-off-by: Yong Tang <[email protected]> * Enable ready-only bazel cache (tensorflow#1294) This PR enables read-only bazel cache Signed-off-by: Yong Tang <[email protected]> * Rename tests (tensorflow#1297) * Combine Ubuntu 20.04 and CentOS 7 tests into one GitHub jobs (tensorflow#1299) When GitHub Actions runs it looks like there is an implicit concurrent jobs limit. As such the CentOS 7 test normally is scheduled later after other jobs completes. However, many times CentOS 7 test hangs (e.g., https://github.com/tensorflow/io/runs/1825943449). This is likely due to the CentOS 7 test is on the GitHub Actions queue for too long. This PR moves CentOS 7 to run after Ubuntu 20.04 test complete, to try to avoid hangs. Signed-off-by: Yong Tang <[email protected]> * Update names of api tests (tensorflow#1300) We renamed the tests to remove "_eager" parts. This PR updates the api test for correct filenames Signed-off-by: Yong Tang <[email protected]> * Fix wrong benchmark tests names (tensorflow#1301) Fixes wrong benchmark tests names caused by last commit Signed-off-by: Yong Tang <[email protected]> * Patch arrow to temporarily resolve the ARROW-11518 issue (tensorflow#1304) This PR patchs arrow to temporarily resolve the ARROW-11518 issue. See 1281 for details Credit to diggerk. We will update arrow after the upstream PR is merged. Signed-off-by: Yong Tang <[email protected]> * Remove AWS headers from tensorflow, and use headers from third_party … (tensorflow#1241) * Remove external headers from tensorflow, and use third_party headers instead This PR removes external headers from tensorflow, and use third_party headers instead. Signed-off-by: Yong Tang <[email protected]> * Address review comment Signed-off-by: Yong Tang <[email protected]> * Switch to use github to download libgeotiff (tensorflow#1307) Signed-off-by: Yong Tang <[email protected]> * Add @com_google_absl//absl/strings:cord (tensorflow#1308) Fix read/STDIN_FILENO Signed-off-by: Yong Tang <[email protected]> * Switch to modular file system for hdfs (tensorflow#1309) * Switch to modular file system for hdfs This PR is part of the effort to switch to modular file system for hdfs. When TF_ENABLE_LEGACY_FILESYSTEM=1 is provided, old behavior will be preserved. Signed-off-by: Yong Tang <[email protected]> * Build against tf-nightly Signed-off-by: Yong Tang <[email protected]> * Update tests Signed-off-by: Yong Tang <[email protected]> * Adjust the if else logic, follow review comment Signed-off-by: Yong Tang <[email protected]> * Disable test_write_kafka test for now. (tensorflow#1310) With tensorflow upgrade to tf-nightly, the test_write_kafka test is failing and that is block the plan to modular file system migration. This PR disables the test temporarily so that CI can continue to push tensorflow-io-nightly image (needed for modular file system migration) Signed-off-by: Yong Tang <[email protected]> * Switch to modular file system for s3 (tensorflow#1312) This PR is part of the effort to switch to modular file system for s3. When TF_ENABLE_LEGACY_FILESYSTEM=1 is provided, old behavior will be preserved. Signed-off-by: Yong Tang <[email protected]> * Add python 3.9 on Windows (tensorflow#1316) * Updates the PR to use attribute instead of Env Variable -Originally AVRO_PARSER_NUM_MINIBATCH was set as an environmental variable. Because tensorflow-io rarely uses env vars to fine tune kernal ops this was changed to an attribute. See comment here: tensorflow#1283 (comment) * Added AVRO_PARSER_NUM_MINIBATCH to override num_minibatches Added AVRO_PARSER_NUM_MINIBATCH to override num_minibatches. This is recommended to be set equal to the vcore request. * Updates the PR to use attribute instead of Env Variable -Originally AVRO_PARSER_NUM_MINIBATCH was set as an environmental variable. Because tensorflow-io rarely uses env vars to fine tune kernal ops this was changed to an attribute. See comment here: tensorflow#1283 (comment) * Adds addtional comments in source code for understandability Co-authored-by: Abin Shahab <[email protected]> Co-authored-by: Yong Tang <[email protected]> Co-authored-by: Vo Van Nghia <[email protected]> Co-authored-by: Vignesh Kothapalli <[email protected]> Co-authored-by: Cheng Ren <[email protected]> Co-authored-by: Cheng Ren <[email protected]> Co-authored-by: Dale Lane <[email protected]> Co-authored-by: Yuan Tang <[email protected]> Co-authored-by: Mark Daoust <[email protected]>

…he parsing time (#1283) * Exposes num_parallel_reads and num_parallel_calls -Exposes `num_parallel_reads` and `num_parallel_calls` in AvroRecordDataset and `make_avro_record_dataset` -Adds parameter constraints -Fixes lint issues -Adds test method for _require() function -This update adds a test to check if ValueErrors are raised when given an invalid input for num_parallel_calls * Bump Apache Arrow to 2.0.0 (#1231) * Bump Apache Arrow to 2.0.0 Also bumps Apache Thrift to 0.13.0 Signed-off-by: Yong Tang <[email protected]> * Update code to match Arrow Signed-off-by: Yong Tang <[email protected]> * Bump pyarrow to 2.0.0 Signed-off-by: Yong Tang <[email protected]> * Stay with version=1 for write_feather to pass tests Signed-off-by: Yong Tang <[email protected]> * Bump flatbuffers to 1.12.0 Signed-off-by: Yong Tang <[email protected]> * Fix Windows issue Signed-off-by: Yong Tang <[email protected]> * Fix tests Signed-off-by: Yong Tang <[email protected]> * Fix Windows Signed-off-by: Yong Tang <[email protected]> * Remove -std=c++11 and leave default -std=c++14 for arrow build Signed-off-by: Yong Tang <[email protected]> * Update sha256 of libapr1 As the hash changed by the repo. Signed-off-by: Yong Tang <[email protected]> * Add emulator for gcs (#1234) * Bump com_github_googleapis_google_cloud_cpp to `1.21.0` * Add gcs testbench * Bump `libcurl` to `7.69.1` * Remove the CI build for CentOS 8 (#1237) Building shared libraries on CentOS 8 is pretty much the same as on Ubuntu 20.04 except `apt` should be changed to `yum`. For that our CentOS 8 CI test is not adding a lot of value. Furthermore with the upcoming CentOS 8 change: https://www.phoronix.com/scan.php?page=news_item&px=CentOS-8-Ending-For-Stream CentOS 8 is effectively EOLed at 2021. For that we may want to drop the CentOS 8 build (only leave a comment in README.md) Note we keep CentOS 7 build for now as there are still many users using CentOS 7 and CentOS 7 will only be EOLed at 2024. We might drop CentOS 7 build in the future as well if there is similiar changes to CentOS 7 like CentOS 8. Signed-off-by: Yong Tang <[email protected]> * add tf-c-header rule (#1244) * Skip tf-nightly:tensorflow-io==0.17.0 on API compatibility test (#1247) Signed-off-by: Yong Tang <[email protected]> * [s3] add support for testing on macOS (#1253) * [s3] add support for testing on macOS * modify docker-compose cmd * add notebook formatting instruction in README (#1256) * [docs] Restructure README.md content (#1257) * Refactor README.md content * bump to run ci jobs * Update libtiff/libgeotiff dependency (#1258) This PR updates libtiff/libgeotiff to the latest version. Signed-off-by: Yong Tang <[email protected]> * remove unstable elasticsearch test setup on macOS (#1263) * Exposes num_parallel_reads and num_parallel_calls (#1232) -Exposes `num_parallel_reads` and `num_parallel_calls` in AvroRecordDataset and `make_avro_record_dataset` -Adds parameter constraints -Fixes lint issues - Adds test method for _require() function -This update adds a test to check if ValueErrors are raised when given an invalid input for num_parallel_calls Co-authored-by: Abin Shahab <[email protected]> * Added AVRO_PARSER_NUM_MINIBATCH to override num_minibatches Added AVRO_PARSER_NUM_MINIBATCH to override num_minibatches. This is recommended to be set equal to the vcore request. * Exposes num_parallel_reads and num_parallel_calls (#1232) * Exposes num_parallel_reads and num_parallel_calls -Exposes `num_parallel_reads` and `num_parallel_calls` in AvroRecordDataset and `make_avro_record_dataset` -Adds parameter constraints -Fixes lint issues * Exposes num_parallel_reads and num_parallel_calls -Exposes `num_parallel_reads` and `num_parallel_calls` in AvroRecordDataset and `make_avro_record_dataset` -Adds parameter constraints -Fixes lint issues * Exposes num_parallel_reads and num_parallel_calls -Exposes `num_parallel_reads` and `num_parallel_calls` in AvroRecordDataset and `make_avro_record_dataset` -Adds parameter constraints -Fixes lint issues * Fixes Lint Issues * Removes Optional typing for method parameter - * Adds test method for _require() function -This update adds a test to check if ValueErrors are raised when given an invalid input for num_parallel_calls * Uncomments skip for macOS pytests * Fixes Lint issues Co-authored-by: Abin Shahab <[email protected]> * add avro tutorial testing data (#1267) Co-authored-by: Cheng Ren <[email protected]> * Update Kafka tutorial to work with Apache Kafka (#1266) * Update Kafka tutorial to work with Apache Kafka Minor update to the Kafka tutorial to remove the dependency on Confluent's distribution of Kafka, and instead work with vanilla Apache Kafka. Signed-off-by: Dale Lane <[email protected]> * Address review comments Remove redundant pip install commands Signed-off-by: Dale Lane <[email protected]> * add github workflow for performance benchmarking (#1269) * add github workflow for performance benchmarking * add github-action-benchmark step * handle missing dependencies while benchmarking (#1271) * handle missing dependencies while benchmarking * setup test_sql * job name change * set auto-push to true * remove auto-push * add personal access token * use alternate method to push to gh-pages * add name to the action * use different id * modify creds * use github_token * change repo name * set auto-push * set origin and push results * set env * use PERSONAL_GITHUB_TOKEN * use push changes action * use github.head_ref to push the changes * try using fetch-depth * modify branch name * use alternative push approach * git switch - * test by merging with forked master * Disable s3 macOS for now as docker is not working on GitHub Actions for macOS (#1277) * Revert "[s3] add support for testing on macOS (#1253)" This reverts commit e25daa9. Signed-off-by: Yong Tang <[email protected]> * Update Signed-off-by: Yong Tang <[email protected]> * rename testing data files (#1278) * Add tutorial for avro dataset API (#1250) * remove docker based mongodb tests in macos (#1279) * trigger benchmarks workflow only on commits (#1282) * Bump Apache Arrow to 3.0.0 (#1285) Signed-off-by: Yong Tang <[email protected]> * Add bazel cache (#1287) Signed-off-by: Yong Tang <[email protected]> * Add initial bigtable stub test (#1286) * Add initial bigtable stub test Signed-off-by: Yong Tang <[email protected]> * Fix kokoro test Signed-off-by: Yong Tang <[email protected]> * Add reference to github-pages benchmarks in README (#1289) * add reference to github-pages benchmarks * minor grammar change * Update README.md Co-authored-by: Yuan Tang <[email protected]> Co-authored-by: Yuan Tang <[email protected]> * Clear outputs (#1292) * fix kafka online-learning section in tutorial notebook (#1274) * kafka notebook fix for colab env * change timeout from 30 to 20 seconds * reduce stream_timeout * Only enable bazel caching writes for tensorflow/io github actions (#1293) This PR updates so that only GitHub actions run on tensorflow/io repo will be enabled with bazel cache writes. Without the updates, a focked repo actions will cause error. Note once bazel cache read-permissions are enabled from gcs forked repo will be able to access bazel cache (read-only). Signed-off-by: Yong Tang <[email protected]> * Enable ready-only bazel cache (#1294) This PR enables read-only bazel cache Signed-off-by: Yong Tang <[email protected]> * Rename tests (#1297) * Combine Ubuntu 20.04 and CentOS 7 tests into one GitHub jobs (#1299) When GitHub Actions runs it looks like there is an implicit concurrent jobs limit. As such the CentOS 7 test normally is scheduled later after other jobs completes. However, many times CentOS 7 test hangs (e.g., https://github.com/tensorflow/io/runs/1825943449). This is likely due to the CentOS 7 test is on the GitHub Actions queue for too long. This PR moves CentOS 7 to run after Ubuntu 20.04 test complete, to try to avoid hangs. Signed-off-by: Yong Tang <[email protected]> * Update names of api tests (#1300) We renamed the tests to remove "_eager" parts. This PR updates the api test for correct filenames Signed-off-by: Yong Tang <[email protected]> * Fix wrong benchmark tests names (#1301) Fixes wrong benchmark tests names caused by last commit Signed-off-by: Yong Tang <[email protected]> * Patch arrow to temporarily resolve the ARROW-11518 issue (#1304) This PR patchs arrow to temporarily resolve the ARROW-11518 issue. See 1281 for details Credit to diggerk. We will update arrow after the upstream PR is merged. Signed-off-by: Yong Tang <[email protected]> * Remove AWS headers from tensorflow, and use headers from third_party … (#1241) * Remove external headers from tensorflow, and use third_party headers instead This PR removes external headers from tensorflow, and use third_party headers instead. Signed-off-by: Yong Tang <[email protected]> * Address review comment Signed-off-by: Yong Tang <[email protected]> * Switch to use github to download libgeotiff (#1307) Signed-off-by: Yong Tang <[email protected]> * Add @com_google_absl//absl/strings:cord (#1308) Fix read/STDIN_FILENO Signed-off-by: Yong Tang <[email protected]> * Switch to modular file system for hdfs (#1309) * Switch to modular file system for hdfs This PR is part of the effort to switch to modular file system for hdfs. When TF_ENABLE_LEGACY_FILESYSTEM=1 is provided, old behavior will be preserved. Signed-off-by: Yong Tang <[email protected]> * Build against tf-nightly Signed-off-by: Yong Tang <[email protected]> * Update tests Signed-off-by: Yong Tang <[email protected]> * Adjust the if else logic, follow review comment Signed-off-by: Yong Tang <[email protected]> * Disable test_write_kafka test for now. (#1310) With tensorflow upgrade to tf-nightly, the test_write_kafka test is failing and that is block the plan to modular file system migration. This PR disables the test temporarily so that CI can continue to push tensorflow-io-nightly image (needed for modular file system migration) Signed-off-by: Yong Tang <[email protected]> * Switch to modular file system for s3 (#1312) This PR is part of the effort to switch to modular file system for s3. When TF_ENABLE_LEGACY_FILESYSTEM=1 is provided, old behavior will be preserved. Signed-off-by: Yong Tang <[email protected]> * Add python 3.9 on Windows (#1316) * Updates the PR to use attribute instead of Env Variable -Originally AVRO_PARSER_NUM_MINIBATCH was set as an environmental variable. Because tensorflow-io rarely uses env vars to fine tune kernal ops this was changed to an attribute. See comment here: tensorflow/io#1283 (comment) * Added AVRO_PARSER_NUM_MINIBATCH to override num_minibatches Added AVRO_PARSER_NUM_MINIBATCH to override num_minibatches. This is recommended to be set equal to the vcore request. * Updates the PR to use attribute instead of Env Variable -Originally AVRO_PARSER_NUM_MINIBATCH was set as an environmental variable. Because tensorflow-io rarely uses env vars to fine tune kernal ops this was changed to an attribute. See comment here: tensorflow/io#1283 (comment) * Adds addtional comments in source code for understandability Co-authored-by: Abin Shahab <[email protected]> Co-authored-by: Yong Tang <[email protected]> Co-authored-by: Vo Van Nghia <[email protected]> Co-authored-by: Vignesh Kothapalli <[email protected]> Co-authored-by: Cheng Ren <[email protected]> Co-authored-by: Cheng Ren <[email protected]> Co-authored-by: Dale Lane <[email protected]> Co-authored-by: Yuan Tang <[email protected]> Co-authored-by: Mark Daoust <[email protected]>

i-ony force-pushed the avro-updates branch from 0be6f4c to a200cfb Compare January 22, 2021 21:20

burgerkingeater reviewed Jan 24, 2021

View reviewed changes

i-ony changed the title ~~Adds AVRO_PARSER_NUM_MINIBATCH to override num_minibatches~~ Adds AVRO_PARSER_NUM_MINIBATCH to override num_minibatches and logs the parsing time Jan 25, 2021

burgerkingeater reviewed Mar 11, 2021

View reviewed changes

burgerkingeater reviewed Mar 15, 2021

View reviewed changes

ashahab and others added 9 commits March 15, 2021 11:37

Add emulator for gcs (tensorflow#1234)

8a7cb8c

* Bump com_github_googleapis_google_cloud_cpp to `1.21.0` * Add gcs testbench * Bump `libcurl` to `7.69.1`

add tf-c-header rule (tensorflow#1244)

b5bb3ea

Skip tf-nightly:tensorflow-io==0.17.0 on API compatibility test (tens…

23b6218

…orflow#1247) Signed-off-by: Yong Tang <[email protected]>

[s3] add support for testing on macOS (tensorflow#1253)

2b313fa

* [s3] add support for testing on macOS * modify docker-compose cmd

add notebook formatting instruction in README (tensorflow#1256)

8644501

[docs] Restructure README.md content (tensorflow#1257)

686ce72

* Refactor README.md content * bump to run ci jobs

yongtang and others added 16 commits March 15, 2021 11:37

Rename tests (tensorflow#1297)

4b01928

Update names of api tests (tensorflow#1300)

b8e53b9

We renamed the tests to remove "_eager" parts. This PR updates the api test for correct filenames Signed-off-by: Yong Tang <[email protected]>

Fix wrong benchmark tests names (tensorflow#1301)

e894533

Fixes wrong benchmark tests names caused by last commit Signed-off-by: Yong Tang <[email protected]>

Switch to use github to download libgeotiff (tensorflow#1307)

5432bcd

Signed-off-by: Yong Tang <[email protected]>

Add @com_google_absl//absl/strings:cord (tensorflow#1308)

3554d27

Fix read/STDIN_FILENO Signed-off-by: Yong Tang <[email protected]>

Switch to modular file system for s3 (tensorflow#1312)

b2e426b

This PR is part of the effort to switch to modular file system for s3. When TF_ENABLE_LEGACY_FILESYSTEM=1 is provided, old behavior will be preserved. Signed-off-by: Yong Tang <[email protected]>

Add python 3.9 on Windows (tensorflow#1316)

e4047df

Updates the PR to use attribute instead of Env Variable

813458d

-Originally AVRO_PARSER_NUM_MINIBATCH was set as an environmental variable. Because tensorflow-io rarely uses env vars to fine tune kernal ops this was changed to an attribute. See comment here: tensorflow#1283 (comment)

Added AVRO_PARSER_NUM_MINIBATCH to override num_minibatches

d76aba1

Added AVRO_PARSER_NUM_MINIBATCH to override num_minibatches. This is recommended to be set equal to the vcore request.

Updates the PR to use attribute instead of Env Variable

33f77af

-Originally AVRO_PARSER_NUM_MINIBATCH was set as an environmental variable. Because tensorflow-io rarely uses env vars to fine tune kernal ops this was changed to an attribute. See comment here: tensorflow#1283 (comment)

Adds addtional comments in source code for understandability

62fb6e7

i-ony force-pushed the avro-updates branch from f9b1c06 to 62fb6e7 Compare March 15, 2021 18:56

burgerkingeater approved these changes Mar 17, 2021

View reviewed changes

yongtang merged commit e5fac57 into tensorflow:master Mar 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds AVRO_PARSER_NUM_MINIBATCH to override num_minibatches and logs the parsing time #1283

Adds AVRO_PARSER_NUM_MINIBATCH to override num_minibatches and logs the parsing time #1283

i-ony commented Jan 22, 2021

google-cla bot commented Jan 22, 2021

google-cla bot commented Jan 22, 2021

google-cla bot commented Jan 22, 2021

burgerkingeater left a comment

ashahab commented Jan 25, 2021

yongtang commented Jan 26, 2021

i-ony commented Jan 26, 2021

ashahab commented Jan 26, 2021

yongtang commented Jan 26, 2021

kvignesh1420 commented Feb 18, 2021

burgerkingeater commented Mar 8, 2021

yongtang commented Mar 8, 2021

google-cla bot commented Mar 8, 2021

ashahab commented Mar 8, 2021

google-cla bot commented Mar 8, 2021

burgerkingeater left a comment

burgerkingeater Mar 11, 2021

i-ony Mar 15, 2021

burgerkingeater Mar 11, 2021

i-ony Mar 15, 2021

burgerkingeater Mar 15, 2021

i-ony Mar 15, 2021

google-cla bot commented Mar 15, 2021

ashahab commented Mar 16, 2021

google-cla bot commented Mar 16, 2021

burgerkingeater left a comment

i-ony commented Mar 17, 2021

google-cla bot commented Mar 17, 2021

burgerkingeater commented Mar 17, 2021

ashahab commented Mar 17, 2021

google-cla bot commented Mar 17, 2021

yongtang commented Mar 18, 2021

Adds AVRO_PARSER_NUM_MINIBATCH to override num_minibatches and logs the parsing time #1283

Adds AVRO_PARSER_NUM_MINIBATCH to override num_minibatches and logs the parsing time #1283

Conversation

i-ony commented Jan 22, 2021

google-cla bot commented Jan 22, 2021

google-cla bot commented Jan 22, 2021

google-cla bot commented Jan 22, 2021

burgerkingeater left a comment

Choose a reason for hiding this comment

ashahab commented Jan 25, 2021

yongtang commented Jan 26, 2021

i-ony commented Jan 26, 2021

ashahab commented Jan 26, 2021

yongtang commented Jan 26, 2021

kvignesh1420 commented Feb 18, 2021

burgerkingeater commented Mar 8, 2021

yongtang commented Mar 8, 2021

google-cla bot commented Mar 8, 2021

ashahab commented Mar 8, 2021

google-cla bot commented Mar 8, 2021

burgerkingeater left a comment

Choose a reason for hiding this comment

burgerkingeater Mar 11, 2021

Choose a reason for hiding this comment

i-ony Mar 15, 2021

Choose a reason for hiding this comment

burgerkingeater Mar 11, 2021

Choose a reason for hiding this comment

i-ony Mar 15, 2021

Choose a reason for hiding this comment

burgerkingeater Mar 15, 2021

Choose a reason for hiding this comment

i-ony Mar 15, 2021

Choose a reason for hiding this comment

google-cla bot commented Mar 15, 2021

ashahab commented Mar 16, 2021

google-cla bot commented Mar 16, 2021

burgerkingeater left a comment

Choose a reason for hiding this comment

i-ony commented Mar 17, 2021

google-cla bot commented Mar 17, 2021

burgerkingeater commented Mar 17, 2021

ashahab commented Mar 17, 2021

google-cla bot commented Mar 17, 2021

yongtang commented Mar 18, 2021