Cache object code in memory instead of entire module. #4

augustoasilva · 2021-05-27T21:01:19Z

No description provided.

Closes apache#10157 from mathyingzhou/ARROW-9299 Lead-authored-by: Ying Zhou <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

For serial CSV readers track the absolute row number and report it in errors encountered during parsing or converting. I did try to get row numbers for the parallel reader but the only way I thought that could work would be to add delimiter counting to the Chunker but that seemed to add more complexity than I wanted to. Closes apache#10321 from n3world/ARROW-12675-report_rows Authored-by: Nate Clark <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

…_id's Questions: - This is my first PR in the parquet namespace, I'm not sure of all the special rules. - The field ID generation doesn't happen on the `parquet::schema` -> `arrow::schema` phase but on the `parquet::format::schema` -> `parquet::schema` phase. So in order to test I had to add `#include "generated/parquet_types.h"` to `arrow_schema_test.cc` and I wasn't sure if I was allowed to reference the `generated/*` files like that. - This PR simply allows user specified field id's to be persisted. Is that sufficient for PARQUET-1798 (the title is rather general) or should I open up a dedicated JIRA? Closes apache#10289 from westonpace/feature/PARQUET-1798-field-id-assignment Lead-authored-by: Weston Pace <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

* Download URL is wrong * Downloaded packages aren't removed Closes apache#10418 from kou/release-csharp Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

… ptrs

Closes apache#10343 from thisisnic/ARROW-12758_examples Lead-authored-by: Nic Crane <[email protected]> Co-authored-by: Jonathan Keane <[email protected]> Signed-off-by: Jonathan Keane <[email protected]>

This runs reverse dependency checks using {revdepchecks}. The way that works is by installing a release version of arrow and the current development version (i.e. from the git checkout), and then runs checks on each of the reverse dependencies first with the release (called "old" in {revdepcheck}'s terms) and with the development version ("new" in {revdepcheck}'s terms). Then it compares the outputs and will only fail if there is a failure in the new check that is not in the old check. I've customized the output a bit so that it prints any errors that come up in either (in the revdepcheck problems step) so we can more easily diagnose, but it will only fail if there are new errors. One thing that I tried and was unable to do is to find a way to cache packages+info across runs. The github cache action will create a cache, but because of how they are run on crossbow (i.e. on different branches) the caches are never accessible in different runs. I've kept the cacheing step in for now, if we could find a way to (manually?) run this on the main branch like https://github.com/ursacomputing/crossbow/blob/master/.github/workflows/cache_vcpkg.yml before we use this heavily (i.e. likely only around a release) that would create a cache that could be used to speed up some of the jobs. Closes apache#10345 from jonkeane/ARROW-12569-revdepcheck Authored-by: Jonathan Keane <[email protected]> Signed-off-by: Jonathan Keane <[email protected]>

Adjust the R version used to be able to install binary arrow packages from RSPM. Small adjustment to tests that doesn't require the order of attributes to be fixed (the order changed slightly in version 3.0.0) Closes apache#10409 from jonkeane/ARROW-12883-version-compatibility Authored-by: Jonathan Keane <[email protected]> Signed-off-by: Jonathan Keane <[email protected]>

Closes apache#10368 from thisisnic/ARROW-12841_examples_part_2 Authored-by: Nic Crane <[email protected]> Signed-off-by: Jonathan Keane <[email protected]>

…ests

…nd is_in Closes apache#10383 from thisisnic/ARROW-12777_match_arrow_is_in Authored-by: Nic Crane <[email protected]> Signed-off-by: Jonathan Keane <[email protected]>

Closes apache#10419 from raybellwaves/docs-np-import Authored-by: Ray Bell <[email protected]> Signed-off-by: David Li <[email protected]>

Closes apache#10413 from jonkeane/ARROW-12894 Authored-by: Jonathan Keane <[email protected]> Signed-off-by: Neal Richardson <[email protected]>

…ory' into feature/cache-object-code-in-memory # Conflicts: # cpp/src/gandiva/base_object_cache.h # cpp/src/gandiva/cache.h # cpp/src/gandiva/engine.h # cpp/src/gandiva/lru_cache.h # cpp/src/gandiva/projector.cc # cpp/src/gandiva/projector.h

… bytes or 5 KiB

Before change: ``` Direct leak of 65536 byte(s) in 1 object(s) allocated from: #0 0x522f09 in #1 0x7f28ae5826f4 in #2 0x7f28ae57fa5d in #3 0x7f28ae58cb0f in #4 0x7f28ae58bda0 in ... ``` After change: ``` Direct leak of 65536 byte(s) in 1 object(s) allocated from: #0 0x522f09 in posix_memalign (/build/cpp/debug/arrow-dataset-file-csv-test+0x522f09) #1 0x7f28ae5826f4 in arrow::(anonymous namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) /arrow/cpp/src/arrow/memory_pool.cc:213:24 #2 0x7f28ae57fa5d in arrow::BaseMemoryPoolImpl<arrow::(anonymous namespace)::SystemAllocator>::Allocate(long, unsigned char**) /arrow/cpp/src/arrow/memory_pool.cc:405:5 #3 0x7f28ae58cb0f in arrow::PoolBuffer::Reserve(long) /arrow/cpp/src/arrow/memory_pool.cc:717:9 #4 0x7f28ae58bda0 in arrow::PoolBuffer::Resize(long, bool) /arrow/cpp/src/arrow/memory_pool.cc:741:7 ... ``` Closes apache#10498 from westonpace/feature/ARROW-13027--c-fix-asan-stack-traces-in-ci Authored-by: Weston Pace <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

Ying Zhou and others added 30 commits May 27, 2021 17:52

ARROW-9299: [C++][Python] Expose ORC metadata

822a5a2

Closes apache#10157 from mathyingzhou/ARROW-9299 Lead-authored-by: Ying Zhou <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

Add to string function

f3a3d31

Add toString, GetObjectCode and PutObjectCode functions

18a7532

Add SetProjectorObjectCache to the engine

c10d914

Add a Build() overload to receive the ProjectorObjectCache

ac8563b

Adds the projector_object_cache class files

c49b936

Modify the Make() func to use the ProjectorObjectCache class

4677361

Remove the include to a not used base_object_cache file

e1d6808

ARROW-12898: [Release][C#] Fix package upload

809606d

* Download URL is wrong * Downloaded packages aren't removed Closes apache#10418 from kou/release-csharp Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

Add ProjectorCacheKey private param to Projector

9f482c9

Add logs that assert that boths keys at setter and getter are the same

bca8332

Modified only the TestProjectCache for testing the object cache using…

048ec81

… ptrs

ARROW-12758: [R] Add examples to more function documentation

fe2d940

Closes apache#10343 from thisisnic/ARROW-12758_examples Lead-authored-by: Nic Crane <[email protected]> Co-authored-by: Jonathan Keane <[email protected]> Signed-off-by: Jonathan Keane <[email protected]>

Fixed TestProjectCache projector's ptrs

9ea1e26

ARROW-12841: [R] Add examples to more function documentation - part 2

aa80860

Closes apache#10368 from thisisnic/ARROW-12841_examples_part_2 Authored-by: Nic Crane <[email protected]> Signed-off-by: Jonathan Keane <[email protected]>

Changed projectCacheKey ptr from unique to shared

2ae9e16

Adds flag to know if the ObjectCode was cached and fixed the broken t…

3cc0836

…ests

ARROW-12777: [R] Convert all inputs to Arrow objects in match_arrow a…

bf0f6aa

…nd is_in Closes apache#10383 from thisisnic/ARROW-12777_match_arrow_is_in Authored-by: Nic Crane <[email protected]> Signed-off-by: Jonathan Keane <[email protected]>

Fix all the broken test

733f055

ARROW-12900: [Python][Doc] Add missing numpy import

de0bb96

Closes apache#10419 from raybellwaves/docs-np-import Authored-by: Ray Bell <[email protected]> Signed-off-by: David Li <[email protected]>

ARROW-12894: [R] Bump R version

406af5e

Closes apache#10413 from jonkeane/ARROW-12894 Authored-by: Jonathan Keane <[email protected]> Signed-off-by: Neal Richardson <[email protected]>

Refactor ProjectorObjectCache to BaseObjectCache with template

c16a368

Remove unnecessary private variable

fc1b81f

Clean the logs and add descriptive comments

7dabd62

Change filter to use the ObjectCache log system too

4b99dc5

Comment out unnecessary cache variable

5ef4714

augustoasilva added 26 commits June 7, 2021 18:07

Remove unnecessary private variable

969014c

Clean the logs and add descriptive comments

54d0d8a

Change filter to use the ObjectCache log system too

0829ac1

Comment out unnecessary cache variable

b99943a

Remove unnecessary log

e1fe2c5

Remove unnecessary log

7a00493

Remove unnecessary commented out code

7d50f8f

Add support to LRU cache to track it's size in bytes

3b129a9

Refactor code removing unnecessary code and logs

146dad3

Refactor filter code to implement logging of the cache size

74825eb

Adds to the projector a func to get how much mem cache is used by it

a37276f

Add to the cache the capability of tracking how memory it uses

ea73fd2

Remove unused functions of the base object cache

abadb27

Comment out logs of the gandiva::engine::SetLLVMObjectCache()

65f610f

Add first tests of gandiva base object cache

16765dd

Add a safely evict func so cache does not pass the cache size limit

b044ae4

Add filter to object cache test file

3debf08

Change the default cache size from 5 KiB to 50 MiB

6563e10

Change TestObjectCache::Setup() to set the GANDIVA_CACHE_SIZE to 5120…

d2520aa

… bytes or 5 KiB

Enable caching for each proejctor's exprs and enable caching to disk.

f691b34

Add clearCacheDirectory function with cross-platform boost::filesystem

14b313d

Change arrow log level to debug for the cache logging

d477084

Add disk usage tracking

75f7acb

Add uuid support for cache key

6cbc4a2

Add cache on-disk tracking persistence between runs

3eacbff

augustoasilva added 3 commits June 16, 2021 18:28

Add upper bound limit to the on-disk cache size with min()

4a26031

Add flatbuffer to gandiva cache file

44f378f

Removed expression array from gandiva cache fb schema

33eebec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache object code in memory instead of entire module. #4

Cache object code in memory instead of entire module. #4

augustoasilva commented May 27, 2021

Cache object code in memory instead of entire module. #4

Are you sure you want to change the base?

Cache object code in memory instead of entire module. #4

Conversation

augustoasilva commented May 27, 2021