Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent OOMs during heap snapshot: Change to streaming out the snapshot data. #51518

Closed
wants to merge 13 commits into from

Conversation

NHDaly
Copy link
Member

@NHDaly NHDaly commented Sep 29, 2023

Fixes #51381.

The solution we came up with here is stream out the heap snapshot, to avoid OOMing while recording it, and then do the downsampling via post-processing, to satisfy the Chrome devtools.

This allows you to record a heap snapshot from a running julia process, even (or especially) when it's current memory usage is close to the limit, without the snapshotter pushing it over the edge.

Unfortunately, this currently represents a change in the API: we now need to write out four files instead of one, and we can no longer support the function that takes an IOBuffer.

Linked here is the current version of our reassembly code, which could probably stand to be cleaned up a bit (thanks @Drvi):


I'd like to solicit opinions on whether this kind of breaking change is okay or not for a debugging tool like this.

If we don't want to break this API, I think we can add an option, like streaming=false. Then, to support the legacy non-streaming mode, I think we have two options:

  1. We could include the reassembly code in the Profiler stdlib, which maybe we'd want to do anyway, so that people don't have to install another package like HeapSnapshotTools.jl just to use the heap snapshot. Then for the legacy mode, we simply reconstruct the file and write it to the destination.
  2. We could keep both the new and the old C++ code, and toggle between them. This seems annoyingly wasteful and messy though.

This PR currently takes approach 1.

Co-Authored-By: @Drvi

This should prevent the engine from OOMing while recording the snapshot!

Now we just need to sample the files, either online, before downloading,
or offline after downloading :)

If we're gonna do it offline, we'll want to gzip the files before
downloading them.
@NHDaly
Copy link
Member Author

NHDaly commented Sep 29, 2023

@apaz-cli and @gbaraldi: Can I get your review?

On reflection, I think I much prefer option 1, so i'm going to push up a commit with that for now.

@NHDaly
Copy link
Member Author

NHDaly commented Sep 29, 2023

Okay I have pushed up another commit to support approach 1. The API options are now:

julia> let io = IOBuffer()
           Profile.take_heap_snapshot(io)   # maybe we want to consider this one deprecated though?
           String(take!(io)[1:100])
       end
"{\"snapshot\":{\"meta\":{\"node_fields\":[\"type\",\"name\",\"id\",\"self_size\",\"edge_count\",\"trace_node_id\",\"det"

julia> Profile.take_heap_snapshot("/tmp/2.heapsnapshot") # streaming=false by default
Recorded heap snapshot: /tmp/2.heapsnapshot
"/tmp/2.heapsnapshot"

julia> Profile.take_heap_snapshot("/tmp/2.heapsnapshot", streaming=true) # the new API
Finished streaming heap snapshot parts to prefix: /tmp/2.heapsnapshot
"/tmp/2.heapsnapshot"

src/gc-heap-snapshot.cpp Outdated Show resolved Hide resolved
@vilterp
Copy link
Contributor

vilterp commented Sep 29, 2023

What format are the streamed-out files in?


_digits_buf = zeros(UInt8, ndigits(typemax(UInt)))
println(io, @view(preamble[1:end-2]), ",") # remove trailing "}\n", we don't end the snapshot here
println(io, "\"nodes\":[")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nodes and edges files currently aren't valid json, since we're not writing out the leading and trailing [], and we aren't writing a trailing comma after each line.

I figured it was easier to process this way, which seems to bear out with the code you wrote, @Drvi. But now I'm kind of thinking we may as well have every file we output be valid JSON...?

On the other hand, that's an extra character per node and per edge, of which there can be billions, so this could add a whole GiB right there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is adding a GB, then the whole file is probably 10s of GB, since that one character will only be a small fraction of each line. But maybe we want to consider output these as BSON instead, to get faster encoding performance for them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In retrospect, i think these are actually truly csvs! I'm going to rename the output files to be .csv. I think that at least makes them self-documenting in their format.


Regarding the streaming time and file size... I agree, writing them out as binary files would be even better! 🤔 They're literally just 2 giant matrices of numbers like:

$ cat 58178_136122584059625.heapsnapshot.edges  | head -n 3
0,2,0,1
0,3,0,2
0,2,0,3

$ cat 58178_136122584059625.heapsnapshot.nodes  | head -n 3
0,0,0,0,0,0,0
1,1,4482646032,384,0,0,0
1,1,15281513072,384,0,0,0

So we should definitely consider some kind of binary format instead. 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there something even simpler than BSON for this? Like, i think we could literally just output an array of int64 binary data, and then read it in like that, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, so it turns out that writing these as binary data does make it faster, but the files are actually slightly bigger, since i guess a lot of the indexes were smallish numbers (so only a few bytes) whereas their binary format is the full 8 bytes.

But still, the speed probably makes this worth it 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of thoughts on the binary format. Since the schema is known, it wouldn't be hard to handle each column separately in the snapshot assembler.

For nodes we know that type is an index into a small array of node_types so that could be a single byte (or less), name and self_size won't need 8 bytes either, and would benefit from varint (aka vbyte) encoding like protobuf does (this eliminates the leading zeros and is relatively easy to implement). edge_count, trace_node_id and detachedness are always zero so they can be omitted altogether. id is an interesting case as currently it's the pointer to the object, but once written to a file, we can just enumerate the nodes to get a unique id so we don't need to write the pointer out (but maybe having all the pointers in your program could be useful for some analysis?).

For edges, again type could be a single byte (or less). name_or_index (the index into strings for some edge types) is interesting as we could use varint encoding but I think we are using typemax(UInt64) for edge types that don't have a corresponding string which wouldn't compress in varint (it would get bigger in fact) so maybe for these edge types we shouldn't encode any name_or_index value at all or have a special value for those. to_node and from_node are bounded by the size of nodes so varint should help again.

One issue with the current approach is that in order to reassemble the snapshot, one needs to update the edge_count for each node, which means that you need to have all nodes in memory and you need to iterate the edges twice (once to update nodes, and then to write them to the assembled snapshot). How about we produce another file, edge_counts, which would basically be an array of edge counts for each node that we accumulate and write at the end (for 10M nodes that would be 40MB)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't just need the edge_counts; the edges need to be grouped and ordered by the nodes. So you read the file by iterating the nodes, seeing how many edges it has, and then those first N edges are coming out of the first node, then you move to the second node, and the next M edges are coming from that node, etc.

I dunno if you can build that file without having all the nodes in memory while we iterate the edges? Maybe I'm not following what you were explaining?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah my bad, I thought the edges were written out already ordered, yeah then the edge_count idea doesn't apply.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm yeah makes sense. Yeah it's too bad. :(

Accounting for that was the last bug that I fixed while you were out on vacation. The file format is really gnarly and i think we didn't get to this detail when we talked through it way back at the start. 😊

@NHDaly
Copy link
Member Author

NHDaly commented Sep 30, 2023

What format are the streamed-out files in?

@vilterp Oh, i missed this message.

I commented on it here: #51518 (comment)

Right now, the .strings and .json files are valid json, and the .nodes and .edges are essentially csvs (they're newline-separated rows of values separated by commas).

I'm very open to changing that; this was just the fastest possible change i could make, since we were rushing for the OOM investigation at work. Open to any suggestions!

@vilterp
Copy link
Contributor

vilterp commented Sep 30, 2023 via email

@@ -106,15 +106,22 @@ struct StringTable {
};

struct HeapSnapshot {
vector<Node> nodes;
// edges are stored on each from_node

StringTable names;
StringTable node_types;
StringTable edge_types;
DenseMap<void *, size_t> node_ptr_to_index_map;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can also avoid creating this table in the serializer process by making it an identity map instead (when used for edges) and streaming out the value of the original pointer also on each node. Then in the later process, you just need to build this map from pointer->number to satisfy converting it into the Javascript format. This dict is also used to serialize the representation/name of the object exactly once, but that can instead be satisfied by knowing the GC will visit each object exactly once as a "from" node when marking all of the out-refs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, great points!
You're right, i think it should be a pattern of N edges from node 1 then M edges from node 2, etc. 👍 So we should be able to account for that in the second half of your comment.

I hadn't noticed this map yet in my rush job to stream the data here. I think you're right that we should fix this too 👍 👍

Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We considered this, but didn't address it in the new PR #52854 since it requires reprocessing of the nodes/edges and some complex logic to merge duplicate nodes and outgoing edges and it could only be an issue for a huge snapshot. We like to leave it for future work.

@@ -106,15 +106,22 @@ struct StringTable {
};

struct HeapSnapshot {
vector<Node> nodes;
// edges are stored on each from_node

StringTable names;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also might want to consider streaming this field out into the file per-node also, at least in some or all cases, since I think this is often unique? The JSON can probably deal with being a Union{Int,String} field, depending on whether the content was probably unique (or long, like a String) or probably common like a name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i think a nice thing here could be to keep some kind of like a bloom filter or something, and in the nodes file, we write out either an index into this table, or we write out the string itself.

But thinking more about it, i think that it's maybe even fine to just duplicate the strings over and over. The priorities here are:
A) be fast
B) don't OOM
I don't think that file size reduction is nearly as important as those two.

So if we can intern some of the strings, like you suggested, and write out the rest, that's probably good enough, yeah. Keep common and big strings, and write them out uniquely otherwise.

This is the last major cleanup that I think we should do, otherwise this PR looks good to go.
It seems to be working in its current state in order to avoid OOMs in our production setup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is addressed in the new PR #52854 by streaming out the string table directly. At the same time, we keep a very limited amount of known strings in memory to reduce duplicates in the string table as much as possible.

@brenhinkeller brenhinkeller added feature Indicates new feature / enhancement requests tooling labels Oct 3, 2023
@gbaraldi
Copy link
Member

gbaraldi commented Oct 3, 2023

Btw, when this code was developed it was kind of reverse engineered, but microsoft released some documentation on it if you need a proper reference https://learn.microsoft.com/en-us/microsoft-edge/devtools-guide-chromium/memory-problems/heap-snapshot-schema

This way you can always recover from an OOM
node order. That's the whole reason this is tricky.

But i'm not sure now whether the SoAs approach is actually an
optimization.... It seems like we should probably prefer to inline the
Edges right into the vector, rather than having to do another random
lookup into the edges table?
@JianFangAtRai
Copy link
Contributor

created a new PR #52854 to continue the work on this PR

@NHDaly
Copy link
Member Author

NHDaly commented Jan 11, 2024

Closing in favor of #52854.

@NHDaly NHDaly closed this Jan 11, 2024
@NHDaly NHDaly deleted the nhd-snapshot-streaming branch January 11, 2024 17:01
@JianFangAtRai
Copy link
Contributor

Hi, the new PR (#52854) is ready for review now. The new PR fixed multiple minor issues in the original PR such as doc, safepoint, alloc type, and so on. The main improvement in the new PR is to stream out the string table during the snapshotting process instead of holding them in memory first and writing them out into a file at the end. To reduce duplicate strings as much as possible, we hold some known strings in memory for deduping purpose to reduce string table size.

We didn't address DenseMap<void *, size_t> node_ptr_to_index_map in the new PR since it requires duplicate nodes and the logic to merge duplicate nodes and outgoing edges, which could be very involving. We leave that for future work since it could only be an issue for a huge heap snapshot. We tested the PR with about 40GB heap snapshot and it didn't crash the process.

d-netto pushed a commit that referenced this pull request Feb 1, 2024
This PR is to continue the work on the following PR: 

Prevent OOMs during heap snapshot: Change to streaming out the snapshot
data (#51518 )

Here are the commit history:

```
* Streaming the heap snapshot!

This should prevent the engine from OOMing while recording the snapshot!

Now we just need to sample the files, either online, before downloading, or offline after downloading :)

If we're gonna do it offline, we'll want to gzip the files before downloading them.

* Allow custom filename; use original API

* Support legacy heap snapshot interface. Add reassembly function.

* Add tests

* Apply suggestions from code review

* Update src/gc-heap-snapshot.cpp

* Change to always save the parts in the same directory

This way you can always recover from an OOM

* Fix bug in reassembler: from_node and to_node were in the wrong order

* Fix correctness mistake: The edges have to be reordered according to the node order. That's the whole reason this is tricky.

But i'm not sure now whether the SoAs approach is actually an optimization.... It seems like we should probably prefer to inline the Edges right into the vector, rather than having to do another random lookup into the edges table?

* Debugging messed up edge array idxs

* Disable log message

* Write the .nodes and .edges as binary data

* Remove unnecessary logging

* fix merge issues

* attempt to add back the orphan node checking logic
```

---------

Co-authored-by: Nathan Daly <[email protected]>
Co-authored-by: Nathan Daly <[email protected]>
Drvi added a commit to RelationalAI/julia that referenced this pull request Feb 1, 2024
This PR is to continue the work on the following PR:

Prevent OOMs during heap snapshot: Change to streaming out the snapshot
data (JuliaLang#51518 )

Here are the commit history:

```
* Streaming the heap snapshot!

This should prevent the engine from OOMing while recording the snapshot!

Now we just need to sample the files, either online, before downloading, or offline after downloading :)

If we're gonna do it offline, we'll want to gzip the files before downloading them.

* Allow custom filename; use original API

* Support legacy heap snapshot interface. Add reassembly function.

* Add tests

* Apply suggestions from code review

* Update src/gc-heap-snapshot.cpp

* Change to always save the parts in the same directory

This way you can always recover from an OOM

* Fix bug in reassembler: from_node and to_node were in the wrong order

* Fix correctness mistake: The edges have to be reordered according to the node order. That's the whole reason this is tricky.

But i'm not sure now whether the SoAs approach is actually an optimization.... It seems like we should probably prefer to inline the Edges right into the vector, rather than having to do another random lookup into the edges table?

* Debugging messed up edge array idxs

* Disable log message

* Write the .nodes and .edges as binary data

* Remove unnecessary logging

* fix merge issues

* attempt to add back the orphan node checking logic
```

---------

Co-authored-by: Nathan Daly <[email protected]>
Co-authored-by: Nathan Daly <[email protected]>
DelveCI pushed a commit to RelationalAI/julia that referenced this pull request Feb 6, 2024
This PR is to continue the work on the following PR:

Prevent OOMs during heap snapshot: Change to streaming out the snapshot
data (JuliaLang#51518 )

Here are the commit history:

```
* Streaming the heap snapshot!

This should prevent the engine from OOMing while recording the snapshot!

Now we just need to sample the files, either online, before downloading, or offline after downloading :)

If we're gonna do it offline, we'll want to gzip the files before downloading them.

* Allow custom filename; use original API

* Support legacy heap snapshot interface. Add reassembly function.

* Add tests

* Apply suggestions from code review

* Update src/gc-heap-snapshot.cpp

* Change to always save the parts in the same directory

This way you can always recover from an OOM

* Fix bug in reassembler: from_node and to_node were in the wrong order

* Fix correctness mistake: The edges have to be reordered according to the node order. That's the whole reason this is tricky.

But i'm not sure now whether the SoAs approach is actually an optimization.... It seems like we should probably prefer to inline the Edges right into the vector, rather than having to do another random lookup into the edges table?

* Debugging messed up edge array idxs

* Disable log message

* Write the .nodes and .edges as binary data

* Remove unnecessary logging

* fix merge issues

* attempt to add back the orphan node checking logic
```

---------

Co-authored-by: Nathan Daly <[email protected]>
Co-authored-by: Nathan Daly <[email protected]>
DelveCI pushed a commit to RelationalAI/julia that referenced this pull request Feb 7, 2024
This PR is to continue the work on the following PR:

Prevent OOMs during heap snapshot: Change to streaming out the snapshot
data (JuliaLang#51518 )

Here are the commit history:

```
* Streaming the heap snapshot!

This should prevent the engine from OOMing while recording the snapshot!

Now we just need to sample the files, either online, before downloading, or offline after downloading :)

If we're gonna do it offline, we'll want to gzip the files before downloading them.

* Allow custom filename; use original API

* Support legacy heap snapshot interface. Add reassembly function.

* Add tests

* Apply suggestions from code review

* Update src/gc-heap-snapshot.cpp

* Change to always save the parts in the same directory

This way you can always recover from an OOM

* Fix bug in reassembler: from_node and to_node were in the wrong order

* Fix correctness mistake: The edges have to be reordered according to the node order. That's the whole reason this is tricky.

But i'm not sure now whether the SoAs approach is actually an optimization.... It seems like we should probably prefer to inline the Edges right into the vector, rather than having to do another random lookup into the edges table?

* Debugging messed up edge array idxs

* Disable log message

* Write the .nodes and .edges as binary data

* Remove unnecessary logging

* fix merge issues

* attempt to add back the orphan node checking logic
```

---------

Co-authored-by: Nathan Daly <[email protected]>
Co-authored-by: Nathan Daly <[email protected]>
DelveCI pushed a commit to RelationalAI/julia that referenced this pull request Feb 14, 2024
This PR is to continue the work on the following PR:

Prevent OOMs during heap snapshot: Change to streaming out the snapshot
data (JuliaLang#51518 )

Here are the commit history:

```
* Streaming the heap snapshot!

This should prevent the engine from OOMing while recording the snapshot!

Now we just need to sample the files, either online, before downloading, or offline after downloading :)

If we're gonna do it offline, we'll want to gzip the files before downloading them.

* Allow custom filename; use original API

* Support legacy heap snapshot interface. Add reassembly function.

* Add tests

* Apply suggestions from code review

* Update src/gc-heap-snapshot.cpp

* Change to always save the parts in the same directory

This way you can always recover from an OOM

* Fix bug in reassembler: from_node and to_node were in the wrong order

* Fix correctness mistake: The edges have to be reordered according to the node order. That's the whole reason this is tricky.

But i'm not sure now whether the SoAs approach is actually an optimization.... It seems like we should probably prefer to inline the Edges right into the vector, rather than having to do another random lookup into the edges table?

* Debugging messed up edge array idxs

* Disable log message

* Write the .nodes and .edges as binary data

* Remove unnecessary logging

* fix merge issues

* attempt to add back the orphan node checking logic
```

---------

Co-authored-by: Nathan Daly <[email protected]>
Co-authored-by: Nathan Daly <[email protected]>
DelveCI pushed a commit to RelationalAI/julia that referenced this pull request Feb 21, 2024
This PR is to continue the work on the following PR:

Prevent OOMs during heap snapshot: Change to streaming out the snapshot
data (JuliaLang#51518 )

Here are the commit history:

```
* Streaming the heap snapshot!

This should prevent the engine from OOMing while recording the snapshot!

Now we just need to sample the files, either online, before downloading, or offline after downloading :)

If we're gonna do it offline, we'll want to gzip the files before downloading them.

* Allow custom filename; use original API

* Support legacy heap snapshot interface. Add reassembly function.

* Add tests

* Apply suggestions from code review

* Update src/gc-heap-snapshot.cpp

* Change to always save the parts in the same directory

This way you can always recover from an OOM

* Fix bug in reassembler: from_node and to_node were in the wrong order

* Fix correctness mistake: The edges have to be reordered according to the node order. That's the whole reason this is tricky.

But i'm not sure now whether the SoAs approach is actually an optimization.... It seems like we should probably prefer to inline the Edges right into the vector, rather than having to do another random lookup into the edges table?

* Debugging messed up edge array idxs

* Disable log message

* Write the .nodes and .edges as binary data

* Remove unnecessary logging

* fix merge issues

* attempt to add back the orphan node checking logic
```

---------

Co-authored-by: Nathan Daly <[email protected]>
Co-authored-by: Nathan Daly <[email protected]>
DelveCI pushed a commit to RelationalAI/julia that referenced this pull request Feb 22, 2024
This PR is to continue the work on the following PR:

Prevent OOMs during heap snapshot: Change to streaming out the snapshot
data (JuliaLang#51518 )

Here are the commit history:

```
* Streaming the heap snapshot!

This should prevent the engine from OOMing while recording the snapshot!

Now we just need to sample the files, either online, before downloading, or offline after downloading :)

If we're gonna do it offline, we'll want to gzip the files before downloading them.

* Allow custom filename; use original API

* Support legacy heap snapshot interface. Add reassembly function.

* Add tests

* Apply suggestions from code review

* Update src/gc-heap-snapshot.cpp

* Change to always save the parts in the same directory

This way you can always recover from an OOM

* Fix bug in reassembler: from_node and to_node were in the wrong order

* Fix correctness mistake: The edges have to be reordered according to the node order. That's the whole reason this is tricky.

But i'm not sure now whether the SoAs approach is actually an optimization.... It seems like we should probably prefer to inline the Edges right into the vector, rather than having to do another random lookup into the edges table?

* Debugging messed up edge array idxs

* Disable log message

* Write the .nodes and .edges as binary data

* Remove unnecessary logging

* fix merge issues

* attempt to add back the orphan node checking logic
```

---------

Co-authored-by: Nathan Daly <[email protected]>
Co-authored-by: Nathan Daly <[email protected]>
Drvi added a commit to RelationalAI/julia that referenced this pull request Feb 28, 2024
This PR is to continue the work on the following PR:

Prevent OOMs during heap snapshot: Change to streaming out the snapshot
data (JuliaLang#51518 )

Here are the commit history:

```
* Streaming the heap snapshot!

This should prevent the engine from OOMing while recording the snapshot!

Now we just need to sample the files, either online, before downloading, or offline after downloading :)

If we're gonna do it offline, we'll want to gzip the files before downloading them.

* Allow custom filename; use original API

* Support legacy heap snapshot interface. Add reassembly function.

* Add tests

* Apply suggestions from code review

* Update src/gc-heap-snapshot.cpp

* Change to always save the parts in the same directory

This way you can always recover from an OOM

* Fix bug in reassembler: from_node and to_node were in the wrong order

* Fix correctness mistake: The edges have to be reordered according to the node order. That's the whole reason this is tricky.

But i'm not sure now whether the SoAs approach is actually an optimization.... It seems like we should probably prefer to inline the Edges right into the vector, rather than having to do another random lookup into the edges table?

* Debugging messed up edge array idxs

* Disable log message

* Write the .nodes and .edges as binary data

* Remove unnecessary logging

* fix merge issues

* attempt to add back the orphan node checking logic
```

---------

Co-authored-by: Nathan Daly <[email protected]>
Co-authored-by: Nathan Daly <[email protected]>
DelveCI pushed a commit to RelationalAI/julia that referenced this pull request Mar 1, 2024
This PR is to continue the work on the following PR:

Prevent OOMs during heap snapshot: Change to streaming out the snapshot
data (JuliaLang#51518 )

Here are the commit history:

```
* Streaming the heap snapshot!

This should prevent the engine from OOMing while recording the snapshot!

Now we just need to sample the files, either online, before downloading, or offline after downloading :)

If we're gonna do it offline, we'll want to gzip the files before downloading them.

* Allow custom filename; use original API

* Support legacy heap snapshot interface. Add reassembly function.

* Add tests

* Apply suggestions from code review

* Update src/gc-heap-snapshot.cpp

* Change to always save the parts in the same directory

This way you can always recover from an OOM

* Fix bug in reassembler: from_node and to_node were in the wrong order

* Fix correctness mistake: The edges have to be reordered according to the node order. That's the whole reason this is tricky.

But i'm not sure now whether the SoAs approach is actually an optimization.... It seems like we should probably prefer to inline the Edges right into the vector, rather than having to do another random lookup into the edges table?

* Debugging messed up edge array idxs

* Disable log message

* Write the .nodes and .edges as binary data

* Remove unnecessary logging

* fix merge issues

* attempt to add back the orphan node checking logic
```

---------

Co-authored-by: Nathan Daly <[email protected]>
Co-authored-by: Nathan Daly <[email protected]>
DelveCI pushed a commit to RelationalAI/julia that referenced this pull request Mar 13, 2024
This PR is to continue the work on the following PR:

Prevent OOMs during heap snapshot: Change to streaming out the snapshot
data (JuliaLang#51518 )

Here are the commit history:

```
* Streaming the heap snapshot!

This should prevent the engine from OOMing while recording the snapshot!

Now we just need to sample the files, either online, before downloading, or offline after downloading :)

If we're gonna do it offline, we'll want to gzip the files before downloading them.

* Allow custom filename; use original API

* Support legacy heap snapshot interface. Add reassembly function.

* Add tests

* Apply suggestions from code review

* Update src/gc-heap-snapshot.cpp

* Change to always save the parts in the same directory

This way you can always recover from an OOM

* Fix bug in reassembler: from_node and to_node were in the wrong order

* Fix correctness mistake: The edges have to be reordered according to the node order. That's the whole reason this is tricky.

But i'm not sure now whether the SoAs approach is actually an optimization.... It seems like we should probably prefer to inline the Edges right into the vector, rather than having to do another random lookup into the edges table?

* Debugging messed up edge array idxs

* Disable log message

* Write the .nodes and .edges as binary data

* Remove unnecessary logging

* fix merge issues

* attempt to add back the orphan node checking logic
```

---------

Co-authored-by: Nathan Daly <[email protected]>
Co-authored-by: Nathan Daly <[email protected]>
Drvi added a commit to RelationalAI/julia that referenced this pull request Apr 3, 2024
This PR is to continue the work on the following PR:

Prevent OOMs during heap snapshot: Change to streaming out the snapshot
data (JuliaLang#51518 )

Here are the commit history:

```
* Streaming the heap snapshot!

This should prevent the engine from OOMing while recording the snapshot!

Now we just need to sample the files, either online, before downloading, or offline after downloading :)

If we're gonna do it offline, we'll want to gzip the files before downloading them.

* Allow custom filename; use original API

* Support legacy heap snapshot interface. Add reassembly function.

* Add tests

* Apply suggestions from code review

* Update src/gc-heap-snapshot.cpp

* Change to always save the parts in the same directory

This way you can always recover from an OOM

* Fix bug in reassembler: from_node and to_node were in the wrong order

* Fix correctness mistake: The edges have to be reordered according to the node order. That's the whole reason this is tricky.

But i'm not sure now whether the SoAs approach is actually an optimization.... It seems like we should probably prefer to inline the Edges right into the vector, rather than having to do another random lookup into the edges table?

* Debugging messed up edge array idxs

* Disable log message

* Write the .nodes and .edges as binary data

* Remove unnecessary logging

* fix merge issues

* attempt to add back the orphan node checking logic
```

---------

Co-authored-by: Nathan Daly <[email protected]>
Co-authored-by: Nathan Daly <[email protected]>
d-netto pushed a commit to RelationalAI/julia that referenced this pull request Apr 16, 2024
This PR is to continue the work on the following PR:

Prevent OOMs during heap snapshot: Change to streaming out the snapshot
data (JuliaLang#51518 )

Here are the commit history:

```
* Streaming the heap snapshot!

This should prevent the engine from OOMing while recording the snapshot!

Now we just need to sample the files, either online, before downloading, or offline after downloading :)

If we're gonna do it offline, we'll want to gzip the files before downloading them.

* Allow custom filename; use original API

* Support legacy heap snapshot interface. Add reassembly function.

* Add tests

* Apply suggestions from code review

* Update src/gc-heap-snapshot.cpp

* Change to always save the parts in the same directory

This way you can always recover from an OOM

* Fix bug in reassembler: from_node and to_node were in the wrong order

* Fix correctness mistake: The edges have to be reordered according to the node order. That's the whole reason this is tricky.

But i'm not sure now whether the SoAs approach is actually an optimization.... It seems like we should probably prefer to inline the Edges right into the vector, rather than having to do another random lookup into the edges table?

* Debugging messed up edge array idxs

* Disable log message

* Write the .nodes and .edges as binary data

* Remove unnecessary logging

* fix merge issues

* attempt to add back the orphan node checking logic
```

---------

Co-authored-by: Nathan Daly <[email protected]>
Co-authored-by: Nathan Daly <[email protected]>
DelveCI pushed a commit to RelationalAI/julia that referenced this pull request Apr 23, 2024
This PR is to continue the work on the following PR:

Prevent OOMs during heap snapshot: Change to streaming out the snapshot
data (JuliaLang#51518 )

Here are the commit history:

```
* Streaming the heap snapshot!

This should prevent the engine from OOMing while recording the snapshot!

Now we just need to sample the files, either online, before downloading, or offline after downloading :)

If we're gonna do it offline, we'll want to gzip the files before downloading them.

* Allow custom filename; use original API

* Support legacy heap snapshot interface. Add reassembly function.

* Add tests

* Apply suggestions from code review

* Update src/gc-heap-snapshot.cpp

* Change to always save the parts in the same directory

This way you can always recover from an OOM

* Fix bug in reassembler: from_node and to_node were in the wrong order

* Fix correctness mistake: The edges have to be reordered according to the node order. That's the whole reason this is tricky.

But i'm not sure now whether the SoAs approach is actually an optimization.... It seems like we should probably prefer to inline the Edges right into the vector, rather than having to do another random lookup into the edges table?

* Debugging messed up edge array idxs

* Disable log message

* Write the .nodes and .edges as binary data

* Remove unnecessary logging

* fix merge issues

* attempt to add back the orphan node checking logic
```

---------

Co-authored-by: Nathan Daly <[email protected]>
Co-authored-by: Nathan Daly <[email protected]>
DelveCI pushed a commit to RelationalAI/julia that referenced this pull request Apr 24, 2024
This PR is to continue the work on the following PR:

Prevent OOMs during heap snapshot: Change to streaming out the snapshot
data (JuliaLang#51518 )

Here are the commit history:

```
* Streaming the heap snapshot!

This should prevent the engine from OOMing while recording the snapshot!

Now we just need to sample the files, either online, before downloading, or offline after downloading :)

If we're gonna do it offline, we'll want to gzip the files before downloading them.

* Allow custom filename; use original API

* Support legacy heap snapshot interface. Add reassembly function.

* Add tests

* Apply suggestions from code review

* Update src/gc-heap-snapshot.cpp

* Change to always save the parts in the same directory

This way you can always recover from an OOM

* Fix bug in reassembler: from_node and to_node were in the wrong order

* Fix correctness mistake: The edges have to be reordered according to the node order. That's the whole reason this is tricky.

But i'm not sure now whether the SoAs approach is actually an optimization.... It seems like we should probably prefer to inline the Edges right into the vector, rather than having to do another random lookup into the edges table?

* Debugging messed up edge array idxs

* Disable log message

* Write the .nodes and .edges as binary data

* Remove unnecessary logging

* fix merge issues

* attempt to add back the orphan node checking logic
```

---------

Co-authored-by: Nathan Daly <[email protected]>
Co-authored-by: Nathan Daly <[email protected]>
DelveCI pushed a commit to RelationalAI/julia that referenced this pull request Apr 30, 2024
This PR is to continue the work on the following PR:

Prevent OOMs during heap snapshot: Change to streaming out the snapshot
data (JuliaLang#51518 )

Here are the commit history:

```
* Streaming the heap snapshot!

This should prevent the engine from OOMing while recording the snapshot!

Now we just need to sample the files, either online, before downloading, or offline after downloading :)

If we're gonna do it offline, we'll want to gzip the files before downloading them.

* Allow custom filename; use original API

* Support legacy heap snapshot interface. Add reassembly function.

* Add tests

* Apply suggestions from code review

* Update src/gc-heap-snapshot.cpp

* Change to always save the parts in the same directory

This way you can always recover from an OOM

* Fix bug in reassembler: from_node and to_node were in the wrong order

* Fix correctness mistake: The edges have to be reordered according to the node order. That's the whole reason this is tricky.

But i'm not sure now whether the SoAs approach is actually an optimization.... It seems like we should probably prefer to inline the Edges right into the vector, rather than having to do another random lookup into the edges table?

* Debugging messed up edge array idxs

* Disable log message

* Write the .nodes and .edges as binary data

* Remove unnecessary logging

* fix merge issues

* attempt to add back the orphan node checking logic
```

---------

Co-authored-by: Nathan Daly <[email protected]>
Co-authored-by: Nathan Daly <[email protected]>
DelveCI pushed a commit to RelationalAI/julia that referenced this pull request Apr 30, 2024
This PR is to continue the work on the following PR:

Prevent OOMs during heap snapshot: Change to streaming out the snapshot
data (JuliaLang#51518 )

Here are the commit history:

```
* Streaming the heap snapshot!

This should prevent the engine from OOMing while recording the snapshot!

Now we just need to sample the files, either online, before downloading, or offline after downloading :)

If we're gonna do it offline, we'll want to gzip the files before downloading them.

* Allow custom filename; use original API

* Support legacy heap snapshot interface. Add reassembly function.

* Add tests

* Apply suggestions from code review

* Update src/gc-heap-snapshot.cpp

* Change to always save the parts in the same directory

This way you can always recover from an OOM

* Fix bug in reassembler: from_node and to_node were in the wrong order

* Fix correctness mistake: The edges have to be reordered according to the node order. That's the whole reason this is tricky.

But i'm not sure now whether the SoAs approach is actually an optimization.... It seems like we should probably prefer to inline the Edges right into the vector, rather than having to do another random lookup into the edges table?

* Debugging messed up edge array idxs

* Disable log message

* Write the .nodes and .edges as binary data

* Remove unnecessary logging

* fix merge issues

* attempt to add back the orphan node checking logic
```

---------

Co-authored-by: Nathan Daly <[email protected]>
Co-authored-by: Nathan Daly <[email protected]>
DelveCI pushed a commit to RelationalAI/julia that referenced this pull request May 2, 2024
This PR is to continue the work on the following PR:

Prevent OOMs during heap snapshot: Change to streaming out the snapshot
data (JuliaLang#51518 )

Here are the commit history:

```
* Streaming the heap snapshot!

This should prevent the engine from OOMing while recording the snapshot!

Now we just need to sample the files, either online, before downloading, or offline after downloading :)

If we're gonna do it offline, we'll want to gzip the files before downloading them.

* Allow custom filename; use original API

* Support legacy heap snapshot interface. Add reassembly function.

* Add tests

* Apply suggestions from code review

* Update src/gc-heap-snapshot.cpp

* Change to always save the parts in the same directory

This way you can always recover from an OOM

* Fix bug in reassembler: from_node and to_node were in the wrong order

* Fix correctness mistake: The edges have to be reordered according to the node order. That's the whole reason this is tricky.

But i'm not sure now whether the SoAs approach is actually an optimization.... It seems like we should probably prefer to inline the Edges right into the vector, rather than having to do another random lookup into the edges table?

* Debugging messed up edge array idxs

* Disable log message

* Write the .nodes and .edges as binary data

* Remove unnecessary logging

* fix merge issues

* attempt to add back the orphan node checking logic
```

---------

Co-authored-by: Nathan Daly <[email protected]>
Co-authored-by: Nathan Daly <[email protected]>
DelveCI pushed a commit to RelationalAI/julia that referenced this pull request May 9, 2024
This PR is to continue the work on the following PR:

Prevent OOMs during heap snapshot: Change to streaming out the snapshot
data (JuliaLang#51518 )

Here are the commit history:

```
* Streaming the heap snapshot!

This should prevent the engine from OOMing while recording the snapshot!

Now we just need to sample the files, either online, before downloading, or offline after downloading :)

If we're gonna do it offline, we'll want to gzip the files before downloading them.

* Allow custom filename; use original API

* Support legacy heap snapshot interface. Add reassembly function.

* Add tests

* Apply suggestions from code review

* Update src/gc-heap-snapshot.cpp

* Change to always save the parts in the same directory

This way you can always recover from an OOM

* Fix bug in reassembler: from_node and to_node were in the wrong order

* Fix correctness mistake: The edges have to be reordered according to the node order. That's the whole reason this is tricky.

But i'm not sure now whether the SoAs approach is actually an optimization.... It seems like we should probably prefer to inline the Edges right into the vector, rather than having to do another random lookup into the edges table?

* Debugging messed up edge array idxs

* Disable log message

* Write the .nodes and .edges as binary data

* Remove unnecessary logging

* fix merge issues

* attempt to add back the orphan node checking logic
```

---------

Co-authored-by: Nathan Daly <[email protected]>
Co-authored-by: Nathan Daly <[email protected]>
DelveCI pushed a commit to RelationalAI/julia that referenced this pull request May 19, 2024
This PR is to continue the work on the following PR:

Prevent OOMs during heap snapshot: Change to streaming out the snapshot
data (JuliaLang#51518 )

Here are the commit history:

```
* Streaming the heap snapshot!

This should prevent the engine from OOMing while recording the snapshot!

Now we just need to sample the files, either online, before downloading, or offline after downloading :)

If we're gonna do it offline, we'll want to gzip the files before downloading them.

* Allow custom filename; use original API

* Support legacy heap snapshot interface. Add reassembly function.

* Add tests

* Apply suggestions from code review

* Update src/gc-heap-snapshot.cpp

* Change to always save the parts in the same directory

This way you can always recover from an OOM

* Fix bug in reassembler: from_node and to_node were in the wrong order

* Fix correctness mistake: The edges have to be reordered according to the node order. That's the whole reason this is tricky.

But i'm not sure now whether the SoAs approach is actually an optimization.... It seems like we should probably prefer to inline the Edges right into the vector, rather than having to do another random lookup into the edges table?

* Debugging messed up edge array idxs

* Disable log message

* Write the .nodes and .edges as binary data

* Remove unnecessary logging

* fix merge issues

* attempt to add back the orphan node checking logic
```

---------

Co-authored-by: Nathan Daly <[email protected]>
Co-authored-by: Nathan Daly <[email protected]>
DelveCI pushed a commit to RelationalAI/julia that referenced this pull request May 26, 2024
This PR is to continue the work on the following PR:

Prevent OOMs during heap snapshot: Change to streaming out the snapshot
data (JuliaLang#51518 )

Here are the commit history:

```
* Streaming the heap snapshot!

This should prevent the engine from OOMing while recording the snapshot!

Now we just need to sample the files, either online, before downloading, or offline after downloading :)

If we're gonna do it offline, we'll want to gzip the files before downloading them.

* Allow custom filename; use original API

* Support legacy heap snapshot interface. Add reassembly function.

* Add tests

* Apply suggestions from code review

* Update src/gc-heap-snapshot.cpp

* Change to always save the parts in the same directory

This way you can always recover from an OOM

* Fix bug in reassembler: from_node and to_node were in the wrong order

* Fix correctness mistake: The edges have to be reordered according to the node order. That's the whole reason this is tricky.

But i'm not sure now whether the SoAs approach is actually an optimization.... It seems like we should probably prefer to inline the Edges right into the vector, rather than having to do another random lookup into the edges table?

* Debugging messed up edge array idxs

* Disable log message

* Write the .nodes and .edges as binary data

* Remove unnecessary logging

* fix merge issues

* attempt to add back the orphan node checking logic
```

---------

Co-authored-by: Nathan Daly <[email protected]>
Co-authored-by: Nathan Daly <[email protected]>
DelveCI pushed a commit to RelationalAI/julia that referenced this pull request May 28, 2024
This PR is to continue the work on the following PR:

Prevent OOMs during heap snapshot: Change to streaming out the snapshot
data (JuliaLang#51518 )

Here are the commit history:

```
* Streaming the heap snapshot!

This should prevent the engine from OOMing while recording the snapshot!

Now we just need to sample the files, either online, before downloading, or offline after downloading :)

If we're gonna do it offline, we'll want to gzip the files before downloading them.

* Allow custom filename; use original API

* Support legacy heap snapshot interface. Add reassembly function.

* Add tests

* Apply suggestions from code review

* Update src/gc-heap-snapshot.cpp

* Change to always save the parts in the same directory

This way you can always recover from an OOM

* Fix bug in reassembler: from_node and to_node were in the wrong order

* Fix correctness mistake: The edges have to be reordered according to the node order. That's the whole reason this is tricky.

But i'm not sure now whether the SoAs approach is actually an optimization.... It seems like we should probably prefer to inline the Edges right into the vector, rather than having to do another random lookup into the edges table?

* Debugging messed up edge array idxs

* Disable log message

* Write the .nodes and .edges as binary data

* Remove unnecessary logging

* fix merge issues

* attempt to add back the orphan node checking logic
```

---------

Co-authored-by: Nathan Daly <[email protected]>
Co-authored-by: Nathan Daly <[email protected]>
DelveCI pushed a commit to RelationalAI/julia that referenced this pull request May 29, 2024
This PR is to continue the work on the following PR:

Prevent OOMs during heap snapshot: Change to streaming out the snapshot
data (JuliaLang#51518 )

Here are the commit history:

```
* Streaming the heap snapshot!

This should prevent the engine from OOMing while recording the snapshot!

Now we just need to sample the files, either online, before downloading, or offline after downloading :)

If we're gonna do it offline, we'll want to gzip the files before downloading them.

* Allow custom filename; use original API

* Support legacy heap snapshot interface. Add reassembly function.

* Add tests

* Apply suggestions from code review

* Update src/gc-heap-snapshot.cpp

* Change to always save the parts in the same directory

This way you can always recover from an OOM

* Fix bug in reassembler: from_node and to_node were in the wrong order

* Fix correctness mistake: The edges have to be reordered according to the node order. That's the whole reason this is tricky.

But i'm not sure now whether the SoAs approach is actually an optimization.... It seems like we should probably prefer to inline the Edges right into the vector, rather than having to do another random lookup into the edges table?

* Debugging messed up edge array idxs

* Disable log message

* Write the .nodes and .edges as binary data

* Remove unnecessary logging

* fix merge issues

* attempt to add back the orphan node checking logic
```

---------

Co-authored-by: Nathan Daly <[email protected]>
Co-authored-by: Nathan Daly <[email protected]>
Drvi added a commit to RelationalAI/julia that referenced this pull request Jun 7, 2024
This PR is to continue the work on the following PR:

Prevent OOMs during heap snapshot: Change to streaming out the snapshot
data (JuliaLang#51518 )

Here are the commit history:

```
* Streaming the heap snapshot!

This should prevent the engine from OOMing while recording the snapshot!

Now we just need to sample the files, either online, before downloading, or offline after downloading :)

If we're gonna do it offline, we'll want to gzip the files before downloading them.

* Allow custom filename; use original API

* Support legacy heap snapshot interface. Add reassembly function.

* Add tests

* Apply suggestions from code review

* Update src/gc-heap-snapshot.cpp

* Change to always save the parts in the same directory

This way you can always recover from an OOM

* Fix bug in reassembler: from_node and to_node were in the wrong order

* Fix correctness mistake: The edges have to be reordered according to the node order. That's the whole reason this is tricky.

But i'm not sure now whether the SoAs approach is actually an optimization.... It seems like we should probably prefer to inline the Edges right into the vector, rather than having to do another random lookup into the edges table?

* Debugging messed up edge array idxs

* Disable log message

* Write the .nodes and .edges as binary data

* Remove unnecessary logging

* fix merge issues

* attempt to add back the orphan node checking logic
```

---------

Co-authored-by: Nathan Daly <[email protected]>
Co-authored-by: Nathan Daly <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Indicates new feature / enhancement requests tooling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ability to sample the heapsnapshot
7 participants