Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

high memory utilization in go-graphsync #7077

Closed
6 tasks done
raulk opened this issue Aug 15, 2021 · 4 comments
Closed
6 tasks done

high memory utilization in go-graphsync #7077

raulk opened this issue Aug 15, 2021 · 4 comments
Assignees
Labels
area/data-transfer kind/bug Kind: Bug P1 P1: Must be resolved team/ignite Issues and PRs being tracked by Team Ignite at Protocol Labs
Milestone

Comments

@raulk
Copy link
Member

raulk commented Aug 15, 2021

Checklist

  • This is not a question or a support request. If you have any lotus related questions, please ask in the lotus forum.
  • I am reporting a bug w.r.t one of the M1 tags. If not, choose another issue option here.
  • I am reporting a bug around deal making. If not, create a M1 Bug Report For Non Deal Making Issue.
  • I have my log level set as instructed here and have logs available for troubleshooting.
  • The deal is coming from one of the M1 clients(communitcated in the coordination slack channel).
  • I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.

Lotus Component

lotus miner market subsystem - storage deal

^^ Selected "lotus miner market subsystem - storage deal" because I had to select one. We should allow multiple options here, because it's not clear if this is storage only, retrieval only, or both.

Lotus Tag and Version

m1.3.5.

Describe the Bug

High memory utilization leading to eventual process crash. Heap profile taken by the memory watchdog when surpassing emergency levels. Shows extreme retention at graphsync layer. Suspected artefact from logging that we introduced in the last releases to aid debugging.

image

Deal Status

N/A

Data Transfer Status

N/A

Logging Information

N/A

Repo Steps (optional)

No response

@jennijuju jennijuju added area/data-transfer kind/enhancement Kind: Enhancement P1 P1: Must be resolved and removed need/triage kind/bug Kind: Bug labels Aug 16, 2021
@jennijuju jennijuju added this to the v1.11.3 milestone Aug 16, 2021
@jacobheun jacobheun added the team/ignite Issues and PRs being tracked by Team Ignite at Protocol Labs label Aug 17, 2021
@jacobheun jacobheun modified the milestones: v1.11.3, v1.11.2 Aug 17, 2021
@raulk raulk added kind/bug Kind: Bug and removed kind/enhancement Kind: Enhancement labels Aug 17, 2021
@aarshkshah1992
Copy link
Contributor

aarshkshah1992 commented Aug 18, 2021

@raulk @hannahhoward Attaching some more heapprofiles here. They all look the same.

image

image

@aarshkshah1992
Copy link
Contributor

  • The heapprofile leads me to suspect that it's because of the blocks we store in-memory in the unverified blockstore. The received Graphysync messages will be retained till we keep the block data in-memory. Note that the consumer of those in-memory blocks i.e. the IPLD traverser submits consume messages via the same event loop that's also used to add incoming blocks to the in-memory blockstore which can lead to contention here.
  • Also, Estuary has a problem where it's datastore is missing some blocks for storage deals. This can exacerbate the problem as we will buffer the other blocks till the receiver sends a Graphsync cancel to the responder i.e. Estuary after noticing that a block is missing (i.e. IsKnownMissingLink=true). My understanding of the code tells me that we do send a Cancel here to the responder when we notice that a block is missing. Please can you confirm this ? If not, that's a big problem because the Graphsync responder traversal has been coded to keep sending the other blocks even if they don't have some of the blocks (by singalling continuation using the traverse.SkipMe error) .
  • To confirm the hypothesis that the in-memory unverified blockstore is the culprit here, I've raised Log unverified blockstore memory consumption ipfs/go-graphsync#201 with some logging around the size of the data that's being kept in the unverified blockstore. We can draw conclusions once we've had the chance to run that PR on some miners and get back stats.

@hannahhoward
Copy link
Contributor

WIP Solution: ipfs/go-graphsync#204

@aarshkshah1992
Copy link
Contributor

Should be solved by ipfs/go-graphsync#204.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/data-transfer kind/bug Kind: Bug P1 P1: Must be resolved team/ignite Issues and PRs being tracked by Team Ignite at Protocol Labs
Projects
None yet
Development

No branches or pull requests

6 participants
@jacobheun @raulk @hannahhoward @aarshkshah1992 @jennijuju and others