Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Refactor][core]Data inflation when using postgres #8142

Closed
narrowizard opened this issue Oct 11, 2024 · 1 comment · Fixed by #8152
Closed

[Refactor][core]Data inflation when using postgres #8142

narrowizard opened this issue Oct 11, 2024 · 1 comment · Fixed by #8152
Assignees
Labels
improvement type/refactor This issue is to refactor existing code

Comments

@narrowizard
Copy link
Collaborator

narrowizard commented Oct 11, 2024

What and why to refactor

As a software engineer, I am using devlake to collect developer data since long time ago. Recently we find that the _devlake_subtasks table occurs a data inflation. It used 750MB after a week from we upgrading devlake v1.0, but there were only 1000 records in the table.

Describe the solution you'd like

Solutions from @klesh

  1. Reduce the update rate to _devlake_subtasks table when collect data.
  2. Store progress info in memory firstly and write it to db in a fixed rate.

Related issues

No

Additional context

  • The issue was introduced in v1.0.
  • We are using postgres.
@narrowizard narrowizard added the type/refactor This issue is to refactor existing code label Oct 11, 2024
@dosubot dosubot bot added the improvement label Oct 11, 2024
@narrowizard narrowizard self-assigned this Oct 21, 2024
@narrowizard
Copy link
Collaborator Author

narrowizard commented Oct 21, 2024

After reading some code,I organizaed a table to show when we updading the progress of sub tasks.

Trigger timing Times Code
Sub task started Once a sub task
if progress != nil {
progress <- plugin.RunningProgress{
Type: plugin.SetCurrentSubTask,
SubTaskName: subtaskMeta.Name,
SubTaskNumber: subtaskNumber,
}
}
Mannually called by sub task context Unknown(decide by plugin implement)
func (c *defaultExecContext) SetProgress(progressType plugin.ProgressType, current int, total int) {
c.current = int64(current)
c.total = total
if c.progress != nil {
c.progress <- plugin.RunningProgress{
Type: progressType,
Current: current,
Total: total,
}
}
}
func (c *defaultExecContext) IncProgress(progressType plugin.ProgressType, quantity int) {
atomic.AddInt64(&c.current, int64(quantity))
current := c.current
if c.progress != nil {
c.progress <- plugin.RunningProgress{
Type: progressType,
Current: int(current),
Total: c.total,
}
// subtask progress may go too fast, remove old messages because they don't matter any more
if progressType == plugin.SubTaskSetProgress {
for len(c.progress) > 1 {
<-c.progress
}
}
}
}

func (c *DefaultSubTaskContext) SetProgress(current int, total int) {
c.defaultExecContext.SetProgress(plugin.SubTaskSetProgress, current, total)
if total > -1 {
c.BasicRes.GetLogger().Info("total jobs: %d", c.total)
}
}
// IncProgress FIXME ...
func (c *DefaultSubTaskContext) IncProgress(quantity int) {
c.defaultExecContext.IncProgress(plugin.SubTaskIncProgress, quantity)
if c.LastProgressTime.IsZero() || c.LastProgressTime.Add(3*time.Second).Before(time.Now()) || c.current%1000 == 0 {
c.LastProgressTime = time.Now()
c.BasicRes.GetLogger().Info("finished records: %d(not exactly)", c.current)
} else {
c.BasicRes.GetLogger().Debug("finished records: %d", c.current)
}
}
Mannually called by task context Unknown
func (c *DefaultTaskContext) SetProgress(current int, total int) {
c.defaultExecContext.SetProgress(plugin.TaskSetProgress, current, total)
c.BasicRes.GetLogger().Info("total step: %d", c.total)
}
// IncProgress FIXME ...
func (c *DefaultTaskContext) IncProgress(quantity int) {
c.defaultExecContext.IncProgress(plugin.TaskIncProgress, quantity)
c.BasicRes.GetLogger().Info("finished step: %d / %d", c.current, c.total)
}

Seems that the progress updating calling is not always under control, so we need to decrease db operation automatically.

narrowizard added a commit to narrowizard/incubator-devlake that referenced this issue Oct 22, 2024
[Refactor][core]Data inflation when using postgres apache#8142
narrowizard added a commit to narrowizard/incubator-devlake that referenced this issue Oct 22, 2024
[Refactor][core]Data inflation when using postgres apache#8142
narrowizard added a commit to narrowizard/incubator-devlake that referenced this issue Oct 22, 2024
[Refactor][core]Data inflation when using postgres apache#8142
narrowizard added a commit to narrowizard/incubator-devlake that referenced this issue Oct 22, 2024
[Refactor][core]Data inflation when using postgres apache#8142
narrowizard added a commit to narrowizard/incubator-devlake that referenced this issue Oct 22, 2024
[Refactor][core]Data inflation when using postgres apache#8142
narrowizard added a commit to narrowizard/incubator-devlake that referenced this issue Oct 23, 2024
[Refactor][core]Data inflation when using postgres apache#8142
narrowizard added a commit to narrowizard/incubator-devlake that referenced this issue Oct 23, 2024
[Refactor][core]Data inflation when using postgres apache#8142
klesh pushed a commit that referenced this issue Oct 23, 2024
[Refactor][core]Data inflation when using postgres #8142
narrowizard added a commit to narrowizard/incubator-devlake that referenced this issue Oct 23, 2024
klesh pushed a commit that referenced this issue Oct 23, 2024
* fix(framework): fix finished_record count in _devlake_subtasks (#8054)

* feat: not update sub task progress if progress less than 1 pct (#8152)

[Refactor][core]Data inflation when using postgres #8142

---------

Co-authored-by: Lynwee <[email protected]>
narrowizard added a commit to narrowizard/incubator-devlake that referenced this issue Oct 23, 2024
- add env SKIP_SUBTASK_PROGRESS to decide wether skip subtask progress updating to db

apache#8142
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement type/refactor This issue is to refactor existing code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant