Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Processing of Large Scale Event Logs #1249

Open
parthosa opened this issue Aug 1, 2024 · 5 comments
Open

[FEA] Processing of Large Scale Event Logs #1249

parthosa opened this issue Aug 1, 2024 · 5 comments
Labels
core_tools Scope the core module (scala) feature request New feature or request

Comments

@parthosa
Copy link
Collaborator

parthosa commented Aug 1, 2024

Currently, we run the Tool (python+jar) on a single machine which is limited by the memory and compute of the host machine. However, Tools should have the capability to process large scale event logs.

Although, we do support running the Tools as a Spark Listener but is not useful for apps that are already processed.

Some of the ideas are:

  1. Distributed Processing:
    • If the JAR can be submitted as an Spark App.
  2. Batch Processing on a Single Machine:
    • If the Tool can do batching and write the JAR output to multiple directories.
    • Then the Python Tool could process multiple rapids_4_spark_qualification_output directories.
    • Batching can be done based on size of event logs or a config

cc: @viadea @kuhushukla

@parthosa parthosa added feature request New feature or request ? - Needs Triage core_tools Scope the core module (scala) labels Aug 1, 2024
@amahussein
Copy link
Collaborator

amahussein commented Aug 2, 2024

Currently, we run the Tool (python+jar) on a single machine which is limited by the memory and compute of the host machine. However, Tools should have the capability to process large scale event logs.

I am not sure I understand the problem. Is it about processing Apps in runtime or about tools resources requirements?

Processing eventlogs require large resources. As instance, Spark History Server is known to require large memory and resources to process eventlogs.
We have issues opened for performance optimizations which mainly target possibility of OOME while processing large eventlogs.

@amahussein
Copy link
Collaborator

Previously, the python CLI had option to submit the Tools jar as a Spark job. This was mainly a way to work with large eventlogs since the CLI will be able to spin distributed Spark jobs.
Based on feature requests, the python CLI was converted to be a single Dev machine despite knowing that large scale processing would be a problem.

@tgravescs
Copy link
Collaborator

Note that scaling can also be done via making a single machine run more efficient by storing the data in a database vs in memory. For instance like RocksDB. This issue should likely be split up into multiple for the various improvements being made

@tgravescs
Copy link
Collaborator

tgravescs commented Oct 7, 2024

linking #1377 to this for handling lots and lots of event logs.
Also linking #1378 to this for processing huge event logs

@amahussein
Copy link
Collaborator

Note that scaling can also be done via making a single machine run more efficient by storing the data in a database vs in memory. For instance like RocksDB. This issue should likely be split up into multiple for the various improvements being made

@tgravescs , yes I agree. We had a previous issue #815 to track that
I am sort of confused about how each of those issues are connected together.
For example, what is the outcome from this issue (1249) Vs. what's in 1378?
IMHO, we should close 1249. Then we can file something specific to Distributed-Tools execution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core_tools Scope the core module (scala) feature request New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants