-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bugfix: skip write index.json if no data is wrote. #19439
Conversation
Hey @cauyxy Thanks for your contribution! Would you mind adding a test after verifying the test breaks using master? Especially one emulated multiple nodes. If you get blocked, I can help there. BTW, what's your interest in Lightning Data? I am quite curious as you are the first community contributor ;) |
Sorry, I'm not very familiar with adding tests, and it might take me some time to do it myself. If you could help with that, it would be great! I've been experimenting with training some language models and have used the Lightning framework. For the data part, I primarily utilized StreamingDataset, which I found very convenient and it solved many of my problems. Thank you for your great work. The code looks very neat, which is extremely pleasing for someone like me who has a strong preference for clean code!🥰 |
Hey @cauyxy, this is kind. Thank you. For LLM, we are a couple of Studios I built to display how to best use Lightning Data:
Studios are fully reproducible cloud environments with data, code, dependencies, etc... Give it a spin ;)
This is entirely fine. Do you think you could provide a reproducible dummy python script that triggers the error? This is usually the first step to build a test. Same for the other error ;) |
@tchaton Sure, I can give it a try. I don't have access to Studio right now, but I can set it up locally. |
this dummy script can trigger pdb. Run import os
import time
import json
import torch
from pathlib import Path
from lightning.pytorch.utilities.imports import RequirementCache
from lightning.data.streaming import DataChunkRecipe, DataProcessor
assert os.cpu_count() > 1, "cpu count should be greater than 1"
assert RequirementCache("lightning==2.2.0"), "Please use lightning==2.2.0"
assert RequirementCache("lightning_cloud==0.5.64")
base_dir = Path(os.path.dirname(__file__))
def prepare_data():
data_dir = base_dir / "dummy_data"
data_dir.mkdir(parents=True, exist_ok=True)
with open(data_dir / "dummy.json", "w") as f:
f.write(json.dumps({"text": "hello world"}))
prepare_data()
class CommonDataRecipe(DataChunkRecipe):
def prepare_structure(self, input_dir):
files = Path(input_dir).rglob("*.json")
return [str(file) for file in files]
def prepare_item(self, filepath):
with open(filepath, "r", encoding="utf-8") as f:
for _ in f:
yield torch.tensor([0, 1, 2, 3, 4, 5, 6])
def prepare(
input_dir: str = "dummy_data",
output_dir: str = "dummy_processed",
chunk_size: int = 4,
) -> None:
data_recipe = CommonDataRecipe(chunk_size=chunk_size)
data_processor = DataProcessor(
input_dir=str(input_dir),
output_dir=str(output_dir),
num_workers=os.cpu_count(),
num_downloaders=1,
)
start_time = time.time()
data_processor.run(data_recipe)
elapsed_time = time.time() - start_time
print(f"Time taken: {elapsed_time:.2f} seconds")
if __name__ == "__main__":
import fire
fire.Fire(prepare) |
Hey @cauyxy This is great. Now, you can convert this reproducible script into a test ;) You are almost there. |
Done! You are really a good teacher! haha...😂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the test! The breakpoint wasn't meant to be there. So you didn't need to patch builtin. However, if you need to do patching, use monkeypatch, so the patching exists only in the context of the test.
Get it! |
Thanks @cauyxy. Great job! Both your PRs got merged ;) Feel free to continue contributing, you did great! |
(cherry picked from commit 47c8f4c)
(cherry picked from commit 47c8f4c)
What does this PR do?
Skip creating index.json for processes that do not write data.
Fixes #19438
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Reviewer checklist
📚 Documentation preview 📚: https://pytorch-lightning--19439.org.readthedocs.build/en/19439/