Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: skip write index.json if no data is wrote. #19439

Merged
merged 7 commits into from
Feb 9, 2024
Merged

bugfix: skip write index.json if no data is wrote. #19439

merged 7 commits into from
Feb 9, 2024

Conversation

cauyxy
Copy link
Contributor

@cauyxy cauyxy commented Feb 8, 2024

What does this PR do?

Skip creating index.json for processes that do not write data.

Fixes #19438

Before submitting
  • Was this discussed/agreed via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist
  • [ yes ] Is this pull request ready for review? (if not, please submit in draft mode)
  • [ yes ] Check that all items from Before submitting are resolved
  • [ yes ] Make sure the title is self-explanatory and the description concisely explains the PR
  • [ yes ] Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--19439.org.readthedocs.build/en/19439/

@cauyxy cauyxy requested a review from tchaton as a code owner February 8, 2024 20:05
@github-actions github-actions bot added the data (external) litdata package label Feb 8, 2024
@tchaton
Copy link
Contributor

tchaton commented Feb 9, 2024

Hey @cauyxy Thanks for your contribution!

Would you mind adding a test after verifying the test breaks using master? Especially one emulated multiple nodes. If you get blocked, I can help there.

BTW, what's your interest in Lightning Data? I am quite curious as you are the first community contributor ;)

@cauyxy
Copy link
Contributor Author

cauyxy commented Feb 9, 2024

Hey @cauyxy Thanks for your contribution!

Would you mind adding a test after verifying the test breaks using master? Especially one emulated multiple nodes. If you get blocked, I can help there.

BTW, what's your interest in Lightning Data? I am quite curious as you are the first community contributor ;)

Sorry, I'm not very familiar with adding tests, and it might take me some time to do it myself. If you could help with that, it would be great!

I've been experimenting with training some language models and have used the Lightning framework. For the data part, I primarily utilized StreamingDataset, which I found very convenient and it solved many of my problems. Thank you for your great work.

The code looks very neat, which is extremely pleasing for someone like me who has a strong preference for clean code!🥰

@tchaton
Copy link
Contributor

tchaton commented Feb 9, 2024

Hey @cauyxy, this is kind. Thank you.

For LLM, we are a couple of Studios I built to display how to best use Lightning Data:

Studios are fully reproducible cloud environments with data, code, dependencies, etc... Give it a spin ;)

Sorry, I'm not very familiar with adding tests

This is entirely fine. Do you think you could provide a reproducible dummy python script that triggers the error? This is usually the first step to build a test. Same for the other error ;)

@cauyxy
Copy link
Contributor Author

cauyxy commented Feb 9, 2024

@tchaton Sure, I can give it a try. I don't have access to Studio right now, but I can set it up locally.

@cauyxy
Copy link
Contributor Author

cauyxy commented Feb 9, 2024

Hey @cauyxy, this is kind. Thank you.

For LLM, we are a couple of Studios I built to display how to best use Lightning Data:

Studios are fully reproducible cloud environments with data, code, dependencies, etc... Give it a spin ;)

Sorry, I'm not very familiar with adding tests

This is entirely fine. Do you think you could provide a reproducible dummy python script that triggers the error? This is usually the first step to build a test. Same for the other error ;)

this dummy script can trigger pdb.
Env: Linux python3.9

Run python3 script.py

import os
import time
import json
import torch
from pathlib import Path

from lightning.pytorch.utilities.imports import RequirementCache
from lightning.data.streaming import DataChunkRecipe, DataProcessor

assert os.cpu_count() > 1, "cpu count should be greater than 1"
assert RequirementCache("lightning==2.2.0"), "Please use lightning==2.2.0"
assert RequirementCache("lightning_cloud==0.5.64")


base_dir = Path(os.path.dirname(__file__))


def prepare_data():
    data_dir = base_dir / "dummy_data"
    data_dir.mkdir(parents=True, exist_ok=True)

    with open(data_dir / "dummy.json", "w") as f:
        f.write(json.dumps({"text": "hello world"}))


prepare_data()


class CommonDataRecipe(DataChunkRecipe):
    def prepare_structure(self, input_dir):
        files = Path(input_dir).rglob("*.json")
        return [str(file) for file in files]

    def prepare_item(self, filepath):
        with open(filepath, "r", encoding="utf-8") as f:
            for _ in f:
                yield torch.tensor([0, 1, 2, 3, 4, 5, 6])


def prepare(
    input_dir: str = "dummy_data",
    output_dir: str = "dummy_processed",
    chunk_size: int = 4,
) -> None:
    data_recipe = CommonDataRecipe(chunk_size=chunk_size)
    data_processor = DataProcessor(
        input_dir=str(input_dir),
        output_dir=str(output_dir),
        num_workers=os.cpu_count(),
        num_downloaders=1,
    )

    start_time = time.time()
    data_processor.run(data_recipe)
    elapsed_time = time.time() - start_time
    print(f"Time taken: {elapsed_time:.2f} seconds")


if __name__ == "__main__":
    import fire

    fire.Fire(prepare)

@tchaton
Copy link
Contributor

tchaton commented Feb 9, 2024

Hey @cauyxy This is great. Now, you can convert this reproducible script into a test ;) You are almost there.

@cauyxy
Copy link
Contributor Author

cauyxy commented Feb 9, 2024

Hey @cauyxy This is great. Now, you can convert this reproducible script into a test ;) You are almost there.

Done!

You are really a good teacher! haha...😂

Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test! The breakpoint wasn't meant to be there. So you didn't need to patch builtin. However, if you need to do patching, use monkeypatch, so the patching exists only in the context of the test.

@cauyxy
Copy link
Contributor Author

cauyxy commented Feb 9, 2024

Thanks for the test! The breakpoint wasn't meant to be there. So you didn't need to patch builtin. However, if you need to do patching, use monkeypatch, so the patching exists only in the context of the test.

Get it!

@mergify mergify bot added the ready PRs ready to be merged label Feb 9, 2024
@tchaton tchaton merged commit 47c8f4c into Lightning-AI:master Feb 9, 2024
55 checks passed
@tchaton
Copy link
Contributor

tchaton commented Feb 9, 2024

Thanks @cauyxy. Great job! Both your PRs got merged ;) Feel free to continue contributing, you did great!

Borda pushed a commit that referenced this pull request Feb 12, 2024
lexierule pushed a commit that referenced this pull request Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data (external) litdata package ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: SteamingDataset Processor data config mismatch
3 participants