bugfix: skip write index.json if no data is wrote. #19439

cauyxy · 2024-02-08T20:05:25Z

What does this PR do?

Skip creating index.json for processes that do not write data.

Fixes #19438

Before submitting

Was this discussed/agreed via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist

[ yes ] Is this pull request ready for review? (if not, please submit in draft mode)
[ yes ] Check that all items from Before submitting are resolved
[ yes ] Make sure the title is self-explanatory and the description concisely explains the PR
[ yes ] Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--19439.org.readthedocs.build/en/19439/

tchaton · 2024-02-09T08:34:02Z

Hey @cauyxy Thanks for your contribution!

Would you mind adding a test after verifying the test breaks using master? Especially one emulated multiple nodes. If you get blocked, I can help there.

BTW, what's your interest in Lightning Data? I am quite curious as you are the first community contributor ;)

cauyxy · 2024-02-09T11:07:33Z

Hey @cauyxy Thanks for your contribution!

Would you mind adding a test after verifying the test breaks using master? Especially one emulated multiple nodes. If you get blocked, I can help there.

BTW, what's your interest in Lightning Data? I am quite curious as you are the first community contributor ;)

Sorry, I'm not very familiar with adding tests, and it might take me some time to do it myself. If you could help with that, it would be great!

I've been experimenting with training some language models and have used the Lightning framework. For the data part, I primarily utilized StreamingDataset, which I found very convenient and it solved many of my problems. Thank you for your great work.

The code looks very neat, which is extremely pleasing for someone like me who has a strong preference for clean code!🥰

tchaton · 2024-02-09T12:15:34Z

Hey @cauyxy, this is kind. Thank you.

For LLM, we are a couple of Studios I built to display how to best use Lightning Data:

Studios are fully reproducible cloud environments with data, code, dependencies, etc... Give it a spin ;)

Sorry, I'm not very familiar with adding tests

This is entirely fine. Do you think you could provide a reproducible dummy python script that triggers the error? This is usually the first step to build a test. Same for the other error ;)

cauyxy · 2024-02-09T13:00:00Z

@tchaton Sure, I can give it a try. I don't have access to Studio right now, but I can set it up locally.

cauyxy · 2024-02-09T13:51:35Z

Hey @cauyxy, this is kind. Thank you.

For LLM, we are a couple of Studios I built to display how to best use Lightning Data:

Benchmark cloud data-loading libraries

Prepare the TinyLlama 1T token dataset

Pretrain LLMs - TinyLlama 1.1B

Studios are fully reproducible cloud environments with data, code, dependencies, etc... Give it a spin ;)

Sorry, I'm not very familiar with adding tests

This is entirely fine. Do you think you could provide a reproducible dummy python script that triggers the error? This is usually the first step to build a test. Same for the other error ;)

this dummy script can trigger pdb.
Env: Linux python3.9

Run python3 script.py

import os
import time
import json
import torch
from pathlib import Path

from lightning.pytorch.utilities.imports import RequirementCache
from lightning.data.streaming import DataChunkRecipe, DataProcessor

assert os.cpu_count() > 1, "cpu count should be greater than 1"
assert RequirementCache("lightning==2.2.0"), "Please use lightning==2.2.0"
assert RequirementCache("lightning_cloud==0.5.64")


base_dir = Path(os.path.dirname(__file__))


def prepare_data():
    data_dir = base_dir / "dummy_data"
    data_dir.mkdir(parents=True, exist_ok=True)

    with open(data_dir / "dummy.json", "w") as f:
        f.write(json.dumps({"text": "hello world"}))


prepare_data()


class CommonDataRecipe(DataChunkRecipe):
    def prepare_structure(self, input_dir):
        files = Path(input_dir).rglob("*.json")
        return [str(file) for file in files]

    def prepare_item(self, filepath):
        with open(filepath, "r", encoding="utf-8") as f:
            for _ in f:
                yield torch.tensor([0, 1, 2, 3, 4, 5, 6])


def prepare(
    input_dir: str = "dummy_data",
    output_dir: str = "dummy_processed",
    chunk_size: int = 4,
) -> None:
    data_recipe = CommonDataRecipe(chunk_size=chunk_size)
    data_processor = DataProcessor(
        input_dir=str(input_dir),
        output_dir=str(output_dir),
        num_workers=os.cpu_count(),
        num_downloaders=1,
    )

    start_time = time.time()
    data_processor.run(data_recipe)
    elapsed_time = time.time() - start_time
    print(f"Time taken: {elapsed_time:.2f} seconds")


if __name__ == "__main__":
    import fire

    fire.Fire(prepare)

tchaton · 2024-02-09T14:39:12Z

Hey @cauyxy This is great. Now, you can convert this reproducible script into a test ;) You are almost there.

cauyxy · 2024-02-09T15:28:01Z

Hey @cauyxy This is great. Now, you can convert this reproducible script into a test ;) You are almost there.

Done!

You are really a good teacher! haha...😂

tchaton

Thanks for the test! The breakpoint wasn't meant to be there. So you didn't need to patch builtin. However, if you need to do patching, use monkeypatch, so the patching exists only in the context of the test.

cauyxy · 2024-02-09T15:54:21Z

Thanks for the test! The breakpoint wasn't meant to be there. So you didn't need to patch builtin. However, if you need to do patching, use monkeypatch, so the patching exists only in the context of the test.

Get it!

tchaton · 2024-02-09T17:08:50Z

Thanks @cauyxy. Great job! Both your PRs got merged ;) Feel free to continue contributing, you did great!

(cherry picked from commit 47c8f4c)

fix empty config

bceb4be

cauyxy requested a review from tchaton as a code owner February 8, 2024 20:05

github-actions bot added the data (external) litdata package label Feb 8, 2024

fix mypy typing error

5a761d9

add test for redundant woker in data processing

742f2ed

cauyxy and others added 3 commits February 9, 2024 23:35

Merge branch 'master' into master-fixw

8495ea2

fix over indent

9e2d83c

update

633a33e

tchaton approved these changes Feb 9, 2024

View reviewed changes

rusenask approved these changes Feb 9, 2024

View reviewed changes

mergify bot added the ready PRs ready to be merged label Feb 9, 2024

Merge branch 'master' into master-fixw

1109e51

tchaton merged commit 47c8f4c into Lightning-AI:master Feb 9, 2024
55 checks passed

Borda pushed a commit that referenced this pull request Feb 12, 2024

bugfix: skip write index.json if no data is wrote. (#19439)

f9c06b9

(cherry picked from commit 47c8f4c)

lexierule pushed a commit that referenced this pull request Feb 12, 2024

bugfix: skip write index.json if no data is wrote. (#19439)

91d12a0

(cherry picked from commit 47c8f4c)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bugfix: skip write index.json if no data is wrote. #19439

bugfix: skip write index.json if no data is wrote. #19439

cauyxy commented Feb 8, 2024 •

edited by github-actions bot

Loading

tchaton commented Feb 9, 2024 •

edited

Loading

cauyxy commented Feb 9, 2024 •

edited

Loading

tchaton commented Feb 9, 2024 •

edited

Loading

cauyxy commented Feb 9, 2024

cauyxy commented Feb 9, 2024 •

edited

Loading

tchaton commented Feb 9, 2024 •

edited

Loading

cauyxy commented Feb 9, 2024

tchaton left a comment

cauyxy commented Feb 9, 2024

tchaton commented Feb 9, 2024 •

edited

Loading

bugfix: skip write index.json if no data is wrote. #19439

bugfix: skip write index.json if no data is wrote. #19439

Conversation

cauyxy commented Feb 8, 2024 • edited by github-actions bot Loading

What does this PR do?

PR review

tchaton commented Feb 9, 2024 • edited Loading

cauyxy commented Feb 9, 2024 • edited Loading

tchaton commented Feb 9, 2024 • edited Loading

cauyxy commented Feb 9, 2024

cauyxy commented Feb 9, 2024 • edited Loading

tchaton commented Feb 9, 2024 • edited Loading

cauyxy commented Feb 9, 2024

tchaton left a comment

Choose a reason for hiding this comment

cauyxy commented Feb 9, 2024

tchaton commented Feb 9, 2024 • edited Loading

cauyxy commented Feb 8, 2024 •

edited by github-actions bot

Loading

tchaton commented Feb 9, 2024 •

edited

Loading

cauyxy commented Feb 9, 2024 •

edited

Loading

tchaton commented Feb 9, 2024 •

edited

Loading

cauyxy commented Feb 9, 2024 •

edited

Loading

tchaton commented Feb 9, 2024 •

edited

Loading

tchaton commented Feb 9, 2024 •

edited

Loading