Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README.md #126

Merged
merged 1 commit into from
Jun 28, 2024
Merged

Update README.md #126

merged 1 commit into from
Jun 28, 2024

Conversation

Maghoumi
Copy link
Collaborator

@Maghoumi Maghoumi commented Jun 24, 2024

Update README.md:

  • Improve and shorten.
  • Include links to blogs and tutorials.
  • Remove incorrect info about non-existent branches.
  • Add a header and a diagram.
  • Add a note about incremental deduplication.

Description

Make minor revisions to the landing page. Include a link to the TinyStories blog post.

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@Maghoumi Maghoumi requested a review from ryantwolf June 24, 2024 20:27
@Maghoumi Maghoumi force-pushed the mmaghoumi/update-readme branch 3 times, most recently from d913c7e to e1d5b18 Compare June 25, 2024 00:07
Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of changes I'd like to see, but overall nice job!

README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
README.md Outdated

## Key Features
# NeMo Curator
🚀 **The GPU-accelerated open source framework for efficient large language models data curation** 🚀
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can we capitalize this like a title?
    Suggested change
    🚀 **The GPU-accelerated open source framework for efficient large language models data curation** 🚀
    🚀 **The GPU-Accelerated Open Source Framework for Efficient Large Language Models Data Curation** 🚀
  2. Not sure how much of this is for SEO, but I personally would prefer to omit the "Open Source" part. It feels self-evident given that we are on GitHub.
  3. I feel like it should be "Large Language Model Data Curation", not "Large Language Models Data Curation". Am I wrong?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Maghoumi given you didn't change according to 2 & 3, I just want to confirm that the open source is needed for SEO and that "Large Language Models Data Curation" is correct (instead of "Large Language Model Data Curation")


NeMo Curator provides a collection of scalable data-mining modules. Some of the key features include:
<p align="center">
<img src="./docs/user-guide/images/diagram.png" alt="diagram"/>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am personally not a fan of having a diagram, but after talking with others they have found it useful. I still have a couple of critiques of this diagram in particular.

  1. Too much detail for the first thing the user sees. I think a smaller, simpler diagram would serve us better.
  2. Slightly confusing information. There are two "Language Detections" that might confuse the user. We would need to go more in detail to explain why each of them are needed than what is good for a diagram like this.
  3. The text is too small.
  4. The diagram aspect ratio is too wide.

I'm not sure what the best way to proceed is. We should perhaps have a separate meeting to discuss how to improve it. Do we know of any graphic designers we can reach out to to help refine this? I can make suggestions all day, but actually producing a better version is beyond my skillset.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments. Adding @arhamm1 who made the diagram for awareness.
I think it'd be great to have a sync on the diagram with all the stakeholders.

In the meantime, let me know if you want me to remove this diagram from the PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the diagram in the PR for now

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, kept.

README.md Show resolved Hide resolved
Copy link
Collaborator Author

@Maghoumi Maghoumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryantwolf Thanks for reviewing. I made all the changes and pushed. Please take a look when you get a chance.

README.md Show resolved Hide resolved

NeMo Curator provides a collection of scalable data-mining modules. Some of the key features include:
<p align="center">
<img src="./docs/user-guide/images/diagram.png" alt="diagram"/>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments. Adding @arhamm1 who made the diagram for awareness.
I think it'd be great to have a sync on the diagram with all the stakeholders.

In the meantime, let me know if you want me to remove this diagram from the PR.

README.md Show resolved Hide resolved
README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a couple of nits this time. Thanks again!

README.md Outdated
- `peft-curation` which focuses on data curation for parameter-efficient fine-tuning use-cases.
- [`tinystories`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/tinystories) which focuses on data curation for training LLMs from scratch.
- [`peft-curation`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/peft-curation) which focuses on data curation for LLM parameter-efficient fine-tuning (PEFT) use-cases.
- [`distributed_data_classification`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/distributed_data_classification) which focuses on using the quality and domain classifiers to help with data annotation and blending.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: There is no data blending that occurs in this tutorial.

Suggested change
- [`distributed_data_classification`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/distributed_data_classification) which focuses on using the quality and domain classifiers to help with data annotation and blending.
- [`distributed_data_classification`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/distributed_data_classification) which focuses on using the quality and domain classifiers to help with data annotation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, the tutorial's overview says it's mean to "help" with data annotation and blending, which is why I thought to include that here.

I changed the wording so it only mentions annotation.

README.md Outdated
ScoreFilter(FastTextQualityFilter(model_path="model.bin")),
# Discard records from irrelevant tasks
Copy link
Collaborator

@ryantwolf ryantwolf Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit inaccurate. What this is actually doing is ensuring that evaluation metrics don't leak into the training data. Think of it like this: The model shouldn't see/memorize the answers to the final exam. I think one of these would be better

  • "Discard records from the evaluation metrics"
  • "Prevent test set leakage"
  • "Prevent test set contamination"

I'll let you decide on what you think is the best description is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, thanks for correcting my understanding. Revised.


NeMo Curator provides a collection of scalable data-mining modules. Some of the key features include:
<p align="center">
<img src="./docs/user-guide/images/diagram.png" alt="diagram"/>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the diagram in the PR for now

README.md Outdated

## Key Features
# NeMo Curator
🚀 **The GPU-accelerated open source framework for efficient large language models data curation** 🚀
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Maghoumi given you didn't change according to 2 & 3, I just want to confirm that the open source is needed for SEO and that "Large Language Models Data Curation" is correct (instead of "Large Language Model Data Curation")

@Maghoumi
Copy link
Collaborator Author

@Maghoumi given you didn't change according to 2 & 3, I just want to confirm that the open source is needed for SEO and that "Large Language Models Data Curation" is correct (instead of "Large Language Model Data Curation")

@ryantwolf Oops, I missed your original comment. I changed it to "Large Language Model Data Curation". Yes the redundant open source emphasis is for SEO as there is no other mention of it in the readme file.

* Improve and shorten.
* Include links to blogs and tutorials.
* Remove incorrect info about non-existent branches.
* Add a header and a diagram.
* Add a note about incremental deduplication.

Signed-off-by: Mehran Maghoumi <[email protected]>
Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good with me. Thanks!

@ryantwolf ryantwolf merged commit 640546c into NVIDIA:main Jun 28, 2024
3 checks passed
@Maghoumi Maghoumi deleted the mmaghoumi/update-readme branch August 1, 2024 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants