Update README.md #126

Maghoumi · 2024-06-24T17:56:29Z

Update README.md:

Improve and shorten.
Include links to blogs and tutorials.
Remove incorrect info about non-existent branches.
Add a header and a diagram.
Add a note about incremental deduplication.

Description

Make minor revisions to the landing page. Include a link to the TinyStories blog post.

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

ryantwolf

I have a couple of changes I'd like to see, but overall nice job!

README.md

ryantwolf · 2024-06-25T21:11:26Z

README.md


-## Key Features
+# NeMo Curator
+🚀 **The GPU-accelerated open source framework for efficient large language models data curation** 🚀


Can we capitalize this like a title?

Suggested change

🚀 **The GPU-accelerated open source framework for efficient large language models data curation** 🚀

🚀 **The GPU-Accelerated Open Source Framework for Efficient Large Language Models Data Curation** 🚀

Not sure how much of this is for SEO, but I personally would prefer to omit the "Open Source" part. It feels self-evident given that we are on GitHub.

I feel like it should be "Large Language Model Data Curation", not "Large Language Models Data Curation". Am I wrong?

@Maghoumi given you didn't change according to 2 & 3, I just want to confirm that the open source is needed for SEO and that "Large Language Models Data Curation" is correct (instead of "Large Language Model Data Curation")

ryantwolf · 2024-06-25T21:24:53Z

README.md


-NeMo Curator provides a collection of scalable data-mining modules. Some of the key features include:
+<p align="center">
+  <img src="./docs/user-guide/images/diagram.png" alt="diagram"/>


I am personally not a fan of having a diagram, but after talking with others they have found it useful. I still have a couple of critiques of this diagram in particular.

Too much detail for the first thing the user sees. I think a smaller, simpler diagram would serve us better.

Slightly confusing information. There are two "Language Detections" that might confuse the user. We would need to go more in detail to explain why each of them are needed than what is good for a diagram like this.

The text is too small.

The diagram aspect ratio is too wide.

I'm not sure what the best way to proceed is. We should perhaps have a separate meeting to discuss how to improve it. Do we know of any graphic designers we can reach out to to help refine this? I can make suggestions all day, but actually producing a better version is beyond my skillset.

Thanks for your comments. Adding @arhamm1 who made the diagram for awareness.
I think it'd be great to have a sync on the diagram with all the stakeholders.

In the meantime, let me know if you want me to remove this diagram from the PR.

Let's keep the diagram in the PR for now

Okay, kept.

README.md

Maghoumi

@ryantwolf Thanks for reviewing. I made all the changes and pushed. Please take a look when you get a chance.

README.md

Maghoumi · 2024-06-26T21:41:10Z

README.md


-NeMo Curator provides a collection of scalable data-mining modules. Some of the key features include:
+<p align="center">
+  <img src="./docs/user-guide/images/diagram.png" alt="diagram"/>


Thanks for your comments. Adding @arhamm1 who made the diagram for awareness.
I think it'd be great to have a sync on the diagram with all the stakeholders.

In the meantime, let me know if you want me to remove this diagram from the PR.

README.md

ryantwolf

Only a couple of nits this time. Thanks again!

ryantwolf · 2024-06-27T17:11:18Z

README.md

- `peft-curation` which focuses on data curation for parameter-efficient fine-tuning use-cases.
+- [`tinystories`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/tinystories) which focuses on data curation for training LLMs from scratch.
+- [`peft-curation`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/peft-curation) which focuses on data curation for LLM parameter-efficient fine-tuning (PEFT) use-cases.
+- [`distributed_data_classification`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/distributed_data_classification) which focuses on using the quality and domain classifiers to help with data annotation and blending.


Nit: There is no data blending that occurs in this tutorial.

Suggested change

- [`distributed_data_classification`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/distributed_data_classification) which focuses on using the quality and domain classifiers to help with data annotation and blending.

- [`distributed_data_classification`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/distributed_data_classification) which focuses on using the quality and domain classifiers to help with data annotation.

Correct, the tutorial's overview says it's mean to "help" with data annotation and blending, which is why I thought to include that here.

I changed the wording so it only mentions annotation.

ryantwolf · 2024-06-27T17:17:25Z

README.md

  ScoreFilter(FastTextQualityFilter(model_path="model.bin")),
+  # Discard records from irrelevant tasks


This is a bit inaccurate. What this is actually doing is ensuring that evaluation metrics don't leak into the training data. Think of it like this: The model shouldn't see/memorize the answers to the final exam. I think one of these would be better

"Discard records from the evaluation metrics"

"Prevent test set leakage"

"Prevent test set contamination"

I'll let you decide on what you think is the best description is.

Ah I see, thanks for correcting my understanding. Revised.

ryantwolf · 2024-06-27T17:20:53Z

README.md


-NeMo Curator provides a collection of scalable data-mining modules. Some of the key features include:
+<p align="center">
+  <img src="./docs/user-guide/images/diagram.png" alt="diagram"/>


Let's keep the diagram in the PR for now

ryantwolf · 2024-06-27T17:22:25Z

README.md


-## Key Features
+# NeMo Curator
+🚀 **The GPU-accelerated open source framework for efficient large language models data curation** 🚀


@Maghoumi given you didn't change according to 2 & 3, I just want to confirm that the open source is needed for SEO and that "Large Language Models Data Curation" is correct (instead of "Large Language Model Data Curation")

Maghoumi · 2024-06-27T20:50:23Z

@Maghoumi given you didn't change according to 2 & 3, I just want to confirm that the open source is needed for SEO and that "Large Language Models Data Curation" is correct (instead of "Large Language Model Data Curation")

@ryantwolf Oops, I missed your original comment. I changed it to "Large Language Model Data Curation". Yes the redundant open source emphasis is for SEO as there is no other mention of it in the readme file.

* Improve and shorten. * Include links to blogs and tutorials. * Remove incorrect info about non-existent branches. * Add a header and a diagram. * Add a note about incremental deduplication. Signed-off-by: Mehran Maghoumi <[email protected]>

ryantwolf

Good with me. Thanks!

Maghoumi requested a review from ryantwolf June 24, 2024 20:27

Maghoumi force-pushed the mmaghoumi/update-readme branch 3 times, most recently from d913c7e to e1d5b18 Compare June 25, 2024 00:07

ryantwolf requested changes Jun 25, 2024

View reviewed changes

Maghoumi force-pushed the mmaghoumi/update-readme branch from 5f8d210 to e4c5a47 Compare June 26, 2024 21:54

Maghoumi commented Jun 26, 2024

View reviewed changes

ryantwolf requested changes Jun 27, 2024

View reviewed changes

Maghoumi force-pushed the mmaghoumi/update-readme branch from e4c5a47 to f5c7f9e Compare June 27, 2024 20:50

Update README.md

5b0214b

* Improve and shorten. * Include links to blogs and tutorials. * Remove incorrect info about non-existent branches. * Add a header and a diagram. * Add a note about incremental deduplication. Signed-off-by: Mehran Maghoumi <[email protected]>

Maghoumi force-pushed the mmaghoumi/update-readme branch from f5c7f9e to 5b0214b Compare June 27, 2024 20:51

ryantwolf approved these changes Jun 28, 2024

View reviewed changes

ryantwolf merged commit 640546c into NVIDIA:main Jun 28, 2024
3 checks passed

Maghoumi deleted the mmaghoumi/update-readme branch August 1, 2024 22:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update README.md #126

Update README.md #126

Maghoumi commented Jun 24, 2024 •

edited

Loading

ryantwolf left a comment

ryantwolf Jun 25, 2024

ryantwolf Jun 27, 2024

ryantwolf Jun 25, 2024

Maghoumi Jun 26, 2024

ryantwolf Jun 27, 2024

Maghoumi Jun 27, 2024

Maghoumi left a comment

Maghoumi Jun 26, 2024

ryantwolf left a comment

ryantwolf Jun 27, 2024

Maghoumi Jun 27, 2024

ryantwolf Jun 27, 2024 •

edited

Loading

Maghoumi Jun 27, 2024

ryantwolf Jun 27, 2024

ryantwolf Jun 27, 2024

Maghoumi commented Jun 27, 2024

ryantwolf left a comment

	🚀 The GPU-accelerated open source framework for efficient large language models data curation 🚀
	🚀 The GPU-Accelerated Open Source Framework for Efficient Large Language Models Data Curation 🚀

	- [`distributed_data_classification`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/distributed_data_classification) which focuses on using the quality and domain classifiers to help with data annotation and blending.
	- [`distributed_data_classification`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/distributed_data_classification) which focuses on using the quality and domain classifiers to help with data annotation.

		ScoreFilter(FastTextQualityFilter(model_path="model.bin")),
		# Discard records from irrelevant tasks

Update README.md #126

Update README.md #126

Conversation

Maghoumi commented Jun 24, 2024 • edited Loading

Description

Checklist

ryantwolf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Maghoumi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryantwolf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryantwolf Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Maghoumi commented Jun 27, 2024

ryantwolf left a comment

Choose a reason for hiding this comment

Maghoumi commented Jun 24, 2024 •

edited

Loading

ryantwolf Jun 27, 2024 •

edited

Loading