[LLM pipeline] Language filter component #232

mrchtr · 2023-06-23T12:01:51Z

This PR adds the first component for the LLM dataset creation pipeline. The component is a language filter which filters out rows in a provided dataframe that are not matching the provided language.
FastText is used for the language detection.

Changes

add component
add unit test to test the filter logic inside of the component

Note: Did not create a pipeline that uses this component yet.

components/language_filter/fondant_component.yaml

components/language_filter/src/main.py

components/language_filter/fondant_component.yaml

components/language_filter/src/main.py

Co-authored-by: NielsRogge <[email protected]>

NielsRogge · 2023-06-29T13:13:53Z

components/language_filter/src/main.py

+logger = logging.getLogger(__name__)
+
+
+class LanguageIdentification:


rather than including the ftz file, can we load from the hub since FastText is now hosted there?

just:

import fasttext from huggingface_hub import hf_hub_download model_path = hf_hub_download(repo_id="facebook/fasttext-language-identification", filename="model.bin") model = fasttext.load_model(model_path)

What speaks against to include the ftz file in the repository? Alternative we could download the file during the image build process. Just want to avoid the situation, if some external dependencies can not be reached that the execution of the component will fail.

components/language_filter/README.md

RobbeSneyders

Thanks @mrchtr! Some small comments.

components/language_filter/src/main.py

RobbeSneyders · 2023-07-03T08:57:33Z

components/language_filter/tests/language_filter_component_test.py

Interesting to see these tests.

This could probably also be easier if we split the general component behavior from the user implementation into separate classes as discussed in chat. Since then we could test the user implementation without having to provide dummy variables for all the general component behavior.

components/language_filter/requirements.txt

Co-authored-by: Robbe Sneyders <[email protected]>

This PR adds the first component for the LLM dataset creation pipeline. The component is a language filter which filters out rows in a provided dataframe that are not matching the provided language. FastText is used for the language detection. Changes - add component - add unit test to test the filter logic inside of the component Note: Did not create a pipeline that uses this component yet. --------- Co-authored-by: NielsRogge <[email protected]> Co-authored-by: Robbe Sneyders <[email protected]>

mrchtr added 3 commits June 23, 2023 10:43

Draft language detection

a9555bb

Add language filter component unit test

498f103

Fix typings

65d4e28

NielsRogge reviewed Jun 23, 2023

View reviewed changes

components/language_filter/fondant_component.yaml Outdated Show resolved Hide resolved

NielsRogge reviewed Jun 23, 2023

View reviewed changes

components/language_filter/src/main.py Outdated Show resolved Hide resolved

NielsRogge reviewed Jun 23, 2023

View reviewed changes

components/language_filter/src/main.py Outdated Show resolved Hide resolved

mrchtr added 4 commits June 26, 2023 13:27

Use PandasTransformComponent instead of DaskTransformComponent

0f9159d

Update component spec and test cases

141c1f8

Apply ruff fix

b9caff5

Fix fasttext dependencies

eb348ed

NielsRogge reviewed Jun 27, 2023

View reviewed changes

components/language_filter/fondant_component.yaml Outdated Show resolved Hide resolved

NielsRogge reviewed Jun 27, 2023

View reviewed changes

components/language_filter/src/main.py Outdated Show resolved Hide resolved

NielsRogge reviewed Jun 27, 2023

View reviewed changes

components/language_filter/src/main.py Outdated Show resolved Hide resolved

NielsRogge reviewed Jun 27, 2023

View reviewed changes

components/language_filter/src/main.py Outdated Show resolved Hide resolved

NielsRogge reviewed Jun 27, 2023

View reviewed changes

components/language_filter/src/main.py Outdated Show resolved Hide resolved

mrchtr and others added 5 commits June 29, 2023 13:18

Update components/language_filter/fondant_component.yaml

6eeea74

Co-authored-by: NielsRogge <[email protected]>

Update components/language_filter/src/main.py

3ec6e5f

Co-authored-by: NielsRogge <[email protected]>

Addressing comments

25979df

Merge branch 'ml6team:main' into main

915b53c

Fixing ruff after merging main into feature branch

86881b8

NielsRogge reviewed Jun 29, 2023

View reviewed changes

components/language_filter/README.md Show resolved Hide resolved

Remove init.py

06bc711

RobbeSneyders reviewed Jul 3, 2023

View reviewed changes

components/language_filter/requirements.txt Outdated Show resolved Hide resolved

RobbeSneyders mentioned this pull request Jul 3, 2023

Split general component behavior from user implementation #257

Closed

mrchtr and others added 2 commits July 3, 2023 16:29

Update components/language_filter/requirements.txt

d563401

Co-authored-by: Robbe Sneyders <[email protected]>

Addressing comments

27fb0cb

PhilippeMoussalli added the Components Implementation of components label Jul 3, 2023

PhilippeMoussalli self-assigned this Jul 3, 2023

PhilippeMoussalli added this to the 0.2.0 milestone Jul 3, 2023

PhilippeMoussalli linked an issue Jul 3, 2023 that may be closed by this pull request

Run Controlnet use case at scale with custom LAION backend #261

Closed

PhilippeMoussalli removed this from the 0.2.0 milestone Jul 3, 2023

PhilippeMoussalli removed their assignment Jul 3, 2023

PhilippeMoussalli removed the Components Implementation of components label Jul 3, 2023

RobbeSneyders approved these changes Jul 5, 2023

View reviewed changes

RobbeSneyders merged commit d06b9e0 into ml6team:main Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLM pipeline] Language filter component #232

[LLM pipeline] Language filter component #232

mrchtr commented Jun 23, 2023

NielsRogge Jun 29, 2023

mrchtr Jun 29, 2023

RobbeSneyders left a comment

RobbeSneyders Jul 3, 2023

		logger = logging.getLogger(__name__)


		class LanguageIdentification:

[LLM pipeline] Language filter component #232

[LLM pipeline] Language filter component #232

Conversation

mrchtr commented Jun 23, 2023

NielsRogge Jun 29, 2023

Choose a reason for hiding this comment

mrchtr Jun 29, 2023

Choose a reason for hiding this comment

RobbeSneyders left a comment

Choose a reason for hiding this comment

RobbeSneyders Jul 3, 2023

Choose a reason for hiding this comment