Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLM pipeline] Language filter component #232

Merged
merged 15 commits into from
Jul 5, 2023
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions components/language_filter/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
FROM --platform=linux/amd64 python:3.8-slim

## System dependencies
RUN apt-get update && \
apt-get upgrade -y && \
apt-get install git -y

# install requirements
COPY requirements.txt /
RUN pip3 install --no-cache-dir -r requirements.txt

# Set the working directory to the component folder
WORKDIR /component/src

# Copy over src-files
COPY src/ .

ENTRYPOINT ["python", "main.py"]
7 changes: 7 additions & 0 deletions components/language_filter/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Language filter

## Description
This component is based on the `TransformComponent` and is used to filter a dataframe based on language.
It allows you to remove rows that do not match the provided language, thus providing a way to focus
on specific languages within your data.
NielsRogge marked this conversation as resolved.
Show resolved Hide resolved

14 changes: 14 additions & 0 deletions components/language_filter/fondant_component.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
name: Filter languages
description: A component that filters text based on the language.
image: ghcr.io/ml6team/filter_language:latest

consumes:
text:
fields:
data:
type: string

args:
language:
description: A valid language code or identifier (e.g., "en", "fr", "de").
type: str
4 changes: 4 additions & 0 deletions components/language_filter/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
fondant
mrchtr marked this conversation as resolved.
Show resolved Hide resolved
pyarrow>=7.0
gcsfs==2023.4.00
fasttext-wheel==0.9.2
Binary file added components/language_filter/src/lid.176.ftz
Binary file not shown.
72 changes: 72 additions & 0 deletions components/language_filter/src/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
"""A component that filters text based on the language."""
import logging

import fasttext
import pandas as pd

from fondant.component import PandasTransformComponent
from fondant.logger import configure_logging

configure_logging()
mrchtr marked this conversation as resolved.
Show resolved Hide resolved
logger = logging.getLogger(__name__)


class LanguageIdentification:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than including the ftz file, can we load from the hub since FastText is now hosted there?

just:

import fasttext
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(repo_id="facebook/fasttext-language-identification", filename="model.bin")
model = fasttext.load_model(model_path)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What speaks against to include the ftz file in the repository? Alternative we could download the file during the image build process. Just want to avoid the situation, if some external dependencies can not be reached that the execution of the component will fail.

"""A class for language detection using FastText."""

def __init__(self, language, model_path: str = "lid.176.ftz"):
"""
Initializes the LanguageDetect class.

Args:
language (str): language to filter on
model_path (str): The path to the FastText language identification model.
"""
pretrained_lang_model_weight_path = model_path
self.language = language
self.model = fasttext.load_model(pretrained_lang_model_weight_path)

def predict_lang(self, text: str):
"""
Detects the language of a text sequence.

Args:
text (str): The text for language detection.

Returns:
str: The predicted language label.
"""
predictions = self.model.predict(text, k=1)
return predictions[0][0]

def is_language(self, row):
"""Predict if text of a row is written in the defined language."""
return self.language in self.predict_lang(row["text"])


class LanguageFilterComponent(PandasTransformComponent):
"""Component that filter columns based on provided language."""

def setup(self, *, language):
"""Setup language filter component."""
self.lang_detector = LanguageIdentification(language)


def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
"""
Args:
dataframe: Pandas dataframe.
language: Only keep text passages which are in the provided language.
mrchtr marked this conversation as resolved.
Show resolved Hide resolved

Returns:
Pandas dataframe
"""
mask = dataframe.apply(
lambda row: self.lang_detector.is_language(row), axis=1)
mrchtr marked this conversation as resolved.
Show resolved Hide resolved

return dataframe[mask]


if __name__ == "__main__":
component = LanguageFilterComponent.from_args()
component.run()
54 changes: 54 additions & 0 deletions components/language_filter/tests/language_filter_component_test.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting to see these tests.

This could probably also be easier if we split the general component behavior from the user implementation into separate classes as discussed in chat. Since then we could test the user implementation without having to provide dummy variables for all the general component behavior.

Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
"""Unit test for language filter component."""
import pandas as pd

from components.language_filter.src.main import LanguageFilterComponent
from fondant.component_spec import ComponentSpec


def test_run_component_test():
"""Test language filter component."""
# Given: Dataframe with text in different languages
data = [{"text": "Das hier ist ein Satz in deutscher Sprache"},
{"text": "This is a sentence in English"},
{"text": "Dit is een zin in het Nederlands"}]
dataframe = pd.DataFrame(data)

# When: The language filter component proceed the dataframe
# and filter out all entries which are not written in german
spec = ComponentSpec.from_file("../fondant_component.yaml")

component = LanguageFilterComponent(spec, input_manifest_path="./dummy_input_manifest.json",
output_manifest_path="./dummy_input_manifest.json",
metadata={},
user_arguments={"language": "de"},
)
component.setup()
dataframe = component.transform(dataframe=dataframe)

# Then: dataframe only contains one german row
assert len(dataframe) == 1
assert dataframe.loc[0]["text"] == "Das hier ist ein Satz in deutscher Sprache"


def test_run_component_test_filter_out_all():
"""Test language filter component."""
# Given: Dataframe with text in different languages
data = [{"text": "Das hier ist ein Satz in deutscher Sprache"},
{"text": "This is a sentence in English"},
{"text": "Dit is een zin in het Nederlands"}]
dataframe = pd.DataFrame(data)

# When: The language filter component proceed the dataframe
# and filter out all entries which are not written in french
spec = ComponentSpec.from_file("../fondant_component.yaml")

component = LanguageFilterComponent(spec, input_manifest_path="./dummy_input_manifest.json",
output_manifest_path="./dummy_input_manifest.json",
metadata={},
user_arguments={"language": "fr"},
)
component.setup()
dataframe = component.transform(dataframe=dataframe)

# Then: dataframe should contain no rows anymore
assert len(dataframe) == 0