Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestions how to clean a dataset #3

Open
VVotan opened this issue Oct 31, 2024 · 1 comment
Open

Suggestions how to clean a dataset #3

VVotan opened this issue Oct 31, 2024 · 1 comment

Comments

@VVotan
Copy link
Collaborator

VVotan commented Oct 31, 2024

Cleaning your translation dataset is crucial for achieving good results with a transformer model. Here are some key steps to effectively clean your dataset:

  1. Remove Duplicates:

    • Check for and eliminate duplicate sentence pairs to ensure the model doesn't overfit on repeated data.
  2. Normalize Text:

    • Convert text to a consistent format (e.g., lowercasing if applicable).
    • Remove unnecessary punctuation, special characters, or extra whitespace.
  3. Tokenization:

    • Use a suitable tokenization method (like BPE or WordPiece) to prepare your text for the model.
  4. Filter Out Short or Long Sentences:

    • Remove sentence pairs that are too short (e.g., one word) or too long (beyond a certain threshold) to maintain quality and consistency.
  5. Check for Alignment:

    • Ensure that each source sentence aligns correctly with its target translation. Mismatched pairs can confuse the model.
  6. Handle Special Cases:

    • Address numbers, dates, and other special tokens to ensure they are represented consistently across languages.
    • For example, decide whether to keep numeric representations as-is or convert them to words.
  7. Remove Non-Translatable Content:

    • Exclude sentences that contain untranslatable content, like programming code or highly domain-specific jargon that might not have equivalents in the target language.
  8. Language Consistency:

    • Make sure that the source and target sentences are in the intended languages, filtering out any entries that are mixed or incorrect.
  9. Consider Context:

    • If applicable, check for context relevance, especially in datasets like subtitles where certain phrases might only make sense in context.
  10. Evaluate and Validate:

    • After cleaning, review a sample of the dataset to ensure that the quality has improved and that the translations are accurate.
  11. Consider Domain-Specific Cleaning:

    • If your dataset is focused on a specific domain (e.g., medical, technical), consider adding domain-specific cleaning steps, such as standardizing terminology.

By following these steps, you can significantly improve the quality of your dataset, which in turn enhances the performance of your transformer model for language translation.

@VVotan
Copy link
Collaborator Author

VVotan commented Oct 31, 2024

Filtering out short or long sentences is important to maintain quality and consistency in your translation dataset. Here’s how you can do it effectively:

1. Define Length Criteria

  • Short Sentences: Decide on a minimum length threshold. For example, you might filter out sentences with fewer than 3-5 tokens (words).
  • Long Sentences: Set a maximum length threshold. Common practices range from 30 to 100 tokens, depending on the language and context.

2. Tokenization

  • Use a consistent tokenization method before applying length filters. This ensures that you’re counting tokens consistently across the dataset. For example, if using BPE or WordPiece, apply the same tokenizer to both source and target sentences.

3. Implement Filtering

  • You can use simple programming logic to filter the dataset. Here’s an example in Python:
def filter_sentences(dataset, min_length=3, max_length=100):
    filtered_dataset = []
    
    for source, target in dataset:
        source_tokens = source.split()  # Basic tokenization; replace with your tokenizer
        target_tokens = target.split()
        
        if len(source_tokens) >= min_length and len(source_tokens) <= max_length and \
           len(target_tokens) >= min_length and len(target_tokens) <= max_length:
            filtered_dataset.append((source, target))
    
    return filtered_dataset

# Example usage
dataset = [
    ("Hi", "Hola"),  # Too short
    ("Hello, how are you?", "Hola, ¿cómo estás?"),  # Good
    ("This is a very long sentence that exceeds the maximum length we set.", "Esta es una oración muy larga que excede la longitud máxima que establecimos.")  # Too long
]

filtered = filter_sentences(dataset)
print(filtered)  # Only valid pairs will be included

4. Review and Adjust

  • After initial filtering, review the filtered dataset to ensure that the remaining sentences make sense and are appropriate for your task. You may need to adjust the thresholds based on your specific requirements.

5. Iterative Refinement

  • If you notice that many valid sentences are being filtered out or that too many remain, iteratively refine your min/max thresholds based on the dataset’s characteristics.

6. Context Consideration

  • In some contexts (like dialogues or poetry), you might want to allow shorter sentences. Tailor your filtering to the specific dataset and intended use case.

By applying these steps, you can effectively filter out short or long sentences, enhancing the overall quality of your translation dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant