-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestions how to clean a dataset #3
Comments
Filtering out short or long sentences is important to maintain quality and consistency in your translation dataset. Here’s how you can do it effectively: 1. Define Length Criteria
2. Tokenization
3. Implement Filtering
def filter_sentences(dataset, min_length=3, max_length=100):
filtered_dataset = []
for source, target in dataset:
source_tokens = source.split() # Basic tokenization; replace with your tokenizer
target_tokens = target.split()
if len(source_tokens) >= min_length and len(source_tokens) <= max_length and \
len(target_tokens) >= min_length and len(target_tokens) <= max_length:
filtered_dataset.append((source, target))
return filtered_dataset
# Example usage
dataset = [
("Hi", "Hola"), # Too short
("Hello, how are you?", "Hola, ¿cómo estás?"), # Good
("This is a very long sentence that exceeds the maximum length we set.", "Esta es una oración muy larga que excede la longitud máxima que establecimos.") # Too long
]
filtered = filter_sentences(dataset)
print(filtered) # Only valid pairs will be included 4. Review and Adjust
5. Iterative Refinement
6. Context Consideration
By applying these steps, you can effectively filter out short or long sentences, enhancing the overall quality of your translation dataset. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Cleaning your translation dataset is crucial for achieving good results with a transformer model. Here are some key steps to effectively clean your dataset:
Remove Duplicates:
Normalize Text:
Tokenization:
Filter Out Short or Long Sentences:
Check for Alignment:
Handle Special Cases:
Remove Non-Translatable Content:
Language Consistency:
Consider Context:
Evaluate and Validate:
Consider Domain-Specific Cleaning:
By following these steps, you can significantly improve the quality of your dataset, which in turn enhances the performance of your transformer model for language translation.
The text was updated successfully, but these errors were encountered: