- Huge performance improvements on large documents.
- Rename from
clean-html
toclear-html
because of the PyPI name clash withCleanHTML
.
- Make the project open-source.
- Fix and update type hints.
- These functions now accept optional callables:
cleaned_node_to_text
hastext_extractor
to extract text.integrate_embeddings
haspreprocessor
to preprocess whitelisted nodes
- cleaned_node_to_html never return None anymore
- Initial version.