feat: Add MarkdownToTextDocument (v2) (#6159)

* Add MarkdownToTextDocument * Add release notes * Update GitHub workflows * Update GitHub workflows * Refactor code with minimal dependencies * Update docstrings * Apply suggestions from code review Co-authored-by: Daria Fokina <[email protected]> * Update document with content and meta for backward compatibility * Refactor Document Class for Backward Compatibility Co-authored-by: Stefano Fiorucci <[email protected]> * Update tests * Improve test assertions --------- Co-authored-by: Daria Fokina <[email protected]> Co-authored-by: Stefano Fiorucci <[email protected]>
deepset-ai · Oct 31, 2023 · 6bf0b9d · 6bf0b9d
1 parent 5e12230
commit 6bf0b9d
Show file tree

Hide file tree

Showing 7 changed files with 266 additions and 5 deletions.
diff --git a/.github/workflows/linting_preview.yml b/.github/workflows/linting_preview.yml
@@ -74,7 +74,7 @@ jobs:
 
       - name: Install Haystack
         run: |
-          pip install .[dev,preview] langdetect transformers[torch,sentencepiece]==4.34.1 'sentence-transformers>=2.2.0' pypdf tika 'azure-ai-formrecognizer>=3.2.0b2'
+          pip install .[dev,preview] langdetect transformers[torch,sentencepiece]==4.34.1 'sentence-transformers>=2.2.0' pypdf markdown-it-py mdit_plain tika 'azure-ai-formrecognizer>=3.2.0b2'
           pip install --no-deps llvmlite numba 'openai-whisper>=20230918'  # prevent outdated version of tiktoken pinned by openai-whisper
           pip install ./haystack-linter
 

diff --git a/.github/workflows/tests_preview.yml b/.github/workflows/tests_preview.yml
@@ -117,7 +117,7 @@ jobs:
 
       - name: Install Haystack
         run: |
-          pip install .[dev,preview] langdetect transformers[torch,sentencepiece]==4.34.1 'sentence-transformers>=2.2.0' pypdf tika 'azure-ai-formrecognizer>=3.2.0b2'
+          pip install .[dev,preview] langdetect transformers[torch,sentencepiece]==4.34.1 'sentence-transformers>=2.2.0' pypdf markdown-it-py mdit_plain tika 'azure-ai-formrecognizer>=3.2.0b2'
           pip install --no-deps llvmlite numba 'openai-whisper>=20230918'  # prevent outdated version of tiktoken pinned by openai-whisper
 
       - name: Run
@@ -178,7 +178,7 @@ jobs:
 
       - name: Install Haystack
         run: |
-          pip install .[dev,preview] langdetect transformers[torch,sentencepiece]==4.34.1 'sentence-transformers>=2.2.0' pypdf tika 'azure-ai-formrecognizer>=3.2.0b2'
+          pip install .[dev,preview] langdetect transformers[torch,sentencepiece]==4.34.1 'sentence-transformers>=2.2.0' pypdf markdown-it-py mdit_plain tika 'azure-ai-formrecognizer>=3.2.0b2'
           pip install --no-deps llvmlite numba 'openai-whisper>=20230918'  # prevent outdated version of tiktoken pinned by openai-whisper
 
       - name: Run
@@ -237,7 +237,7 @@ jobs:
 
       - name: Install Haystack
         run: |
-          pip install .[dev,preview] langdetect transformers[torch,sentencepiece]==4.34.1 'sentence-transformers>=2.2.0' pypdf tika 'azure-ai-formrecognizer>=3.2.0b2'
+          pip install .[dev,preview] langdetect transformers[torch,sentencepiece]==4.34.1 'sentence-transformers>=2.2.0' pypdf markdown-it-py mdit_plain tika 'azure-ai-formrecognizer>=3.2.0b2'
           pip install --no-deps llvmlite numba 'openai-whisper>=20230918'  # prevent outdated version of tiktoken pinned by openai-whisper
 
       - name: Run Tika
@@ -291,7 +291,7 @@ jobs:
 
       - name: Install Haystack
         run: |
-          pip install .[dev,preview] langdetect transformers[torch,sentencepiece]==4.34.1 'sentence-transformers>=2.2.0' pypdf tika 'azure-ai-formrecognizer>=3.2.0b2'
+          pip install .[dev,preview] langdetect transformers[torch,sentencepiece]==4.34.1 'sentence-transformers>=2.2.0' pypdf markdown-it-py mdit_plain tika 'azure-ai-formrecognizer>=3.2.0b2'
           pip install --no-deps llvmlite numba 'openai-whisper>=20230918'  # prevent outdated version of tiktoken pinned by openai-whisper
 
       - name: Run

diff --git a/haystack/preview/components/file_converters/__init__.py b/haystack/preview/components/file_converters/__init__.py
@@ -3,11 +3,13 @@
 from haystack.preview.components.file_converters.azure import AzureOCRDocumentConverter
 from haystack.preview.components.file_converters.pypdf import PyPDFToDocument
 from haystack.preview.components.file_converters.html import HTMLToDocument
+from haystack.preview.components.file_converters.markdown import MarkdownToDocument
 
 __all__ = [
     "TextFileToDocument",
     "TikaDocumentConverter",
     "AzureOCRDocumentConverter",
     "PyPDFToDocument",
     "HTMLToDocument",
+    "MarkdownToDocument",
 ]
diff --git a/haystack/preview/components/file_converters/markdown.py b/haystack/preview/components/file_converters/markdown.py
@@ -0,0 +1,97 @@
+import logging
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Union
+
+from tqdm import tqdm
+
+from haystack.preview import Document, component
+from haystack.preview.dataclasses import ByteStream
+from haystack.preview.lazy_imports import LazyImport
+
+with LazyImport("Run 'pip install markdown-it-py mdit_plain'") as markdown_conversion_imports:
+    from markdown_it import MarkdownIt
+    from mdit_plain.renderer import RendererPlain
+
+
+logger = logging.getLogger(__name__)
+
+
+@component
+class MarkdownToDocument:
+    """
+    Converts a Markdown file into a text Document.
+
+    Usage example:
+    ```python
+    from haystack.preview.components.file_converters.markdown import MarkdownToDocument
+
+    converter = MarkdownToDocument()
+    results = converter.run(sources=["sample.md"])
+    documents = results["documents"]
+    print(documents[0].content)
+    # 'This is a text from the markdown file.'
+    ```
+    """
+
+    def __init__(self, table_to_single_line: bool = False, progress_bar: bool = True):
+        """
+        :param table_to_single_line: Convert contents of the table into a single line. Defaults to False.
+        :param progress_bar: Show a progress bar for the conversion. Defaults to True.
+        """
+        markdown_conversion_imports.check()
+
+        self.table_to_single_line = table_to_single_line
+        self.progress_bar = progress_bar
+
+    @component.output_types(documents=List[Document])
+    def run(self, sources: List[Union[str, Path, ByteStream]], meta: Optional[List[Dict[str, Any]]] = None):
+        """
+        Reads text from a markdown file and executes optional preprocessing steps.
+
+        :param sources: A list of markdown data sources (file paths or binary objects)
+        :param meta: Optional list of metadata to attach to the Documents.
+        The length of the list must match the number of paths. Defaults to `None`.
+        """
+        parser = MarkdownIt(renderer_cls=RendererPlain)
+        if self.table_to_single_line:
+            parser.enable("table")
+
+        documents = []
+        if meta is None:
+            meta = [{}] * len(sources)
+
+        for source, metadata in tqdm(
+            zip(sources, meta),
+            total=len(sources),
+            desc="Converting markdown files to Documents",
+            disable=not self.progress_bar,
+        ):
+            try:
+                file_content = self._extract_content(source)
+            except Exception as e:
+                logger.warning("Could not read %s. Skipping it. Error: %s", source, e)
+                continue
+            try:
+                text = parser.render(file_content)
+            except Exception as conversion_e:  # Consider specifying the expected exception type(s) here
+                logger.warning("Failed to extract text from %s. Skipping it. Error: %s", source, conversion_e)
+                continue
+
+            document = Document(content=text, meta=metadata)
+            documents.append(document)
+
+        return {"documents": documents}
+
+    def _extract_content(self, source: Union[str, Path, ByteStream]) -> str:
+        """
+        Extracts content from the given data source.
+        :param source: The data source to extract content from.
+        :return: The extracted content.
+        """
+        if isinstance(source, (str, Path)):
+            with open(source) as text_file:
+                return text_file.read()
+        if isinstance(source, ByteStream):
+            return source.data.decode("utf-8")
+
+        raise ValueError(f"Unsupported source type: {type(source)}")
diff --git a/releasenotes/notes/add-MarkdownToTextDocument-f97ec6c5fb35527d.yaml b/releasenotes/notes/add-MarkdownToTextDocument-f97ec6c5fb35527d.yaml
@@ -0,0 +1,4 @@
+---
+preview:
+  - |
+    Add MarkdownToTextDocument, a file converter that converts Markdown files into a text Documents.
diff --git a/test/preview/components/file_converters/test_markdown_to_document.py b/test/preview/components/file_converters/test_markdown_to_document.py
@@ -0,0 +1,93 @@
+import logging
+
+import pytest
+
+from haystack.preview.components.file_converters.markdown import MarkdownToDocument
+from haystack.preview.dataclasses import ByteStream
+
+
+class TestMarkdownToDocument:
+    @pytest.mark.unit
+    def test_init_params_default(self):
+        converter = MarkdownToDocument()
+        assert converter.table_to_single_line is False
+        assert converter.progress_bar is True
+
+    @pytest.mark.unit
+    def test_init_params_custom(self):
+        converter = MarkdownToDocument(table_to_single_line=True, progress_bar=False)
+        assert converter.table_to_single_line is True
+        assert converter.progress_bar is False
+
+    @pytest.mark.integration
+    def test_run(self, preview_samples_path):
+        converter = MarkdownToDocument()
+        sources = [preview_samples_path / "markdown" / "sample.md"]
+        results = converter.run(sources=sources)
+        docs = results["documents"]
+
+        assert len(docs) == 1
+        for doc in docs:
+            assert "What to build with Haystack" in doc.content
+            assert "# git clone https://github.com/deepset-ai/haystack.git" in doc.content
+
+    @pytest.mark.integration
+    def test_run_metadata(self, preview_samples_path):
+        converter = MarkdownToDocument()
+        sources = [preview_samples_path / "markdown" / "sample.md"]
+        metadata = [{"file_name": "sample.md"}]
+        results = converter.run(sources=sources, meta=metadata)
+        docs = results["documents"]
+
+        assert len(docs) == 1
+        for doc in docs:
+            assert "What to build with Haystack" in doc.content
+            assert "# git clone https://github.com/deepset-ai/haystack.git" in doc.content
+            assert doc.meta == {"file_name": "sample.md"}
+
+    @pytest.mark.integration
+    def test_run_wrong_file_type(self, preview_samples_path, caplog):
+        """
+        Test if the component runs correctly when an input file is not of the expected type.
+        """
+        sources = [preview_samples_path / "audio" / "answer.wav"]
+        converter = MarkdownToDocument()
+        with caplog.at_level(logging.WARNING):
+            output = converter.run(sources=sources)
+            assert "codec can't decode byte" in caplog.text
+
+        docs = output["documents"]
+        assert not docs
+
+    @pytest.mark.integration
+    def test_run_error_handling(self, caplog):
+        """
+        Test if the component correctly handles errors.
+        """
+        sources = ["non_existing_file.md"]
+        converter = MarkdownToDocument()
+        with caplog.at_level(logging.WARNING):
+            result = converter.run(sources=sources)
+            assert "Could not read non_existing_file.md" in caplog.text
+            assert not result["documents"]
+
+    @pytest.mark.unit
+    def test_mixed_sources_run(self, preview_samples_path):
+        """
+        Test if the component runs correctly if the input is a mix of strings, paths and ByteStreams.
+        """
+        sources = [
+            preview_samples_path / "markdown" / "sample.md",
+            str((preview_samples_path / "markdown" / "sample.md").absolute()),
+        ]
+        with open(preview_samples_path / "markdown" / "sample.md", "rb") as f:
+            byte_stream = f.read()
+            sources.append(ByteStream(byte_stream))
+
+        converter = MarkdownToDocument()
+        output = converter.run(sources=sources)
+        docs = output["documents"]
+        assert len(docs) == 3
+        for doc in docs:
+            assert "What to build with Haystack" in doc.content
+            assert "# git clone https://github.com/deepset-ai/haystack.git" in doc.content
diff --git a/test/preview/test_files/markdown/sample.md b/test/preview/test_files/markdown/sample.md
@@ -0,0 +1,65 @@
+---
+type: intro
+date: 1.1.2023
+---
+```bash
+pip install farm-haystack
+```
+## What to build with Haystack
+
+- **Ask questions in natural language** and find granular answers in your own documents.
+- Perform **semantic search** and retrieve documents according to meaning not keywords
+- Use **off-the-shelf models** or **fine-tune** them to your own domain.
+- Use **user feedback** to evaluate, benchmark and continuously improve your live models.
+- Leverage existing **knowledge bases** and better handle the long tail of queries that **chatbots** receive.
+- **Automate processes** by automatically applying a list of questions to new documents and using the extracted answers.
+
+![Logo](https://raw.githubusercontent.com/deepset-ai/haystack/main/docs/img/logo.png)
+
+
+## Core Features
+
+-   **Latest models**: Utilize all latest transformer based models (e.g. BERT, RoBERTa, MiniLM) for extractive QA, generative QA and document retrieval.
+-   **Modular**: Multiple choices to fit your tech stack and use case. Pick your favorite database, file converter or modeling framework.
+-   **Open**: 100% compatible with HuggingFace's model hub. Tight interfaces to other frameworks (e.g. Transformers, FARM, sentence-transformers)
+-   **Scalable**: Scale to millions of docs via retrievers, production-ready backends like Elasticsearch / FAISS and a fastAPI REST API
+-   **End-to-End**: All tooling in one place: file conversion, cleaning, splitting, training, eval, inference, labeling ...
+-   **Developer friendly**: Easy to debug, extend and modify.
+-   **Customizable**: Fine-tune models to your own domain or implement your custom DocumentStore.
+-   **Continuous Learning**: Collect new training data via user feedback in production & improve your models continuously
+
+|  |  |
+|-|-|
+| :ledger: [Docs](https://haystack.deepset.ai/overview/intro) | Usage, Guides, API documentation ...|
+| :beginner: [Quick Demo](https://github.com/deepset-ai/haystack/#quick-demo) | Quickly see what Haystack offers |
+| :floppy_disk: [Installation](https://github.com/deepset-ai/haystack/#installation) | How to install Haystack |
+| :art: [Key Components](https://github.com/deepset-ai/haystack/#key-components) | Overview of core concepts |
+| :mortar_board: [Tutorials](https://github.com/deepset-ai/haystack/#tutorials) | Jupyter/Colab Notebooks & Scripts |
+| :eyes: [How to use Haystack](https://github.com/deepset-ai/haystack/#how-to-use-haystack) | Basic explanation of concepts, options and usage |
+| :heart: [Contributing](https://github.com/deepset-ai/haystack/#heart-contributing) | We welcome all contributions! |
+| :bar_chart: [Benchmarks](https://haystack.deepset.ai/benchmarks/v0.9.0) | Speed & Accuracy of Retriever, Readers and DocumentStores |
+| :telescope: [Roadmap](https://haystack.deepset.ai/overview/roadmap) | Public roadmap of Haystack |
+| :pray: [Slack](https://haystack.deepset.ai/community/join) | Join our community on Slack |
+| :bird: [Twitter](https://twitter.com/deepset_ai) | Follow us on Twitter for news and updates |
+| :newspaper: [Blog](https://medium.com/deepset-ai) | Read our articles on Medium |
+
+
+## Quick Demo
+
+The quickest way to see what Haystack offers is to start a [Docker Compose](https://docs.docker.com/compose/) demo application:
+
+**1. Update/install Docker and Docker Compose, then launch Docker**
+
+```
+    # apt-get update && apt-get install docker && apt-get install docker-compose
+    # service docker start
+```
+
+**2. Clone Haystack repository**
+
+```
+    # git clone https://github.com/deepset-ai/haystack.git
+```
+
+### 2nd level headline for testing purposes
+#### 3rd level headline for testing purposes