Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"parent-child" mod don't works on txt documents #12819

Open
5 tasks done
Sunzhuoyi opened this issue Jan 17, 2025 · 4 comments
Open
5 tasks done

"parent-child" mod don't works on txt documents #12819

Sunzhuoyi opened this issue Jan 17, 2025 · 4 comments
Labels
🐞 bug Something isn't working

Comments

@Sunzhuoyi
Copy link

Self Checks

  • This is only for bug report, if you would like to ask a question, please head to Discussions.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template :) and fill in all the required fields.

Dify version

0.15.1

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Hi , when I using "parent-child" mode to my documents, I find that it works on docx、pdf but not txt, my txt documents is uploaded by the api "/document/create-by-text" of dify.
`"""通过文本创建文档并上传到dify"""

    url = f"{self.DIFY_BASE_URL}/{self.DATASET_ID}/document/create-by-text"
    headers = {
        "Authorization": f"Bearer {self.DIFY_API_KEY}"
    }

    # Proceed with the upload
    payload = {
        "name": title,
        "text": content,
        "indexing_technique": "high_quality",
        "doc_form": "hierarchical_model",
        "process_rule": {
            "mode": "automatic"
        }
    }`

when I enter txt documents, it looks like this

Image

Added
the txt documents works well on general thunk mode

✔️ Expected Behavior

it supposed to be like docx or pdf files

Image

❌ Actual Behavior

I can't see anything in documents
Image

@dosubot dosubot bot added the 🐞 bug Something isn't working label Jan 17, 2025
Copy link

dosubot bot commented Jan 17, 2025

The issue you're experiencing with the "parent-child" mode not working on txt documents in Dify might be related to how the txt files are being processed or segmented. Here are a few things to consider:

  1. Segmentation Rules: Ensure that the segmentation rules for txt files are correctly configured. By default, parent chunks are split by paragraphs (\n\n) and child chunks by sentences (\n). You can customize these delimiters using regex to better fit the structure of your txt documents [1][2].

  2. Chunk Length: Check the maximum chunk length settings for both parent and child chunks. The default is 500 tokens for parent chunks and 200 tokens for child chunks, with a maximum of 4000 tokens. If your txt documents are not being segmented properly, adjusting these values might help [1].

  3. Text Preprocessing: Make sure that any preprocessing rules, such as removing URLs, emails, or extra spaces, are not inadvertently stripping out important content from your txt files [1].

  4. Indexing Technique: Verify that the indexing technique and document form settings are appropriate for your use case. The "high_quality" mode and "hierarchical_model" form should work, but ensure they are correctly applied [3].

If these settings are correctly configured and the issue persists, it might be worth checking if there are any specific errors or logs generated during the upload and processing stages that could provide more insight into the problem.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@Sunzhuoyi
Copy link
Author

@dosubot yeah, I checked all the things ,and they are right.

Copy link

dosubot bot commented Jan 17, 2025

If you've verified all the configurations and settings, and the issue persists, consider the following additional steps:

  1. Error Logs: Check the error logs for any specific messages that might indicate why the txt files are not being processed correctly. Look for errors related to segmentation or processing that might not be immediately obvious.

  2. File Content: Ensure that the content of the txt files is not causing issues. Sometimes, unexpected characters or formatting can interfere with processing. Try simplifying the content to see if it affects the outcome.

  3. API Usage: Double-check the API usage for uploading txt files. Ensure that the payload and headers are correctly formatted and that the API endpoint is functioning as expected.

  4. Version Compatibility: Make sure that the version of Dify you are using is compatible with the features you are trying to use. Occasionally, updates or patches might address specific issues with file processing.

If none of these steps resolve the issue, it might be helpful to isolate the problem by testing with different txt files or configurations to identify any patterns or specific conditions that trigger the problem.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@kurokobo
Copy link
Contributor

@Sunzhuoyi
Hi, have you tried "mode": "hierarchical" instead of "mode": "automatic"?
See also: #12714 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants