-
Notifications
You must be signed in to change notification settings - Fork 507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOC] Add html strip processor documentation #5984
Conversation
Signed-off-by: Melissa Vagi <[email protected]>
@hdhalter We need a dev to take the first pass at drafting this content. I can support them in refining the content and getting it through the documentation process. The ingest processor template is provided with this PR, so it should help the dev get started. Thanks! |
Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
@gaobinlong This PR is ready for technical review. Thank you! |
Signed-off-by: Melissa Vagi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vagimeli Please see my comments and changes and let me know if you have any questions. Thanks!
|
||
# HTML strip processor | ||
|
||
The `html_strip` processor removes HTML tags from string fields in incoming documents. The processor is useful when indexing data from web pages or other sources that may contain HTML markup. By removing the HTML tags, you can ensure that the indexed content is clean and easily searchable. HTML tags are replaced with newline characters (`\n`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we be more precise than "clean"? What do we actually mean by this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted that sentence. It's not necessary info. "Clean" means readable text content without HTML tags.
|
||
#### Response | ||
|
||
The response shows that the request has indexed the document into the index `products` and will index all documents with the `description` field containing HTML tags, while storing the cleaned version in the `cleaned_description` field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment re: "clean"
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
* Add html strip processor documentation Signed-off-by: Melissa Vagi <[email protected]> * Add html strip processor documentation Signed-off-by: Melissa Vagi <[email protected]> * Add examples Signed-off-by: Melissa Vagi <[email protected]> * Copy edits Signed-off-by: Melissa Vagi <[email protected]> * Update _ingest-pipelines/processors/html-strip.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]> * Update _ingest-pipelines/processors/html-strip.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]> * Update _ingest-pipelines/processors/html-strip.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]> * Update _ingest-pipelines/processors/html-strip.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]> * Update _ingest-pipelines/processors/html-strip.md Signed-off-by: Melissa Vagi <[email protected]> * Update _ingest-pipelines/processors/html-strip.md Signed-off-by: Melissa Vagi <[email protected]> * Update _ingest-pipelines/processors/html-strip.md Signed-off-by: Melissa Vagi <[email protected]> * Update _ingest-pipelines/processors/html-strip.md Signed-off-by: Melissa Vagi <[email protected]> --------- Signed-off-by: Melissa Vagi <[email protected]> Co-authored-by: Nathan Bower <[email protected]> (cherry picked from commit 6a119e1) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Description
Closes content gap
Issues Resolved
#4647
Checklist
For more information on following Developer Certificate of Origin and signing off your commits, please check here.