Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uri_parts parse directory structure for extension #105612

Closed
jguay opened this issue Feb 19, 2024 · 1 comment · Fixed by #105689
Closed

uri_parts parse directory structure for extension #105612

jguay opened this issue Feb 19, 2024 · 1 comment · Fixed by #105689
Assignees
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team

Comments

@jguay
Copy link
Contributor

jguay commented Feb 19, 2024

Elasticsearch Version

8.12.1

Installed Plugins

No response

Java Version

bundled

OS Version

docker

Problem Description

uri_parts ingest pipeline processor output wrong extension when there is none and URL path contains dot character(s)

URLs https://www.example.com/path.withdot/filenamewithoutextension computes extension as "extension": "withdot/filenamewithoutextension"

Steps to Reproduce

  1. Create pipeline
PUT /_ingest/pipeline/test-uri-parts
{
    "processors": [
        {
            "uri_parts": {
                "field": "url.original",
                "target_field": "url.parsed"
            }
        }
    ]
}
  1. Simulate pipeline :
POST _ingest/pipeline/test-uri-parts/_simulate
{
  "docs" :
  [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "url": {
          "original": "https://www.example.com/path.withdot/filenamewithoutextension"
        }
      }
    }
    ]
}

Output contains wrong data for extension

{
  "docs": [
    {
      "doc": {
        "_index": "index",
        "_version": "-3",
        "_id": "id",
        "_source": {
          "url": {
            "parsed": {
              "path": "/path.withdot/folder/filenamewithoutextension",
              "extension": "withdot/folder/filenamewithoutextension",
              "original": "https://www.example.com/path.withdot/folder/filenamewithoutextension",
              "scheme": "https",
              "domain": "www.example.com"
            },
            "original": "https://www.example.com/path.withdot/folder/filenamewithoutextension"
          }
        },
        "_ingest": {
          "timestamp": "2024-02-19T09:47:21.38168605Z"
        }
      }
    }
  ]
}

Workaround

  • The issue won't appear on https://www.example.com/path.withdot/filenamewithextension.zip so the following workaround is available to remove the unwanted extension field
PUT /_ingest/pipeline/test-uri-parts
{
  "processors": [
    {
      "uri_parts": {
        "field": "url.original",
        "target_field": "url.parsed"
      }
    },
    {
      "remove": {
        "field": "url.parsed.extension",
        "if": "ctx?.url?.parsed?.extension != null && ctx?.url?.parsed?.extension.indexOf('/') != -1"
      }
    }
  ]
}
@jguay jguay added >bug needs:triage Requires assignment of a team area label :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team labels Feb 19, 2024
@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Feb 19, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants