Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copy_to of mapper attachments metadata field isn't working #14946

Closed
dadoonet opened this issue Nov 23, 2015 · 19 comments
Closed

copy_to of mapper attachments metadata field isn't working #14946

dadoonet opened this issue Nov 23, 2015 · 19 comments
Assignees
Labels
>bug >regression :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch

Comments

@dadoonet
Copy link
Member

From @gpaul on November 17, 2015 10:43

The following set of steps against a fresh elasticsearch 2.0.0 instance with v3.0.2 of this plugin installed shows that copy_to isn't working for the name field. I doubt it is working for the other metadata fields, either.

You can copy/paste this in your shell.

# create a mapping
curl -XPOST 'http://localhost:9200/test_copyto' -d '{
  "mappings": {
    "person": {
      "properties": {
        "copy_dst": { "type": "string" },
        "doc": {
      "type": "attachment",
      "fields": {
        "name": { "copy_to": "copy_dst" }
          }
    }
      }
    }
  }
}'
## => {"acknowledged":true}


# index a document, specifying a document name
curl -XPOST 'http://localhost:9200/test_copyto/person/1' -d '{
  "doc": {
    "_content": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
    "_name": "my-attachment-name.doc"
  }
}'
## => {"_index":"test_copyto","_type":"person","_id":"1","_version":1,"_shards":{"total":2,"successful":1,"failed":0},"created":true}


# search for the document by its contents
curl -XPOST 'http://localhost:9200/test_copyto/person/_search' -d '
{
  "query": {
    "query_string": {
      "default_field": "doc.content",
      "fields": ["copy_dst", "doc.content"],
      "query": "ipsum"
    }
  }
}
'
## => {"took":5,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits": "total":1,"max_score":0.04119441,"hits":[{"_index":"test_copyto","_type":"person","_id":"1","_score":0.04119441,"_source":{
#  "doc": {
#    "_content": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
#    "_name": "my-attachment-name.doc"
#  }
#}}]}}


# search for the document by its name
curl -XPOST 'http://localhost:9200/test_copyto/person/_search' -d '
{
  "query": {
    "query_string": {
      "default_field": "doc.name",
      "fields": ["doc.name"],
      "query": "my-test.doc"
    }
  }
}
'
## => {"took":5,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.02250402,"hits":[{"_index":"test_copyto","_type":"person","_id":"1","_score":0.02250402,"_source":{
#  "doc": {
#    "_content": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
#    "_name": "my-attachment-name.doc"
#  }
#}}]}}

# search for the document by the copy_dst field
curl -XPOST 'http://localhost:9200/test_copyto/person/_search' -d '
{
  "query": {
    "query_string": {
      "default_field": "copy_dst",
      "fields": ["copy_dst"],
      "query": "my-test.doc"
    }
  }
}
'
## => {"took":1,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

Copied from original issue: elastic/elasticsearch-mapper-attachments#190

@dadoonet dadoonet added the >bug label Nov 23, 2015
@dadoonet
Copy link
Member Author

From @gpaul on November 23, 2015 13:37

Ping. Should I open this issue against the main elasticsearch repository now that this plugin is moving there?

@dadoonet
Copy link
Member Author

Did you try the same script with elasticsearch 1.7? I'd like to know if it's a regression or if it has always been there.

I know that copy_to feature is supposed to work for the extracted text but I don't think it worked for metadata.

If I'm right (so it's not an issue but more a feature request), then you can open it in elasticsearch repo.
If I'm wrong (so it's a regression), then keep it here.

@dadoonet
Copy link
Member Author

From @gpaul on November 23, 2015 14:58

It seems like a regression:
elasticsearch 1.7.0 with mapper-attachments 2.7.1
yields

curl -XPOST 'http://localhost:9200/test_copyto/person/_search' -d '
 {
   "query": {
     "query_string": {
       "default_field": "copy_dst",
       "fields": ["copy_dst"],
       "query": "my-test.doc"
     }
   }
 }
 '
#{"took":3,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.02250402,"hits":[{"_index":"test_copyto","_type":"person","_id":"1","_score":0.02250402,"_source":{
#  "doc": {
#    "_content": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
#    "_name": "my-test.doc"
#  }
#}}]}}

@dadoonet
Copy link
Member Author

Thank you @gpaul

@dadoonet
Copy link
Member Author

From @hhoechtl on November 23, 2015 15:2

It's also not working with the .content field

@dadoonet dadoonet changed the title copy_to of metadata field isn't working (v3.0.2) copy_to of mapper attachments metadata field isn't working Nov 23, 2015
@dadoonet
Copy link
Member Author

I created a test for elasticsearch 1.7 and it is working well in 1.x series:

@Test
public void testCopyToMetaData() throws Exception {
    String mapping = copyToStringFromClasspath("/org/elasticsearch/index/mapper/attachment/test/integration/simple/copy-to-metadata.json");
    byte[] txt = copyToBytesFromClasspath("/org/elasticsearch/index/mapper/attachment/test/sample-files/text-in-english.txt");

    client().admin().indices().putMapping(putMappingRequest("test").type("person").source(mapping)).actionGet();

    index("test", "person", jsonBuilder().startObject()
            .startObject("file")
            .field("_content", txt)
            .field("_name", "name")
            .endObject()
            .endObject());
    refresh();

    CountResponse countResponse = client().prepareCount("test").setQuery(queryStringQuery("name").defaultField("file.name")).execute().get();
    assertThatWithError(countResponse.getCount(), equalTo(1l));

    countResponse = client().prepareCount("test").setQuery(queryStringQuery("name").defaultField("copy")).execute().get();
    assertThatWithError(countResponse.getCount(), equalTo(1l));
}

I created a test for 2.* branches which demonstrates the regression from 2.0.
It can be reused to fix this issue: elastic/elasticsearch-mapper-attachments@30aeda6:

"Copy To Feature":

    - do:
        indices.create:
            index: test
            body:
              mappings:
                doc:
                  properties:
                    copy_dst:
                      type: string
                    doc:
                      type: attachment
                      fields:
                        name:
                          copy_to: copy_dst
    - do:
        cluster.health:
          wait_for_status: yellow

    - do:
        index:
            index: test
            type: doc
            id: 1
            body:
              doc:
                _content: "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
                _name: "name"

    - do:
        indices.refresh: {}

    - do:
        search:
            index: test
            body:
                query:
                    match:
                        doc.content: "ipsum"

    - match: { hits.total: 1 }

    - do:
        search:
            index: test
            body:
                query:
                    match:
                        doc.name: "name"

    - match: { hits.total: 1 }

    - do:
        search:
            index: test
            body:
                query:
                    match:
                        copy_dst: "name"

    - match: { hits.total: 1 }

@rjernst Could you give a look please?

@brwe brwe assigned brwe and unassigned rjernst Nov 26, 2015
@rjernst
Copy link
Member

rjernst commented Nov 30, 2015

@dadoonet One odd thing I see is using a copy_to from within a multi field. Seems like we should disallow that? We removed support here:
https://github.com/elastic/elasticsearch/pull/10802/files#diff-ab54166cc098ea6f03350e68a8689a5bL432

The copy's are now handled outside of the mappers, while before there was a lot of spaghetti sharing between object mappers and field mappers that made document parsing complex. If we want to add it back, we will probably need a good refactoring in the way multi fields and copy_tos are handled. The problem is copy_to inside a multi field is essentially a nested copy_to since multi fields are conceptually just a copy to (see my notes in #10802).

@dadoonet
Copy link
Member Author

dadoonet commented Dec 1, 2015

@clintongormley WDYT?

Let me sum up the discussion.

Before 2.0, we were able to support:

PUT /test/person/_mapping
{
  "person": {
    "properties": {
      "file": {
        "type": "attachment",
        "fields": {
          "content": {
            "type": "string",
            "copy_to": "copy"
          },
          "name": {
            "type": "string",
            "copy_to": "copy"
          },
          "author": {
            "type": "string",
            "copy_to": "copy"
          },
          "title": {
            "type": "string",
            "copy_to": "copy"
          }
        }
      },
      "copy": {
        "type": "string"
      }
    }
  }
}

It means that extracted content, title, name (for filename) and author can be indexed in another field copy where the user can run a "fulltext" search.

From 2.0, this is not supported anymore for the reasons @rjernst described.

Should we document that mapper attachments does not support anymore copy_to feature on extracted/provided nested fields?
Note that we don't support a global copy_to on the BASE64 content itself.

Or should we try to implement such a thing only for mapper attachments plugin. I can't see another use case today. May be some other community plugins would like to have it but really unsure here.

IMO, users can always run search on multiple fields at the same time so instead of searching in copy in the above example, they could search on file.content, file.title, file.name and file.author. So not supporting this feature anymore does not sound as a big deal to me.

Thoughts?

@clintongormley
Copy link
Contributor

My feeling is that, long term, we should remove the attachment field type. Instead, we should move Tika to being a processor in the node ingest API which will read the binary attachment and add the contents and any meta fields to the _source field itself. The result is that attachments stop being magical and become just like any other field which can be configured in standard ways.

With this goal in mind, it doesn't make sense to add complicated (and likely buggy) hacks to fix this regression in 2.x. But, we should let the user know that copy-to on multi-fields is not supported: we should throw an exception at mapping time instead of silently ignoring the problem.

@brwe
Copy link
Contributor

brwe commented Dec 1, 2015

So...I took a look at the copy_to and multi_fields and tried to throw an exception when we encounter copy_to in multi_fields - we could do that I guess. But I have the suspicion that re-adding the copy_to to multi_fields is just a matter of shifting three lines. I made a pr here so you can see what I mean: #15152 Tests pass but I might be missing something.

@dadoonet
Copy link
Member Author

dadoonet commented Dec 1, 2015

That would be an awesome news @brwe. Was not expecting that. Are the tests I wrote for mapper attachments plugin work as well? Is that what you mean by Tests pass?

@brwe
Copy link
Contributor

brwe commented Dec 2, 2015

@dadoonet I mean the tests that were there already and the tests I added in the pr. I suspect that your test would pass as well but did not check.

However, we ( @rjernst @clintongormley and I ) had a chat yesterday about this in here is the outcome:
Doing this fix is not a good idea for several reasons:

  1. it would introduce a dependency between DocumentParser and FieldMapper again which was removed with much effort in Consolidate document parsing logic #10802
  2. long term we want to replace the implementation of multi fields with copy_to mechanism anyway
  3. chains of copy_to ala
"a" : {
"type": "string",
"copy_to": "b"
},
"b": {
"type": "string",
"copy_to": "a"
}

should not be possible because they add complexity in code and usage.

This reduces flexibility of mappings and how they can transform data. However, the consensus is that elasticsearch is not the right place to perform these kind of transformations anyway and these kind of operations should move to external tools such as the planned node ingest plugin #14049

Applying the fix I proposed would just delay the removal of the feature and therefore we think we should not do it.

@gpaul
Copy link

gpaul commented Dec 2, 2015

It seems like you have removed a feature without providing a better alternative. The use of copy_to for custom '_all' fields is well-documented and very useful.

@gpaul
Copy link

gpaul commented Dec 2, 2015

@brwe My 2c on your discussion

  1. it would introduce a dependency between DocumentParser and FieldMapper again which was removed with much effort in Consolidate document parsing logic #10802
    -> if such cleanup code broke something - then perhaps it was merged too soon. A temporary patch to fix a broken feature until such time as it can be properly reengineered is not unreasonable.
  2. long term we want to replace the implementation of multi fields with copy_to mechanism anyway
    -> perfect - the process I would like to see in that case is: a deprecation note in 2.1 followed by a better alternative and migration path users can follow in preparation for 2.2 - keeping in mind that mappings from older versions of elasticsearch need to be upgraded. There are people who used ES as their primary data store - I'm not one of those, but having to rebuilding indexes when new ES versions are released is unfortunate.
  3. chains of copy_to ... should not be possible because they add complexity in code and usage.
    -> by all means prohibit them. I didn't know this was possible in earlier versions as it seems too easy to define cycles.

@clintongormley
Copy link
Contributor

@gpaul If our resources were unlimited, then I would agree with you. However, in an effort to clean up a massive code base and to remove complexity, we have to do it incrementally and sometimes we have to remove things that worked before. The mapping cleanup was 5 long months of work, and there is still a good deal more to be done. It brought some huge improvements (just see how many issues were linked to #8870) but meant that we couldn't support everything that we supported before.

Every hack that we add into the code adds technical debt and increases the likelihood of introducing new bugs. We'd much rather focus our limited resources on making the system clean, stable, reliable, and maintainable.

This is why I don't want to make this change. The workaround for your case is to search across multiple fields.

@gpaul
Copy link

gpaul commented Dec 2, 2015

[woops, I was logged in as a friend of mine when I posted this comment a minute ago. I've removed it and this is a repost as myself ><]

That's fair. Thanks for all the hard work.

As I'll have to redesign my mappings anyway, should I avoid copy_to in its entirety going forward or is it just the multi-field case that was causing pain? I'd like to avoid features that are on their way out.

@rjernst
Copy link
Member

rjernst commented Dec 2, 2015

@gpaul It is just copy_to in a multi field.

@gpaul
Copy link

gpaul commented Dec 2, 2015

Got it, thanks.

brwe added a commit to brwe/elasticsearch that referenced this issue Dec 3, 2015
Copy to within multi field is ignored from 2.0 on, see elastic#10802.
Instead of just ignoring it, we should throw an exception if this
is found in the mapping when a mapping is added. For already
existing indices we should at least log a warning.

related to elastic#14946
brwe added a commit that referenced this issue Dec 8, 2015
throw exception if a copy_to is within a multi field

Copy to within multi field is ignored from 2.0 on, see #10802.
Instead of just ignoring it, we should throw an exception if this
is found in the mapping when a mapping is added. For already
existing indices we should at least log a warning.

related to #14946
brwe added a commit that referenced this issue Dec 8, 2015
throw exception if a copy_to is within a multi field

Copy to within multi field is ignored from 2.0 on, see #10802.
Instead of just ignoring it, we should throw an exception if this
is found in the mapping when a mapping is added. For already
existing indices we should at least log a warning.

related to #14946
@spiderpug
Copy link

Hi, I think I've this problem as well.

Following the documentation (?) here: https://github.com/elastic/elasticsearch-mapper-attachments#copy-to-feature the copyTo on the content should work, but I cannot manage to. Isn't that the correct documentation?

I want to make use of this feature to copy the extracted content into a custom _all field. Any hints how to solve this?

Is there a content extraction service/endpoint I could make use of to index prepared content, so that I don't have to rely on copyTo?

Edit: These docs mention the copyTo feature as well: https://www.elastic.co/guide/en/elasticsearch/plugins/current/mapper-attachments-copy-to.html

Thanks!

@clintongormley clintongormley added :Search Foundations/Mapping Index mappings, including merging and defining field types and removed :Plugin Mapper Attachment labels Feb 13, 2018
@javanna javanna added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug >regression :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

7 participants