Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change JSONLoader content_key behavior #8075

Closed
wants to merge 7 commits into from

Conversation

kzk-maeda
Copy link
Contributor

@kzk-maeda kzk-maeda commented Jul 21, 2023

Description

Changed the value specified for content_key in JSONLoader from a single key to a value based on jq schema.

Why

For json data like the following, specify .data[].attributes.message for page_content and .data[].attributes.id or .data[].attributes.attributes. tags, etc., the content_key must also parse the json structure.

sample json data
{
  "data": [
    {
      "attributes": {
        "attributes": {
          "dd": {
            "service": "worker",
            "env": "production",
            "version": "6b2c46a5883a9097aa2cb09907786b5c06ca3bd0"
          },
          "source": "stderr",
          "service": "worker",
          "name": "app.services.predict_services",
          "levelname": "ERROR",
          "container_id": "f9f880b243cc41f1a7d9bae5bf922d60-1262558729",
          "timestamp": 1686679407827.0
        },
        "message": "error while processing user #19084: 'numpy.dtype[bool_]' object is not callable",
        "service": "worker",
        "status": "error",
        "tags": [
          "datadog.submission_auth:private_api_key",
          "env:production"
        ],
        "timestamp": "2023-06-14T03:03:27.827000+09:00"
      },
      "id": "AgAAAYi17UzTUeIczgAAAAAAAAAYAAAAAEFZaTE3VThXQUFEOF9sS2J4Z3psRmdBRAAAACQAAAAAMDE4OGI2MzYtMmZlNC00ZDEwLThjZDMtMzhkZTI0NmUyNWMz",
      "type": "log"
    },
    {
      "attributes": {
        "attributes": {
          "dd": {
            "service": "worker",
            "env": "production",
            "version": "6b2c46a5883a9097aa2cb09907786b5c06ca3bd0"
          },
          "process": 42.0,
          "messages": "error while processing user #19084: 'numpy.dtype[bool_]' object is not callable",
          "levelname": "ERROR",
          "container_id": "f9f880b243cc41f1a7d9bae5bf922d60-1262558729",
          "timestamp": 1686679407831.0
        },
        "message": "{\"messages\": \"error while processing user #19084: 'numpy.dtype[bool_]' object is not callable\"}",
        "service": "worker",
        "status": "error",
        "tags": [
          "datadog.submission_auth:private_api_key",
          "env:production"
        ],
        "timestamp": "2023-06-14T03:03:27.831000+09:00"
      },
      "id": "AgAAAYi17UzXUeIczwAAAAAAAAAYAAAAAEFZaTE3VThXQUFEOF9sS2J4Z3psRmdBRQAAACQAAAAAMDE4OGI2MzYtMmZlNC00ZDEwLThjZDMtMzhkZTI0NmUyNWMz",
      "type": "log"
    }
  ],
  "meta": {
    "elapsed": 26,
    "request_id": "pddv1ChY0SmttaDA0a1REeXZRM01yNkFwYnd3Ii0KHWIANr8mghGpsMIX2cOarI6t4WyTVObXx3wrAuudEgzSbtmduLtPxFVkSo0",
    "status": "done"
  }
}
sample code
def metadata_func(record: dict, metadata: dict) -> dict:
    print(record)

    metadata["id"] = record.get("id")
    metadata["tags"] = record["attributes"].get("tags")

    return metadata

sample_file = "sample.json"
loader = JSONLoader(
    file_path=sample_file,
    jq_schema='.data[].attributes.message',
    content_key="data",
    metadata_func=metadata_func
)

Dependencies

none

Tag maintainer

@rlancemartin, @eyurtsev

Twitter handle

kzk_maeda

@vercel
Copy link

vercel bot commented Jul 21, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Aug 10, 2023 9:58pm

@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Jul 21, 2023
@kzk-maeda kzk-maeda marked this pull request as ready for review July 23, 2023 13:16
@kzk-maeda kzk-maeda changed the title Fix json loader Change JSONLoader content_key behavior Jul 23, 2023
@baskaryan
Copy link
Collaborator

is there any way to make this backwards compatible?

@kzk-maeda
Copy link
Contributor Author

I think a way to make it backward compatible would be to determine if the content_key matches the jq schema and branch at the point where the data is retrieved

However, I think that even if temporary backward compatibility is ensured with this method, it will not be able to inevitable future specification complications, so I have submitted this pull request as a non-backward-compatible change.

@baskaryan baskaryan added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Aug 10, 2023
@leo-gan
Copy link
Collaborator

leo-gan commented Sep 19, 2023

@kzk-maeda Hi , could you, please, resolve the merging issues? After that ping me and I push this PR for the review. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation) lgtm PR looks good. Use to confirm that a PR is ready for merging.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants