Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Specify how header-level data are returned in ABCD #7

Closed
pombredanne opened this issue Oct 4, 2017 · 14 comments
Closed

RFC: Specify how header-level data are returned in ABCD #7

pombredanne opened this issue Oct 4, 2017 · 14 comments

Comments

@pombredanne
Copy link
Member

pombredanne commented Oct 4, 2017

Today we have header-level data in ScanCode data that is a tad ad-hoc. For instance:

{
  "scancode_notice": "Generated with ScanCode...",
  "scancode_version": "2.1.0.post55.ff2948e",
  "scancode_options": {
    "--info": true,
    "--format": "json-pp"
  },
  "files_count": 1,

  ..files, etc ... regular ABC Data.....
}

We should normalize this and support having multiple tools providing some log that they touched the data.
Here is what I suggest: store these in a top level "header" attribute. This attribute would contain a list. Each list item would be an object.
With this in mind the new ScanCode output would look like this:

{ 
  "header" : [
    { 
      "tool": "scancode-toolkit",
      "tool_version": "2.1.0.post55.ff2948e",
      "date": "2017-09-12T12:23:12",

      "scancode_notice": "Generated with ScanCode...",
      "scancode_options": {
        "--info": true,
        "--format": "json-pp"
       },
      "files_count": 1,
      [.... any other attributes that a tool may want to add, such as a scanned path, etc] ,
    }
  ]
  ..files, etc ... regular ABC Data.....
}

And with several tools having "touched" the data:

{ 
  "header" : [
    { 
      "tool": "scancode-toolkit",
      "tool_version": "2.1.0.post55.ff2948e",
      "date": "2017-09-12T12:23:12",

      "scancode_notice": "Generated with ScanCode...",
      "scancode_options": {
        "--info": true,
        "--format": "json-pp"
      },
      "files_count": 1,
      [.... any other attributes that a tool may want to add, such as a scanned path, etc] ,
    },
    { 
      "tool": "aboutcode-mamanger",
      "tool_version": "3.1.0",
      "date": "2017-09-13T15:23:12",
      [.... any other attributes that a tool may want to add, such as a scanned path, etc] ,
    },
    { 
      "tool": "vulnerablecode",
      "tool_version": "0.1.0",
      "date": "2017-09-13T16:23:12",
      [.... any other attributes that a tool may want to add, such as a scanned path, etc] ,
    }
  ]
  ..files, component, packages etc ... e.g. regular ABC Data.....
}

In this context these would be the only fields that are expected in a header item:

      "tool": "aboutcode-mamanger",
      "tool_version": "3.1.0",
      "date": "2017-09-13T15:23:12",

and the convention would be that each tool exporting ABCD data would:

  1. preserve the previous header
  2. add one "log" entry to the header

The benefits of all this are:

  1. clear header data, no longer mixed with other regular code-related data
  2. minimal trail/log of which tool touched and eventually transformed the data which is useful for tracing and documentation
@pombredanne
Copy link
Member Author

@mnonnenmacher @tdruez @jdaguil @mjherzog @DennisClark @sschuberth .... feedback welcomed!
I would like to move ahead quickly on this!

@mjherzog
Copy link
Member

mjherzog commented Oct 4, 2017

Should it just be tool_notice and tool_options under each tool rather than scancode_notice and scancode_options. ScanCode is primus inter pares, but I guess it might not always be the first AboutCode tool to create a file.

@pombredanne
Copy link
Member Author

pombredanne commented Oct 5, 2017

@mjherzog good points.

So with this, these would be the fields that are expected in a header item:

      "tool": "aboutcode-manager",
      "tool_version": "3.1.0",
      "date": "2017-09-13T15:23:12",

with these extra and optional:

      "tool_notice": "Generated with ....",
      "tool_options": [a list of tools option strings ],
      "date": "2017-09-13T15:23:12",

@jdaguil
Copy link

jdaguil commented Oct 5, 2017

@pombredanne LGTM

@mjherzog
Copy link
Member

mjherzog commented Oct 6, 2017

@pombredanne Presumably ScanCode Plugins would be among the tools writing to the header items (?)

@pombredanne
Copy link
Member Author

@mjherzog sorry for the late reply: yes, plugins would contribute header items
For an early example of such header structure, see this in the JSON lines ScanCode output https://github.com/nexB/scancode-toolkit/blob/6d07756efcef9b6014f553dd1f084a16cfc1f474/src/formattedcode/format_jsonlines.py#L45

@JonoYang
Copy link
Member

JonoYang commented Aug 3, 2018

@pombredanne 👍 The proposed header attribute looks good to me. I think it would be beneficial to have a small summary in the headers to note what each tool did to the data, though it could get unruly quickly.

@pombredanne
Copy link
Member Author

so here is what I am working to include in ScanCode v3:

  "history_log": [
    {
      "tool": "scancode-toolkit",
      "tool_version": "2.9.7.post135.90333b4",
      "options": {
        "input": "samples/",
        "--classify": true,
        "--copyright": true,
        "--email": true,
        "--facet": [
          "dev=FOOBAR"
        ],
        "--generated": true,
        "--info": true,
        "--json-pp": "scan.json",
        "--license": true,
        "--license-diag": true,
        "--license-text": true,
        "--package": true,
        "--processes": "3",
        "--summary-with-details": true,
        "--url": true,
        "--verbose": true
      },
      "notice": "Generated with ScanCode and provided on an \"AS IS\" BASIS, WITHOUT WARRANTIES\nOR CONDITIONS OF ANY KIND, either express or implied. No content created from\nScanCode should be considered or used as legal advice. Consult an Attorney\nfor any legal advice.\nScanCode is a free software code scanning tool from nexB Inc. and others.\nVisit https://github.com/nexB/scancode-toolkit/ for support and download.",
      "start_timestamp": "2018-11-14T134215.214510",
      "end_timestamp": "2018-11-14T134224.595734",
      "message": null,
      "errors": [],
      "extra_data": {}
    }

pombredanne added a commit to aboutcode-org/scancode-toolkit that referenced this issue Nov 14, 2018
* This is a new data structure as designed in
  aboutcode-org/aboutcode#7
* For now, the old header-level data have been kept

Signed-off-by: Philippe Ombredanne <[email protected]>
@pombredanne
Copy link
Member Author

@pombredanne
Copy link
Member Author

As @sschuberth pointed the new proposed history_log top level attribute name is a poor choice.
The original name here was header and @sschuberth suggests info.
Let's find a better name

@sschuberth
Copy link

What I don't like about header is that it basically says nothing about what its contents are, but only that it appears "somewhere at the top". While info is a bit vague, it at least tells you that some (generic) information is contained.

Depending on what kind of data is to be added, I could also imagine summary or stats (for "statistics") as possible names, but I still like a simple info best.

@sschuberth
Copy link

sschuberth commented Nov 15, 2018

One more thought: Now that we have tool_version, should we rename tool to tool_name? In the sense that tool_name and tool_version could be thought of to be (internally / on-demand) combined to a "tool" field which contains both the name and the version, like the SPDX "Creator: Tool:" field does? But I would not serialize such a combined tool field additionally.

@pombredanne
Copy link
Member Author

tool_name is indeed better

pombredanne added a commit to aboutcode-org/scancode-toolkit that referenced this issue Nov 27, 2018
* This is a new data structure as designed in
  aboutcode-org/aboutcode#7
* For now, the old header-level data have been kept

Signed-off-by: Philippe Ombredanne <[email protected]>
pombredanne added a commit to aboutcode-org/scancode-toolkit that referenced this issue Nov 27, 2018
@mjherzog
Copy link
Member

Closing since this is covered by aboutcode-org/scancode-toolkit#211

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants