validate does not correctly validate byte offsets to data objects #616

msbentley · 2023-03-29T15:40:29Z

Checked for duplicates

Yes - I've already checked

🐛 Describe the bug

When validating a data product with multiple tables of different types, I first got the error reported in #614 (since the tables were not defined in ascending byte offset order). I then fixed this, but now get a weird error:

ERROR  [error.label.invalid_object_definition]   line 2: Invalid object definition for object #2.  The previously defined object ends at byte 901. Offset defined in label: 595

where byte 901 is the size of the file, defined in the file_size attribute. In fact, if I remove this attribute, validation passes successfully.

🕵️ Expected behavior

I expect the attached product to pass validation

📜 To Reproduce

Attempt to validate the attached product.

🖥 Environment Info

Version of this software: validate 3.1.1
Operating System: MacOS

📚 Version of Software Used

(base) mark bentley@ESAC01110146 202103 % validate --version
gov.nasa.pds:validate
Version 3.1.1
Release Date: 2023-01-03 23:08:19

🩺 Test Data / Additional context

mre_cal_sc_ttcp_delay_schulte_01s_2021069.zip

🦄 Related requirements

🦄 #xyz

⚙️ Engineering Details

No response

The text was updated successfully, but these errors were encountered:

jordanpadams · 2023-03-29T16:28:10Z

@msbentley added to top of stack. something weird is happening here with the logic

al-niessner · 2023-03-30T16:55:34Z

The actual problem is here:

https://github.com/NASA-PDS/pds4-jparser/blob/7d0a01be763d503c19094d90a4398ecbfc9d2556/src/main/java/gov/nasa/pds/label/Label.java#L649-L652

If the file size is not defined, it leaves it as -1 which means the math later works out. The real wrong part is that the size of the table is assumed to be the file size. If there were only one table in a file, this would be valid. However cannot count on this.

jordanpadams · 2023-03-30T18:00:05Z

@msbentley I may be wrong, but after looking at the fix proposed by @al-niessner on pds4-jparser, I don't actually think we can accurately calculate the expected size of a delimited table unless the object_length attribute is specified.

is this an update we need to the standards? I feel like you should have to specify the size of a table in bytes so we know how much to read.

msbentley · 2023-03-31T12:36:16Z

Personally I like the simplicity of being able to label e.g. a CSV without having to declare the object length... From my side validating the number of records and fields should be sufficient...

jordanpadams · 2023-04-10T18:10:05Z

ok. thanks @msbentley

jordanpadams · 2023-04-12T18:59:03Z

See NASA-PDS/pds4-jparser#89 for discussion on solution

msbentley · 2023-04-13T15:44:47Z

OK, great - so validate now will skip any checks on table size for delimited tables, right? Since the records are by definition variable length we cannot check the number of bytes in a given table.

al-niessner · 2023-04-13T17:46:47Z

@jordanpadams @msbentley

It is more complicated than what is being suggested here. It would have to be any file that contained any delimiter table would have to ignored. The validation code looks through all objects in the file and keeps a running offset. If we ignore a delimited table, then all objects after that cannot be computed correctly either. It would be possible to say that a delimited table has been found then take the next non-delimited-table offset as truth. However, if a delimited table is allowed to actually be variable in size, then that next offset is a lie because it would depend on the of the variant table. If the table is always padded, then one must know the padded, fixed table size a priori. If it is variable row length with a padded max size as this example is, then that is enough to correctly determine the table size. A truly variant table at the end of the file or the whole file makes no difference here because no offset is used after it.

What I am saying is, the previous arguments in this issue have the the implicit premises that we do not know nor can accurately calculate the table size because it is variable but that the next offset must correct and invariant. Since the start of the table is invariant, then not sure how both of these premises can be true making the arguments unsound.

I am pushing back because making it not test offset after a delimited table is going to introduce a host of weird and misleading bugs because validation will say the XML offsets are good but the file will not match when someone goes to use it. It will also allow for knowable overlap that should be easily detectable.

Is there some other reason you want to turn this check off? Please advise soon.

jordanpadams · 2023-04-13T18:01:39Z

@al-niessner I agree it is not straightforward but unfortunately this is the route we have to take:

It would be possible to say that a delimited table has been found then take the next non-delimited-table offset as truth.

We cant keep the current check because it is inaccurate. We have to assume the person creating the data product knows how large the delimited table is, and sets the next offset appropriately. If we later try to read the data, and that offset is wrong, I agree it is going to through all kinds of weird bugs.

However, our only option for determining a delimited table length is by:

The optional object_length value is specified
Assume the object length is from the table delimited offset to the offset of the next table? This may introduce errors though if there is undefined padding, but I think this is the best we can do

jordanpadams · 2023-04-13T18:02:49Z

The current PR will throw invalid errors if the records are less than the max_record_length

al-niessner · 2023-04-13T19:47:35Z

@jordanpadams

The current change will not throw errors if one or more records are smaller than max record length. The size is the possible max for shifting around in the file not as a you must use all these bytes to decode a table. It simply says that if you 10 records and the maximum record length is 50 then you have to reserve 500 bytes in the file for that table and the next object must be after that 500 bytes.

Yes, if the object_length and max_record_length are given, then max_record_length*number_of_records must be less than equal to object_length. This accounts for all possible padding. It is also accurate.

If neither is defined and objects come after the delimited table then an error should occur where the user can then define either or both. If the table is truly variable in size then they have to move it to the end of the file otherwise one one can skip the table to following objects anyway. If skipping ahead is not important, then why define all the offsets and sizes in the first place.

My point is, keep adding file size corrections as new examples come out rather than ditching the check. There will always be a way to define the block size for the table when objects follow it and they writer of the XML should add it even if it is normally optional. If no objects after it, then no harm no foul.

jordanpadams · 2023-04-13T20:28:07Z

@al-niessner we will go ahead with your code changes as committed and if we come across a bug down the road.

msbentley · 2023-04-14T07:25:04Z

If I read the above correctly, validate will now through errors for products which are valid according to the standards, right? In my experience incorrect byte offsets are anyone found by data validation checks (incorrect data types, values outside of limits etc.) which whilst not so clear, at least give you a hint whilst not flagging a good product as invalid.

I would be fine throwing a warning that there may be a problem with the table definition or similar, but we're currently suffering and having to roll back to older versions of validate because new checks are introduced that don't seem to reflect the standard and fail validation on what we believe are good products.

jordanpadams · 2023-04-14T17:55:59Z

@msbentley agreed that content validation will often catch any issues with byte offsets, but we are continuously updating the software. the current functionality should be OK as-is, but let us know if you come across an issue and we will address it ASAP.

jordanpadams · 2023-04-14T17:58:32Z

Also, per the updates that are not in the standards, those are often unintended consequences trying to address one enhancement/bug while not considering all of the nuances of the standard. we have made some pretty significant progress in bug fixes and enhancements in the last 6 months, so I imagine most work as expected, but there may be 1 or 2 that caused another issue. 3 steps forward, 1 step back with this level of complexity. but anything that seems to be introducing an unwanted check, let us know and we will address it ASAP

miguelp1986 · 2023-04-27T18:21:03Z

@jordanpadams @al-niessner The above test case is failing. I am getting ERROR [error.label.invalid_object_definition] line 2: Invalid object definition for object #2. The previously defined object ends at byte 901. Offset defined in label: 595

validate -t mre_cal_sc_ttcp_delay_schulte_01s_2021069 

PDS Validate Tool Report

Configuration:
   Version                       3.2.0-SNAPSHOT
   Date                          2023-04-27T18:13:52Z

Parameters:
   Targets                       [file:/Users/MPena/Documents/PDS/validate_test_files/616/mre_cal_sc_ttcp_delay_schulte_01s_2021069/]
   Severity Level                WARNING
   Recurse Directories           true
   File Filters Used             [*.xml, *.XML]
   Data Content Validation       on
   Product Level Validation      on
   Max Errors                    100000
   Registered Contexts File      /usr/local/validate-3.2.0-SNAPSHOT/resources/registered_context_products.json



Product Level Validation Results

  FAIL: file:/Users/MPena/Documents/PDS/validate_test_files/616/mre_cal_sc_ttcp_delay_schulte_01s_2021069/mre_cal_sc_ttcp_delay_schulte_01s_2021069.xml
      ERROR  [error.label.context_ref_not_found]   line 40: 'Context product not found: urn:esa:psa:context:investigation:mission.bc
      ERROR  [error.label.context_ref_not_found]   line 50: 'Context product not found: urn:esa:psa:context:instrument_host:spacecraft.mpo
      ERROR  [error.label.context_ref_not_found]   line 59: 'Context product not found: urn:esa:psa:context:instrument:mpo.more
      ERROR  [error.label.invalid_object_definition]   line 2: Invalid object definition for object #2.  The previously defined object ends at byte 901. Offset defined in label: 595
        1 product validation(s) completed

Summary:

  4 error(s)
  0 warning(s)

  Product Validation Summary:
    0          product(s) passed
    1          product(s) failed
    0          product(s) skipped

  Referential Integrity Check Summary:
    0          check(s) passed
    0          check(s) failed
    0          check(s) skipped

  Message Types:
    3            error.label.context_ref_not_found
    1            error.label.invalid_object_definition

End of Report
Completed execution in 19562 ms

miguelp1986 · 2023-05-02T18:16:02Z

I retested with v3.2.0 and I no longer see ERROR [error.label.invalid_object_definition]:

PDS Validate Tool Report

Configuration:
   Version                       3.2.0
   Date                          2023-05-02T18:12:20Z

Parameters:
   Targets                       [file:/Users/MPena/Documents/PDS/validate_test_files/616/mre_cal_sc_ttcp_delay_schulte_01s_2021069/mre_cal_sc_ttcp_delay_schulte_01s_2021069.xml]
   Severity Level                WARNING
   Recurse Directories           true
   File Filters Used             [*.xml, *.XML]
   Data Content Validation       on
   Product Level Validation      on
   Max Errors                    100000
   Registered Contexts File      /usr/local/validate-3.2.0/resources/registered_context_products.json



Product Level Validation Results

  FAIL: file:/Users/MPena/Documents/PDS/validate_test_files/616/mre_cal_sc_ttcp_delay_schulte_01s_2021069/mre_cal_sc_ttcp_delay_schulte_01s_2021069.xml
      ERROR  [error.label.context_ref_not_found]   line 40: 'Context product not found: urn:esa:psa:context:investigation:mission.bc
      ERROR  [error.label.context_ref_not_found]   line 50: 'Context product not found: urn:esa:psa:context:instrument_host:spacecraft.mpo
      ERROR  [error.label.context_ref_not_found]   line 59: 'Context product not found: urn:esa:psa:context:instrument:mpo.more
        1 product validation(s) completed

Summary:

  3 error(s)
  0 warning(s)

  Product Validation Summary:
    0          product(s) passed
    1          product(s) failed
    0          product(s) skipped

  Referential Integrity Check Summary:
    0          check(s) passed
    0          check(s) failed
    0          check(s) skipped

  Message Types:
    3            error.label.context_ref_not_found

End of Report
Completed execution in 20184 ms

msbentley added bug Something isn't working needs:triage labels Mar 29, 2023

msbentley assigned jordanpadams Mar 29, 2023

jordanpadams assigned al-niessner and unassigned jordanpadams Mar 29, 2023

jordanpadams added B13.1 sprint-backlog s.high and removed needs:triage labels Mar 29, 2023

al-niessner mentioned this issue Mar 30, 2023

updated delimited table object length calculation using max_record_length NASA-PDS/pds4-jparser#89

Merged

jordanpadams closed this as completed in NASA-PDS/pds4-jparser#89 Apr 13, 2023

miguelp1986 added the i&t.done label May 12, 2023

cgobat mentioned this issue Aug 18, 2023

Validation failures are contingent on presence of <file_size> attribute in <File> class #684

Closed

jordanpadams removed the sprint-backlog label Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

validate does not correctly validate byte offsets to data objects #616

validate does not correctly validate byte offsets to data objects #616

msbentley commented Mar 29, 2023

jordanpadams commented Mar 29, 2023

al-niessner commented Mar 30, 2023

jordanpadams commented Mar 30, 2023

msbentley commented Mar 31, 2023

jordanpadams commented Apr 10, 2023

jordanpadams commented Apr 12, 2023

msbentley commented Apr 13, 2023

al-niessner commented Apr 13, 2023

jordanpadams commented Apr 13, 2023

jordanpadams commented Apr 13, 2023

al-niessner commented Apr 13, 2023

jordanpadams commented Apr 13, 2023

msbentley commented Apr 14, 2023

jordanpadams commented Apr 14, 2023

jordanpadams commented Apr 14, 2023

miguelp1986 commented Apr 27, 2023

miguelp1986 commented May 2, 2023

validate does not correctly validate byte offsets to data objects #616

validate does not correctly validate byte offsets to data objects #616

Comments

msbentley commented Mar 29, 2023

Checked for duplicates

🐛 Describe the bug

🕵️ Expected behavior

📜 To Reproduce

🖥 Environment Info

📚 Version of Software Used

🩺 Test Data / Additional context

🦄 Related requirements

⚙️ Engineering Details

jordanpadams commented Mar 29, 2023

al-niessner commented Mar 30, 2023

jordanpadams commented Mar 30, 2023

msbentley commented Mar 31, 2023

jordanpadams commented Apr 10, 2023

jordanpadams commented Apr 12, 2023

msbentley commented Apr 13, 2023

al-niessner commented Apr 13, 2023

jordanpadams commented Apr 13, 2023

jordanpadams commented Apr 13, 2023

al-niessner commented Apr 13, 2023

jordanpadams commented Apr 13, 2023

msbentley commented Apr 14, 2023

jordanpadams commented Apr 14, 2023

jordanpadams commented Apr 14, 2023

miguelp1986 commented Apr 27, 2023

miguelp1986 commented May 2, 2023