Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validate does not correctly validate byte offsets to data objects #616

Closed
msbentley opened this issue Mar 29, 2023 · 17 comments · Fixed by NASA-PDS/pds4-jparser#89
Closed
Assignees
Labels
B13.1 bug Something isn't working i&t.done s.high

Comments

@msbentley
Copy link

Checked for duplicates

Yes - I've already checked

🐛 Describe the bug

When validating a data product with multiple tables of different types, I first got the error reported in #614 (since the tables were not defined in ascending byte offset order). I then fixed this, but now get a weird error:

ERROR  [error.label.invalid_object_definition]   line 2: Invalid object definition for object #2.  The previously defined object ends at byte 901. Offset defined in label: 595

where byte 901 is the size of the file, defined in the file_size attribute. In fact, if I remove this attribute, validation passes successfully.

🕵️ Expected behavior

I expect the attached product to pass validation

📜 To Reproduce

Attempt to validate the attached product.

🖥 Environment Info

  • Version of this software: validate 3.1.1
  • Operating System: MacOS

📚 Version of Software Used

(base) mark bentley@ESAC01110146 202103 % validate --version
gov.nasa.pds:validate
Version 3.1.1
Release Date: 2023-01-03 23:08:19

🩺 Test Data / Additional context

mre_cal_sc_ttcp_delay_schulte_01s_2021069.zip

🦄 Related requirements

🦄 #xyz

⚙️ Engineering Details

No response

@jordanpadams
Copy link
Member

@msbentley added to top of stack. something weird is happening here with the logic

@al-niessner
Copy link
Contributor

The actual problem is here:

https://github.com/NASA-PDS/pds4-jparser/blob/7d0a01be763d503c19094d90a4398ecbfc9d2556/src/main/java/gov/nasa/pds/label/Label.java#L649-L652

If the file size is not defined, it leaves it as -1 which means the math later works out. The real wrong part is that the size of the table is assumed to be the file size. If there were only one table in a file, this would be valid. However cannot count on this.

@jordanpadams
Copy link
Member

@msbentley I may be wrong, but after looking at the fix proposed by @al-niessner on pds4-jparser, I don't actually think we can accurately calculate the expected size of a delimited table unless the object_length attribute is specified.

is this an update we need to the standards? I feel like you should have to specify the size of a table in bytes so we know how much to read.

@msbentley
Copy link
Author

Personally I like the simplicity of being able to label e.g. a CSV without having to declare the object length... From my side validating the number of records and fields should be sufficient...

@jordanpadams
Copy link
Member

ok. thanks @msbentley

@jordanpadams
Copy link
Member

See NASA-PDS/pds4-jparser#89 for discussion on solution

@msbentley
Copy link
Author

OK, great - so validate now will skip any checks on table size for delimited tables, right? Since the records are by definition variable length we cannot check the number of bytes in a given table.

@al-niessner
Copy link
Contributor

@jordanpadams @msbentley

It is more complicated than what is being suggested here. It would have to be any file that contained any delimiter table would have to ignored. The validation code looks through all objects in the file and keeps a running offset. If we ignore a delimited table, then all objects after that cannot be computed correctly either. It would be possible to say that a delimited table has been found then take the next non-delimited-table offset as truth. However, if a delimited table is allowed to actually be variable in size, then that next offset is a lie because it would depend on the of the variant table. If the table is always padded, then one must know the padded, fixed table size a priori. If it is variable row length with a padded max size as this example is, then that is enough to correctly determine the table size. A truly variant table at the end of the file or the whole file makes no difference here because no offset is used after it.

What I am saying is, the previous arguments in this issue have the the implicit premises that we do not know nor can accurately calculate the table size because it is variable but that the next offset must correct and invariant. Since the start of the table is invariant, then not sure how both of these premises can be true making the arguments unsound.

I am pushing back because making it not test offset after a delimited table is going to introduce a host of weird and misleading bugs because validation will say the XML offsets are good but the file will not match when someone goes to use it. It will also allow for knowable overlap that should be easily detectable.

Is there some other reason you want to turn this check off? Please advise soon.

@jordanpadams
Copy link
Member

@al-niessner I agree it is not straightforward but unfortunately this is the route we have to take:

It would be possible to say that a delimited table has been found then take the next non-delimited-table offset as truth.

We cant keep the current check because it is inaccurate. We have to assume the person creating the data product knows how large the delimited table is, and sets the next offset appropriately. If we later try to read the data, and that offset is wrong, I agree it is going to through all kinds of weird bugs.

However, our only option for determining a delimited table length is by:

  1. The optional object_length value is specified
  2. Assume the object length is from the table delimited offset to the offset of the next table? This may introduce errors though if there is undefined padding, but I think this is the best we can do

@jordanpadams
Copy link
Member

The current PR will throw invalid errors if the records are less than the max_record_length

@al-niessner
Copy link
Contributor

@jordanpadams

The current change will not throw errors if one or more records are smaller than max record length. The size is the possible max for shifting around in the file not as a you must use all these bytes to decode a table. It simply says that if you 10 records and the maximum record length is 50 then you have to reserve 500 bytes in the file for that table and the next object must be after that 500 bytes.

Yes, if the object_length and max_record_length are given, then max_record_length*number_of_records must be less than equal to object_length. This accounts for all possible padding. It is also accurate.

If neither is defined and objects come after the delimited table then an error should occur where the user can then define either or both. If the table is truly variable in size then they have to move it to the end of the file otherwise one one can skip the table to following objects anyway. If skipping ahead is not important, then why define all the offsets and sizes in the first place.

My point is, keep adding file size corrections as new examples come out rather than ditching the check. There will always be a way to define the block size for the table when objects follow it and they writer of the XML should add it even if it is normally optional. If no objects after it, then no harm no foul.

@jordanpadams
Copy link
Member

@al-niessner we will go ahead with your code changes as committed and if we come across a bug down the road.

@msbentley
Copy link
Author

If I read the above correctly, validate will now through errors for products which are valid according to the standards, right? In my experience incorrect byte offsets are anyone found by data validation checks (incorrect data types, values outside of limits etc.) which whilst not so clear, at least give you a hint whilst not flagging a good product as invalid.

I would be fine throwing a warning that there may be a problem with the table definition or similar, but we're currently suffering and having to roll back to older versions of validate because new checks are introduced that don't seem to reflect the standard and fail validation on what we believe are good products.

@jordanpadams
Copy link
Member

@msbentley agreed that content validation will often catch any issues with byte offsets, but we are continuously updating the software. the current functionality should be OK as-is, but let us know if you come across an issue and we will address it ASAP.

@jordanpadams
Copy link
Member

Also, per the updates that are not in the standards, those are often unintended consequences trying to address one enhancement/bug while not considering all of the nuances of the standard. we have made some pretty significant progress in bug fixes and enhancements in the last 6 months, so I imagine most work as expected, but there may be 1 or 2 that caused another issue. 3 steps forward, 1 step back with this level of complexity. but anything that seems to be introducing an unwanted check, let us know and we will address it ASAP

@miguelp1986
Copy link

@jordanpadams @al-niessner The above test case is failing. I am getting ERROR [error.label.invalid_object_definition] line 2: Invalid object definition for object #2. The previously defined object ends at byte 901. Offset defined in label: 595

validate -t mre_cal_sc_ttcp_delay_schulte_01s_2021069 

PDS Validate Tool Report

Configuration:
   Version                       3.2.0-SNAPSHOT
   Date                          2023-04-27T18:13:52Z

Parameters:
   Targets                       [file:/Users/MPena/Documents/PDS/validate_test_files/616/mre_cal_sc_ttcp_delay_schulte_01s_2021069/]
   Severity Level                WARNING
   Recurse Directories           true
   File Filters Used             [*.xml, *.XML]
   Data Content Validation       on
   Product Level Validation      on
   Max Errors                    100000
   Registered Contexts File      /usr/local/validate-3.2.0-SNAPSHOT/resources/registered_context_products.json



Product Level Validation Results

  FAIL: file:/Users/MPena/Documents/PDS/validate_test_files/616/mre_cal_sc_ttcp_delay_schulte_01s_2021069/mre_cal_sc_ttcp_delay_schulte_01s_2021069.xml
      ERROR  [error.label.context_ref_not_found]   line 40: 'Context product not found: urn:esa:psa:context:investigation:mission.bc
      ERROR  [error.label.context_ref_not_found]   line 50: 'Context product not found: urn:esa:psa:context:instrument_host:spacecraft.mpo
      ERROR  [error.label.context_ref_not_found]   line 59: 'Context product not found: urn:esa:psa:context:instrument:mpo.more
      ERROR  [error.label.invalid_object_definition]   line 2: Invalid object definition for object #2.  The previously defined object ends at byte 901. Offset defined in label: 595
        1 product validation(s) completed

Summary:

  4 error(s)
  0 warning(s)

  Product Validation Summary:
    0          product(s) passed
    1          product(s) failed
    0          product(s) skipped

  Referential Integrity Check Summary:
    0          check(s) passed
    0          check(s) failed
    0          check(s) skipped

  Message Types:
    3            error.label.context_ref_not_found
    1            error.label.invalid_object_definition

End of Report
Completed execution in 19562 ms

@miguelp1986
Copy link

I retested with v3.2.0 and I no longer see ERROR [error.label.invalid_object_definition]:

PDS Validate Tool Report

Configuration:
   Version                       3.2.0
   Date                          2023-05-02T18:12:20Z

Parameters:
   Targets                       [file:/Users/MPena/Documents/PDS/validate_test_files/616/mre_cal_sc_ttcp_delay_schulte_01s_2021069/mre_cal_sc_ttcp_delay_schulte_01s_2021069.xml]
   Severity Level                WARNING
   Recurse Directories           true
   File Filters Used             [*.xml, *.XML]
   Data Content Validation       on
   Product Level Validation      on
   Max Errors                    100000
   Registered Contexts File      /usr/local/validate-3.2.0/resources/registered_context_products.json



Product Level Validation Results

  FAIL: file:/Users/MPena/Documents/PDS/validate_test_files/616/mre_cal_sc_ttcp_delay_schulte_01s_2021069/mre_cal_sc_ttcp_delay_schulte_01s_2021069.xml
      ERROR  [error.label.context_ref_not_found]   line 40: 'Context product not found: urn:esa:psa:context:investigation:mission.bc
      ERROR  [error.label.context_ref_not_found]   line 50: 'Context product not found: urn:esa:psa:context:instrument_host:spacecraft.mpo
      ERROR  [error.label.context_ref_not_found]   line 59: 'Context product not found: urn:esa:psa:context:instrument:mpo.more
        1 product validation(s) completed

Summary:

  3 error(s)
  0 warning(s)

  Product Validation Summary:
    0          product(s) passed
    1          product(s) failed
    0          product(s) skipped

  Referential Integrity Check Summary:
    0          check(s) passed
    0          check(s) failed
    0          check(s) skipped

  Message Types:
    3            error.label.context_ref_not_found

End of Report
Completed execution in 20184 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B13.1 bug Something isn't working i&t.done s.high
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants