-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
validate does not correctly validate byte offsets to data objects #616
Comments
@msbentley added to top of stack. something weird is happening here with the logic |
The actual problem is here: If the file size is not defined, it leaves it as -1 which means the math later works out. The real wrong part is that the size of the table is assumed to be the file size. If there were only one table in a file, this would be valid. However cannot count on this. |
@msbentley I may be wrong, but after looking at the fix proposed by @al-niessner on pds4-jparser, I don't actually think we can accurately calculate the expected size of a delimited table unless the is this an update we need to the standards? I feel like you should have to specify the size of a table in bytes so we know how much to read. |
Personally I like the simplicity of being able to label e.g. a CSV without having to declare the object length... From my side validating the number of records and fields should be sufficient... |
ok. thanks @msbentley |
See NASA-PDS/pds4-jparser#89 for discussion on solution |
OK, great - so validate now will skip any checks on table size for delimited tables, right? Since the records are by definition variable length we cannot check the number of bytes in a given table. |
It is more complicated than what is being suggested here. It would have to be any file that contained any delimiter table would have to ignored. The validation code looks through all objects in the file and keeps a running offset. If we ignore a delimited table, then all objects after that cannot be computed correctly either. It would be possible to say that a delimited table has been found then take the next non-delimited-table offset as truth. However, if a delimited table is allowed to actually be variable in size, then that next offset is a lie because it would depend on the of the variant table. If the table is always padded, then one must know the padded, fixed table size a priori. If it is variable row length with a padded max size as this example is, then that is enough to correctly determine the table size. A truly variant table at the end of the file or the whole file makes no difference here because no offset is used after it. What I am saying is, the previous arguments in this issue have the the implicit premises that we do not know nor can accurately calculate the table size because it is variable but that the next offset must correct and invariant. Since the start of the table is invariant, then not sure how both of these premises can be true making the arguments unsound. I am pushing back because making it not test offset after a delimited table is going to introduce a host of weird and misleading bugs because validation will say the XML offsets are good but the file will not match when someone goes to use it. It will also allow for knowable overlap that should be easily detectable. Is there some other reason you want to turn this check off? Please advise soon. |
@al-niessner I agree it is not straightforward but unfortunately this is the route we have to take:
We cant keep the current check because it is inaccurate. We have to assume the person creating the data product knows how large the delimited table is, and sets the next offset appropriately. If we later try to read the data, and that offset is wrong, I agree it is going to through all kinds of weird bugs. However, our only option for determining a delimited table length is by:
|
The current PR will throw invalid errors if the records are less than the max_record_length |
The current change will not throw errors if one or more records are smaller than max record length. The size is the possible max for shifting around in the file not as a you must use all these bytes to decode a table. It simply says that if you 10 records and the maximum record length is 50 then you have to reserve 500 bytes in the file for that table and the next object must be after that 500 bytes. Yes, if the object_length and max_record_length are given, then max_record_length*number_of_records must be less than equal to object_length. This accounts for all possible padding. It is also accurate. If neither is defined and objects come after the delimited table then an error should occur where the user can then define either or both. If the table is truly variable in size then they have to move it to the end of the file otherwise one one can skip the table to following objects anyway. If skipping ahead is not important, then why define all the offsets and sizes in the first place. My point is, keep adding file size corrections as new examples come out rather than ditching the check. There will always be a way to define the block size for the table when objects follow it and they writer of the XML should add it even if it is normally optional. If no objects after it, then no harm no foul. |
@al-niessner we will go ahead with your code changes as committed and if we come across a bug down the road. |
If I read the above correctly, validate will now through errors for products which are valid according to the standards, right? In my experience incorrect byte offsets are anyone found by data validation checks (incorrect data types, values outside of limits etc.) which whilst not so clear, at least give you a hint whilst not flagging a good product as invalid. I would be fine throwing a warning that there may be a problem with the table definition or similar, but we're currently suffering and having to roll back to older versions of validate because new checks are introduced that don't seem to reflect the standard and fail validation on what we believe are good products. |
@msbentley agreed that content validation will often catch any issues with byte offsets, but we are continuously updating the software. the current functionality should be OK as-is, but let us know if you come across an issue and we will address it ASAP. |
Also, per the updates that are not in the standards, those are often unintended consequences trying to address one enhancement/bug while not considering all of the nuances of the standard. we have made some pretty significant progress in bug fixes and enhancements in the last 6 months, so I imagine most work as expected, but there may be 1 or 2 that caused another issue. 3 steps forward, 1 step back with this level of complexity. but anything that seems to be introducing an unwanted check, let us know and we will address it ASAP |
@jordanpadams @al-niessner The above test case is failing. I am getting
|
I retested with v3.2.0 and I no longer see
|
Checked for duplicates
Yes - I've already checked
🐛 Describe the bug
When validating a data product with multiple tables of different types, I first got the error reported in #614 (since the tables were not defined in ascending byte offset order). I then fixed this, but now get a weird error:
where byte 901 is the size of the file, defined in the
file_size
attribute. In fact, if I remove this attribute, validation passes successfully.🕵️ Expected behavior
I expect the attached product to pass validation
📜 To Reproduce
Attempt to validate the attached product.
🖥 Environment Info
📚 Version of Software Used
(base) mark bentley@ESAC01110146 202103 % validate --version
gov.nasa.pds:validate
Version 3.1.1
Release Date: 2023-01-03 23:08:19
🩺 Test Data / Additional context
mre_cal_sc_ttcp_delay_schulte_01s_2021069.zip
🦄 Related requirements
🦄 #xyz
⚙️ Engineering Details
No response
The text was updated successfully, but these errors were encountered: