Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi part reverse lookup #59

Merged
merged 7 commits into from
Apr 18, 2024

Conversation

NebularNerd
Copy link
Contributor

Closes #57

While digging through files for PR #58 I discovered that we could do with a footer style reverse lookup on the multi part match to help boost confidence scores (especially if the footer is small as well). This modifies the multiple match to allow reverse looksup.

To behave like the normal forward match, we aggregate the matched.byte_match and magic_row.byte_match to provide a longer match to the _confidence function. One downside of the is that in the match data it will report the two matches smooshed together, e.g.:

  • .mlt is 0x3c6d6c74 / <mlt (0.4) and 3c2f6d6c743e0a / </mlt>\n (0,7), combined confidence of 0.8 but you'll see <mlt</mlt>\n in the data which looks untidy (could there be a better way to address this)

Also, some unexpected results:

  • .pt2 cannot score above 0.8 when combined, I imagine it's because both are long matches which score 0.8 individually
  • .ct gets 0.4 for CREM and 0.8 for DONE but the combined will not go above 0.8

Sample files.zip

Any improvements/comments are welcome, it works but there might be a better, nicer looking way to handle this.

New entries for magic_data.json:

  • Shotcut .mlt opensource video editor
  • Creamtracker .ct: Tracker format from the creator of BeRo Tracker BeRo's Blog, under Audio software
  • Picatune 2 .pt2: Another music format from the creator of BeRo Tracker Picatune 2

Add three new formats with multiple headers using new reverse lookup feature
Allows `multi_part_header_dict` to perform a footer style lookup, this is good for files with small primary headers that score low confidence to get an aggregate score from a fixed footer (that may also be small). The combined score improves confidence.
@NebularNerd
Copy link
Contributor Author

NebularNerd commented Mar 3, 2024

Not sure why checks are failing, still a bit of a GitHub noob when it comes to certain aspects.

EDIT:
Fixed it, ran the code block I modified through Black Playground and all is well. Will have to look more into that later for debugging/prettifying my own code.

I think I has a poorly placed/spaced ) that may have been causing the autobuild to fails, hopefully this will resolve it.
Second attempt to correct, ran it through Black so hopefully this will now clear the autobuild issues.
Comment on lines 154 to 171
if "-" in str(magic_row.offset):
start = magic_row.offset
end = magic_row.offset + len(magic_row.byte_match)
match_area = footer[start:end] if end != 0 else footer[start:]
if match_area == magic_row.byte_match:
new_matches.add(
PureMagic(
byte_match=matched.byte_match + magic_row.byte_match,
offset=magic_row.offset,
extension=magic_row.extension,
mime_type=magic_row.mime_type,
name=magic_row.name,
)
)
else:
start = magic_row.offset
end = magic_row.offset + len(magic_row.byte_match)
if end > len(header):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if "-" in str(magic_row.offset):
start = magic_row.offset
end = magic_row.offset + len(magic_row.byte_match)
match_area = footer[start:end] if end != 0 else footer[start:]
if match_area == magic_row.byte_match:
new_matches.add(
PureMagic(
byte_match=matched.byte_match + magic_row.byte_match,
offset=magic_row.offset,
extension=magic_row.extension,
mime_type=magic_row.mime_type,
name=magic_row.name,
)
)
else:
start = magic_row.offset
end = magic_row.offset + len(magic_row.byte_match)
if end > len(header):
start = magic_row.offset
end = magic_row.offset + len(magic_row.byte_match)
if magic_row.offset < 0:
match_area = footer[start:end] if end != 0 else footer[start:]
if match_area == magic_row.byte_match:
new_matches.add(
PureMagic(
byte_match=matched.byte_match + magic_row.byte_match,
offset=magic_row.offset,
extension=magic_row.extension,
mime_type=magic_row.mime_type,
name=magic_row.name,
)
)
else:
if end > len(header):

Haven't verified the logic / tested myself but just wanted to provide a bit of python specific cleanup. Moving the start and end outside the if statements as they are the same, and check magic_row.offset < 0 instead of against a string (if that's a problem for some reason let me know.)

Copy link
Contributor Author

@NebularNerd NebularNerd Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdgriffith, Tested and no issues, I have merged it into the PR. Thanks for the suggestions, my coding skills are ok but I know there's always room for improvement. 🙂

I used a str() as I had brain fog and skipped past the correct < 0 method, both achieve the same goal, just mine was the long way round. 🤣

Pythonic fixes suggested by @cdgriffith in PR comments. Checking in Black Playground so should merge first time. 🤞
Honestly need more coffee
@cdgriffith cdgriffith changed the base branch from master to develop April 7, 2024 18:51
@NebularNerd
Copy link
Contributor Author

Conflicts resolved 🙂

@cdgriffith cdgriffith merged commit be1cc98 into cdgriffith:develop Apr 18, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multi-part checks with negative offset for second match
2 participants