-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi part reverse lookup #59
Multi part reverse lookup #59
Conversation
Add three new formats with multiple headers using new reverse lookup feature
Allows `multi_part_header_dict` to perform a footer style lookup, this is good for files with small primary headers that score low confidence to get an aggregate score from a fixed footer (that may also be small). The combined score improves confidence.
Not sure why checks are failing, still a bit of a GitHub noob when it comes to certain aspects. EDIT: |
I think I has a poorly placed/spaced ) that may have been causing the autobuild to fails, hopefully this will resolve it.
Second attempt to correct, ran it through Black so hopefully this will now clear the autobuild issues.
puremagic/main.py
Outdated
if "-" in str(magic_row.offset): | ||
start = magic_row.offset | ||
end = magic_row.offset + len(magic_row.byte_match) | ||
match_area = footer[start:end] if end != 0 else footer[start:] | ||
if match_area == magic_row.byte_match: | ||
new_matches.add( | ||
PureMagic( | ||
byte_match=matched.byte_match + magic_row.byte_match, | ||
offset=magic_row.offset, | ||
extension=magic_row.extension, | ||
mime_type=magic_row.mime_type, | ||
name=magic_row.name, | ||
) | ||
) | ||
else: | ||
start = magic_row.offset | ||
end = magic_row.offset + len(magic_row.byte_match) | ||
if end > len(header): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if "-" in str(magic_row.offset): | |
start = magic_row.offset | |
end = magic_row.offset + len(magic_row.byte_match) | |
match_area = footer[start:end] if end != 0 else footer[start:] | |
if match_area == magic_row.byte_match: | |
new_matches.add( | |
PureMagic( | |
byte_match=matched.byte_match + magic_row.byte_match, | |
offset=magic_row.offset, | |
extension=magic_row.extension, | |
mime_type=magic_row.mime_type, | |
name=magic_row.name, | |
) | |
) | |
else: | |
start = magic_row.offset | |
end = magic_row.offset + len(magic_row.byte_match) | |
if end > len(header): | |
start = magic_row.offset | |
end = magic_row.offset + len(magic_row.byte_match) | |
if magic_row.offset < 0: | |
match_area = footer[start:end] if end != 0 else footer[start:] | |
if match_area == magic_row.byte_match: | |
new_matches.add( | |
PureMagic( | |
byte_match=matched.byte_match + magic_row.byte_match, | |
offset=magic_row.offset, | |
extension=magic_row.extension, | |
mime_type=magic_row.mime_type, | |
name=magic_row.name, | |
) | |
) | |
else: | |
if end > len(header): |
Haven't verified the logic / tested myself but just wanted to provide a bit of python specific cleanup. Moving the start and end outside the if statements as they are the same, and check magic_row.offset < 0
instead of against a string (if that's a problem for some reason let me know.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cdgriffith, Tested and no issues, I have merged it into the PR. Thanks for the suggestions, my coding skills are ok but I know there's always room for improvement. 🙂
I used a str()
as I had brain fog and skipped past the correct < 0
method, both achieve the same goal, just mine was the long way round. 🤣
Pythonic fixes suggested by @cdgriffith in PR comments. Checking in Black Playground so should merge first time. 🤞
Honestly need more coffee
Conflicts resolved 🙂 |
Closes #57
While digging through files for PR #58 I discovered that we could do with a
footer
style reverse lookup on the multi part match to help boost confidence scores (especially if the footer is small as well). This modifies the multiple match to allow reverse looksup.To behave like the normal forward match, we aggregate the
matched.byte_match
andmagic_row.byte_match
to provide a longer match to the_confidence
function. One downside of the is that in the match data it will report the two matches smooshed together, e.g.:.mlt
is0x3c6d6c74
/<mlt
(0.4) and3c2f6d6c743e0a
/</mlt>\n
(0,7), combined confidence of 0.8 but you'll see<mlt</mlt>\n
in the data which looks untidy (could there be a better way to address this)Also, some unexpected results:
.pt2
cannot score above 0.8 when combined, I imagine it's because both are long matches which score 0.8 individually.ct
gets 0.4 forCREM
and 0.8 forDONE
but the combined will not go above 0.8Sample files.zip
Any improvements/comments are welcome, it works but there might be a better, nicer looking way to handle this.
New entries for magic_data.json:
.mlt
opensource video editor.ct
: Tracker format from the creator of BeRo Tracker BeRo's Blog, under Audio software.pt2
: Another music format from the creator of BeRo Tracker Picatune 2