Multi part reverse lookup #59

NebularNerd · 2024-03-03T16:16:55Z

Closes #57

While digging through files for PR #58 I discovered that we could do with a footer style reverse lookup on the multi part match to help boost confidence scores (especially if the footer is small as well). This modifies the multiple match to allow reverse looksup.

To behave like the normal forward match, we aggregate the matched.byte_match and magic_row.byte_match to provide a longer match to the _confidence function. One downside of the is that in the match data it will report the two matches smooshed together, e.g.:

.mlt is 0x3c6d6c74 / <mlt (0.4) and 3c2f6d6c743e0a / </mlt>\n (0,7), combined confidence of 0.8 but you'll see <mlt</mlt>\n in the data which looks untidy (could there be a better way to address this)

Also, some unexpected results:

.pt2 cannot score above 0.8 when combined, I imagine it's because both are long matches which score 0.8 individually
.ct gets 0.4 for CREM and 0.8 for DONE but the combined will not go above 0.8

Sample files.zip

Any improvements/comments are welcome, it works but there might be a better, nicer looking way to handle this.

New entries for magic_data.json:

Shotcut .mlt opensource video editor
Creamtracker .ct: Tracker format from the creator of BeRo Tracker BeRo's Blog, under Audio software
Picatune 2 .pt2: Another music format from the creator of BeRo Tracker Picatune 2

Add three new formats with multiple headers using new reverse lookup feature

Allows `multi_part_header_dict` to perform a footer style lookup, this is good for files with small primary headers that score low confidence to get an aggregate score from a fixed footer (that may also be small). The combined score improves confidence.

NebularNerd · 2024-03-03T16:26:23Z

Not sure why checks are failing, still a bit of a GitHub noob when it comes to certain aspects.

EDIT:
Fixed it, ran the code block I modified through Black Playground and all is well. Will have to look more into that later for debugging/prettifying my own code.

I think I has a poorly placed/spaced ) that may have been causing the autobuild to fails, hopefully this will resolve it.

Second attempt to correct, ran it through Black so hopefully this will now clear the autobuild issues.

cdgriffith · 2024-03-11T23:43:17Z

puremagic/main.py

+                if "-" in str(magic_row.offset):
+                    start = magic_row.offset
+                    end = magic_row.offset + len(magic_row.byte_match)
+                    match_area = footer[start:end] if end != 0 else footer[start:]
+                    if match_area == magic_row.byte_match:
+                        new_matches.add(
+                            PureMagic(
+                                byte_match=matched.byte_match + magic_row.byte_match,
+                                offset=magic_row.offset,
+                                extension=magic_row.extension,
+                                mime_type=magic_row.mime_type,
+                                name=magic_row.name,
+                            )
+                        )
+                else:
+                    start = magic_row.offset
+                    end = magic_row.offset + len(magic_row.byte_match)
+                    if end > len(header):


Suggested change

if "-" in str(magic_row.offset):

start = magic_row.offset

end = magic_row.offset + len(magic_row.byte_match)

match_area = footer[start:end] if end != 0 else footer[start:]

if match_area == magic_row.byte_match:

new_matches.add(

PureMagic(

byte_match=matched.byte_match + magic_row.byte_match,

offset=magic_row.offset,

extension=magic_row.extension,

mime_type=magic_row.mime_type,

name=magic_row.name,

)

)

else:

start = magic_row.offset

end = magic_row.offset + len(magic_row.byte_match)

if end > len(header):

start = magic_row.offset

end = magic_row.offset + len(magic_row.byte_match)

if magic_row.offset < 0:

match_area = footer[start:end] if end != 0 else footer[start:]

if match_area == magic_row.byte_match:

new_matches.add(

PureMagic(

byte_match=matched.byte_match + magic_row.byte_match,

offset=magic_row.offset,

extension=magic_row.extension,

mime_type=magic_row.mime_type,

name=magic_row.name,

)

)

else:

if end > len(header):

Haven't verified the logic / tested myself but just wanted to provide a bit of python specific cleanup. Moving the start and end outside the if statements as they are the same, and check magic_row.offset < 0 instead of against a string (if that's a problem for some reason let me know.)

@cdgriffith, Tested and no issues, I have merged it into the PR. Thanks for the suggestions, my coding skills are ok but I know there's always room for improvement. 🙂

I used a str() as I had brain fog and skipped past the correct < 0 method, both achieve the same goal, just mine was the long way round. 🤣

@cdgriffith

Pythonic fixes suggested by @cdgriffith in PR comments. Checking in Black Playground so should merge first time. 🤞

Honestly need more coffee

NebularNerd · 2024-04-08T08:07:57Z

Conflicts resolved 🙂

NebularNerd added 2 commits March 3, 2024 15:39

Update magic_data.json with .ctm, .pt2, .mlt

65f1bbf

Add three new formats with multiple headers using new reverse lookup feature

NebularNerd added 2 commits March 7, 2024 10:21

Update main.py reformat to hopefully pass build tests

ed74e52

I think I has a poorly placed/spaced ) that may have been causing the autobuild to fails, hopefully this will resolve it.

Update main.py layout fixing using Black

284daa4

Second attempt to correct, ran it through Black so hopefully this will now clear the autobuild issues.

cdgriffith reviewed Mar 11, 2024

View reviewed changes

NebularNerd added 2 commits March 12, 2024 10:10

Update main.py pythonic fixes

1a98717

Pythonic fixes suggested by @cdgriffith in PR comments. Checking in Black Playground so should merge first time. 🤞

Update main.py again

463e013

Honestly need more coffee

cdgriffith changed the base branch from master to develop April 7, 2024 18:51

Merge branch 'develop' into Multi-part-reverse-lookup

1c7d0f8

cdgriffith merged commit be1cc98 into cdgriffith:develop Apr 18, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi part reverse lookup #59

Multi part reverse lookup #59

NebularNerd commented Mar 3, 2024

NebularNerd commented Mar 3, 2024 •

edited

Loading

cdgriffith Mar 11, 2024

NebularNerd Mar 12, 2024 •

edited

Loading

NebularNerd commented Apr 8, 2024

Multi part reverse lookup #59

Multi part reverse lookup #59

Conversation

NebularNerd commented Mar 3, 2024

New entries for magic_data.json:

NebularNerd commented Mar 3, 2024 • edited Loading

cdgriffith Mar 11, 2024

Choose a reason for hiding this comment

NebularNerd Mar 12, 2024 • edited Loading

Choose a reason for hiding this comment

NebularNerd commented Apr 8, 2024

NebularNerd commented Mar 3, 2024 •

edited

Loading

NebularNerd Mar 12, 2024 •

edited

Loading