Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apply fuzzy match if url prefix and regex match #524

Merged
merged 1 commit into from
Dec 21, 2019

Conversation

nlevitt
Copy link
Contributor

@nlevitt nlevitt commented Dec 3, 2019

even if no groups are captured by the regex

As we discussed on slack.

nlevitt 3:12 PM
hey dude, i have a question about the fuzzy match rules

https://github.com/webrecorder/pywb/blob/5f938e68797/pywb/warcserver/index/fuzzymatcher.py#L70 (edited) 
if i’m reading the code right, if a url prefix matches and the regex matches, but there are no matching groups, then the fuzzy match rule does not apply
ikreymer 3:14 PM
hey, yeah, this could probably benefit from additional comments..
nlevitt 3:15 PM
i’m wondering, if the regex matches, why not consider that a match… and if there are no matching groups, then you just don’t have any urlkey: filters
this came up because some of our fuzzy match rules from java wayback have regexes with no capturing group parens
and i didn’t think about this case when i migrated them over
for example
 782     # <!-- for https://webarchive.jira.com/browse/AITFIVE-4477 -->
 783     # <bean class="org.archive.archiveit.url.ArchiveItExtractRule">             
 784     #   <property name="startsWith" value="com,tumblr,media"/>                  
 785     #   <property name="regex" value="[&amp;?].*r=[\d.]+.*"/>                   
 786     # </bean>                                                                   
 787     #                                                                           
 788     # <!-- for https://webarchive.jira.com/browse/AITFIVE-4477 -->              
 789     # <bean class="org.archive.archiveit.url.ArchiveItExtractRule">             
 790     #   <property name="startsWith" value="com,trainwreckmovie)/img/sounds"/>   
 791     #   <property name="regex" value="[&amp;?].*r=[\d.]+.*"/>                   
 792     # </bean>                                                                   
 793     - url_prefix:                                                               
 794       - com,tumblr,media                                                        
 795       - com,trainwreckmovie)/img/sounds                                         
 796       fuzzy_lookup: '[&?].*r=[\d.]+.*'                                          
i think i can make it work just by adding an empty capturing group () somewhere in that regex
ikreymer 3:20 PM
trying to remember how it all works.. hm
yes, i suppose that would work if if not groups did if groups is None
that might have been an oversight.., or maybe i specifically wanted it to have at least one capture group..
right, so w/o capture group the first prefix result will be selected
nlevitt 3:25 PM
“first prefix result will be selected” — not 100% sure what you mean by that
ikreymer 3:30 PM
oh i just mean the first result from the prefix query, w/o any filters.. i guess its the same with filters, so, yeah that makes sense :slightly_smiling_face:
nlevitt 3:30 PM
ah ok
so far you haven’t been able to think of anything that would break if you changed if not groups to if groups is None ? :slightly_smiling_face:
ikreymer 3:36 PM
right, yeah
yeah, i think that should be fine, since rx is still matched, just no groups
nlevitt 3:38 PM
cool
is that a change you’d like to make? i’d rather not deviate from pywb here
and can work around it with configuration
but making the change looks like the best solution from my perspective
ikreymer 3:41 PM
yeah, i can make that change in pywb

nlevitt added a commit to jkafader/outbackcdx that referenced this pull request Dec 3, 2019
even if no groups are captured by the regex

see webrecorder/pywb#524 for some discussion
@ikreymer ikreymer merged commit 523e35d into webrecorder:develop Dec 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants