Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

De-dupe search results #31

Open
that1guy opened this issue Apr 10, 2015 · 3 comments
Open

De-dupe search results #31

that1guy opened this issue Apr 10, 2015 · 3 comments
Assignees

Comments

@that1guy
Copy link
Member

I have code that will de-dupe records returned from mongo.find. This should probably be included somewhere in the waterfall block before returning results. Give me your suggestion on how to implement and I will do this on separate branch and send pull request.

@that1guy
Copy link
Member Author

FYI, I can do this on client-side if that's helpful. Just say the word so we're not duplicating efforts.

@that1guy that1guy modified the milestone: 5/15 beta launch Apr 21, 2015
@that1guy
Copy link
Member Author

Cleaning up code on my end.. before I delete my de-dupe code here it is.

var HashTable = require('hashtable');

exports.dedupe = function(result, promise){

    var userLat = result.location.latitude;

    var userLong = result.location.longitude;

    var response = result.external;

    var deDupeExternalID = new HashTable();

    var deDupeHeading = new HashTable();

    var duplicates = [];

    var originals = [];

    deDupeExternalID.put(response.postings[0].external_url, 0);
    //console.log("ID is unique: " + response.postings[0].external_id);
    deDupeHeading.put(response.postings[0].heading, 0);
    //console.log("Heading is unique: " + response.postings[0].heading);

    for (var i = 1; i < response.postings.length; i++) {

        result = response.postings[i];

        if(typeof deDupeExternalID.get(result.external_url) === 'undefined'){
            //console.log("URL is unique: " + result.external_url);
            deDupeExternalID.put(result.external_url, i);

            if(typeof deDupeHeading.get(result.heading) === 'undefined'){
                //console.log("Heading is unique: "+ result.heading);
                deDupeHeading.put(result.heading, i);

//              TODO: Clean up HTML
                originals.push(convertToHTSObjStructure(result, userLat, userLong));

            } else {

                duplicates.push(result);
                console.log("Duplicate Heading: "+ result.heading);
            }
        } else {
            duplicates.push(result);
            console.log("Duplicate URL: "+result.external_url);
        }
    }


    console.log("!!!!!!!!!~~~~ DONE WITH DEDUPE ~~~~!!!!!!!!!");
    console.log(duplicates.length + " Duplicates");
    console.log(originals.length + " Originals");
    console.log("!!!!!!!!!~~~~ DONE WITH DEDUPE ~~~~!!!!!!!!!");

    promise(null, originals);

};

@that1guy
Copy link
Member Author

This is becoming our only weak spot in search that I can see. knock this down and I think we're golden.

https://staging-posting-api.hashtagsell.com/v1/postings/?start=0&count=35&filters[mandatory][contains][heading]=htc&filters[optional][exact][categoryCode]=SELE,SAPL&geo[lookup]=true&geo[min]=0&geo[max]=12890000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants