Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the URL and path extractor #22

Open
mzfr opened this issue Oct 8, 2020 · 2 comments
Open

Improve the URL and path extractor #22

mzfr opened this issue Oct 8, 2020 · 2 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@mzfr
Copy link
Owner

mzfr commented Oct 8, 2020

Not sure if its the regex or what but we get loads of binary/unicode kind of characters in the file.

Also, there are loads of URLs pointing to websites like googleapi or w3school. We should keep it clean

@mzfr mzfr added the bug Something isn't working label Oct 8, 2020
@mzfr mzfr added the help wanted Extra attention is needed label Oct 18, 2020
@mzfr
Copy link
Owner Author

mzfr commented Oct 20, 2020

I think the way to do this is that we make a blacklist and check if the domain is in that list before writing it to the file here

Something like:

var BadURLs = map[string]bool{
	"URL HERE": true,
}

for _, d := range data {
if BadURLs[d] {
		continue
	} else {
		_, _ = datawriter.WriteString(d + "\n")
	}
}

But the thing is we don't want to compare the exact urls but just the root domains. Might have to use a regex for each value or have to parse the URL and takeout the root domains.

@mzfr
Copy link
Owner Author

mzfr commented May 23, 2021

I spent quite a lot of time trying to figure out the way but couldn't. Actually, there is no way, I mean changing the regex would sometimes miss a lot of stuff(could be important).

What should we do then?

Once the URL.txt and path.txt is generated then run strings on the file. Simple.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant