Improve the URL and path extractor #22

mzfr · 2020-10-08T20:06:47Z

Not sure if its the regex or what but we get loads of binary/unicode kind of characters in the file.

Also, there are loads of URLs pointing to websites like googleapi or w3school. We should keep it clean

mzfr · 2020-10-20T08:11:10Z

I think the way to do this is that we make a blacklist and check if the domain is in that list before writing it to the file here

Something like:

var BadURLs = map[string]bool{
	"URL HERE": true,
}

for _, d := range data {
if BadURLs[d] {
		continue
	} else {
		_, _ = datawriter.WriteString(d + "\n")
	}
}

But the thing is we don't want to compare the exact urls but just the root domains. Might have to use a regex for each value or have to parse the URL and takeout the root domains.

mzfr · 2021-05-23T18:17:02Z

I spent quite a lot of time trying to figure out the way but couldn't. Actually, there is no way, I mean changing the regex would sometimes miss a lot of stuff(could be important).

What should we do then?

Once the URL.txt and path.txt is generated then run strings on the file. Simple.

mzfr added the bug Something isn't working label Oct 8, 2020

mzfr added the help wanted Extra attention is needed label Oct 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the URL and path extractor #22

Improve the URL and path extractor #22

mzfr commented Oct 8, 2020

mzfr commented Oct 20, 2020

mzfr commented May 23, 2021

Improve the URL and path extractor #22

Improve the URL and path extractor #22

Comments

mzfr commented Oct 8, 2020

mzfr commented Oct 20, 2020

mzfr commented May 23, 2021