Skip to content

Commit

Permalink
v2.2 - see CHANGELOG.md
Browse files Browse the repository at this point in the history
  • Loading branch information
xnl-h4ck3r committed Nov 21, 2024
1 parent c50b2bf commit ba5a8e7
Show file tree
Hide file tree
Showing 4 changed files with 61 additions and 18 deletions.
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,17 @@
## Changelog

- v2.2

- New

- Add argument `-c`/`--config` to specify a path to a custom `config.yml` file. This resolves [Issue 9](https://github.com/xnl-h4ck3r/urless/issues/9).
- Add argument `-dp`/`--disregard-params`. There is certain filtering that is not done if the URLs have parameters, because by default we want to see all possible parameters. If this argument is passed, then the filtering will be done, regardless of the existence of any parameters. This resolves [Issue 11](https://github.com/xnl-h4ck3r/urless/issues/11) and [Issue 12](https://github.com/xnl-h4ck3r/urless/issues/12).

- Changed

- The description for argument `-khw`/`--keep-human-written` says `By default, any URL with a path part that contains 3 or more dashes (-) are removed` but this will be corrected to `contains more than 3 dashes`.
- Correct the description for argument `-kym`/`--keep-yyyymm` on the `-h` output and `README.md`. It says `By default, any URL with a path containing 3 /YYYY/MM` but the `3` should be removed.

- v2.1

- New
Expand Down
14 changes: 8 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
<center><img src="https://github.com/xnl-h4ck3r/urless/blob/main/urless/images/title.png"></center>

## About - v2.1
## About - v2.2

This is a tool used to de-clutter a list of URLs.
As a starting point, I took the amazing tool [uro](https://github.com/s0md3v/uro/) by Somdev Sangwan. But I wanted to change a few things, make some improvements (like deal with GUIDs) and make it more customizable.
Expand Down Expand Up @@ -45,12 +45,14 @@ pipx install git+https://github.com/xnl-h4ck3r/urless.git
| -fe | --filter-extensions | A comma separated list of file extensions to exclude. This will override the `FILTER_EXTENSIONS` list specified in `config.yml` |
| -rp | --remove-params | A comma separated list of **case senistive** parameters to remove from ALL URLs. This will override the `REMOVE_PARAMS` list specified in `config.yml`. This can be useful to remove cache buster parameters for example.\*\* |
| -ks | --keep-slash | A trailing slash at the end of a URL in input will not be removed. Therefore there may be identical URLs output, one with and one without a trailing slash. |
| -khw | --keep-human-written | By default, any URL with a path part that contains 3 or more dashes (-) are removed because it is assumed to be human written content (e.g. blog post), and not interesting. Passing this argument will keep them in the output. |
| -kym | --keep-yyyymm | By default, any URL with a path containing 3 /YYYY/MM (where YYYY is a year and MM month) are removed because it is assumed to be blog/news content, and not interesting. Passing this argument will keep them in the output. |
| -khw | --keep-human-written | By default, any URL with a path part that contains more than 3 dashes (-) are removed because it is assumed to be human written content (e.g. blog post), and not interesting. Passing this argument will keep them in the output. |
| -kym | --keep-yyyymm | By default, any URL with a path containing /YYYY/MM (where YYYY is a year and MM month) are removed because it is assumed to be blog/news content, and not interesting. Passing this argument will keep them in the output. |
| -rcid | --regex-custom-id | **USE WITH CAUTION!** Regex for a Custom ID that your target uses. Ensure the value is passed in quotes. See the section below for more details on this. |
| -iq | --ignore-querystring | Remove the query string (including URL fragments `#`) so output is unique paths only. |
| -fnp | --fragment-not-param | Don't treat URL fragments `#` in the same way as parameters, e.g. if a link has a filter keyword and a fragment (or param) the link is usually kept, but if this argument is passed and a link has a filter word and fragment, the link will be removed. Also, if this arg is passed and `-iq` / `--ignore-querystring` is used, the fragment will NOT be removed from links if no query string is in the link. |
| -lang | --language | If passed and there are multiple URLs with different language codes as a part of the path, only one version of the URL will be output. The codes are specified in the `LANGUAGE` section of `config.yml`. |
| -c | --config | Path to the YML config file. If not passed, it looks for file `config.yml` in the default config directory, e.g. `~/.config/urless/`. |
| -dp | --disregard-params | There is certain filtering that is not done if the URLs have parameters, because by default we want to see all possible parameters. If this argument is passed, then the filtering will be done, regardless of the existence of any parameters. |
| -nb | --no-banner | Hides the tool banner (it is hidden by default if you pipe input to urless) output. |
| | --version | Show current version number. |
| -v | --verbose | Verbose output |
Expand All @@ -71,14 +73,14 @@ Here's what happens:

- If a URL has port 80 or 443 explicitly given, then remove it from the URL (e.g. http://example.com:80/test -> http://example.com/test)
- If the URL has any **FILTER-EXTENSIONS**, it will be removed from the output.
- If the URL has NO parameters:
- If the URL has NO parameters **OR** the `-dp`/`--disregard-params` argument was passed:
- If the URL contains a **FILTER-KEYWORDS** or **UNWANTED-CONTENT**, it will be removed.
- if the URL query string contains unwanted parameters specified in config `REMOVE_PARAMS` (or overridden wit argument `-rp`/`--remove-params`), they will be removed from all URLs before processing.
- If `-rcid`/`--regex-custom-id` is passed and the URL path contains a Custom ID, only one match to the Custom ID regex will be included if there are multiple URLs where that is the only difference.
- If the URL path contains a GUID, only one of the GUIDs will be included if there are multiple URLs where the GUID is the only difference.
- If the URL path contains an Integer ID, only one of the Integer IDs will be included if there are multiple URLs where the Integer ID is the only difference.
- If the `-lang` argument is passed and the URL contains a language code (e.g. `en-gb`), only one of the language codes will be included if there are multiple URLs where the language code is different.
- Else the URL has Parameters (or a fragment `#`):
- Else the URL has Parameters (or a fragment `#`) **AND** the `-dp`/`--disregard-params` argument was NOT passed:
- If there are multiple URLs with the same parameters, then only URLs with unique parameter values are included.
- If there are URL's with a Parameter, but no value (or a fragment), then this will be included.

Expand Down Expand Up @@ -153,7 +155,7 @@ If you come across any problems at all, or have ideas for improvements, please f

## TODO

- Allow `-rcid`/`--regex-custom-id` argument to take multiple regex strings
None - feel free to raise a Github issue to suggest any enhancements.

## And finally...

Expand Down
2 changes: 1 addition & 1 deletion urless/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__="2.1"
__version__="2.2"
51 changes: 40 additions & 11 deletions urless/urless.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@
outFile = None
linesOrigCount = 0
linesFinalCount = 0
usingConfigDefaults = False

def verbose():
'''
Expand Down Expand Up @@ -115,7 +116,7 @@ def getConfig():
'''
Try to get the values from the config file, otherwise use the defaults
'''
global FILTER_EXTENSIONS, FILTER_KEYWORDS, LANGUAGE, REMOVE_PARAMS, reLangPart
global FILTER_EXTENSIONS, FILTER_KEYWORDS, LANGUAGE, REMOVE_PARAMS, reLangPart, usingConfigDefaults
try:

# Try to get the config file values
Expand All @@ -129,10 +130,13 @@ def getConfig():
)

urlessPath.absolute
if urlessPath == '':
configPath = 'config.yml'
if args.config is None:
if urlessPath == '':
configPath = 'config.yml'
else:
configPath = Path(urlessPath / 'config.yml')
else:
configPath = Path(urlessPath / 'config.yml')
configPath = Path(args.config)
config = yaml.safe_load(open(configPath))

# If the user provided the --filter-extensions argument then it overrides the config value
Expand Down Expand Up @@ -193,8 +197,12 @@ def getConfig():
writerr(colored('Unable to read REMOVE_PARAMS from config.yml - default set', 'red'))
REMOVE_PARAMS = DEFAULT_REMOVE_PARAMS

except:
writerr(colored('WARNING: Cannot find config.yml, so using default values', 'yellow'))
except Exception as e:
if args.config is None:
writerr(colored('WARNING: Cannot find file "config.yml", so using default values', 'yellow'))
else:
writerr(colored('WARNING: Cannot find file "' + args.config + '", so using default values', 'yellow'))
usingConfigDefaults = True
FILTER_EXTENSIONS = DEFAULT_FILTER_EXTENSIONS
FILTER_KEYWORDS = DEFAULT_FILTER_KEYWORDS
LANGUAGE = DEFAULT_LANGUAGE
Expand Down Expand Up @@ -446,8 +454,8 @@ def processUrl(line):
if hasBadExtension(path):
return

# If there are no parameters and path isn't empty
if not params and path != "":
# If there are no parameters (or the --disregard-params argument was passed) and path isn't empty
if (not params or args.disregard_params) and path != "":

# If its unwanted content or has a keyword to be excluded, then just return to continue with the next line
if isUnwantedContent(path) or hasFilterKeyword(path):
Expand Down Expand Up @@ -598,15 +606,24 @@ def processOutput():
writerr(colored('ERROR processOutput 1: ' + str(e), 'red'))

def showOptionsAndConfig():
global FILTER_EXTENSIONS, FILTER_KEYWORDS, LANGUAGE, REMOVE_PARAMS
global FILTER_EXTENSIONS, FILTER_KEYWORDS, LANGUAGE, REMOVE_PARAMS, usingConfigDefaults
try:
write(colored('Selected options and config:', 'cyan'))
write(colored('-i: ' + args.input, 'magenta')+colored(' The input file of URLs to de-clutter.','white'))
if args.output is not None:
write(colored('-o: ' + args.output, 'magenta')+colored(' The output file that the de-cluttered URL list will be written to.','white'))
else:
write(colored('-o: <STDOUT>', 'magenta')+colored(' An output file wasn\'t given, so output will be written to STDOUT.','white'))

if args.disregard_params:
write(colored('-dp: True', 'magenta')+colored(' When filtering the URLs, they will not be treated differently just because they have parameters.','white'))

if args.config:
if usingConfigDefaults:
write(colored('-config: ' + args.config, 'magenta')+colored(' The path of the YML config file.','white')+colored(' WARNING: Not found, so using default values.','yellow'))
else:
write(colored('-config: ' + args.config, 'magenta')+colored(' The path of the YML config file.','white'))

if args.filter_keywords:
write(colored('-fk (Keywords to Filter): ', 'magenta')+colored(args.filter_keywords,'white'))
else:
Expand Down Expand Up @@ -720,13 +737,13 @@ def main():
'-khw',
'--keep-human-written',
action='store_true',
help='By default, any URL with a path part that contains 3 or more dashes (-) are removed because it is assumed to be human written content (e.g. blog post) and not interesting. Passing this argument will keep them in the output.',
help='By default, any URL with a path part that contains more than 3 dashes (-) are removed because it is assumed to be human written content (e.g. blog post) and not interesting. Passing this argument will keep them in the output.',
)
parser.add_argument(
'-kym',
'--keep-yyyymm',
action='store_true',
help='By default, any URL with a path containing 3 /YYYY/MM (where YYYY is a year and MM month) are removed because it is assumed to be blog/news content, and not interesting. Passing this argument will keep them in the output.',
help='By default, any URL with a path containing /YYYY/MM (where YYYY is a year and MM month) are removed because it is assumed to be blog/news content, and not interesting. Passing this argument will keep them in the output.',
)
parser.add_argument(
'-rcid',
Expand Down Expand Up @@ -755,6 +772,18 @@ def main():
action='store_true',
help='If passed, and there are multiple URLs with different language codes as a part of the path, only one version of the URL will be output. The codes are specified in the "LANGUAGE" section of "config.yml".',
)
parser.add_argument(
"-c",
"--config",
action="store",
help="Path to the YML config file. If not passed, it looks for file 'config.yml' in the default config directory, e.g. '~/.config/urless/'.",
)
parser.add_argument(
"-dp",
"--disregard-params",
action="store_true",
help="There is certain filtering that is not done if the URLs have parameters, because by default we want to see all possible parameters. If this argument is passed, then the filtering will be done, regardless of the existence of any parameters.",
)
parser.add_argument("-nb", "--no-banner", action="store_true", help="Hides the tool banner.")
parser.add_argument('--version', action='store_true', help="Show version number")
parser.add_argument('-v', '--verbose', action='store_true', help='Verbose output.')
Expand Down

0 comments on commit ba5a8e7

Please sign in to comment.