diff --git a/CHANGELOG.md b/CHANGELOG.md index 82436d0..1a83fe0 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,17 @@ ## Changelog +- v2.2 + + - New + + - Add argument `-c`/`--config` to specify a path to a custom `config.yml` file. This resolves [Issue 9](https://github.com/xnl-h4ck3r/urless/issues/9). + - Add argument `-dp`/`--disregard-params`. There is certain filtering that is not done if the URLs have parameters, because by default we want to see all possible parameters. If this argument is passed, then the filtering will be done, regardless of the existence of any parameters. This resolves [Issue 11](https://github.com/xnl-h4ck3r/urless/issues/11) and [Issue 12](https://github.com/xnl-h4ck3r/urless/issues/12). + + - Changed + + - The description for argument `-khw`/`--keep-human-written` says `By default, any URL with a path part that contains 3 or more dashes (-) are removed` but this will be corrected to `contains more than 3 dashes`. + - Correct the description for argument `-kym`/`--keep-yyyymm` on the `-h` output and `README.md`. It says `By default, any URL with a path containing 3 /YYYY/MM` but the `3` should be removed. + - v2.1 - New diff --git a/README.md b/README.md index 7d9cba4..a5e39a5 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@
-## About - v2.1 +## About - v2.2 This is a tool used to de-clutter a list of URLs. As a starting point, I took the amazing tool [uro](https://github.com/s0md3v/uro/) by Somdev Sangwan. But I wanted to change a few things, make some improvements (like deal with GUIDs) and make it more customizable. @@ -45,12 +45,14 @@ pipx install git+https://github.com/xnl-h4ck3r/urless.git | -fe | --filter-extensions | A comma separated list of file extensions to exclude. This will override the `FILTER_EXTENSIONS` list specified in `config.yml` | | -rp | --remove-params | A comma separated list of **case senistive** parameters to remove from ALL URLs. This will override the `REMOVE_PARAMS` list specified in `config.yml`. This can be useful to remove cache buster parameters for example.\*\* | | -ks | --keep-slash | A trailing slash at the end of a URL in input will not be removed. Therefore there may be identical URLs output, one with and one without a trailing slash. | -| -khw | --keep-human-written | By default, any URL with a path part that contains 3 or more dashes (-) are removed because it is assumed to be human written content (e.g. blog post), and not interesting. Passing this argument will keep them in the output. | -| -kym | --keep-yyyymm | By default, any URL with a path containing 3 /YYYY/MM (where YYYY is a year and MM month) are removed because it is assumed to be blog/news content, and not interesting. Passing this argument will keep them in the output. | +| -khw | --keep-human-written | By default, any URL with a path part that contains more than 3 dashes (-) are removed because it is assumed to be human written content (e.g. blog post), and not interesting. Passing this argument will keep them in the output. | +| -kym | --keep-yyyymm | By default, any URL with a path containing /YYYY/MM (where YYYY is a year and MM month) are removed because it is assumed to be blog/news content, and not interesting. Passing this argument will keep them in the output. | | -rcid | --regex-custom-id | **USE WITH CAUTION!** Regex for a Custom ID that your target uses. Ensure the value is passed in quotes. See the section below for more details on this. | | -iq | --ignore-querystring | Remove the query string (including URL fragments `#`) so output is unique paths only. | | -fnp | --fragment-not-param | Don't treat URL fragments `#` in the same way as parameters, e.g. if a link has a filter keyword and a fragment (or param) the link is usually kept, but if this argument is passed and a link has a filter word and fragment, the link will be removed. Also, if this arg is passed and `-iq` / `--ignore-querystring` is used, the fragment will NOT be removed from links if no query string is in the link. | | -lang | --language | If passed and there are multiple URLs with different language codes as a part of the path, only one version of the URL will be output. The codes are specified in the `LANGUAGE` section of `config.yml`. | +| -c | --config | Path to the YML config file. If not passed, it looks for file `config.yml` in the default config directory, e.g. `~/.config/urless/`. | +| -dp | --disregard-params | There is certain filtering that is not done if the URLs have parameters, because by default we want to see all possible parameters. If this argument is passed, then the filtering will be done, regardless of the existence of any parameters. | | -nb | --no-banner | Hides the tool banner (it is hidden by default if you pipe input to urless) output. | | | --version | Show current version number. | | -v | --verbose | Verbose output | @@ -71,14 +73,14 @@ Here's what happens: - If a URL has port 80 or 443 explicitly given, then remove it from the URL (e.g. http://example.com:80/test -> http://example.com/test) - If the URL has any **FILTER-EXTENSIONS**, it will be removed from the output. -- If the URL has NO parameters: +- If the URL has NO parameters **OR** the `-dp`/`--disregard-params` argument was passed: - If the URL contains a **FILTER-KEYWORDS** or **UNWANTED-CONTENT**, it will be removed. - if the URL query string contains unwanted parameters specified in config `REMOVE_PARAMS` (or overridden wit argument `-rp`/`--remove-params`), they will be removed from all URLs before processing. - If `-rcid`/`--regex-custom-id` is passed and the URL path contains a Custom ID, only one match to the Custom ID regex will be included if there are multiple URLs where that is the only difference. - If the URL path contains a GUID, only one of the GUIDs will be included if there are multiple URLs where the GUID is the only difference. - If the URL path contains an Integer ID, only one of the Integer IDs will be included if there are multiple URLs where the Integer ID is the only difference. - If the `-lang` argument is passed and the URL contains a language code (e.g. `en-gb`), only one of the language codes will be included if there are multiple URLs where the language code is different. -- Else the URL has Parameters (or a fragment `#`): +- Else the URL has Parameters (or a fragment `#`) **AND** the `-dp`/`--disregard-params` argument was NOT passed: - If there are multiple URLs with the same parameters, then only URLs with unique parameter values are included. - If there are URL's with a Parameter, but no value (or a fragment), then this will be included. @@ -153,7 +155,7 @@ If you come across any problems at all, or have ideas for improvements, please f ## TODO -- Allow `-rcid`/`--regex-custom-id` argument to take multiple regex strings +None - feel free to raise a Github issue to suggest any enhancements. ## And finally... diff --git a/urless/__init__.py b/urless/__init__.py index 27ebfa0..26c964d 100644 --- a/urless/__init__.py +++ b/urless/__init__.py @@ -1 +1 @@ -__version__="2.1" +__version__="2.2" diff --git a/urless/urless.py b/urless/urless.py index e4a7e03..706d604 100644 --- a/urless/urless.py +++ b/urless/urless.py @@ -67,6 +67,7 @@ outFile = None linesOrigCount = 0 linesFinalCount = 0 +usingConfigDefaults = False def verbose(): ''' @@ -115,7 +116,7 @@ def getConfig(): ''' Try to get the values from the config file, otherwise use the defaults ''' - global FILTER_EXTENSIONS, FILTER_KEYWORDS, LANGUAGE, REMOVE_PARAMS, reLangPart + global FILTER_EXTENSIONS, FILTER_KEYWORDS, LANGUAGE, REMOVE_PARAMS, reLangPart, usingConfigDefaults try: # Try to get the config file values @@ -129,10 +130,13 @@ def getConfig(): ) urlessPath.absolute - if urlessPath == '': - configPath = 'config.yml' + if args.config is None: + if urlessPath == '': + configPath = 'config.yml' + else: + configPath = Path(urlessPath / 'config.yml') else: - configPath = Path(urlessPath / 'config.yml') + configPath = Path(args.config) config = yaml.safe_load(open(configPath)) # If the user provided the --filter-extensions argument then it overrides the config value @@ -193,8 +197,12 @@ def getConfig(): writerr(colored('Unable to read REMOVE_PARAMS from config.yml - default set', 'red')) REMOVE_PARAMS = DEFAULT_REMOVE_PARAMS - except: - writerr(colored('WARNING: Cannot find config.yml, so using default values', 'yellow')) + except Exception as e: + if args.config is None: + writerr(colored('WARNING: Cannot find file "config.yml", so using default values', 'yellow')) + else: + writerr(colored('WARNING: Cannot find file "' + args.config + '", so using default values', 'yellow')) + usingConfigDefaults = True FILTER_EXTENSIONS = DEFAULT_FILTER_EXTENSIONS FILTER_KEYWORDS = DEFAULT_FILTER_KEYWORDS LANGUAGE = DEFAULT_LANGUAGE @@ -446,8 +454,8 @@ def processUrl(line): if hasBadExtension(path): return - # If there are no parameters and path isn't empty - if not params and path != "": + # If there are no parameters (or the --disregard-params argument was passed) and path isn't empty + if (not params or args.disregard_params) and path != "": # If its unwanted content or has a keyword to be excluded, then just return to continue with the next line if isUnwantedContent(path) or hasFilterKeyword(path): @@ -598,7 +606,7 @@ def processOutput(): writerr(colored('ERROR processOutput 1: ' + str(e), 'red')) def showOptionsAndConfig(): - global FILTER_EXTENSIONS, FILTER_KEYWORDS, LANGUAGE, REMOVE_PARAMS + global FILTER_EXTENSIONS, FILTER_KEYWORDS, LANGUAGE, REMOVE_PARAMS, usingConfigDefaults try: write(colored('Selected options and config:', 'cyan')) write(colored('-i: ' + args.input, 'magenta')+colored(' The input file of URLs to de-clutter.','white')) @@ -606,7 +614,16 @@ def showOptionsAndConfig(): write(colored('-o: ' + args.output, 'magenta')+colored(' The output file that the de-cluttered URL list will be written to.','white')) else: write(colored('-o: ', 'magenta')+colored(' An output file wasn\'t given, so output will be written to STDOUT.','white')) + + if args.disregard_params: + write(colored('-dp: True', 'magenta')+colored(' When filtering the URLs, they will not be treated differently just because they have parameters.','white')) + if args.config: + if usingConfigDefaults: + write(colored('-config: ' + args.config, 'magenta')+colored(' The path of the YML config file.','white')+colored(' WARNING: Not found, so using default values.','yellow')) + else: + write(colored('-config: ' + args.config, 'magenta')+colored(' The path of the YML config file.','white')) + if args.filter_keywords: write(colored('-fk (Keywords to Filter): ', 'magenta')+colored(args.filter_keywords,'white')) else: @@ -720,13 +737,13 @@ def main(): '-khw', '--keep-human-written', action='store_true', - help='By default, any URL with a path part that contains 3 or more dashes (-) are removed because it is assumed to be human written content (e.g. blog post) and not interesting. Passing this argument will keep them in the output.', + help='By default, any URL with a path part that contains more than 3 dashes (-) are removed because it is assumed to be human written content (e.g. blog post) and not interesting. Passing this argument will keep them in the output.', ) parser.add_argument( '-kym', '--keep-yyyymm', action='store_true', - help='By default, any URL with a path containing 3 /YYYY/MM (where YYYY is a year and MM month) are removed because it is assumed to be blog/news content, and not interesting. Passing this argument will keep them in the output.', + help='By default, any URL with a path containing /YYYY/MM (where YYYY is a year and MM month) are removed because it is assumed to be blog/news content, and not interesting. Passing this argument will keep them in the output.', ) parser.add_argument( '-rcid', @@ -755,6 +772,18 @@ def main(): action='store_true', help='If passed, and there are multiple URLs with different language codes as a part of the path, only one version of the URL will be output. The codes are specified in the "LANGUAGE" section of "config.yml".', ) + parser.add_argument( + "-c", + "--config", + action="store", + help="Path to the YML config file. If not passed, it looks for file 'config.yml' in the default config directory, e.g. '~/.config/urless/'.", + ) + parser.add_argument( + "-dp", + "--disregard-params", + action="store_true", + help="There is certain filtering that is not done if the URLs have parameters, because by default we want to see all possible parameters. If this argument is passed, then the filtering will be done, regardless of the existence of any parameters.", + ) parser.add_argument("-nb", "--no-banner", action="store_true", help="Hides the tool banner.") parser.add_argument('--version', action='store_true', help="Show version number") parser.add_argument('-v', '--verbose', action='store_true', help='Verbose output.')