Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.airflowignore does not use 'glob' as .gitignore for example #21392

Closed
1 of 2 tasks
ltutar opened this issue Feb 7, 2022 · 9 comments · Fixed by #22051
Closed
1 of 2 tasks

.airflowignore does not use 'glob' as .gitignore for example #21392

ltutar opened this issue Feb 7, 2022 · 9 comments · Fixed by #22051
Assignees
Labels

Comments

@ltutar
Copy link

ltutar commented Feb 7, 2022

Description

Hi,

It would be nice of the file .airflowignore also used 'glob' as .gitignore for example. In my case when I used *_test.py instead of .*_test.py, I could not see the dags and also got errors when I used airflow dags list. The error message was not easy to understand. Without the help of @potiuk, I was lost.

An example is shown below:

Traceback (most recent call last): File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/dag_processing.py", line 370, in _run_processor_manager processor_manager.start() File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/dag_processing.py", line 610, in start return self._run_parsing_loop() File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/dag_processing.py", line 620, in _run_parsing_loop self._refresh_dag_dir() File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/dag_processing.py", line 755, in _refresh_dag_dir self._file_paths = list_py_file_paths(self._dag_directory) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/file.py", line 169, in list_py_file_paths file_paths.extend(find_dag_file_paths(directory, safe_mode)) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/file.py", line 187, in find_dag_file_paths for file_path in find_path_from_directory(str(directory), ".airflowignore"): File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/file.py", line 115, in find_path_from_directory patterns += [re.compile(line) for line in lines_no_comments if line] File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/file.py", line 115, in <listcomp> patterns += [re.compile(line) for line in lines_no_comments if line] File "/usr/local/lib/python3.8/re.py", line 252, in compile return _compile(pattern, flags) File "/usr/local/lib/python3.8/re.py", line 304, in _compile p = sre_compile.compile(pattern, flags) File "/usr/local/lib/python3.8/sre_compile.py", line 764, in compile p = sre_parse.parse(p, flags) File "/usr/local/lib/python3.8/sre_parse.py", line 948, in parse p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0) File "/usr/local/lib/python3.8/sre_parse.py", line 443, in _parse_sub itemsappend(_parse(source, state, verbose, nested + 1, File "/usr/local/lib/python3.8/sre_parse.py", line 668, in _parse raise source.error("nothing to repeat", re.error: nothing to repeat at position 0

Use case/motivation

No response

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@ltutar ltutar added the kind:feature Feature Requests label Feb 7, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Feb 7, 2022

Thanks for opening your first issue here! Be sure to follow the issue template!

@potiuk
Copy link
Member

potiuk commented Feb 7, 2022

I thin changing to glob would not be feasible, as glob and regexp are not compatible. But better error message would be nice indeed.

@potiuk potiuk assigned potiuk and unassigned potiuk Feb 7, 2022
@potiuk
Copy link
Member

potiuk commented Feb 7, 2022

Another think to add could be to add the list of ignored files to the logs (With some reasonable frequency) with information that they were matched by the regexp (that would cover the case where regexp woudl not fail, but you would end up in not ignoring the files that you want).

I marked it as "good first issue" I think this one might be a good one for some "fresh" contribution.

@ashb
Copy link
Member

ashb commented Feb 7, 2022

Yeah, if we wanted to change the format of this file to more closely match what people expect (which I agree is confusing) -- i.e. .gitignore we'd either have to change the filename, or add a # airflow-ignore-v2 comment as the first line or something.

(Of the two I'd prefer a new name, but I don't know what I'd call the file)

@potiuk
Copy link
Member

potiuk commented Feb 7, 2022

Yeah, if we wanted to change the format of this file to more closely match what people expect (which I agree is confusing) -- i.e. .gitignore we'd either have to change the filename, or add a # airflow-ignore-v2 comment as the first line or something.

The comment would be also not good, because people would not know they have to add it :( . Same story as now.

I think we are a bit stuck with regexp until 2.3.0 3.0.0 (then we could change it and add migration check with smart detection and even conversion suggestions - that would be kinda doable I think.

@ianbuss
Copy link
Contributor

ianbuss commented Feb 14, 2022

One option might be to add a config option to specify the file syntax, which would default to regexp to maintain backwards compat. Something like:

[core]
dag_ignorefile_syntax = regexp|glob

May be something I haven't considered though.

@ianbuss
Copy link
Contributor

ianbuss commented Feb 15, 2022

@potiuk thoughts on the configuration suggestion? I think supporting the Unix glob-style is reasonable, since it would match user expectations from other tools much more closely:

These all support a similar (identical?) syntax, so we'd be in good company.

If we think it's worth pursuing it's probably something I can take on.

@potiuk
Copy link
Member

potiuk commented Feb 15, 2022

Yeah. That looks like a good idea to add configuration option. That's a clear path for deprecation as well - we coul add a deprecation for regexp now and swap the default in Airflow 3.

@ianbuss
Copy link
Contributor

ianbuss commented Feb 17, 2022

@ashb @potiuk can you put me down for this one please, thanks.

ianbuss added a commit to ianbuss/airflow that referenced this issue Apr 12, 2022
A new configuration parameter "CORE_IGNORE_FILE_SYNTAX" is added to
allow patterns in .airflowignore files to be interpreted as either
regular expressions (the default) or glob expressions as found in
.gitignore files. This allows users to use patterns they will be
familiar with from tools such as git, helm and docker.

Glob expressions support wildcard matches ("*", "?") within a directory
as well as character classes ("[0-9]"). In addition, zero or more
directories can be matched using "**". Patterns can be negated by
prefixing a "!" at the beginning of the pattern.

The "fnmatch" library in core Python does not produce patterns that are
fully compliant with the kind of patterns that users will be used to
from gitignore or dockerignore files, so the globs are parsed using
the pathspec package from PyPI.

To aid with debugging ignorefile patterns a more helpful error
message is emitted in the logs for invalid patterns, which are
now skipped rather than causing a hard-to-read scheduler stack trace.
kaxil pushed a commit that referenced this issue Apr 13, 2022
A new configuration parameter "CORE_IGNORE_FILE_SYNTAX" is added to
allow patterns in .airflowignore files to be interpreted as either
regular expressions (the default) or glob expressions as found in
.gitignore files. This allows users to use patterns they will be
familiar with from tools such as git, helm and docker.

Glob expressions support wildcard matches ("*", "?") within a directory
as well as character classes ("[0-9]"). In addition, zero or more
directories can be matched using "**". Patterns can be negated by
prefixing a "!" at the beginning of the pattern.

The "fnmatch" library in core Python does not produce patterns that are
fully compliant with the kind of patterns that users will be used to
from gitignore or dockerignore files, so the globs are parsed using
the pathspec package from PyPI.

To aid with debugging ignorefile patterns a more helpful error
message is emitted in the logs for invalid patterns, which are
now skipped rather than causing a hard-to-read scheduler stack trace.

closes: #21392
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants