Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 23 Explanation #42

Closed
afarbos opened this issue Jun 17, 2015 · 8 comments
Closed

Issue 23 Explanation #42

afarbos opened this issue Jun 17, 2015 · 8 comments

Comments

@afarbos
Copy link

afarbos commented Jun 17, 2015

I am re-reading the last sentence of your answer of this issue #23.

"If you're planning on doing some large distributed computation that uses lots of memory, Airflow might not be the right platform. Instead, you may want to use a distributed system, or write some sort of service, and invoke it from Airflow."

I am not sure to really understand. Could you give me more information about it?

@mistercrunch
Copy link
Member

Airflow dispatches work to other systems. Hadoop, Hive, MySQL, Redshift, ... Airflow can take on some workload, but its role is not to crunch huge amount of data on its own. The unit of work (task instance) should be fairly small (single threaded, few gigs of memory, not too much disk usage, ...).

Most of the work that Airflow does is orchestration, most tasks are assumed to be remote, waiting for the external system to say it's done with the work. Data transfer operators are a little bit more involved, getting a chunk of data from a system and moving it into another.

If you need to run huge jobs, I'd suggest to do it in a system that is made for that purpose and have Airflow call that system.

@afarbos
Copy link
Author

afarbos commented Jun 18, 2015

Ok, so for example I could use airflow, calling rabbitmq to execute jobs on ec2 instances?

If I am right how do you do your dashboard/stats after? (with cpu percentage, etc)

And do you have any simple way to deploy on specific systems? If yes, which systems?

@mistercrunch
Copy link
Member

What are you trying to do? Are you looking for a system like Nagios to monitor your servers?

What do you mean by specific systems?

@afarbos
Copy link
Author

afarbos commented Jun 18, 2015

I am trying to use airflow to modelize my workflow and execute it on ec2 instances.

I mean, is it easier to use ec2, hadoop or another specific system with airflow?

@mistercrunch
Copy link
Member

Well Airflow nodes run Airflow machines (should work with EC2 or whatever linux boxes you have, metal or virtual), and they can task to external systems (Hadoop, Hive, MySQL, HDFS, ...).

In a simple setup you can just have one Airflow box that runs the scheduler, web server and a worker process and even a MySQL or Postgres database on that same box.

In a more complex setup you can have an array of workers, a dedicated web server, a dedicated scheduler, and a dedicated database.

If you are interested in looking at the workload on these machines, use whatever your in-house ops tool is (Nagios, Datadog, ...).

@afarbos
Copy link
Author

afarbos commented Jun 19, 2015

Ok, so that answer to one of my question: if I want a dashboard, I have to do it myself.

I will try do be more clear of what I want to do. I take a look to the CeleryExecutor and my question is: If I use it with rabbitmq as broker, does that do auto-scaling with your worker ?

@mistercrunch
Copy link
Member

The celery executor sends task instances to be executed by workers. Workers have the number of execution slots you define and when they receive a message to run a task instance they proceed to do so.

If you add new workers, they start listening to the queue and will take on processing task instances as well. At Airbnb we currently have 6 workers with 128 slots on each, meaning we can process up to 768 task instances in parallel. If you have deployment automation like we do, you can kick off more workers and get more slots by clicking a few buttons.

Since most of these tasks are just sensors of Hive jobs, the workers don't do much work, that's why we're able to provision 128 slots on machine that have 8 vCPUs without worrying about saturating those machines. Each worker could probably have twice as much slots and we'd be fine.

@mistercrunch
Copy link
Member

I just threw in an endpoint to refresh the DagBag:
81ab8fb
/admin/airflow/refresh_all

I'm not adding a link to it since it wouldn't work in a multi-threaded / multi-server environment, but you'll be able to use this hidden endpoint in the next release as a hack.

rajatsri28 pushed a commit to rajatsri28/airflow that referenced this issue May 13, 2020
rajatsri28 pushed a commit to rajatsri28/airflow that referenced this issue Jan 25, 2022
* EWT-569 : Initial Commit for migrations

* [EWT-569] Airflow Upgrade to 1.10.14, Cherry-Pick  76fe7ac from 1.10.4

* CP Contains fb64f2e: [TWTR][AIRFLOW-XXX] Twitter Airflow Customizations + Fixup job scheduling without explicit_defaults_for_timestamp

* [EWT-569] Airflow Upgrade to 1.10.14, Cherry-Pick 91d2b00
[CP][EWT-548][AIRFLOW-6527] Make send_task_to_executor timeout configurable (apache#63)

* [EWT-569] Airflow Upgrade to 1.10.14, Cherry-Pick 91d2b00
CP contains [EWT-16]: Airflow fix for manual trigger during version upgrade (apache#13)

* [EWT-16]: Airflow fix for manual trigger during version upgrade

* [EWT-569] Airflow Upgrade to 1.10.14, Cherry-Pick 91d2b00
[CP][EWT-548][AIRFLOW-6527] Make send_task_to_executor timeout configurable (apache#63)

CP of f757a54

* CP(55bb579) [AIRFLOW-5597] Linkify urls in task instance log (apache#16)

* [EWT-569] Airflow Upgrade to 1.10.14 [CP] from 1.10.4+twtr : 94cdcf6
[CP] Contains [AIRFLOW-5597] Linkify urls in task instance log

CP of f757a54

* [EWT-569] Airflow Upgrade to 1.10.14, Cherry-Pick  4ce8d4c from 1.10.4
CP contains [TWTTR] Fix for rendering code on UI (apache#34)

* [EWT-569] Airflow Upgrade to 1.10.14, Cherry-Pick  299b4d8 from 1.10.4
CP contains [TWTR] CP from 1.10+twtr (apache#35)

* 99ee040: CP from 1.10+twtr

* 2e01c24: CP from 1.10.4 ([TWTR][AIRFLOW-4939] Fixup use of fallback kwarg in conf.getint)

* 00cb4ae: [TWTR][AIRFLOW-XXXX] Cherry-pick d4a83bc and bump version (apache#21)

* CP 51b1aee: Relax version requiremets (apache#24)

* CP 67a4d1c: [CX-16266] Change with reference to 1a4c164 commit in open source (apache#25)

* CP 54bd095: [TWTR][CX-17516] Queue tasks already being handled by the executor (apache#26)

* CP 87fcc1c: [TWTR][CX-17516] Requeue tasks in the queued state (apache#27)

* CP 98a1ca9: [AIRFLOW-6625] Explicitly log using utf-8 encoding (apache#7247) (apache#31)

* [EWT-569] Airflow Upgrade to 1.10.14 [CP] from 1.10.4+twtr : f7050fb
CP Contains Experiment API path fix (apache#37)

* [EWT-569] Airflow Upgrade to 1.10.14, Cherry-Pick  8a689af from 1.10.4
CP Contains Export scheduler env variable into worker pods. (apache#38)

* [EWT-569] Airflow Upgrade to 1.10.14, Cherry-Pick  5875a15 from 1.10.4
Cp Contains [EWT-115][EWT-118] Initialise dag var to None and fix for DagModel.fileloc (missed in EWT-16) (apache#39)

* [EWT-569] Airflow Upgrade to 1.10.14, Cherry-Pick  a68e2b3 from 1.10.4
[CX-16591] Fix regex to work with impersonated clusters like airflow_scheduler_ddavydov (apache#42)

* [EWT-569] Airflow Upgrade to 1.10.14 [CP] from 1.10.4+twtr : e9642c2
[CP][EWT-128] Fetch task logs from worker pods (19ac45a) (apache#43)

* [EWT-569] Airflow Upgrade to 1.10.14 [CP] from 1.10.4+twtr : d5d0a07
[CP][AIRFLOW-6561][EWT-290]: Adding priority class and default resource for worker pod. (apache#47)

* [EWT-569] Airflow Upgrade to 1.10.14 [CP] from 1.10.4+twtr : 9b58c88
[CP][EWT-302]Patch Pool.DEFAULT_POOL_NAME in BaseOperator (apache#8587) (apache#49)

Open source commit id: b37ce29

* [EWT-569] Airflow Upgrade to 1.10.14 [CP] from 1.10.4+twtr : 7b52a71
[CP][AIRFLOW-3121] Define closed property on StreamLogWriter (apache#3955) (apache#52)

CP of 2d5b8a5

* [EWT-361] Fix broken regex pattern for extracting dataflow job id (apache#51)

Update the dataflow URL regex as per AIRFLOW-9323

* [EWT-569] Airflow Upgrade to 1.10.14 [CP] from 1.10.4+twtr : 4b5b977
EWT-370: Use python3 to launch the dataflow job. (apache#53)

* [EWT-569] Airflow Upgrade to 1.10.14 [CP] from 1.10.4+twtr : 596e24f
* [EWT-450] fixing sla miss triggering duplicate alerts every minute (apache#56)

* [EWT-569] Airflow Upgrade to 1.10.14 [CP] from 1.10.4+twtr : b3d7fb4
[CP] Handle IntegrityErrors for trigger dagruns & add Stacktrace when DagFileProcessorManager gets killed (apache#57)

CP of faaf179 - from master
CP of 2102122 - from 1.10.12

* [EWT-569] Airflow Upgrade to 1.10.14 [CP] from 1.10.4+twtr : bac4acd
[TWTR][EWT-472] Add lifecycle support while launching worker pods (apache#59)

* [EWT-569] Airflow Upgrade to 1.10.14 [CP] from 1.10.4+twtr : 6162402
[TWTTR] Don't enqueue tasks again if already queued for K8sExecutor(apache#60)

Basically reverting commit 87fcc1c  and making changes specifically into the Celery Executor class only.

* [EWT-569] Airflow Upgrade to 1.10.14 [CP] from 1.10.4+twtr : 1991419
[CP][TWTR][EWT-377] Fix DagBag bug when a Dag has invalid schedule_interval (apache#61)

CP of 5605d10 & apache#11462

* [EWT-569] Airflow Upgrade to 1.10.14 [CP] from 1.10.4+twtr : 48be0f9
[TWTR][EWT-350] Reverting the last commit partially (apache#62)

* [EWT-569] Airflow Upgrade to 1.10.14 [CP] from 1.10.4+twtr : d8c473e
[CP][EWT-548][AIRFLOW-6527] Make send_task_to_executor timeout configurable (apache#63)

CP of f757a54
gopidesupavan added a commit to gopidesupavan/airflow that referenced this issue Oct 2, 2024
Lee-W pushed a commit that referenced this issue Oct 2, 2024
joaopamaral pushed a commit to joaopamaral/airflow that referenced this issue Oct 21, 2024
ellisms pushed a commit to ellisms/airflow that referenced this issue Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants