An asynchronous scheduler for Adaptive
copied from cf-staging / adaptive-schedulerThe Adaptive scheduler solves the following problem, you need to run a few 100
learners and can use >1k cores. ipyparallel
and dask.distributed
provide
very powerful engines for interactive sessions. However, when you want to
connect to >1k cores it starts to struggle. Besides that, on a shared cluster
there is often the problem of starting an interactive session with ample space
available. Our approach is to schedule a different job for each
adaptive.Learner
. The creation and running of these jobs are managed by
adaptive-scheduler
. This means that your calculation will definitely run, even
though the cluster might be fully occupied at the moment. Because of this
approach, there is almost no limit to how many cores you want to use. You can
either use 10 nodes for 1 job (learner
) or 1 core for 1 job (learner
) while
scheduling hundreds of jobs. Everything is written such that the computation is
maximally local. This means that is one of the jobs crashes, there is no
problem and it will automatically schedule a new one and continue the
calculation where it left off (because of Adaptive's periodic saving
functionality). Even if the central "job manager" dies, the jobs will continue
to run (although no new jobs will be scheduled.)