我正在尝试在 AzureML 管道中运行一系列多个 ParallelRunStep
。为此,我使用以下助手创建一个步骤:
def create_step(name, script, inp, inp_ds):
out = pip_core.PipelineData(name=f"{name}_out", datastore=dstore, is_directory=True)
out_ds = out.as_dataset()
out_ds_named = out_ds.as_named_input(f"{name}_out")
config = cont_steps.ParallelRunConfig(
source_directory="src",
entry_script=script,
mini_batch_size="1",
error_threshold=0,
output_action="summary_only",
compute_target=compute_target,
environment=component_env,
node_count=2,
logging_level="DEBUG"
)
step = cont_steps.ParallelRunStep(
name=name,
parallel_run_config=config,
inputs=[inp_ds],
output=out,
arguments=[],
allow_reuse=False,
)
return step, out, out_ds_named
作为示例,我创建了两个这样的步骤
step1, out1, out1_ds_named = create_step("step1", "demo_s1.py", input_ds, named_input_ds)
step2, out2, out2_ds_named = create_step("step2", "demo_s2.py", out1, out1_ds_named)
创建实验并将其提交到现有工作区和 Azure ML 计算集群可以正常工作。此外,第一步 step1
使用 input_ds
运行其脚本 demo_s1.py
(生成其输出文件,并成功完成。
但是第二步step2
永远不会开始。
还有最后一个异常(exception)
The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 300.0 seconds
2 items cleaning up...
Cleanup took 0.16968441009521484 seconds
Starting the daemon thread to refresh tokens in background for process with pid = 394
Traceback (most recent call last):
File "driver/amlbi_main.py", line 52, in <module>
main()
File "driver/amlbi_main.py", line 44, in main
JobStarter().start_job()
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/job_starter.py", line 48, in start_job
job.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/job.py", line 70, in start
master.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/master.py", line 174, in start
self._start()
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/master.py", line 149, in _start
self.wait_for_input_init()
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/master.py", line 124, in wait_for_input_init
raise exc
exception.FirstTaskCreationTimeout: Unable to create any task within 600 seconds.
Load the datasource and read the first row locally to see how long it will take.
Set the advanced argument '--first_task_creation_timeout' to a larger value in arguments in ParallelRunStep.
我的印象是,第二步是等待一些数据。然而,第一步创建提供的输出目录和一个文件。
import argparse
import os
def init():
pass
def run(parallel_input):
print(f"*** Running {os.path.basename(__file__)} with input {parallel_input}")
parser = argparse.ArgumentParser(description="Data Preparation")
parser.add_argument('--output', type=str, required=True)
args, unknown_args = parser.parse_known_args()
out_path = os.path.join(args.output, "1.data")
os.makedirs(args.output, exist_ok=True)
open(out_path, "a").close()
return [out_path]
我不知道如何进一步调试。有人有想法吗?
最佳答案
您可以检查此笔记本是否可以并行运行,并确保您使用相同的软件包。 https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/parallel-run/tabular-dataset-inference-iris.ipynb
关于python - 管道中的第二个 `ParallelRunStep` 在启动时超时,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61324600/
相关文章:
uwp - 如何支持使用 Windows Cloud Sync Engine API 进行删除?
wordpress - 全局禁用 WP Gutenberg 预发布检查
reactjs - 为什么在主题 UI 上使用 Rebass?
c# - Blazor + MongoDb 身份 : Value cannot be null.(参
windows - 如何使用 ssh 将 git push 到远程 Windows 机器
ruby-on-rails - 指定环境时 Webpacker 不替换 "process.env"变
reactjs - 为什么 react-router 在调度时自动返回到以前的路由