apache-spark - 使用来自 s3 存储桶的数据在 AWS EMR 上使用 pyspark

我正在使用 pyspark.ml 在 JupyterLab 笔记本中的 AWS EMR 上的 s3 存储桶中的 .json 数据上训练机器学习模型。桶不是我的，但我认为访问工作正常，因为数据预处理、特征工程等工作正常。但是，当我调用 cv.fit(training_data) 函数时，训练过程一直运行到它几乎完成(由状态栏指示)，但随后抛出错误:

Exception in thread cell_monitor-64:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg/awseditorssparkmonitoringwidget/cellmonitor.py", line 178, in cell_monitor
    job_binned_stages[job_id][stage_id] = all_stages[stage_id]
KeyError: 6571

我还找不到关于这个错误的任何信息。到底是怎么回事？

这是我的管道:

train, test = clean_df.randomSplit([0.8, 0.2], seed=42)

va1 = VectorAssembler(inputCols="vars", outputCol="vars")

scaler = StandardScaler(inputCol="to_scale", outputCol="scaled_features")

va2 = VectorAssembler(inputCols=["more_vars","scaled_features"], outputCol="features")

gbt = GBTClassifier()   

pipeline = Pipeline(stages=[va1, scaler,va2,gbt])

paramGrid = ParamGridBuilder()\
    .addGrid(gbt.maxDepth, [2, 5])\
    .addGrid(gbt.maxIter, [10, 100])\
    .build() 

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(metricName='f1'),
                          numFolds=3)

cvModel = crossval.fit(train)

其次，我有一种预感，我可能会在 Python 3.8 中解决；我可以在 EMR 上安装 Python 3.8 吗？

最佳答案

我们遇到了同样的问题。我们正在使用 Hyperopt，我们刚刚添加了 try except 来避免这个问题。错误不断出现，但它一直在运行。该错误似乎影响了显示 EMR 笔记本上 spark 作业内部进度的条形图，但它完成了管道。

# Defining the hyperopt objetive
def objetive(params):
    try:
        # Pipeline here with Vector Assembler and GBT
        return {'loss': -metrics_val.areaUnderPR, 
                "status": STATUS_OK, 
                "output_dict": output_dict}
    except Exception as e:
        print("## Exception", e)
        return {'loss': 0, 
                "status": STATUS_FAIL,
                "except": e,
                "output_dict": {"params": params}}

我们得到的异常如下:

Exception in thread cell_monitor-18:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg/awseditorssparkmonitoringwidget/cellmonitor.py", line 178, in cell_monitor
    job_binned_stages[job_id][stage_id] = all_stages[stage_id]
KeyError: 2395

但是我们得到了所有 hyperopt 的输出:

100%|##########| 5/5 [34:36<00:00, 415.32s/trial, best loss: -0.3907675279893325]

关于apache-spark - 使用来自 s3 存储桶的数据在 AWS EMR 上使用 pyspark.ml 训练模型时出现 KeyError，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58910023/

相关文章：

javascript - 单击通知打开已安装的 PWA

react-native - 从选项卡导航选项卡打开抽屉导航

android-studio - 为什么我只看到官方 android 类的反编译源代码？

amazon-web-services - 状态机忽略阶跃函数错误

azure - 从 Azure 部署中排除 Azure Function 中的文件

python - 鼠兔连接丢失 Error : pika. exceptions.StreamLos

java - 如何设置不需要凭据的 Localstack 容器？

python - 如何使用需要使用 MLflow 的二维以上输入形状的模型进行预测？

gradle - pivotal/LicenseFinder 为 Gradle 项目返回 "No d

linux - Podman (libpod) 在使用 SELinux 上下文挂载 shm 时无法运