hadoop - reducer 的默认数量

在Hadoop中，如果我们没有设置reducer的数量，那么将创建多少个reducer？

就像映射器的数量取决于(总数据大小)/(输入拆分大小)一样，
例如。如果数据大小为1 TB，输入拆分大小为100 MB。那么映射器的数量将是(1000 * 1000)/ 100 = 10000(万)。

reducer 的数量取决于哪些因素？为一个工作创建了多少个 reducer ？

最佳答案

有多少减少？ (来自official documentation)

正确的减少数似乎是0.95或1.75乘以
(节点数)*(每个节点的最大容器数)。

With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.

增加减少的数量会增加框架开销，但会增加负载平衡并降低故障成本。

上面的缩放因子略小于整数，以便在框架中为推测性任务和失败任务保留一些减少的时间。

本文也介绍了Mapper的数量。

多少张 map ？

The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.

尽管已针对非常cpu-light的 map task 将其设置为300个 map ，但 map 的并行性的正确级别似乎是每个节点10-100个 map 。任务设置需要一段时间，因此最好执行 map 至少一分钟。

因此，如果您希望输入数据为10TB，块大小为128MB，则最终会得到 82,000映射，除非使用Configuration.set(MRJobConfig.NUM_MAPS, int)(仅向框架提供提示)将其设置得更高。

如果要更改 reducer 数量的默认值1，则可以将以下属性(从hadoop 2.x版本开始)设置为命令行参数

mapreduce.job.reduce

要么

您可以通过编程设置

job.setNumReduceTasks(integer_numer);

看看另一个与SE相关的问题:What is Ideal number of reducers on Hadoop?

关于hadoop - reducer 的默认数量，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55200955/

相关文章：

docker - 在Boot2Docker中使用MySQL时JDBC连接缓慢

php - PHP shell_exec无法执行Hadoop命令

java - java.io.IOException:方案:maprfs没有文件系统。将maprfs

hadoop - HDFS如何存储大于 block 大小的单个数据？

hadoop - 从Teradata查询到pyspark

json - 如何使用Python解析Spark 1.6中格式错误的JSON字符串，其中包含空格，多

docker - 如何将Docker的容器与管道连接

docker - Docker中的开发环境

nginx - 在docker中无法将nginx与ghost链接

hadoop - NameNode 的用户名必须与 DataNode 的用户名相同吗？