当我尝试使用 Spark 和 scala 读取管道分隔文件时,如下所示:
1|Consumer Goods|101|
2|Marketing|102|
我正在使用命令:
val part = spark.read
.format("com.databricks.spark.csv")
.option("delimiter","|")
.load("file_name")
我得到的结果是:
+---+--------------+---+----+
|_c0| _c1|_c2| _c3|
+---+--------------+---+----+
| 1|Consumer Goods|101|null|
| 2| Marketing|102|null|
+---+--------------+---+----+
Spark 正在读取源文件中不存在的最后一列,因为分隔符被称为管道。 有什么替代方法可以让我得到如下结果:
+---+--------------+---+
|_c0| _c1|_c2|
+---+--------------+---+
| 1|Consumer Goods|101|
| 2| Marketing|102|
+---+--------------+---+
最佳答案
一个解决方案是像这样简单地删除最后一列:
part
.select(part.columns.dropRight(1).map(col) : _*)
.show(false)
+---+--------------+---+
|_c0|_c1 |_c2|
+---+--------------+---+
|1 |Consumer Goods|101|
|2 |Marketing |102|
+---+--------------+---+
另一种解决方案是将文件作为文本文件读取并像这样自行拆分:
val text = spark.read.text("file_name")
// Note that the split functions in java/scala/spark ignores a separator that ends
// a string, but that one that starts one
val size = text.head.getAs[String]("value").split("\\|").size
text
.withColumn("value", split('value, "\\|"))
.select((0 until size).map(i => 'value getItem i as s"_c$i") : _*)
.show(false)
+---+--------------+---+
|_c0|_c1 |_c2|
+---+--------------+---+
|1 |Consumer Goods|101|
|2 |Marketing |102|
+---+--------------+---+
https://stackoverflow.com/questions/63869125/