最佳答案
“ json_data ”内容实际上是一个字符串,而不是 json ,它已内置了诸如数组,映射和结构的架构结构。我的问题是在“ json_data ”的实际内线周围加上了多余的双引号(“),这在Spark尝试读取它时引起了问题。示例:
{"json_data":"{"table":"TEST.FUBAR","op_type":"I","op_ts":"2019-03-14 15:33:50.031848","current_ts":"2019-03-14T15:33:57.479002","pos":"1111","after":{"COL1":949494949494949494,"COL2":99,"COL3":2,"COL4":" 99999","COL5":9999999,"COL6":90,"COL7":42478,"COL8":"I","COL9":null,"COL10":"2019-03-14 15:33:49","COL11":null,"COL12":null,"COL13":null,"COL14":"x222263 ","COL15":"2019-03-14 15:33:49","COL16":"x222263 ","COL17":"2019-03-14 15:33:49","COL18":"2020-09-10 00:00:00","COL19":"A","COL20":"A","COL21":0,"COL22":null,"COL23":"2019-03-14 15:33:47","COL24":2,"COL25":2,"COL26":"R","COL27":"2019-03-14 15:33:49","COL28":" ","COL29":"PBU67H ","COL30":" 20000","COL31":2,"COL32":null}}"}
{"json_data":{"table":"TEST.FUBAR","op_type":"I","op_ts":"2019-03-14 15:33:50.031848","current_ts":"2019-03-14T15:33:57.479002","pos":"1111","after":{"COL1":949494949494949494,"COL2":99,"COL3":2,"COL4":" 99999","COL5":9999999,"COL6":90,"COL7":42478,"COL8":"I","COL9":null,"COL10":"2019-03-14 15:33:49","COL11":null,"COL12":null,"COL13":null,"COL14":"x222263 ","COL15":"2019-03-14 15:33:49","COL16":"x222263 ","COL17":"2019-03-14 15:33:49","COL18":"2020-09-10 00:00:00","COL19":"A","COL20":"A","COL21":0,"COL22":null,"COL23":"2019-03-14 15:33:47","COL24":2,"COL25":2,"COL26":"R","COL27":"2019-03-14 15:33:49","COL28":" ","COL29":"PBU67H ","COL30":" 20000","COL31":2,"COL32":null}}}
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Python version 2.7.13 (default, Dec 20 2016 23:09:15)
SparkContext available as sc, HiveContext available as sqlContext.
>>> filePath = "/user/no_quote_json.json"
>>> df = sqlContext.read.json(filePath)
>>> df.printSchema()
root
|-- json_data: struct (nullable = true)
| |-- after: struct (nullable = true)
| | |-- COL1: long (nullable = true)
| | |-- COL10: string (nullable = true)
| | |-- COL11: string (nullable = true)
| | |-- COL12: string (nullable = true)
| | |-- COL13: string (nullable = true)
| | |-- COL14: string (nullable = true)
| | |-- COL15: string (nullable = true)
| | |-- COL16: string (nullable = true)
| | |-- COL17: string (nullable = true)
| | |-- COL18: string (nullable = true)
| | |-- COL19: string (nullable = true)
| | |-- COL2: long (nullable = true)
| | |-- COL20: string (nullable = true)
| | |-- COL21: long (nullable = true)
| | |-- COL22: string (nullable = true)
| | |-- COL23: string (nullable = true)
| | |-- COL24: long (nullable = true)
| | |-- COL25: long (nullable = true)
| | |-- COL26: string (nullable = true)
| | |-- COL27: string (nullable = true)
| | |-- COL28: string (nullable = true)
| | |-- COL29: string (nullable = true)
| | |-- COL3: long (nullable = true)
| | |-- COL30: string (nullable = true)
| | |-- COL31: long (nullable = true)
| | |-- COL32: string (nullable = true)
| | |-- COL4: string (nullable = true)
| | |-- COL5: long (nullable = true)
| | |-- COL6: long (nullable = true)
| | |-- COL7: long (nullable = true)
| | |-- COL8: string (nullable = true)
| | |-- COL9: string (nullable = true)
| |-- current_ts: string (nullable = true)
| |-- op_ts: string (nullable = true)
| |-- op_type: string (nullable = true)
| |-- pos: string (nullable = true)
| |-- table: string (nullable = true)
>>> df.select("json_data.after.col29").show()
+---------+
| col29|
+---------+
|PBU67H |
+---------+
关于json - 如何使用Python解析Spark 1.6中格式错误的JSON字符串,其中包含空格,多余的双引号和反斜杠? ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55230220/
相关文章:
java - 如何从 HDFS 中的 Path 获取绝对路径
python - 具有非字母数字字符的字段名称的 Pydantic 模型
apache-spark - 在EMR集群中运行Spark应用时在哪里指定Spark配置
apache-spark - Apache Spark "Py4JError: Answer from Java side is empty"
python - 具有二进制输入的 Hadoop 流作业?
mysql - 将数据从 mysql 导入到 hbase 时出现问题
c# - 如何使用 jQuery 遍历我的 Json 响应?
python - 从 API 获取 header
json - Cloudformation 与 OpsWorks 处理空值
java - Java 中 Spark 的 Scala Seq?