从Kafka解析嵌套json的模式

我从Kafka接收JSON字符串,需要由PySpark处理。字符串如下:

{"_id": {"$oid": "5eb56a371af2d82e242d24ae"},"Id": 7,"Timestamp": {"$date": 1582889068586},"TTNR": "R902170286","SNR": 92177446,"State": 0,"I_A1": "FALSE","I_B1": "FALSE","I1": 0.0037385,"Mabs": -20.3711051,"p_HD1": 30.9632005,"pG": 27.788934,"pT": 1.7267373,"pXZ": 3.4487671,"T3": 25.2357555,"nan": 202.1999969,"Q1": 0,"a_01X": [62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925]}

我的计划是将字符串分成JSON字段。为此,我定义了以下架构:

json_schema=StructType([StructField("_id",StructField("$oid",StringType())), \
    StructField("Id", DoubleType()), \
    StructType(StructField("Timestamp", StructField("$date",LongType())), \
    StructField("TTNR", StringType()), \
    StructField("SNR", LongType()), \
    StructField("State", LongType()), \
    StructField("I_A1", StringType()), \
    StructField("I_B1", StringType()), \
    StructField("I1", DoubleType()), \
    StructField("Mabs", DoubleType()), \
    StructField("p_HD1", DoubleType()), \
    StructField("pG", DoubleType()), \
    StructField("pT", DoubleType()), \
    StructField("pXZ", DoubleType()), \
    StructField("T3", DoubleType()), \
    StructField("nan", DoubleType()), \
    StructField("Q1", LongType()), \
    StructField("a_01X", ArrayType(DoubleType()))
    ])

但是,使用此架构会导致以下错误:

pyspark.sql.utils.ParseException: u'\nmismatched input \'{\' expecting {\'SELECT\', \'FROM\', \'ADD\', \'AS\', \'ALL\', \'ANY\', \'DISTINCT\', \'WHERE\', \'GROUP\', \'BY\', \'GROUPING\', \'SETS\', \'CUBE\', \'ROLLUP\', \'ORDER\', \'HAVING\', \'LIMIT\', \'AT\', \'OR\', \'AND\', \'IN\', NOT, \'NO\', \'EXISTS\', \'BETWEEN\', \'LIKE\', RLIKE, \'IS\', \'NULL\', \'TRUE\', \'FALSE\', \'NULLS\', \'ASC\', \'DESC\', \'FOR\', \'INTERVAL\', \'CASE\', \'WHEN\', \'THEN\', \'ELSE\', \'END\', \'JOIN\', \'CROSS\', \'OUTER\', \'INNER\', \'LEFT\', \'SEMI\', \'RIGHT\', \'FULL\', \'NATURAL\', \'ON\', \'PIVOT\', \'LATERAL\', \'WINDOW\', \'OVER\', \'PARTITION\', \'RANGE\', \'ROWS\', \'UNBOUNDED\', \'PRECEDING\', \'FOLLOWING\', \'CURRENT\', \'FIRST\', \'AFTER\', \'LAST\', \'ROW\', \'WITH\', \'VALUES\', \'CREATE\', \'TABLE\', \'DIRECTORY\', \'VIEW\', \'REPLACE\', \'INSERT\', \'DELETE\', \'INTO\', \'DESCRIBE\', \'EXPLAIN\', \'FORMAT\', \'LOGICAL\', \'CODEGEN\', \'COST\', \'CAST\', \'SHOW\', \'TABLES\', \'COLUMNS\', \'COLUMN\', \'USE\', \'PARTITIONS\', \'FUNCTIONS\', \'DROP\', \'UNION\', \'EXCEPT\', \'MINUS\', \'INTERSECT\', \'TO\', \'TABLESAMPLE\', \'STRATIFY\', \'ALTER\', \'RENAME\', \'ARRAY\', \'MAP\', \'STRUCT\', \'COMMENT\', \'SET\', \'RESET\', \'DATA\', \'START\', \'TRANSACTION\', \'COMMIT\', \'ROLLBACK\', \'MACRO\', \'IGNORE\', \'BOTH\', \'LEADING\', \'TRAILING\', \'IF\', \'POSITION\', \'EXTRACT\', \'DIV\', \'PERCENT\', \'BUCKET\', \'OUT\', \'OF\', \'SORT\', \'CLUSTER\', \'DISTRIBUTE\', \'OVERWRITE\', \'TRANSFORM\', \'REDUCE\', \'SERDE\', \'SERDEPROPERTIES\', \'RECORDREADER\', \'RECORDWRITER\', \'DELIMITED\', \'FIELDS\', \'TERMINATED\', \'COLLECTION\', \'ITEMS\', \'KEYS\', \'ESCAPED\', \'LINES\', \'SEPARATED\', \'FUNCTION\', \'EXTENDED\', \'REFRESH\', \'CLEAR\', \'CACHE\', \'UNCACHE\', \'LAZY\', \'FORMATTED\', \'GLOBAL\', TEMPORARY, \'OPTIONS\', \'UNSET\', \'TBLPROPERTIES\', \'DBPROPERTIES\', \'BUCKETS\', \'SKEWED\', \'STORED\', \'DIRECTORIES\', \'LOCATION\', \'EXCHANGE\', \'ARCHIVE\', \'UNARCHIVE\', \'FILEFORMAT\', \'TOUCH\', \'COMPACT\', \'CONCATENATE\', \'CHANGE\', \'CASCADE\', \'RESTRICT\', \'CLUSTERED\', \'SORTED\', \'PURGE\', \'INPUTFORMAT\', \'OUTPUTFORMAT\', DATABASE, DATABASES, \'DFS\', \'TRUNCATE\', \'ANALYZE\', \'COMPUTE\', \'LIST\', \'STATISTICS\', \'PARTITIONED\', \'EXTERNAL\', \'DEFINED\', \'REVOKE\', \'GRANT\', \'LOCK\', \'UNLOCK\', \'MSCK\', \'REPAIR\', \'RECOVER\', \'EXPORT\', \'IMPORT\', \'LOAD\', \'ROLE\', \'ROLES\', \'COMPACTIONS\', \'PRINCIPALS\', \'TRANSACTIONS\', \'INDEX\', \'INDEXES\', \'LOCKS\', \'OPTION\', \'ANTI\', \'LOCAL\', \'INPATH\', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0)\n\n== SQL ==\n{"fields":[{"metadata":{},"name":"_id","nullable":true,"type":{"metadata":{},"name":"$oid","nullable":true,"type":"string"}},{"metadata":{},"name":"Id","nullable":true,"type":"double"},{"metadata":{},"name":"Timestamp","nullable":true,"type":"string"},{"metadata":{},"name":"TTNR","nullable":true,"type":"string"},{"metadata":{},"name":"SNR","nullable":true,"type":"long"},{"metadata":{},"name":"State","nullable":true,"type":"long"},{"metadata":{},"name":"I_A1","nullable":true,"type":"string"},{"metadata":{},"name":"I_B1","nullable":true,"type":"string"},{"metadata":{},"name":"I1","nullable":true,"type":"double"},{"metadata":{},"name":"Mabs","nullable":true,"type":"double"},{"metadata":{},"name":"p_HD1","nullable":true,"type":"double"},{"metadata":{},"name":"pG","nullable":true,"type":"double"},{"metadata":{},"name":"pT","nullable":true,"type":"double"},{"metadata":{},"name":"pXZ","nullable":true,"type":"double"},{"metadata":{},"name":"T3","nullable":true,"type":"double"},{"metadata":{},"name":"nan","nullable":true,"type":"double"},{"metadata":{},"name":"Q1","nullable":true,"type":"long"},{"metadata":{},"name":"a_01X","nullable":true,"type":{"containsNull":true,"elementType":"double","type":"array"}}],"type":"struct"}\n^^^\n'

但是,如果我使用的模式没有嵌套字段(如下所示),则可以解析:

after_schema=StructType([ StructField("_id", StringType()), \
    StructField("Id", DoubleType()), \
    StructField("Timestamp", StringType()),
    StructField("TTNR", StringType()), \
    StructField("SNR", LongType()), \
    StructField("State", LongType()), \
    StructField("I_A1", StringType()), \
    StructField("I_B1", StringType()), \
    StructField("I1", DoubleType()), \
    StructField("Mabs", DoubleType()), \
    StructField("p_HD1", DoubleType()), \
    StructField("pG", DoubleType()), \
    StructField("pT", DoubleType()), \
    StructField("pXZ", DoubleType()), \
    StructField("T3", DoubleType()), \
    StructField("nan", DoubleType()), \
    StructField("Q1", LongType()), \
    StructField("a_01X", ArrayType(DoubleType()))
    ])

我的目标是得到这样的输出:

+---+---+---------+----+---+-----+----+----+---+----+-----+---+---+---+---+---+---+-----+
|_id| Id|Timestamp|TTNR|SNR|State|I_A1|I_B1| I1|Mabs|p_HD1| pG| pT|pXZ| T3|nan| Q1|a_01X|
+---+---+---------+----+---+-----+----+----+---+----+-----+---+---+---+---+---+---+-----+
+---+---+---------+----+---+-----+----+----+---+----+-----+---+---+---+---+---+---+-----+

我想获得一些帮助。现在,我可以获取除嵌套结构之外的所有字段。

评论
  • kaut
    kaut 回复

    您忘记将“时间戳”指定为StructType。

    此代码段对我有用。

    from pyspark.sql.types import *
    import json
    
    schema = StructType(
        [
            StructField("_id",StringType(),True),
            StructField("Id",LongType(),True),
            StructField("Timestamp",StructType(
                [
                    StructField("$date",LongType(),True)
                ]
            ), True),
            StructField("TTNR",StringType(),True),
            StructField("SNR",LongType(),True),
            StructField("State",LongType(),True),
            StructField("I_A1",StringType(),True),
            StructField("I_B1",StringType(),True),
            StructField("I1",DoubleType(),True),
            StructField("Mabs",DoubleType(),True),
            StructField("p_HD1",DoubleType(),True),
            StructField("pG",DoubleType(),True),
            StructField("pT",DoubleType(),True),
            StructField("pXZ",DoubleType(),True),
            StructField("T3",DoubleType(),True),
            StructField("nan",DoubleType(),True),
            StructField("Q1",LongType(),True),
            StructField("a_01X",ArrayType(DoubleType()),True)
          ]
    )
    
    example = {"_id": '{"$oid": "5eb56a371af2d82e242d24ae"}',"Id": 7,"Timestamp": {"$date": 1582889068586},"TTNR": "R902170286","SNR": 92177446,"State": 0,"I_A1": "FALSE","I_B1": "FALSE","I1": 0.0037385,"Mabs": -20.3711051,"p_HD1": 30.9632005,"pG": 27.788934,"pT": 1.7267373,"pXZ": 3.4487671,"T3": 25.2357555,"nan": 202.1999969,"Q1": 0,"a_01X": [62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925]}
    
    
    df = spark.read.json(sc.parallelize([json.dumps(example)]), schema=schema)
    df.printSchema()