Spark CSV:解析由Asciiæ(Hex E6)定义的文件

我有用ASCII字符æ(十六进制E6)分隔的大数据文件。我的解析文件的代码段如下,但是解析器似乎无法正确分割值(我使用Spark 2.4.1)

implicit class DataFrameReadImplicits (dataFrameReader: DataFrameReader) {
     def readTeradataCSV(schema: StructType, path: String) : DataFrame = {
        dataFrameReader.option("delimiter", "\u00E6")
          .option("header", "false")
          .option("inferSchema", "false")
          .option("multiLine","true")
          .option("encoding", "UTF-8")
          .schema(schema)
          .csv(path)
     }
  }

任何提示如何解决此问题?

enter image description here

评论
  • in_ut
    in_ut 回复

    Based on your sample data from the screenshot, Delimiter is multi character i.e "æ"

    java.lang.IllegalArgumentException: Delimiter cannot be more than one character: "æ"

    Multi character delimiter is not allowed to specify in option - option("delimiter",""""\u00E6"""")

    Please check below code two step process to parse data.

    spark
    .read
    .option("delimiter", """\u00E6""")
    .option("header", "true")
    .option("inferSchema", "true")
    .option("multiLine","true")
    .option("encoding", "UTF-8")
    .csv(
        spark
        .read
        .textFile("path")
        .map(line => line.split(""""\u00E6"""")
        .mkString("\u00E6"))
    )