Spark CSV:解析由Asciiæ(Hex E6)定义的文件

我有用ASCII字符æ(十六进制E6)分隔的大数据文件。我的解析文件的代码段如下,但是解析器似乎无法正确分割值(我使用Spark 2.4.1)

implicit class DataFrameReadImplicits (dataFrameReader: DataFrameReader) {
     def readTeradataCSV(schema: StructType, path: String) : DataFrame = {
        dataFrameReader.option("delimiter", "\u00E6")
          .option("header", "false")
          .option("inferSchema", "false")
          .option("multiLine","true")
          .option("encoding", "UTF-8")
          .schema(schema)
          .csv(path)
     }
  }

任何提示如何解决此问题?

enter image description here

评论
in_ut
in_ut

Based on your sample data from the screenshot, Delimiter is multi character i.e "æ"

java.lang.IllegalArgumentException: Delimiter cannot be more than one character: "æ"

Multi character delimiter is not allowed to specify in option - option("delimiter",""""\u00E6"""")

Please check below code two step process to parse data.

spark
.read
.option("delimiter", """\u00E6""")
.option("header", "true")
.option("inferSchema", "true")
.option("multiLine","true")
.option("encoding", "UTF-8")
.csv(
    spark
    .read
    .textFile("path")
    .map(line => line.split(""""\u00E6"""")
    .mkString("\u00E6"))
)

点赞
评论