从Spark Scala中的旧数据框中获取列名

看我的代码:

 val spark = SparkSession.builder
      .master("local[*]")
      .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .getOrCreate()


    val data = spark.read.option("header", "true")
      .option("inferSchema", "true")
      .csv("src/main/resources/student.csv")

我的数据如下:

Id   Name  City  
1    Ali   lhr
2    abc   khi
3    xyz   isb

现在,我创建一个新的DataFrame,如下所示:

 val someDF = Seq(
      (4,"Ahmad","swl")
    ).toDF("Id", "Name","City")

Here you can see I have created a new DataFramesomeDF with same column name as old DataFramedata. But I have assigned names manually to the new DataFramesomeDF. My question is that is there any method that can take coumn names from old DataFrame and assign it to new DataFrame programatically.

就像是

val featureCols= data.columns

提前致谢。

评论
  • mest
    mest 回复

    .toDF accepts (colNames: String*) , We can unnest List[String] as strings with :_*

    Example:

    val featureCols=Seq("Id","Name","City")
    val someDF = Seq((4,"Ahmad","swl").toDF(cols:_*)
    
    Seq(("1","2","3")).toDF(featureCols:_*).show()
    //+---+----+----+
    //| Id|Name|City|
    //+---+----+----+
    //|  1|   2|   3|
    //+---+----+----+
    
  • 寂寞在掉泪
    寂寞在掉泪 回复

    2种方法来完成varargs和union。以下是完整的示例。

      val csv =
        """
          |Id,Name,  City
          |1,Ali,lhr
          |2,abc,khi
          |3,xyz,isb
        """.stripMargin.lines.toSeq.toDS()
    //*** Option1***
      val data: DataFrame = spark.read.option("header", true)
        .option("sep", ",")
        .option("inferSchema", true)
        .csv(csv)
      data.show
      val someDF: DataFrame = Seq(
        (4,"Ahmad","swl")
      ).toDF(data.columns:_*)
      someDF.show
    
      //***Option 2***
      val someDF1: DataFrame = Seq(
        (4,"Ahmad","swl")
      ).toDF
      data.limit(0).union(someDF1).show
    

    结果:

    +---+----+------+
    | Id|Name|  City|
    +---+----+------+
    |  1| Ali|   lhr|
    |  2| abc|   khi|
    |  3| xyz|   isb|
    +---+----+------+
    
    +---+-----+------+
    | Id| Name|  City|
    +---+-----+------+
    |  4|Ahmad|   swl|
    +---+-----+------+
    
    +---+-----+------+
    | Id| Name|  City|
    +---+-----+------+
    |  4|Ahmad|   swl|
    +---+-----+------+
    
  • XXOO
    XXOO 回复
    object DuplicateDataframe {
    
      def main(args: Array[String]): Unit = {
        val spark = SparkSession.builder().master("local[*]").getOrCreate()
        import spark.implicits._
    
        val df = List(PersonCity(1,"Ali","lhr")).toDF()
        val someDF = df.limit(0).union(List(PersonCity(4,"Ahmad","sw1")).toDF())
      }
    }
    
    case class PersonCity(Id : Int,Name : String,City :String )