因为之后的项目要用Spark来做分布式处理,所以开始接触DataFrame
需要先安装pyspark
pip install pyspark
然后导入SparkSession
from pyspark.sql import SparkSession
然后实例化对象
spark=SparkSession.builder.getOrCreate()
或者
spark = SparkSession \ .builder \ .appName("Python Spark SQL basic example") \ .config("spark.some.config.option", "some-value") \ .getOrCreate()
接着读取json文件
df = spark.read.json("file:///root/hyq/people.json")
由于和服务器连接使用的软件是SecureCRT,在SecureCRT下怎样上传文件,在下
接着进行一系列操作:
1.展示
In [5]: df.show()+----+-------+| age| name|+----+-------+|null|Michael|| 30| Andy|| 19| Justin|+----+-------+
2.打印模式信息
In [7]: df.printSchema()root |-- age: long (nullable = true) |-- name: string (nullable = true)
3.选择多列
In [6]: df.select(df.name,df.age+1).show()+-------+---------+| name|(age + 1)|+-------+---------+|Michael| null|| Andy| 31|| Justin| 20|+-------+---------+
4.条件过滤
In [8]: df.filter(df.age > 20).show()+---+----+|age|name|+---+----+| 30|Andy|+---+----+
5.分组聚合
In [9]: df.groupBy("age").count().show()+----+-----+| age|count|+----+-----+| 19| 1||null| 1|| 30| 1|+----+-----+
5.排序
In [10]: df.sort(df.age.desc()).show()+----+-------+| age| name|+----+-------+| 30| Andy|| 19| Justin||null|Michael|+----+-------+
6.多列排序
In [12]: df.sort(df.age.desc(),df.name.asc()).show()+----+-------+| age| name|+----+-------+| 30| Andy|| 19| Justin||null|Michael|+----+-------+
7.对列进行重命名
In [13]: df.select(df.name.alias("username"),df.age).show()+--------+----+|username| age|+--------+----+| Michael|null|| Andy| 30|| Justin| 19|+--------+----+