博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
spark SQL学习(数据源之json)
阅读量:7054 次
发布时间:2019-06-28

本文共 3294 字,大约阅读时间需要 10 分钟。

准备工作

数据文件students.json

{"id":1, "name":"leo", "age":18}{"id":2, "name":"jack", "age":19}{"id":3, "name":"marry", "age":17}

存放目录:hdfs://master:9000/student/2016113012/spark/students.json

scala代码

package wujiadong_sparkSQLimport org.apache.spark.sql.SQLContextimport org.apache.spark.{SparkConf, SparkContext}/**  * Created by Administrator on 2017/2/12.  *///通过加载json数据源创建datafrobject JsonOperation {  def main(args: Array[String]): Unit = {    val conf = new SparkConf().setAppName("JsonOperation")    val sc = new SparkContext(conf)    val sqlContext = new SQLContext(sc)    //直接读取json格式文件    val df1 = sqlContext.read.json("hdfs://master:9000/student/2016113012/spark/students.json")    //通过load读取json格式文件,需要指定格式,不指定默认读取的是parquet格式文件    //sqlContext.read.format("json").load("hdfs://master:9000/student/2016113012/spark/students.json")    df1.printSchema()    df1.registerTempTable("t_students")    val teenagers = sqlContext.sql("select name from t_students where age > 13 and age <19")    teenagers.write.parquet("hdfs://master:9000/student/2016113012/teenagers")  }}

提交集群

hadoop@master:~/wujiadong$ spark-submit --class wujiadong_sparkSQL.JsonOperation  --executor-memory 500m --total-executor-cores 2 /home/hadoop/wujiadong/wujiadong.spark.jar

运行结果

hadoop@master:~/wujiadong$ spark-submit --class wujiadong_sparkSQL.JsonOperation  --executor-memory 500m --total-executor-cores 2 /home/hadoop/wujiadong/wujiadong.spark.jar17/02/14 10:58:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable17/02/14 10:58:56 INFO Slf4jLogger: Slf4jLogger started17/02/14 10:58:56 INFO Remoting: Starting remoting17/02/14 10:58:56 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.131:58268]17/02/14 10:58:59 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.17/02/14 10:59:05 INFO FileInputFormat: Total input paths to process : 117/02/14 10:59:11 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id17/02/14 10:59:11 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id17/02/14 10:59:11 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap17/02/14 10:59:11 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition17/02/14 10:59:11 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.idroot |-- age: long (nullable = true) |-- id: long (nullable = true) |-- name: string (nullable = true)17/02/14 10:59:18 INFO FileInputFormat: Total input paths to process : 117/02/14 10:59:18 INFO CodecPool: Got brand-new compressor [.gz]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".SLF4J: Defaulting to no-operation (NOP) logger implementationSLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.17/02/14 10:59:19 INFO FileOutputCommitter: Saved output of task 'attempt_201702141059_0001_m_000000_0' to hdfs://master:9000/studnet/2016113012/teenagers/_temporary/0/task_201702141059_0001_m_000000

常见报错

Exception in thread "main" java.io.IOException: No input paths specified in job原因是读取数据源失败导致的,比如写错了数据源路径

转载于:https://www.cnblogs.com/wujiadong2014/p/6516588.html

你可能感兴趣的文章
RAID
查看>>
SHELL学习——判断
查看>>
用Python做科学计算
查看>>
Unity3d中C#使用指针(Unsafe)的办法
查看>>
Win10下用Anaconda安装TensorFlow
查看>>
http_load压力测试工具
查看>>
我的友情链接
查看>>
sokect网络编程
查看>>
ios 实现选中时阴影效果
查看>>
Post XML到一个服务器上
查看>>
linux基本网络配置
查看>>
深圳高交会
查看>>
Java中设置classpath、path、JAVA_HOME的作用
查看>>
MySQL数据库管理之日常操作
查看>>
连续潜在变量---核PCA
查看>>
lucene 3.5.0 入门笔记
查看>>
AD域桌面重定向_无权限访问桌面
查看>>
DSPLINK 介绍
查看>>
css笔试题整理(其他)
查看>>
oracle锁及oracle查找锁定表信息
查看>>