注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

阿弥陀佛

街树飘影未见尘 潭月潜水了无声 般若观照心空静...

 
 
 

日志

 
 
关于我

一直从事气象预报、服务建模实践应用。 注重气象物理场、实况场、地理信息、本体知识库、分布式气象内容管理系统建立。 对Barnes客观分析, 小波,计算神经网络、信任传播、贝叶斯推理、专家系统、网络本体语言有一定体会。 一直使用Java、Delphi、Prolog、SQL编程。

网易考拉推荐

RDD 过滤数据  

2015-01-11 10:02:21|  分类: Spark |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |
RDD 过滤数据
I wanted some clarity into the functioning of Filter function of RDD.

1) Does filter function scan every element saved in RDD? if my RDD represents 10 Million rows, and if i want to work on only 1000 of them, is there an efficient way of filtering the subset without having to scan every element ?
 I think that we can use sc.parallelize(rdd.take(1000))
using .take(1000) may be a biased sample.
you may want to consider sampling your RDD (with or without replacement) using a seed for randomization, using .takeSample()
eg.
rdd.takeSample(false, 1000, 1)
this returns an Array, from which you could create another RDD.

also available is .sample(), which will randomly sample your RDD with or without replacement, and returns an RDD.
.sample() takes a fraction, so it doesn't return an exact number of elements.
eg.
rdd.sample(true, .0001, 1)


2) If my RDD represents a Key / Value data set. When i filter this data set of 10 Million rows, can i specify that the search should be restricted to only partitions which contain specific keys ? Will spark run by filter operation on all partitions if the partitions are done by key, irrespective the key exists in a partition or not ?
Also, you may want to use .lookup() instead of .filter()
def lookup(key: K): Seq[V]
Return the list of values in the RDD for key key. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to.

You might want to partition your first batch of data with .partitionBy() using your CustomTuple hash implementation, persist it, and do not run any operations on it which can remove it's partitioner object.
  评论这张
 
阅读(321)| 评论(0)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017