RDD 过滤数据
2015-01-11 10:02:21| 分类:
Spark
| 标签:
|举报
|字号大中小 订阅
RDD 过滤数据
I wanted some clarity into the functioning of Filter function of RDD.
1) Does filter function scan every element saved in RDD? if my
RDD represents 10 Million rows, and if i want to work on only 1000 of
them, is there an efficient way of filtering the subset without having
to scan every element ?
I think that we can use sc.parallelize(rdd.take(1000))
using .take(1000) may be a biased sample.
you may want to consider sampling your RDD (with or without replacement) using a seed for randomization, using .takeSample()
eg.
rdd.takeSample(false, 1000, 1)
this returns an Array, from which you could create another RDD.
also available is .sample(), which will randomly sample your RDD with or without replacement, and returns an RDD.
.sample() takes a fraction, so it doesn't return an exact number of elements.
eg.
rdd.sample(true, .0001, 1)
2) If my RDD represents a Key / Value data set. When i filter
this data set of 10 Million rows, can i specify that the search should
be restricted to only partitions which contain specific keys ? Will
spark run by filter operation on all partitions if the partitions are
done by key, irrespective the key exists in a partition or not ?
Also, you may want to use .lookup() instead of .filter()
def
lookup(key: K): Seq[V]
Return the list of values in the RDD for key key. This operation is
done efficiently if the RDD has a known partitioner by only searching
the partition that the key maps to.
You might want to partition your first batch of data with
.partitionBy() using your CustomTuple hash implementation, persist it,
and do not run any operations on it which can remove it's partitioner
object.
评论这张
转发至微博
转发至微博
评论