Flow Analysis & Time-based Bloom Filters - igvita.com

来源:百度文库 编辑:神马文学网 时间:2024/05/23 20:30:40

Flow Analysis & Time-based Bloom Filters

Workingwith large streams of data is becoming increasingly widespread, be itfor log, user behavior, or raw firehose analysis of user generatedcontent. There is some very interesting academic literature on this typeof data crunching, although much of it is focused on query or networkpacket analysis and is often not directly applicable to the type of datawe have to deal with in the social web. For example, if you were taskedto build (a better) "Trending Topics" algorithm for Twitter, how would you do it?

Of course, the challenge is that it has to be practical - it needs tobe "real-time" and be able to react to emerging trends in under aminute, all the while using a reasonable amount of CPU and memory. Now,we don't know how the actual system is implemented at Twitter, nor willwe look at any specific solutions - I have some ideas, but I am morecurious to hear how you would approach it. Instead, I want to revisitthe concept of Bloom Filters, because as I am making my way through theliterature, it is surprising how sparsely they are employed for thesetypes of tasks. Specifically, a concept I have been thinking ofprototyping for some time now: time-based, counting bloom filters!

Bloom Filters: What & Why

A Bloom Filter is a probabilistic data structurewhich can tell if an element is a member of a set. However, the reasonit is interesting is because it accomplishes this task with anincredibly efficient use of memory: instead of storing a full hash map,it is simply a bit vector which guarantees that you may have some smallfraction of false positives (the filter will report that a key is in thebloom filter when it is really not), but it will never report a falsenegative. File system and web caches frequently use bloom filters as thefirst query to avoid otherwise costly database or file system lookups.There is some math involved in determining the right parameters for yourbloom filter, which you can read about in an earlier post.

Of course, as is, the Bloom Filter data structure is not very usefulfor analyzing continuous data streams - eventually we would fill up thefilter and it would begin reporting false positives all the time. But,what if your bloom filter only remembered seen data for a fixed intervalof time? Imagine adding time-to-live (TTL) timestamp on each record.All of the sudden, if you knew the approximate number of messages forthe interval of time you wanted to analyze, then a bloom filter is onceagain an incredibly fast and space-efficient (fixed memory footprint)data structure!

Time-based Bloom Filters

Arguably the key feature of bloom filters is their compactrepresentation as a bit vector. By associating a timestamp with eachrecord, the size of the filter immediately expands by an order ofmagnitude, but even with that, depending on the size of the time windowyou are analyzing, you could store the TTL's in just a few additionalbits. Conversely, if counting bits is not mission critical, you couldeven used a backend such as Redis or Memcachedto drive the filter as well. The direct benefit of such approach isthat the data can be shared by many distributed processes. On that note,I have added a prototype Redis backend to the bloomfilter gem which implements a time-based, counting Bloom Filter. Let's take a look at a simple example:

require 'bloomfilter' options = {:size=> 100,       # size of bit vector:hashes => 4,       # number of hash functions:seed=> rand(100), # seed value for the filter:bucket => 3        # number of bits for the counting filter} # Regular, in-memory counting bloom filterbf = BloomFilter.new(options)bf.insert("mykey")bf.include?("mykey")  # => truebf.include?("mykey1") # => false ## Redis-backed bloom filter, with optional time-based semantics#bf = BloomFilter.new(options.merge({:type => :redis, :ttl => 2, :server => {:host => 'localhost'}}))bf.insert("mykey")bf.include?("mykey")  # => truesleep(3)bf.include?("mykey")  # => false # custom 5s TTL for a keybf.insert("newkey", nil, 5) 

 

bloomfilter.git (Ruby+Redis counting Bloom Filter)

Downloads: 1299 File Size: 0.0 KB

Storing data in Redis or Memcached is roughly an order of magnitudeless efficient, but it gives us an easy to use, distributed, and fixedmemory filter for analyzing continuous data streams. In other words, auseful tool for applications such as duplicate detection, trendsanalysis, and many others.

Mechanics of Time-Based Bloom Filters

Sohow does it work? Given the settings above, we create a fixed memoryvector of 100 buckets (or bits in raw C implementation). Then, for eachkey, we hash it 4 times with different key offsets and increment thecounts in those buckets - a non-negative value indicates that one of thehash functions for some key has used that bucket. Then, for a lookup,we reverse the operation: generate the 4 different hash keys and lookthem up, if all of them are non-zero then either we have seen this keyor there has been a collision (false positive). By optimizing the sizeof the bit vector we can control the false positive rate - you're alwaystrading the of amount of allocated memory vs. collision rate. Finally,we also make use of the native expire functionality in Redis to guarantee that keys are only stored for a bounded amount of time.

Time-based bloom filters have seen a few rogue mentions in theacademic literature, but to the best of my knowledge, have not seen wideapplications in the real world. However, it is an incredibly powerfuldata structure, and one that could benefit many modern, big-dataapplications. Gem install the bloomfilter gem and give it a try, perhapsit will help you build a better trends analysis tool. Speaking ofwhich, what other tools, algorithms, or data structures would you use tobuild a "Trending Topics" algorithm for a high-velocity stream?