Count min sketch | Efficient algorithm for counting stream of data | system design components

Подписаться 158 тыс.

Просмотров 96 тыс.

50% 1

Count Min sketch is a simple technique to summarize large amounts of frequency data. which is widely used in many places where there is a streaming big data.
Donate/Patreon: / techdummies
CODE:
----------------------------------------------------------------------------
By Varun Vats: gist.github.com/VarunVats9/7f...
Applications of count min sketch:
----------------------------------------------------------------------------
theory.stanford.edu/~tim/s15/l...
highscalability.com/blog/2016/...
spark.apache.org/docs/2.0.1/a...
Applications using Count Tracking There are dozens of applications of count tracking and in particular, the Count-Min sketch datastructure that goes beyond the task of approximating data distributions. We give three examples.
1. A more general query is to identify the Heavy-Hitters, that is, the query HH(k) returns theset of items which have large frequency (say 1/k of the overall frequency). Count trackingcan be used to directly answer this query, by considering the frequency of each item. Whenthere are very many possible items, answering the query in this way can be quite slow. Theprocess can be sped up immensely by keeping additional information about the frequenciesof groups of items [6], at the expense of storing additional sketches. As well as being ofinterest in mining applications, finding heavy-hitters is also of interest in the context of signalprocessing. Here, viewing the signal as defining a data distribution, recovering the heavy-hitters is key to building the best approximation of the signal. As a result, the Count-Minsketch can be used in compressed sensing, a signal acquisition paradigm that has recentlyrevolutionized signal processing [7].
2. One application where very large data sets arise is in Natural Language Processing (NLP).Here, it is important to keep statistics on the frequency of word combinations, such as pairsor triplets of words that occur in sequence. In one experiment, researchers compacted a large6
Page 7
90GB corpus down to a (memory friendly) 8GB Count-Min sketch [8]. This proved to be justas effective for their word similarity tasks as using the exact data.
3. A third example is in designing a mechanism to help users pick a safe password. To makepassword guessing difficult, we can track the frequency of passwords online and disallowcurrently popular ones. This is precisely the count tracking problem. Recently, this wasput into practice using the Count-Min data structure to do count tracking (see • Simply Science Episode... ). A nice feature of this solution is that the impactof a false positive-erroneously declaring a rare password choice to be too popular and sodisallowing it-is only a mild inconvenience to the user

Опубликовано:

24 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 131

@mattleahy3951 5 лет назад

I appreciate this; I was unaware of this algorithm until I saw your easy to understand explanation.

@rinkalagrawal 3 года назад

Thanks for this simplistic explanation of the algorithm. I've watched your other videos as well and it has helped a lot. Appreciate the work you are doing. Thank you!

@dataguy7013 4 года назад

Naren, I am officially your fan. Nice video. You simplify and explain really well. Thanks!

@satishvedpathak9972 4 года назад

you have excellent skills to simplify the complex concepts ; this was better than explain in my big data training course .

@anupamtamrakar3985 5 лет назад

Great job. Really like the way you explained

@nik15oc 5 лет назад

very nicely explained. This data structure is frequently used in the caches

@viputdBeast 4 года назад

No one ..!! Seriously no one does it better than you.. You are GOAT sir when it comes to system design....

@moulisiramdasu6753 2 года назад

Thank you. Before watching this I found it difficult to understand but by the end of the video, it's crystal clear.

@studentforlife5502 3 года назад

U SAVED MY LIFE!! please provide videos for all sketching algorithms

@apoorvasaxena4522 5 лет назад

Thank you for this video... You did a great job... :)

@ShabnamKhan-cj4zc 3 года назад

Thank you easy explanation.. It feels like good teacher explaning to students.

@karl6892 2 года назад

Great video Naren! Very clear explanation

@mohanishmedar5332 3 года назад

Perfect explanation! Good job

@sebastiandanielripari6457 5 лет назад

Nice explanation. Great Job !

@mahdiarghavani1804 5 лет назад

You describe it well. Thanks

@user-qj8hx5ds8v Год назад

very good!!!!! Thanks a lot over 1000 times!

@felipegutierrez7856 5 лет назад

very good explanation! thanks

@abhishek_sengupta 3 года назад

Very nice explanation! Thank You!!

@dpsky666 Год назад

Great explanation! Thank you!

@mayankrathore7558 5 лет назад

Awesome algo and ofcourse nicely explained thanks . :)

@chikudholu 5 лет назад

Great work!!

@mydodethailung395 Год назад

Great explanation !!

@ThoRicHeLLi 5 лет назад

Keep up the good work!

@anandkumar2058 5 лет назад

video suggestion : Design maps (google map for example ) and finding the shortest route from A to B with/without traffic

@rameshnraghavan 4 года назад

Very nice explanation.. Thank you..

@Tasteofgyan 5 лет назад

Great job!! Thank you!

@harshav3902 3 года назад

Great Explanation. Thank you.

@tassoskat8623 4 года назад

Great video! You are better than my professor

@satang500 5 лет назад

Thanks a lot of your videos. Suggestions: 1. your video can be categorized into two: 1) fundamentals and glossary ( which is building blocks to design other systems such as consistent hashing, caching, load balancing... ) 2) system design ( such as design uber, facebook, youtube, netflix... ). So how about creating two groups and put videos into these groups? 2. request of video lecture: proxy, load balancing, web crawler, server and slave architecture for availability

@barrymullan6363 2 года назад

Great explanation, thanks!

@FengZhu-ym6qu 9 месяцев назад

Thanks bro, that really helps!

@jhnflory 2 года назад

Thanks...nice explanation!

@saitrinathdubba 4 года назад

Concise !! Thank you

@janina542 5 лет назад

Nice explanation. keep it Up

@shreyasns1 2 года назад

Count Min Sketch explained with Min complexity, Great Job Naren. Good luck for your videos

@vitaliano01 3 года назад

Thank you for the good explanation

@fernandozago Год назад

Pretty coooll!!! Is pretty close algorithm like KTop of Redis.... But, the only difference between this and Ktop from redis is that KTop uses two Values on each array element (Like: Count and Hash) and when a collision happens, it does some calculation to "decay" (meaning Decrement the count) from the data on the sketch instead of always increase the count. Very gool explanation !! =)

@moshelevi5485 3 года назад

Amazing job!

@user-oy4kf5wr8l 4 года назад

Thanks bro! We love u!

@vincentmax4571 5 лет назад

Good tutorial.

@arvindaaswani1303 4 года назад

Nice explanation 👍

@PrateekMehtaABDFAN 2 года назад

Great videos sir . I came here from the caching for distributed system video where you mentioned this as a use-case for calculating the frequency of data in eviction policies . You kind of dint mention that in the applications . So, CACHING can also be an application for the same .

@prateekthakur4060 5 лет назад

This one is new for me,Thanks for sharing! Also for video suggestion: How to design multiplayer gaming system

@TechDummiesNarendraL 5 лет назад

Sure, I will work on it

@mayuresh247 5 лет назад

This is somewhat similar to bloom filters.. Thanks for the video.. Keep it up !! Suggestion: food delivery application like swiggy/uberEats

@TechDummiesNarendraL 5 лет назад

Thanks and Sure, I will work on it

@Beadyboyplayz 2 года назад

Good explanation 😎

@mgayatri 5 лет назад

Very similar to the Consistent Hash ring solution you explained.

@kumarc4853 2 года назад

thank you Sys Guru!

@ankitgarg4273 5 лет назад

For the failure case where output is probable, letter "B" frequency can be seen. It should be 1 but from the CSM it is coming as 2. So, that's the reason it is called probable.

@waylonbradley1345 3 года назад

pro trick : you can watch movies at flixzone. I've been using them for watching loads of movies recently.

@cadejose7770 3 года назад

@Waylon Bradley Yup, I've been watching on flixzone} for since december myself :D

@Templesify 5 месяцев назад

awesome 👌

@arayanamit123 5 лет назад

@Tech Dummies , In this I what should be the max value my Hash function should return, or should we consider using modulo operation to keep the max value in check. what would be the impact of this max value on the collision.

@user-pn8vw8rr3m 5 лет назад

Interesting, is there an improvement of such algo which dinamically increase the matrix size (horizontally or vertically or both) proportionally, say log(N) to keep collision probability near constant.. (I assume here the key space is also has infinite size.)

@AbhishekSharma-si8ui 4 года назад

AWESOME

@ghazalabd8890 Год назад

great!

@prashidell8980 5 лет назад

Hi, I really enjoy your videos. Count Min Sketch is using basic idea of bloom filter, right?

@awaissindhu 4 года назад

how would we use this for trending data? for example, where we want to see the data, say, for the last 24 hours? do we subtract entries from these values that are older than 24 hours?

@chitthiaayeehai 5 лет назад

Super

@AmolGautam 6 месяцев назад

Thanks

@hidekazushidara320 5 лет назад

video suggestion: microservice architecture vs traditional architecture

@TechDummiesNarendraL 5 лет назад

Sure

@rozawar 3 года назад

Thanks for the video explanation. How to query top K using this algorithm? it shows object to count mapping. What about reverse?

@pramodsingh4668 2 года назад

Fully clear explanation Narendra. One suggestion, you could have reduced the video to 6-7 minutes instead of 19 minute.

@felipegutierrez7856 5 лет назад

Hi @Tech Dummies, which kind of data stream applications will benefit from ising count-min sketches?

@gakhov 5 лет назад

Think you about the task of estimating heavy hitters or the most frequent events in an event stream. Events can come from the IoT devices or from airplane sensors, the source doesn't really matter, matters only the fact that there are a lot of such events of different types. If we need to find the most frequent events in a traditional way, we need to count all of them and maintain such counters for all of them, even for rare once. Instead, we can use Count-Min Sketch and maintain only a few counters with some small number of "promising" elements. If you are interested in such data structures, check out my recent book "Probabilistic Data Structures and Algorithms in Big Data Applications" (pdsa.gakhov.com).

@antrapurohit8010 Год назад

Thank you for the explanation. Just have one doubt. Can we use this data structures to implement top K URLs accessed in last 1hr?

@carlaludgate6597 2 года назад

Thanks for this, great video! One question: why do you call it sublinear? Because it looks to me like all your operations are constant except for getting the min. So is the time complexity dependent on the amount of hash functions?

@yogeshverma9267 2 года назад

Sublinear is the relationship between data and memory. As the data grows, the memory required to store it does not grow in a linear fashion since we are using a fixed set of hash functions and hash outputs.

@AnuragSingh7 3 года назад

If we use reservoir sampling with a fairly large sample size wouldn't we be guaranteed to find the true frequent elements? I mean the frequent elements present in the sample would also be the frequent elements in the stream.

@pixelizedworld 3 года назад

my first comment here... just THANK YOU.

@bhulakshmidevanamaina3001 4 года назад

Is it also called the kth min value sketch algorithm

@varunganesh2028 5 лет назад

Is it possible to evict entries from the countmin sketch? Isn't that an important criterion while handling streams? If I subtract for all the hashes on removal, will the "min" property be preserved?

@TechDummiesNarendraL 5 лет назад

I am not sure, but min property might be violated!! I will try to get more info and update you

@Cosciug1234 5 лет назад

video suggestion: file storage and synchronization service system design (dropbox, google drive)

@TechDummiesNarendraL 5 лет назад

@cos thanks alot, I will consider the topics

@TechDummiesNarendraL 5 лет назад

@Cos Sure, I will work on it

@meghadave9363 2 года назад

Can you please also post a video on concurrency management in database as well as code for situations like concurrent ticket reservation. Also what is the best strategy for handling millions of operation(update/insert) in a relational database scenario

@harsha.m4026 4 года назад

Quick question. How do we decide the sketch's size?

@chinna311983 4 года назад

What if we use a hash function which uses string each literal ascii multiplied by index of the letter where it appears, and add all together, doesn't that guarantee the unique hash value all the time?

@thedankest1974 2 года назад

No, because of the pigeon hole principle. Remember the purpose of this data structure is to support counting frequencies of elements in sub-linear time, by trading-off accuracy. If we have unbounded strings mapping to a finite amount of space, there will always be a collision.

@KrishnaList 4 года назад

Gr8

@DorinCiobanu007 5 лет назад

How is CMS different from a hash map with no collision verification and with smaller than the data output space size of the hash function? Example: hash(n) = n mod 17, int freq[17] freq[hash("your input")]++;

@AmanSingh-nu6ng 5 лет назад

I don't think it is different, conceptually. CMS with one hash function is just hash map with no collision verification.

@usamahashmi7274 5 лет назад

@@AmanSingh-nu6ng It is different because you are not storing against each object. The total size of your data structure is the size of the sketch in this case. With hashmap, your hashmap will store all distinct elements which can be much greater than the size of your sketch.

@AmanSingh-nu6ng 5 лет назад

@@usamahashmi7274 no hash map with no collision verification will not store all the elements distinctly. If n keys will have same hash they will be stored as 1 key value. CMS with 1 hash function is same as hashmap with no collision verification. The key part is no collision verification.

@ameynaik2743 3 года назад

Nice video, could you please elaborate on how you decided on the number of columns for count-min sketch?

@_ityadi 2 года назад

Good question. I was also wondering on that coz in reality the hash function might return a really huge value which we would have to fit into our window size. But how do we calibrate the window size, that is the question

@RandomShowerThoughts Год назад

@@_ityadi I don't think you can ever change them, but it sounded like you needed to take the max of all the hash functions

@GSSMoney 3 года назад

Given a count-Min sketch - how to get the top K? Would be nice to mention what else we need to store in along with the count?

@prakashmishra 5 лет назад

Hi Narendra, The video was quite informational. But I am yet unsure how this is taking less space. How are we deciding the size of hash here? And ain't the taking the 5-6 hashes will be equivalent to taking single hash for maintaining count?

@arthursimeon2620 2 года назад

it was maybe not as clear as it could be with the small example size, but the idea would be that maybe you have millions of videos and you are counting likes, then you have a set of hash functions and then columns that are much smaller than the total number of videos, whereas a normal hash table would be linear with the number of the videos.

@shortstories765 3 года назад

what is the advantage of count-min sketch over hashtable??? we can store the count in hashtable, same time and space complexity

@jonarmani8654 2 года назад

Replace the counters with bloom filters, and you've got yourself a RAMBO.

@rishabhgupta2513 5 лет назад

The frequency of B is coming wrong via this CSM.How can we get the most accurate solution?

@singaravelann3671 3 года назад

What should be the modules value for each hash function??

@flyingonsnow 5 лет назад

What kind of hash functions to use in count in sketch.

@TechDummiesNarendraL 5 лет назад

Here is the list: en.wikipedia.org/wiki/List_of_hash_functions But the catch is to select the hash functions which has property of equal distribution.

@donyayasir2681 4 года назад

Please answer me .. how about B ?? why it is 2 (when i calculate it ) and in the stream appears only one time ??

@NoWarForever 3 года назад

collision

@saurabhgohil2232 5 лет назад

This is little bit similar to bloom filter. but instead of Boolean value we keep count in our sketch.

@aprilfool302 2 года назад

You do awesome bro, but sometimes there is issue with sound, your voice becomes feeble. Keep it up

@sivamunnaluri2263 3 года назад

How you are concluded more hash function will have more accurate ?

@nareshm8291 5 лет назад

can anyone please tell me what would be the space and time complexity of count min sketch algorithm

@nzs1 5 лет назад

Kind of complex question. But a quick answer: According to wikipedia, the space is decided beforehand and is only based on how big of an array to allocate (where rows = # hashes, cols = # of hash values).

@satishvedpathak9972 4 года назад

while (inpStream) for hashFn = 1 to 4 HV = s1(hashFn(inpStream)) MAT(hashFn,HV) = MAT(hashFn,HV)++; end for end while findFrequncey(inputSteam) for hashFn = 1 to 4 freq = freq + MAT(hashFn,HV) end for return freq/hashFn

@tutsdriveonline2632 3 месяца назад

Won't it be better if increase number of possible columns i.e. from 0-6 to like 0-100 instead of increasing the hasfunctions ?

@zhongyuanzhang5611 3 года назад

if it is half of the space, it is still linear, sublinear means when the data grows, the difference will be bigger and bigger

@snr7131 2 года назад

space wont grow though, it will be a grid fixed width and height (height = no of hash functions, width= any arbitrary large number)

@yuenyiupang Год назад

that mean u need to know what you looking for, say i wanna know the freq of "A", can it apply to i wanna the top 20 freq item?

@leozhang7341 3 года назад

''' Assumption: input stream could be single numbers or lower letters: 36 possibilities 8 columns 4 hash functions: 1. md5 2. sha1 3. sha256 4. sha384 ''' import hashlib from typing import List class CountMinSketch: def __init__(self): self.cols = 8 self.rows = 4 self.h1 = lambda x: hashlib.md5(x).hexdigest() self.h2 = lambda x: hashlib.sha1(x).hexdigest() self.h3 = lambda x: hashlib.sha256(x).hexdigest() self.h4 = lambda x: hashlib.sha384(x).hexdigest() self.hash_funcs = [self.h1, self.h2, self.h3, self.h4] self.matrix = [[0] * self.cols for _ in range(self.rows)] def __get_hashes(self, c: str) -> List[int]: hash_results = [] for func in self.hash_funcs: # Generate has value from each hash function as string hash_result = func(c.encode()) # Convert it into number and divide by columns size hash_results.append(int(hash_result, 16) % 8) return hash_results def process_element(self, c: str) -> None: hash_results = self.__get_hashes(c) # Update count values in the matrix for i, val in enumerate(hash_results): self.matrix[i][val] += 1 def get_freq(self, c: str) -> int: hash_results = self.__get_hashes(c) freqs = [] # Find frequence value for each hash function for i, val in enumerate(hash_results): freqs.append(self.matrix[i][val]) return min(freqs) if __name__ == '__main__': # Use list for simplicity input = ['1', 'v', 'd', 'c', '2', '1', 'c', 'a', '1', '0', 'v', '1', '1', '0', '1', '6', 'd'] cms = CountMinSketch() # Assume the last element of the list in the start of the data stream while len(input) > 0: c = input.pop() cms.process_element(c) print(cms.get_freq('1')) Great video! Thanks for the explanation. Just a side note: if 'e' is the most tolerented error rate, the width of the table is: w = 2 / e , and the height of the table is: h = log(e) / log(0.5). (ref: ieeexplore.ieee.org/abstract/document/6042851?casa_token=r2yECvKp2esAAAAA:4V0CffEkxHBXXe1rvs928EebsoOg79dKrxfIjhsjHMroV2t_-fiQrnDt1qmU2dqvV0NIHCkn4A)

@satishvedpathak9972 4 года назад

Updated: for hashFn = 1 to 4 for hashValue = 1 to n MAT(hashFn,hashValue) = 0; end for end for while (inpStream) for hashFn = 1 to 4 HV = s1(hashFn(inpStream)) MAT(hashFn,HV) = MAT(hashFn,HV)++; end for end while findFrequncey(inputSteam) for hashFn = 1 to 3 if ( MAT(hashFn,HV) > MAT(hashFn++,HV) freq = MAT(hashFn++,HV) else freq = MAT(hashFn,HV) end if end for return freq

@olenaolenaolena 3 года назад

wow, smart and cute)

@satishvedpathak9972 4 года назад

for hashFn = 1 to 4 for hashValue = 1 to n MAT(hashFn,hashValue) = 0; end for end for while (inpStream) for hashFn = 1 to 4 HV = s1(hashFn(inpStream)) MAT(hashFn,HV) = MAT(hashFn,HV)++; end for end while findFrequncey(inputSteam) for hashFn = 1 to 3 if ( MAT(hashFn,HV) > MAT(hashFn++,HV) freq = MAT(hashFn++,HV) else freq = MAT(hashFn,HV) end if end for return freq

@SHASHALIBRA 2 года назад

Did not understand how executing 4 hash function can have lower time complexity than map. Hash functions will always be excited 4 times but a hash function can return the result in O(1). Somebody plz help me understand

@rajeshji7698 4 года назад

Can u do videos for how Facebook like button works ?

@sarthakuiit 2 года назад

I am not able to understand how does it take sub linear space if we have to maintain a reasonable amount of accuracy? For 6x4 matrix it can’t store count of more than 24 elements accurately. (that too when hash functions are working perfectly, leaving at least 1 count accurately in the matrix) Then how is it different from a regular hashMap. There could be collisions.. yes.. but then also search time is LOG(N) for few of the elements which underwent collision. But then, let’s consider cost for counting in CMS - We need to use k hash functions to both put and get values with a risk of some inaccuracy. So I fail to understand, what great CMS is achieving when we call it ‘sub linear’ data structure.

@bharath_v 3 года назад

As you pointed out, if the hash value is same for both 'A' and 'B', then it might overcount, in such case we should increase the number of hash function.

@NK-ju6ns 2 года назад

Thanks. Helpful. However the video explains how count min sketch works but not why it works.

@sumonmal009 3 года назад

THIS COMMENT IS FOR MY PERSONAL REFERENCE. TO UNDERSTAND PROPERLY WATCH THE FULL VIDEO -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- sublinear 1:00 Why we need count min sketch 2:30 Count sketch 5:50 8:30 9:40 calculate frequency13:20 to 15:06