Okay there are several gaps. When dealing with large scale data you'll likely need to partition your tries. If you partition. Your tries how do you partition them? If you partition your tries by hash of the full search queries they represent then it will be tricky because it will be hard for you to determine from a prefix what partition to go to? The prefix 's hash could send you to a totally different partition
great design! thanks a lot! just one question hope you could help. generating trie is kind of distributed right? for example, a-bi, bj-cid...,suppose pefix sorted. each sub-trie contributes to hashmap construction, it is possible that lower level sub-trie could overlap on higher level prefix key. and a heap could be used to filter out top k frequency. is my understanding right?
Hi! Unfortunately, I don't know too much about tries. It's been a while since I took Data Structures, haha. Hopefully someone else can help answer here.
So I have two questions 1. If we are going for the TRIE data structure then there is no need for the table 3:23 right, we just build the TRIE ds and store it in the DB. From there, we can either serve the data either from the DB or cache right ? 2. Every once a week, we can run a Cron job to execute a spark job which will take whatever there is in our logs and update our TRIE data structure and update it in our DB. Is my understanding correct ?
Hi! Sure. So as part of the optimization, for every Trie node, we are storing all the leaf nodes you could reach from that node. Let’s say we have a node with prefix “ca”. Some leaf nodes you could reach from this node are: canada, cat, can, cambodia, etc. So in the Trie node “ca” we can store all those leaf nodes. And then we can store that Trie node in Cassandra with the key being “ca”. Now, if you want to find all the leaf nodes for the node “ca”, instead of parsing the tree, you can just query Cassandra for primary key “ca”.
Thanks for sharing. However, I do feel there are more nuances than simply using a trie structure. For example, how can each user's personal interest be reflected? What if some users start with uncommon query which isn't stored in the database but the intent can be clearly predicted?