Indexing 3: sparseness and linear merge

Подписаться 59 тыс.

Просмотров 11 тыс.

50% 1

Inverted indices are sparse: most of the entries in the list are zero. We store only non-zero values as a set of (docid,frequency) tuples. We use a variant of the linear merge algorithm to find a set of documents matching the query. An integral part of the algorithm is a scoring function, which combines entries for the same document across multiple inverted lists.

Наука

Опубликовано:

1 июн 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 5

@aamirmirza8638 8 лет назад

Search concepts very well explained, thank for posting.

@xaviermt1 2 года назад

Hey there Victor. I am trying to solve a related problem to this for a personal project I am working on. Say I have a set of documents D in some natural language. Each document is a(n ordered multi-) set of words, Wi. The set of all words in this language is W. Say I have some subset of W, K. Then say I have a subset of K, S. I want to efficiently find all documents in D such that: 1) the document contains at least one word from S. 2) for all words in the document, that word is an element of K. My first extremely naive solution to this was simply putting this data into a normalized relational database, and finding the set of documents for a given K and S using a join. This very quickly becomes too slow on a dataset of day, 2 million documents (where typically a document is a sentence or two long). Two avenues I am considering exploring are building a word trie and building an inverted index. I studied undergrad computer science a few years ago but now no longerwork in tech, so I don't have any academic or industry people to ask (none of the industry people I discussed this problem with had a good suggestion. The problem is different to typical search engine search, which generally doesn't care about the set K, it is just looking for matches based on the smaller set S. After watching this video of yours, I am inspired to code up an adaption of the data structure you have described, but with modifications for my particular use case. But I was hoping that if you or anyone reading this can tell me 'the problem you are working on has been solved, just go check out algorithm X or data structure Y', that would be amazing. Thanks for any help! Some further context: for my problem, both D and W are fixed (D is several million, W is in the order of high 10s or low hundreds of thousands, ie a natural language). But I want to build a data structure (whether in postgres, neo4j, reddis, or even just in some application language like C or Python) that I can query with a variety of K and S values, and quickly find the subset of D that satisfies the above constraints.