We often want to find a specific record in the reams of unorganized data that every organization produces: log files, CSVs, or Hadoop-style batch outputs. To organize data for efficient access, we can slurp it into a hash-map by some key, at the high cost of memory. Alternatively, we can pour it into an indexed database, which adds complexity, network latency, and failure domains. We’ll demonstrate how a generic, durable hash-index, using a variant of Clojure’s persistent maps, provides a cost-effective way to gain leverage over disparate, unorganized datasets.
10 май 2023