Settrie is a Python library implemented in C++ to create, update and query SetTrie objects.
Settrie was born from the need of a better implementation of the algorithm for our recommender system. It has direct application to text indexing for search in logarithmic time of documents containing a set of words. Think of a collection of documents as a container of the set of words that each document has. Finding all the documents that contain a set of words is finding the superset of the set of words in the query. A SetTrie will give you the answer –the list of document names– in logarithmic time. The data structure is also used in auto-complete applications.
Settrie is a C++ implementation with a Python interface. It is single threaded and can seamlessly operate over really large collections of sets. Note that the main structure is a tree and a tree node is exactly 20 bytes, a billion nodes is 20 Gb, of course plus some structures to store identifiers, etc. Note that the tree is compressing the documents by sharing the common parts and documents are already compressed by considering them a set of words. An of-the-shelf computer can store in RAM a representation of terabytes of documents and query result in much less than typing speed.
It is about 200 times faster and 20 times more memory efficient that a pure Python implementation.
The API is very easy to use. You can see this benchmark notebook for reference.
pip install mercury-settrie
To work with Settrie command line or develop Settrie, you can set up an environment with git, gcc, make and the following tools:
git clone https://github.com/BBVA/mercury-settrie.git
cd mercury-settrie/src
make
Make without arguments gives help. Try all the options. Everything should work assuming the tools are installed.
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
Copyright 2021-23, Banco de Bilbao Vizcaya Argentaria, S.A.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0