# Document Reordering ## Document name & URL based ordering To reorder an inverted index based on document names or their URLs, use the `generate_sorted_docids_mapping.py` script and the `shuffle_docids` command. The first one allows to generate a file in which each line maps a `` with a ``. These identifiers are generated based on the lexicographical order of document names (or their URLs). Furthermore, this file serves as input to the `shuffle_docids` command, which materializes the reordering operation. ### 1. Mapping file creation To generate the mapping file, it is necessary to take into account that each line number of the `.documents` and `.urls` files (which are part of the forward index) correspond to its docid. So, the first document of `.documents` (line 0, and therefore docid = 0) is equivalent to the first URL in the `.urls` file, and so on. In this way, if you want to generate the mapping file to perform a reordering using either files, you should use the following script: ``` usage: script/generate_sorted_docids_mapping.py [-h] documents output Take a text file lexicon (e.g. '.documents' or '.urls' file) and sort it, generating a file mapping ' ' to use with the 'suffle_docids' script. positional arguments: documents File containing one document (or URL) per line and where each line number (starting from zero) represents its docid output Output file mapping ' ' optional arguments: -h, --help show this help message and exit ``` For example: ``` $ python script/generate_docids_sorted_map.py path/to/inverted.urls mapping.txt ``` Note that in the example the `.urls` file is used for reordering. If you want to generate the mapping based on document names, the `.documents` file should be used instead. ### 2. Index remapping Once the mapping file has been created, you must use the `shuffle_docids` command to generate the new inverted index: ``` Usage: ./bin/shuffle_docids [ordering file] Ordering file is of the form ``` For example: ``` $ mkdir -p path/to/ordered $ ./bin/shuffle_docids path/to/inverted path/to/ordered/inverted mapping.txt ``` ## Random Ordering If you want to perfom a random ordering, use the `shuffle_docids` command (described in the previous section), but without specifying the `[ordering file]` option. ## Recursive Graph Bisection Recursive graph bisection algorithm used for inverted indexed reordering. ### Description Implementation of the *Recursive Graph Bisection* (aka *BP*) algorithm which is currently the state-of-the-art for minimizing the compressed space used by an inverted index (or graph) through document reordering. The algorithm tries to minimize an objective function directly related to the number of bits needed to storea graph or an index using a delta-encoding scheme. > L. Dhulipala, I. Kabiljo, B. Karrer, G. Ottaviano, S. Pupyrev, andA. Shalita. Compressing graphs and indexes with recursive graph bisec-tion. InProc. SIGKDD, pages 1535–1544, 2016. ### Usage ``` Recursive graph bisection algorithm used for inverted indexed reordering. Usage: ./bin/recursive_graph_bisection [OPTIONS] Options: -h,--help Print this help message and exit -c,--collection TEXT REQUIRED Collection basename -o,--output TEXT Output basename --store-fwdidx TEXT Output basename (forward index) --fwdidx TEXT Use this forward index -m,--min-len UINT Minimum list threshold -d,--depth INT in [1 - 64] Excludes: --config Recursion depth -t,--threads UINT Thread count --prelim UINT Precomputing limit --config TEXT Excludes: --depth Node configuration file --nogb No VarIntGB compression in forward index -p,--print Print ordering to standard output ```