Document Reordering

PISA supports reassigning document IDs that were initially assigned in order of parsing. The point of doing it is usually to decrease the index size or speed up query processing. This part is done on an uncompressed inverted index. Depending on the method, you might also need access to some parts of the forward index. We support the following ways of reordering:

  • random,
  • by a feature (such as URL or document TREC ID),
  • with a custom-defined mapping, and
  • recursive graph bisection. All of the above are supported by a single command reorder-docids. Below, we explain each method and show some examples of running the command.

Reordering document lexicon

All methods can optionally take a path to a document lexicon and make a copy of it that reflects the produced reordering.

reorder-docids \
    --documents /path/to/original/doclex \
    --reordered-documents /path/to/reordered/doclex \
    ...

Typically, you will want to do that if you plan to evaluate queries, which will need access to a correct document lexicon.

NOTE: Because these options are common to all reordering methods, we ignore them below for brevity.

Random

Random document reordering, as the name suggests, randomly shuffles all document IDs. Additionally, it can take a random seed. Two executions of the command with the same seed will produce the same final ordering.

reorder-docids --random \
    --collection /path/to/inv \
    --output /path/to/inv.random \
    --seed 123456789 # optional

By feature (e.g., URL or TRECID)

An index can be reordered according to any single document feature, such as URL or TRECID, as long as it is stored in a text file line by line, where line n is the feature of document n in the original order.

In particular, our collection parsing command produces two such feature files:

  • *.documents, which is typically a list of TRECIDs,
  • *.urls, which is a list of document URLs.

To use either, you simply need to run:

reorder-docids \
    --collection /path/to/inv \
    --output /path/to/inv.random \
    --by-feature /path/to/feature/file

From custom mapping

You can also produce a mapping yourself and feed it to the command. Such mapping is a text file with two columns separated by a whitespace:

<original ID> <new ID>

Having that, reordering is as simple as running:

reorder-docids \
    --collection /path/to/inv \
    --output /path/to/inv.random \
    --from-mapping /path/to/custom/mapping

Recursive Graph Bisection

We provide an implementation of the Recursive Graph Bisection (aka BP) algorithm, which is currently the state-of-the-art for minimizing the compressed space used by an inverted index (or graph) through document reordering. The algorithm tries to minimize an objective function directly related to the number of bits needed to store a graph or an index using a delta-encoding scheme.

Learn more from the original paper:

L. Dhulipala, I. Kabiljo, B. Karrer, G. Ottaviano, S. Pupyrev, and A. Shalita. Compressing graphs and indexes with recursive graph bisection. In Proc. SIGKDD, pages 1535–1544, 2016.

In PISA, you simply need to pass --recursive-graph-bisection option (or its alias --bp) to the reorder-docids command.

reorder-docids --bp \
    --collection /path/to/inv \
    --output /path/to/inv.random

Note that --bp allows for some additional options. For example, the algorithm constructs a forward index in memory, which is in a special format separate from the PISA forward index that you obtain from the parse_collection tool. You can instruct reorder-docids to store that intermediate structure (--store-fwdidx), as well as provide a previously constructed one (--fwdidx), which can be useful if you want to reuse it for several runs with different algorithm parameters. To see all available parameters, run reorder-docids --help.