Sharding

We support partitioning a collection into a number of smaller subsets called shards. Right now, only a forward index can be partitioned by running partition_fwd_index command. For convenience, we provide shards command that supports certain bulk operations on all shards.

Partitioning collection

We support two methods of partitioning: random, and by a defined mapping. For example, one can partition collection randomly:

$ partition_fwd_index \
    -j 8                          # use up to 8 threads at a time
    -i full_index_prefix \
    -o shard_prefix \
    -r 123                          # partition randomly into 123 shards

Alternatively, a set of files can be provided. Let’s assume we have a folder shard-titles with a set of text files. Each file contains new-line-delimited document titles (e.g., TREC-IDs) for one partition. Then, one would call:

$ partition_fwd_index \
    -j 8                          # use up to 8 threads at a time
    -i full_index_prefix \
    -o shard_prefix \
    -s shard-titles/*

Note that the names of the files passed with -s will be ignored. Instead, each shard will be assigned a numerical ID from 0 to N - 1 in order in which they are passed in the command line. Then, each resulting forward index will have appended .ID to its name prefix: shard_prefix.000, shard_prefix.001, and so on.

Working with shards

The shards tool allows to perform some index operations in bulk on all shards at once. At the moment, the following subcommands are supported:

  • invert,
  • compress,
  • wand-data, and
  • reorder-docids.

All input and output paths passed to the subcommands will be expanded for each individual shards by extending it with .<shard-id> (e.g., .000) or, if substring {} is present, then the shard number will be substituted there. For example:

shards reorder-docids --by-url \
    -c inv \
    -o inv.url \
    --documents fwd.{}.doclex \
    --reordered-documents fwd.url.{}.doclex

is equivalent to running the following command for every shard XYZ:

reorder-docids --by-url \
    -c inv.XYZ \
    -o inv.url.XYZ \
    --documents fwd.XYZ.doclex \
    --reordered-documents fwd.url.XYZ.doclex