Sharding¶
We support partitioning a collection into a number of smaller subsets called shards.
Right now, only a forward index can be partitioned by running partition_fwd_index
command.
Then, the resulting shards must be inverted individually with invert
command.
For convenience, we provide script/invert-shards
that takes a file prefix
to shard forward indexes and inverts them all.
partition_fwd_index
¶
Partition a forward index
Usage: ./bin/partition_fwd_index [OPTIONS]
Options:
-h,--help Print this help message and exit
-i,--input TEXT REQUIRED Forward index filename
-o,--output TEXT REQUIRED Basename of partitioned shards
-j,--threads INT Thread count
-r,--random-shards INT Excludes: --shard-files
Number of random shards
-s,--shard-files TEXT ... Excludes: --random-shards
List of files with shard titles
--debug Print debug messages
For example, one can partition collection randomly:
$ partition_fwd_index \
-j 8 # use up to 8 threads at a time
-i full_index_prefix \
-o shard_prefix \
-r 123 # partition randomly into 123 shards
Alternatively, a set of files can be provided.
Let’s assume we have a folder shard-titles
with a set of text files.
Each file contains new-line-delimited document titles (e.g., TREC-IDs) for one partition.
Then, one would call:
$ partition_fwd_index \
-j 8 # use up to 8 threads at a time
-i full_index_prefix \
-o shard_prefix \
-s shard-titles/*
Note that the names of the files passed with -s
will be ignored.
Instead, each shard will be assigned a numerical ID from 0
to N - 1
in order
in which they are passed in the command line.
Then, each resulting forward index will have appended .ID
to its name prefix:
shard_prefix.000
, shard_prefix.001
, and so on.
invert-shards.sh
¶
This script inverts all shards with a common prefix.
USAGE:
invert-shards <PROGRAM> <INPUT_BASENAME> <OUTPUT_BASENAME> [program flags]
For example, if the following command was used to partition a collection:
$ partition_fwd_index \
-j 8 # use up to 8 threads at a time
-i full_index_prefix \
-o shard_prefix \
-r 123 # partition randomly into 123 shards
Then, one can invert the shards by executing the following script:
$ invert-shards.sh \
/path/to/invert # provide path to program
shard_prefix # basename to shard collections
shard_prefix_inverted # basename to shard inverted indexes
-j 8 -b 1000 # any arguments to be appended to each program execution
compress-shards.sh
¶
Next, you can compress the inverted shards with compress-shards.sh
:
USAGE:
compress-shards <PROGRAM> <INPUT_BASENAME> <OUTPUT_BASENAME> [program flags]
For example, following the above example:
$ compress-shards.sh \
/path/to/create_freq_index # provide path to program
shard_prefix_inverted # basename to shard inverted indexes
shard_prefix_inverted_simdbp # basename to shard compressed indexes
-t block_simdbp --check # any arguments to be appended to each program execution
Note that this script can be also used for creating WAND data by replacing the program:
$ compress-shards.sh \
/path/to/create_wand_data # provide path to program
shard_prefix_inverted # basename to shard inverted indexes
shard_prefix_inverted_wand # basename to shard compressed indexes