PISA: Regression Tests for Disks 4 & 5 (Robust04)¶
Indexing¶
First, we will create a directory where all the indexes are going to be stored:
mkdir robust04
Parsing¶
gzip -dc $(find /path/to/disk45/ -type f -name '*.*z' \
\( -path '*/disk4/fr94/[0-9]*/*' -o -path '*/disk4/ft/ft*' \
-o -path '*/disk5/fbis/fb*' -o -path '*/disk5/latimes/la*' \)) \
| bin/parse_collection -f trectext -b 10000 --stemmer porter2 --content-parser html -o robust04/fwd
You can replace gzip -dc
with zcat
on Linux or gzcat
on MacOS.
The directory /path/to/disk45/
should be the root directory of TREC Disks 4 & 5.
Inverting¶
/path/to/pisa/build/bin/invert \
-i robust04/fwd \
-o robust04/inv \
-b 400000
Reordering¶
We perform Recursive Graph Bisection (aka BP) algorithm, which is currently the state-of-the-art for minimizing the compressed space used by an inverted index (or graph) through document reordering.
/path/to/pisa/build/bin/recursive_graph_bisection \
-c robust04/inv \
-o robust04/inv.bp \
--documents robust04/fwd.doclex \
--reordered-documents \
robust04/fwd.bp.doclex
Meta data¶
To perform BM25 queries it is necessary to build an additional file containing the information needed to compute the score, such as the document lengths. The following command builds a metadata file with block-max structure with blocks of fixed size of 64 postings:
/path/to/pisa/build/bin/create_wand_data \
-c robust04/inv.bp \
-b 64 \
-o robust04/inv.bm25.bmw \
-s bm25
Index Compression¶
/path/to/pisa/build/bin/create_freq_index \
-e block_simdbp \
-c robust04/inv.bp \
-o robust04/inv.block_simdbp \
--check
Retrieval¶
Queries can be downloaded from NIST: TREC 2004 Robust Track (Topics 301-450 & 601-700)
wget http://trec.nist.gov/data/robust/04.testset.gz
gunzip 04.testset.gz
/path/to/pisa/build/bin/extract_topics -i 04.testset -o topics.robust2004
The above command will download the topics from the NIST website, extract the archive and parse topics in order to get title
, desc
and narr
fields in separate files.
/path/to/pisa/build/bin/evaluate_queries \
-e block_simdbp \
-a block_max_wand \
-i robust04/inv.block_simdbp \
-w robust04/inv.bm25.bmw \
--stemmer porter2 \
--documents robust04/fwd.bp.doclex \
--terms robust04/fwd.termlex \
-k 1000 \
--scorer bm25 \
-q topics.robust2004.title \
> run.robust2004.bm25.title.robust2004.txt
Evaluation¶
Qrels can be downloaded from NIST: TREC 2004 Robust Track (Topics 301-450 & 601-700)
wget http://trec.nist.gov/data/robust/qrels.robust2004.txt
trec_eval is the standard tool used by the TREC community for evaluating an ad-hoc retrieval run, given the results file and a standard set of judged results (qrels). It needs to be compiled and installed in order to perform the following command:
trec_eval -m map -m P.30 -m ndcg_cut.20 qrels.robust2004.txt run.robust2004.bm25.title.robust2004.txt
With the above commands, you should be able to replicate the following results:
map all 0.2543
P_30 all 0.3139
ndcg_cut_20 all 0.4250