# PISA: Regression Tests for [Disks 4 & 5](https://trec.nist.gov/data_disks.html) (Robust04)

## Indexing

First, we will create a directory where all the indexes are going to be stored:

```
mkdir robust04
```

### Parsing

```bash
gzip -dc $(find /path/to/disk45/ -type f -name '*.*z' \
    \( -path '*/disk4/fr94/[0-9]*/*' -o -path '*/disk4/ft/ft*' \
    -o -path '*/disk5/fbis/fb*' -o -path '*/disk5/latimes/la*' \)) \
    | bin/parse_collection -f trectext -b 10000 --stemmer porter2 --content-parser html -o robust04/fwd
```

You can replace `gzip -dc` with `zcat` on Linux or `gzcat` on MacOS.
The directory `/path/to/disk45/` should be the root directory of [TREC Disks 4 & 5](https://trec.nist.gov/data_disks.html).

### Inverting

```bash
/path/to/pisa/build/bin/invert \
    -i robust04/fwd \
    -o robust04/inv \
    -b 400000
```

### Reordering
We perform [Recursive Graph Bisection (aka BP) algorithm](https://dl.acm.org/doi/10.1145/2939672.2939862), which is currently the state-of-the-art for minimizing the compressed space used by an inverted index (or graph) through document reordering. 

```bash
/path/to/pisa/build/bin/recursive_graph_bisection \
    -c robust04/inv \
    -o robust04/inv.bp \
    --documents robust04/fwd.doclex \
    --reordered-documents \
    robust04/fwd.bp.doclex
```

### Meta data
To perform BM25 queries it is necessary to build an additional file containing the information needed to compute the score, such as the document lengths. The following command builds a metadata file with block-max structure with blocks of fixed size of 64 postings:

```bash
/path/to/pisa/build/bin/create_wand_data \
    -c robust04/inv.bp \
    -b 64 \
    -o robust04/inv.bm25.bmw \
    -s bm25
```

### Index Compression

```bash
/path/to/pisa/build/bin/create_freq_index \
    -e block_simdbp \
    -c robust04/inv.bp \
    -o robust04/inv.block_simdbp \
    --check
```
## Retrieval

Queries can be downloaded from NIST:
[TREC 2004 Robust Track (Topics 301-450 & 601-700)](http://trec.nist.gov/data/robust/04.testset.gz)

```bash
wget http://trec.nist.gov/data/robust/04.testset.gz
gunzip 04.testset.gz
/path/to/pisa/build/bin/extract_topics -i 04.testset -o topics.robust2004
```
The above command will download the topics from the NIST website, extract the archive and parse topics in order to get `title`, `desc` and `narr` fields in separate files.

```
/path/to/pisa/build/bin/evaluate_queries \
    -e block_simdbp \
    -a block_max_wand \
    -i robust04/inv.block_simdbp \
    -w robust04/inv.bm25.bmw \
    --stemmer porter2 \
    --documents robust04/fwd.bp.doclex \
    --terms robust04/fwd.termlex \
    -k 1000 \
    --scorer bm25 \
    -q topics.robust2004.title \
    > run.robust2004.bm25.title.robust2004.txt
```

## Evaluation

Qrels can be downloaded from NIST:
[TREC 2004 Robust Track (Topics 301-450 & 601-700)](http://trec.nist.gov/data/robust/qrels.robust2004.txt)
```
wget http://trec.nist.gov/data/robust/qrels.robust2004.txt
```

[trec_eval](https://github.com/usnistgov/trec_eval) is the standard tool used by the TREC community for
evaluating an ad-hoc retrieval run, given the results file and a standard set of judged results (qrels).
It needs to be compiled and installed in order to perform the following command:

```
trec_eval -m map -m P.30 -m ndcg_cut.20 qrels.robust2004.txt run.robust2004.bm25.title.robust2004.txt
```

With the above commands, you should be able to replicate the following results:

```
map                     all 0.2543
P_30                    all 0.3139
ndcg_cut_20             all 0.4250
```

## Replication Log

+ Results replicated by [@amallia](https://github.com/amallia) on 2020-04-03 (commit [b01073](https://github.com/pisa-engine/pisa/commit/2b010731e6ea1b45a5f4a7caa9135a76219ed487))