Parsing ======= A _forward index_ is a data structure that stores the term identifiers associated to every document. Conversely, an _inverted index_ stores for each unique term the document identifiers where it appears (usually, associated to a numeric value used for ranking purposes such as the raw frequency of the term within the document). The objective of the parsing process is to represent a given collection as a forward index. To parse a collection, use the `parse_collection` command: parse_collection - parse collection and store as forward index. Usage: parse_collection [OPTIONS] [SUBCOMMAND] Options: -h,--help Print this help message and exit -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn}=info Log level -j,--threads UINT Number of threads --tokenizer TEXT:{english,whitespace}=english Tokenizer -H,--html UINT=0 Strip HTML -F,--token-filters TEXT:{krovetz,lowercase,porter2} ... Token filters --stopwords TEXT Path to file containing a list of stop words to filter out --config TEXT Configuration .ini file -o,--output TEXT REQUIRED Forward index filename -b,--batch-size INT=100000 Number of documents to process in one thread -f,--format TEXT=plaintext Input format Subcommands: merge Merge previously produced batch files. When parsing process was killed during merging, use this command to finish merging without having to restart building batches. For example: $ mkdir -p path/to/forward $ zcat ClueWeb09B/*/*.warc.gz | \ # pass unzipped stream in WARC format parse_collection \ -j 8 \ # use up to 8 threads at a time -b 10000 \ # one thread builds up to 10k documents in memory -f warc \ # use WARC -F lowercase porter2 \ # lowercase and stem every term (using the Porter2 algorithm) --html \ # strip HTML markup before extracting tokens -o path/to/forward/cw09b In case you get the error `-bash: /bin/zcat: Argument list too long`, you can pass the unzipped stream using: $ find ClueWeb09B -name '*.warc.gz' -exec zcat -q {} \; The parsing process will write the following files: * `cw09b`: forward index in binary format. * `cw09b.terms`: a new-line-delimited list of sorted terms, where term having ID N is on line N, with N starting from 0. * `cw09b.termlex`: a binary representation (lexicon) of the `.terms` file that is used to look up term identifiers at query time. * `cw09b.documents`: a new-line-delimited list of document titles (e.g., TREC-IDs), where document having ID N is on line N, with N starting from 0. * `cw09b.doclex`: a binary representation of the `.documents` file that is used to look up document identifiers at query time. * `cw09b.urls`: a new-line-delimited list of URLs, where URL having ID N is on line N, with N starting from 0. Also, keep in mind that each ID corresponds with an ID of the `cw09b.documents` file. ### Generating mapping files Once the forward index has been generated, a binary document map and lexicon file will be automatically built. However, they can also be built using the `lexicon` utility by providing the new-line delimited file as input. The `lexicon` utility also allows efficient look-ups and dumping of these binary mapping files. Examples of the `lexicon` command are shown below: Build, print, or query lexicon Usage: lexicon [OPTIONS] SUBCOMMAND Options: -h,--help Print this help message and exit -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn}=info Log level --config TEXT Configuration .ini file Subcommands: build Build a lexicon lookup Retrieve the payload at index rlookup Retrieve the index of payload print Print elements line by line For example, assume we have the following plaintext, new-line delimited file, `example.terms`: aaa bbb def zzz We can generate a lexicon as follows: ./bin/lexicon build example.terms example.lex You can dump the binary lexicon back to a plaintext representation: ./bin/lexicon print example.lex It should output: aaa bbb def zzz You can retrieve the term with a given identifier: ./bin/lexicon lookup example.lex 2 Which outputs: def Finally, you can retrieve the id of a given term: ./bin/lexicon rlookup example.lex def It outputs: 2 _NOTE_: This requires the initial file to be lexicographically sorted, as `rlookup` uses binary search for reverse lookups. ### Supported stemmers * [Porter2](https://snowballstem.org/algorithms/english/stemmer.html) * [Krovetz](https://dl.acm.org/doi/abs/10.1145/160688.160718) Both are English stemmers. Unfortunately, PISA does not have support for any other languages. Contributions are welcome. ### Supported formats The following raw collection formats are supported: * `plaintext`: every line contains the document's title first, then any number of whitespaces, followed by the content delimited by a new line character. * `trectext`: TREC newswire collections. * `trecweb`: TREC web collections. * `warc`: Web ARChive format as defined in [the format specification](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/). * `wapo`: TREC Washington Post Corpus. In case you want to parse a set of files where each one is a document (for example, the collection [wiki-large](http://dg3rtljvitrle.cloudfront.net/wiki-large.tar.gz)), use the `files2trec.py` script to format it to TREC (take into account that each relative file path is used as the document ID). Once the file is generated, parse it with the `parse_collection` command specifying the `trectext` value for the `--format` option.