LIR – Help

Help contents

Quick tips	–	What are a few quick tips?
Ranked query	–	What's a ranked query?
Algorithms	–	Which ranking/weighting algorithms are supported?
Regular expressions	–	What are regular expressions?
Options	–	What do these options mean?

Quick tips

The query will be performed as "ranked query" (or "best match query", as opposed to "exact match query").
There's no support for (hierarchical) grouping by parentheses; phrases ("...") are supported though.
Regular expressions are supported!
Terms beginning with a + sign must occur.
Terms beginning with a - sign must not occur.
You can modify ("boost") a term's weight by appending "^" plus value to the term which will be multiplied with the term's initial weight (negative values are possible as well).
Sample query: +indexierung ranking.*^1.5 automatisch^-1.5 -thesaurus
Will find documents containing "indexierung" and preferably "ranking" (the ".*" indicates end-truncation) but not "thesaurus", any occurence of "automatisch" will decrease the document's rank.

A ranked query or best match query, as opposed to exact match query or boolean query, is a means of improving query results by assigning each result a weight according to its relevance for this particular query and displaying the results in descending order of relevance.
Thus, the most relevant documents will be presented at the top of the result list, without having less relevant documents completely omitted as is the case with boolean queries, where documents not containing all of the query terms simply fall short. [1]
In so doing, recall gets immensely increased while some kind of precision is still achieved through the relevance ranking.

Yet still problematic is the concept of relevance [2] of a certain document for a particular query. The usual approach is to assign weights to the index terms and generate a document's weight (which acts as a measure for its relevance) from the matching term's weights.
lir.pl offers using various term weights calculated by different algorithms [3] to allow for comparison of their effects and appropriateness as relevance measures.

Algorithms

Formulas and annotations are based on Lepsky [3] and Lohmann [4].

Legend:

docNum	N	Number of documents in the collection
docNum(W)	df	Number of documents containing basic form `w` ("document frequency")
colLen		Number of detected word forms in the collection ("collection length")
docLen(D)		Number of detected word forms in document `d`
formLen(D)		Number of distinct basic forms in document `d`
freq(W, D)	tf	Frequency of basic form `w` in document `d` ("term frequency")
freq(W)		Frequency of basic form `w` in the collection
len(W)		Character length of basic form `W`

w(W, D)		Weight of document `d` resulting from basic form `w` (The overall weight of the document - expressing the relevance of this document for the query - is calculated as the sum of the individual weights resulting from each of the query's terms.)

Salton

w(W, D) = freq(W, D) * log(docNum / docNum(W))
Kascade einfach

w₁(W) = 1 - docNum(W) / freq(W)
w₂(W, D) = 1 - ((docLen(D) / colLen) / (freq(W, D) / freq(W))) [at least 0]
w₃(W) = log(len(W)) / 4

w(W, D) = c₁ * w₁ + c₂ * w₂ + c₃ * w₃

c₁, c₂, c₃: arbitrary constants, default value: 1
Kascade komplex

w₁(W) = 1 - docNum(W) / E(docNum(W)) [at least 0]

E(docNum(W)) = formLen(D) * (1 - e^-λ)

λ = freq(W) / colLen

w₂(W, D) = (p(1) * 1 + ... + p(freq(W, D)) * freq(W, D)) / λ

p(i) = e^-λ * λⁱ / i!

λ = freq(W) * docLen(D) / colLen

w₃(W) = log(len(W)) / 4

w(W, D) = c₁ * w₁ + c₂ * w₂ + c₃ * w₃

c₁, c₂, c₃: arbitrary constants, default value: 1
Robertson

w(W, D) = (c + 1) * freq(W, D) / (c + freq(W, D)) * log((docNum - docNum(W) + 0.5) / (docNum(W) + 0.5))

c: arbitrary constant
IDF

w(W) = log(docNum / docNum(W))

Custom ranking

Build your own ranking algorithm by using the following components to form an arithmetic expression:

Special variables:

tf	Term frequency	= freq(W, D)
df	Document frequency	= docNum(W)
N	Number of documents in the collection	= docNum

Arithmetic operators:

+	Addition
-	Subtraction
*	Multiplication
/	Division
%	Modulus
**	Exponent
()	Parentheses

Arithmetic functions:

atan2(Y,X)	Arctangent of Y/X in the range -π to π
cos(EXPR)	Cosine of EXPR (expressed in radians)
exp(EXPR)	e to the power of EXPR
int(EXPR)	The integer portion of EXPR
log(EXPR)	Logarithm (base e) of EXPR
rand(EXPR)	A random fractional number between 0 and the value of EXPR (EXPR should be positive)
sin(EXPR)	Sine of EXPR (expressed in radians)
sqrt(EXPR)	Square root of EXPR

Examples:

Salton	`tf * log(N / df)`
Robertson	`(c + 1) * tf / (c + tf) * log((N - df + 0.5) / (df + 0.5))` Where `c` should be replaced with a suitable constant.
IDF	`log(N / df)`
Default:
*TFIDF**	`tf * log(N / df)`

No ranking
Documents are sorted by document number instead.

Regular expressions

In short:

. stands for any character
[] stands for any enumerated character
E.g. [aeiou] matches any vowel.
(|) stands for any enumerated string
E.g. (ai|ei|ay|ey) matches any of "ai", "ei", "ay", "ey" (| meaning "or").
=> Not sure how to write Dewey? Try this one Dew(ai|ei|ay|ey).
* repeats the previous symbol zero or more times
? repeats the previous symbol zero or one time
+ repeats the previous symbol one or more times
This makes .* working as the well-known truncation
=> Indexierung.* finds anything beginning with "Indexierung".
() groups characters
E.g. (ierung)? matches "ierung" at most once.
=> So Index(ierung)? searches for "Index" or "Indexierung".
(?i:string) makes string matching case-insensitive
(While this is the default behaviour you can use (?-i:string) to match case-sensitively.)

For more information see:

Regular expression - From wikipedia, the free encyclopedia
Regular Expressions - A slightly outdated tutorial
Perl regular expressions - Perl 5 documentation

Options

regular expression
Regular expressions provide powerful means for query formulation, but since it's (about 50%) faster to use "fixed strings" you need to explicitly toggle the "regular expression" option.
(NB: The "end-truncated" and "case-sensitive" options imply "regular expression"!)
end-truncated
...
case-sensitive
...

Footnotes

[1]	Strictly speaking, this only applies to Boolean AND-queries, but the problem with OR-queries is that they lack any information on the result's relevance, the relevant results are simply scattered around the results list; actually, a ranked query - in its simplest form - is a Boolean OR-query with relevance ranking.
[2]	For a comprehensive survey see: Saracevic: RELEVANCE: A Review of and a Framework for the Thinking on the Notion in Information Science [pdf]. Probabilistic IR: Crestani et al.: "Is This Document Relevant? ... Probably": A Survey of Probabilistic Models in Information Retrieval [pdf].
[3]	Lepsky: Automatisches Indexieren [pdf]. Larson & Hearst: Term Weighting and Ranking Algorithms [pdf]. Salton & Buckley: Term-weighting approaches in automatic text retrieval. In: Information Processing & Management vol. 24, no. 5, pp. 513-523, 1988.
[4]	Lohmann: KASCADE: Dokumentanreicherung und automatische Inhaltserschließung - Projektbericht und Ergebnisse des Retrievaltests. Düsseldorf, 2000. (Schriften der Universitäts- und Landesbibliothek Düsseldorf; 31).

*	And also, the lir.pl databases "LIR" and "Literatur zur Inhaltserschließung" offer a lot of references on these and related topics!