Quick tips | – | What are a few quick tips? |
Ranked query | – | What's a ranked query? |
Algorithms | – | Which ranking/weighting algorithms are supported? |
Regular expressions | – | What are regular expressions? |
Options | – | What do these options mean? |
+
sign must occur.-
sign must not occur.
Salton |
Kascade einfach |
Kascade komplex |
Robertson |
IDF |
Custom |
None |
docNum | N | Number of documents in the collection |
docNum(W) | df | Number of documents containing basic form w ("document frequency") |
colLen | Number of detected word forms in the collection ("collection length") | |
docLen(D) | Number of detected word forms in document d | |
formLen(D) | Number of distinct basic forms in document d | |
freq(W, D) | tf | Frequency of basic form w in document d ("term frequency") |
freq(W) | Frequency of basic form w in the collection | |
len(W) | Character length of basic form W | |
w(W, D) | Weight of document d resulting from basic form w (The overall weight of the document - expressing the relevance of this document for the query - is calculated as the sum of the individual weights resulting from each of the query's terms.) |
tf | Term frequency | = freq(W, D) |
df | Document frequency | = docNum(W) |
N | Number of documents in the collection | = docNum |
+ | Addition |
- | Subtraction |
* | Multiplication |
/ | Division |
% | Modulus |
** | Exponent |
() | Parentheses |
atan2(Y,X) | Arctangent of Y/X in the range -π to π |
cos(EXPR) | Cosine of EXPR (expressed in radians) |
exp(EXPR) | e to the power of EXPR |
int(EXPR) | The integer portion of EXPR |
log(EXPR) | Logarithm (base e) of EXPR |
rand(EXPR) | A random fractional number between 0 and the value of EXPR (EXPR should be positive) |
sin(EXPR) | Sine of EXPR (expressed in radians) |
sqrt(EXPR) | Square root of EXPR |
Salton | tf * log(N / df) |
Robertson | (c + 1) * tf / (c + tf) * log((N - df + 0.5) / (df + 0.5)) Where c should be replaced with a suitable constant. |
IDF | log(N / df) |
Default: |
|
TF*IDF | tf * log(N / df) |
.
stands for any character
[]
stands for any enumerated character[aeiou]
matches any vowel.
(|)
stands for any enumerated string(ai|ei|ay|ey)
matches any of "ai", "ei", "ay", "ey" (|
meaning "or").*
repeats the previous symbol zero or more times
?
repeats the previous symbol zero or one time
+
repeats the previous symbol one or more times
.*
working as the well-known truncation()
groups characters(ierung)?
matches "ierung" at most once.(?i:string)
makes string matching case-insensitive(?-i:string)
to match case-sensitively.)
[1] | Strictly speaking, this only applies to Boolean AND-queries, but the problem with OR-queries is that they lack any information on the result's relevance, the relevant results are simply scattered around the results list; actually, a ranked query - in its simplest form - is a Boolean OR-query with relevance ranking. |
[2] |
For a comprehensive survey see: Saracevic: RELEVANCE: A Review of and a Framework for the Thinking on the Notion in Information Science [pdf]. Probabilistic IR: Crestani et al.: "Is This Document Relevant? ... Probably": A Survey of Probabilistic Models in Information Retrieval [pdf]. |
[3] |
Lepsky: Automatisches Indexieren [pdf]. Larson & Hearst: Term Weighting and Ranking Algorithms [pdf]. Salton & Buckley: Term-weighting approaches in automatic text retrieval. In: Information Processing & Management vol. 24, no. 5, pp. 513-523, 1988. |
[4] | Lohmann: KASCADE: Dokumentanreicherung und automatische Inhaltserschließung - Projektbericht und Ergebnisse des Retrievaltests. Düsseldorf, 2000. (Schriften der Universitäts- und Landesbibliothek Düsseldorf; 31). |
* | And also, the lir.pl databases "LIR" and "Literatur zur Inhaltserschließung" offer a lot of references on these and related topics! |