Document (#33955)

Author
Ringltetter, C.
Stubbe, A.
Title
Practical aspects of automatic genre classification
Source
Bulletin of the American Society for Information Science and Technology. 34(2008) no.5, S.27-30
Year
2008
Abstract
In the field of automatic text processing the technical term genre refers to the partition of documents into classes of documents with similar function and form. Genre represents an independent dimension, ideally orthogonal to topic. Traditionally, most work in the area of text classification from a practical as well as from a theoretical perspective has focused on the problem of how to recognize thematic domains. However, given a user's information need, even prior to content, the genre of a document leads to a first coarse binary classification of the recall space into immediately rejected documents and those that require further processing. Depending on the information task at hand, each genre can represent a class of documents that should be filtered. For example, cooking recipes represent a kind of "noise" if someone needs to find articles about the economic outlook on fish breeding; a person might be interested only in prose about the Spanish Civil War, another only in military documents. In cases like these, a genre-triggered search can deliver significantly higher precision than a simple keyword search. If the documents are not tagged initially and the document base is too big for manual annotation, we need an automatic classification system.
Footnote
Available online at: http://www.asis.org/Bulletin/Jun-08/JunJul08_Ringlstetter_Stubbe.html.

Similar documents (content)

  1. Finn, A.; Kushmerick, N.: Learning to classify documents according to genre (2006) 0.29
    0.29379678 = sum of:
      0.29379678 = product of:
        1.0492742 = sum of:
          0.023260318 = weight(abstract_txt:text in 10) [ClassicSimilarity], result of:
            0.023260318 = score(doc=10,freq=1.0), product of:
              0.07367997 = queryWeight, product of:
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.018233635 = queryNorm
              0.3156939 = fieldWeight in 10, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.078125 = fieldNorm(doc=10)
          0.025699206 = weight(abstract_txt:need in 10) [ClassicSimilarity], result of:
            0.025699206 = score(doc=10,freq=1.0), product of:
              0.07874424 = queryWeight, product of:
                1.0337956 = boost
                4.1774464 = idf(docFreq=1851, maxDocs=44421)
                0.018233635 = queryNorm
              0.326363 = fieldWeight in 10, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1774464 = idf(docFreq=1851, maxDocs=44421)
                0.078125 = fieldNorm(doc=10)
          0.02791378 = weight(abstract_txt:document in 10) [ClassicSimilarity], result of:
            0.02791378 = score(doc=10,freq=1.0), product of:
              0.08320539 = queryWeight, product of:
                1.0626763 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.018233635 = queryNorm
              0.33548045 = fieldWeight in 10, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.078125 = fieldNorm(doc=10)
          0.10488695 = weight(abstract_txt:automatic in 10) [ClassicSimilarity], result of:
            0.10488695 = score(doc=10,freq=2.0), product of:
              0.1827148 = queryWeight, product of:
                1.9286693 = boost
                5.1956835 = idf(docFreq=668, maxDocs=44421)
                0.018233635 = queryNorm
              0.5740473 = fieldWeight in 10, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.1956835 = idf(docFreq=668, maxDocs=44421)
                0.078125 = fieldNorm(doc=10)
          0.06343891 = weight(abstract_txt:classification in 10) [ClassicSimilarity], result of:
            0.06343891 = score(doc=10,freq=2.0), product of:
              0.1438278 = queryWeight, product of:
                1.9758852 = boost
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.018233635 = queryNorm
              0.44107544 = fieldWeight in 10, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.078125 = fieldNorm(doc=10)
          0.18160416 = weight(abstract_txt:documents in 10) [ClassicSimilarity], result of:
            0.18160416 = score(doc=10,freq=6.0), product of:
              0.23015098 = queryWeight, product of:
                3.0612044 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.018233635 = queryNorm
              0.7890653 = fieldWeight in 10, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.078125 = fieldNorm(doc=10)
          0.6224709 = weight(abstract_txt:genre in 10) [ClassicSimilarity], result of:
            0.6224709 = score(doc=10,freq=4.0), product of:
              0.5989246 = queryWeight, product of:
                4.9382377 = boost
                6.651612 = idf(docFreq=155, maxDocs=44421)
                0.018233635 = queryNorm
              1.0393144 = fieldWeight in 10, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.651612 = idf(docFreq=155, maxDocs=44421)
                0.078125 = fieldNorm(doc=10)
        0.28 = coord(7/25)
    
  2. Lim, C.S.; Lee, K.J.; Kim, G.C.: Multiple sets of features for automatic genre classification of web documents (2005) 0.15
    0.15159161 = sum of:
      0.15159161 = product of:
        0.75795805 = sum of:
          0.038678467 = weight(abstract_txt:document in 2048) [ClassicSimilarity], result of:
            0.038678467 = score(doc=2048,freq=3.0), product of:
              0.08320539 = queryWeight, product of:
                1.0626763 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.018233635 = queryNorm
              0.46485534 = fieldWeight in 2048, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.0625 = fieldNorm(doc=2048)
          0.05933302 = weight(abstract_txt:automatic in 2048) [ClassicSimilarity], result of:
            0.05933302 = score(doc=2048,freq=1.0), product of:
              0.1827148 = queryWeight, product of:
                1.9286693 = boost
                5.1956835 = idf(docFreq=668, maxDocs=44421)
                0.018233635 = queryNorm
              0.32473022 = fieldWeight in 2048, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.1956835 = idf(docFreq=668, maxDocs=44421)
                0.0625 = fieldNorm(doc=2048)
          0.050751127 = weight(abstract_txt:classification in 2048) [ClassicSimilarity], result of:
            0.050751127 = score(doc=2048,freq=2.0), product of:
              0.1438278 = queryWeight, product of:
                1.9758852 = boost
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.018233635 = queryNorm
              0.35286036 = fieldWeight in 2048, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.0625 = fieldNorm(doc=2048)
          0.17793499 = weight(abstract_txt:documents in 2048) [ClassicSimilarity], result of:
            0.17793499 = score(doc=2048,freq=9.0), product of:
              0.23015098 = queryWeight, product of:
                3.0612044 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.018233635 = queryNorm
              0.7731229 = fieldWeight in 2048, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0625 = fieldNorm(doc=2048)
          0.4312605 = weight(abstract_txt:genre in 2048) [ClassicSimilarity], result of:
            0.4312605 = score(doc=2048,freq=3.0), product of:
              0.5989246 = queryWeight, product of:
                4.9382377 = boost
                6.651612 = idf(docFreq=155, maxDocs=44421)
                0.018233635 = queryNorm
              0.7200581 = fieldWeight in 2048, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.651612 = idf(docFreq=155, maxDocs=44421)
                0.0625 = fieldNorm(doc=2048)
        0.2 = coord(5/25)
    
  3. Santini, M.: Zero, single, or multi? : genre of web pages through the users' perspective (2008) 0.13
    0.1307637 = sum of:
      0.1307637 = product of:
        0.8172731 = sum of:
          0.020559365 = weight(abstract_txt:need in 3059) [ClassicSimilarity], result of:
            0.020559365 = score(doc=3059,freq=1.0), product of:
              0.07874424 = queryWeight, product of:
                1.0337956 = boost
                4.1774464 = idf(docFreq=1851, maxDocs=44421)
                0.018233635 = queryNorm
              0.2610904 = fieldWeight in 3059, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1774464 = idf(docFreq=1851, maxDocs=44421)
                0.0625 = fieldNorm(doc=3059)
          0.030311132 = weight(abstract_txt:only in 3059) [ClassicSimilarity], result of:
            0.030311132 = score(doc=3059,freq=2.0), product of:
              0.08095999 = queryWeight, product of:
                1.0482395 = boost
                4.235812 = idf(docFreq=1746, maxDocs=44421)
                0.018233635 = queryNorm
              0.37439644 = fieldWeight in 3059, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.235812 = idf(docFreq=1746, maxDocs=44421)
                0.0625 = fieldNorm(doc=3059)
          0.062157184 = weight(abstract_txt:classification in 3059) [ClassicSimilarity], result of:
            0.062157184 = score(doc=3059,freq=3.0), product of:
              0.1438278 = queryWeight, product of:
                1.9758852 = boost
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.018233635 = queryNorm
              0.43216392 = fieldWeight in 3059, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.0625 = fieldNorm(doc=3059)
          0.7042454 = weight(abstract_txt:genre in 3059) [ClassicSimilarity], result of:
            0.7042454 = score(doc=3059,freq=8.0), product of:
              0.5989246 = queryWeight, product of:
                4.9382377 = boost
                6.651612 = idf(docFreq=155, maxDocs=44421)
                0.018233635 = queryNorm
              1.1758499 = fieldWeight in 3059, product of:
                2.828427 = tf(freq=8.0), with freq of:
                  8.0 = termFreq=8.0
                6.651612 = idf(docFreq=155, maxDocs=44421)
                0.0625 = fieldNorm(doc=3059)
        0.16 = coord(4/25)
    
  4. Morato, J.; Llorens, J.; Genova, G.; Moreiro, J.A.: Experiments in discourse analysis impact on information classification and retrieval algorithms (2003) 0.11
    0.110256724 = sum of:
      0.110256724 = product of:
        0.45940304 = sum of:
          0.026316045 = weight(abstract_txt:text in 2083) [ClassicSimilarity], result of:
            0.026316045 = score(doc=2083,freq=2.0), product of:
              0.07367997 = queryWeight, product of:
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.018233635 = queryNorm
              0.3571669 = fieldWeight in 2083, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=2083)
          0.021433206 = weight(abstract_txt:only in 2083) [ClassicSimilarity], result of:
            0.021433206 = score(doc=2083,freq=1.0), product of:
              0.08095999 = queryWeight, product of:
                1.0482395 = boost
                4.235812 = idf(docFreq=1746, maxDocs=44421)
                0.018233635 = queryNorm
              0.26473826 = fieldWeight in 2083, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.235812 = idf(docFreq=1746, maxDocs=44421)
                0.0625 = fieldNorm(doc=2083)
          0.03158084 = weight(abstract_txt:document in 2083) [ClassicSimilarity], result of:
            0.03158084 = score(doc=2083,freq=2.0), product of:
              0.08320539 = queryWeight, product of:
                1.0626763 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.018233635 = queryNorm
              0.3795528 = fieldWeight in 2083, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.0625 = fieldNorm(doc=2083)
          0.07177293 = weight(abstract_txt:classification in 2083) [ClassicSimilarity], result of:
            0.07177293 = score(doc=2083,freq=4.0), product of:
              0.1438278 = queryWeight, product of:
                1.9758852 = boost
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.018233635 = queryNorm
              0.49901992 = fieldWeight in 2083, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.0625 = fieldNorm(doc=2083)
          0.059311662 = weight(abstract_txt:documents in 2083) [ClassicSimilarity], result of:
            0.059311662 = score(doc=2083,freq=1.0), product of:
              0.23015098 = queryWeight, product of:
                3.0612044 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.018233635 = queryNorm
              0.25770763 = fieldWeight in 2083, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0625 = fieldNorm(doc=2083)
          0.24898836 = weight(abstract_txt:genre in 2083) [ClassicSimilarity], result of:
            0.24898836 = score(doc=2083,freq=1.0), product of:
              0.5989246 = queryWeight, product of:
                4.9382377 = boost
                6.651612 = idf(docFreq=155, maxDocs=44421)
                0.018233635 = queryNorm
              0.41572574 = fieldWeight in 2083, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.651612 = idf(docFreq=155, maxDocs=44421)
                0.0625 = fieldNorm(doc=2083)
        0.24 = coord(6/25)
    
  5. Altinel, B.; Ganiz, M.C.: Semantic text classification : a survey of past and recent advances (2018) 0.11
    0.10620661 = sum of:
      0.10620661 = product of:
        0.37930933 = sum of:
          0.058706388 = weight(abstract_txt:text in 51) [ClassicSimilarity], result of:
            0.058706388 = score(doc=51,freq=13.0), product of:
              0.07367997 = queryWeight, product of:
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.018233635 = queryNorm
              0.7967754 = fieldWeight in 51, product of:
                3.6055512 = tf(freq=13.0), with freq of:
                  13.0 = termFreq=13.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0546875 = fieldNorm(doc=51)
          0.018754056 = weight(abstract_txt:only in 51) [ClassicSimilarity], result of:
            0.018754056 = score(doc=51,freq=1.0), product of:
              0.08095999 = queryWeight, product of:
                1.0482395 = boost
                4.235812 = idf(docFreq=1746, maxDocs=44421)
                0.018233635 = queryNorm
              0.23164597 = fieldWeight in 51, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.235812 = idf(docFreq=1746, maxDocs=44421)
                0.0546875 = fieldNorm(doc=51)
          0.03384366 = weight(abstract_txt:document in 51) [ClassicSimilarity], result of:
            0.03384366 = score(doc=51,freq=3.0), product of:
              0.08320539 = queryWeight, product of:
                1.0626763 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.018233635 = queryNorm
              0.4067484 = fieldWeight in 51, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.0546875 = fieldNorm(doc=51)
          0.02947769 = weight(abstract_txt:processing in 51) [ClassicSimilarity], result of:
            0.02947769 = score(doc=51,freq=1.0), product of:
              0.10944668 = queryWeight, product of:
                1.2187835 = boost
                4.9249606 = idf(docFreq=876, maxDocs=44421)
                0.018233635 = queryNorm
              0.26933378 = fieldWeight in 51, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.9249606 = idf(docFreq=876, maxDocs=44421)
                0.0546875 = fieldNorm(doc=51)
          0.05191639 = weight(abstract_txt:automatic in 51) [ClassicSimilarity], result of:
            0.05191639 = score(doc=51,freq=1.0), product of:
              0.1827148 = queryWeight, product of:
                1.9286693 = boost
                5.1956835 = idf(docFreq=668, maxDocs=44421)
                0.018233635 = queryNorm
              0.28413895 = fieldWeight in 51, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.1956835 = idf(docFreq=668, maxDocs=44421)
                0.0546875 = fieldNorm(doc=51)
          0.11321668 = weight(abstract_txt:classification in 51) [ClassicSimilarity], result of:
            0.11321668 = score(doc=51,freq=13.0), product of:
              0.1438278 = queryWeight, product of:
                1.9758852 = boost
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.018233635 = queryNorm
              0.7871683 = fieldWeight in 51, product of:
                3.6055512 = tf(freq=13.0), with freq of:
                  13.0 = termFreq=13.0
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.0546875 = fieldNorm(doc=51)
          0.07339444 = weight(abstract_txt:documents in 51) [ClassicSimilarity], result of:
            0.07339444 = score(doc=51,freq=2.0), product of:
              0.23015098 = queryWeight, product of:
                3.0612044 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.018233635 = queryNorm
              0.31889692 = fieldWeight in 51, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0546875 = fieldNorm(doc=51)
        0.28 = coord(7/25)