Document (#33872)

Author
Medelyan, O.
Witten, I.H.
Title
Domain-independent automatic keyphrase indexing with small training sets
Source
Journal of the American Society for Information Science and Technology. 59(2008) no.7, S.1026-1040
Year
2008
Abstract
Keyphrases are widely used in both physical and digital libraries as a brief, but precise, summary of documents. They help organize material based on content, provide thematic access, represent search results, and assist with navigation. Manual assignment is expensive because trained human indexers must reach an understanding of the document and select appropriate descriptors according to defined cataloging rules. We propose a new method that enhances automatic keyphrase extraction by using semantic information about terms and phrases gleaned from a domain-specific thesaurus. The key advantage of the new approach is that it performs well with very little training data. We evaluate it on a large set of manually indexed documents in the domain of agriculture, compare its consistency with a group of six professional indexers, and explore its performance on smaller collections of documents in other domains and of French and Spanish documents.
Theme
Automatisches Indexieren

Similar documents (author)

  1. Witten, I.H.; Frank, E.: Data Mining : Praktische Werkzeuge und Techniken für das maschinelle Lernen (2000) 4.61
    4.6082807 = sum of:
      4.6082807 = weight(author_txt:witten in 833) [ClassicSimilarity], result of:
        4.6082807 = fieldWeight in 833, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.216561 = idf(docFreq=11, maxDocs=44421)
          0.5 = fieldNorm(doc=833)
    
  2. Witten, I.H.; Bainbridge, D.: Creating digital library collections with Greenstone (2005) 4.61
    4.6082807 = sum of:
      4.6082807 = weight(author_txt:witten in 3578) [ClassicSimilarity], result of:
        4.6082807 = fieldWeight in 3578, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.216561 = idf(docFreq=11, maxDocs=44421)
          0.5 = fieldNorm(doc=3578)
    
  3. Witten, I.H.; Moffat, A.; Bell, T.C.: Managing gigabytes : compressing and indexing documents and images (1994) 3.46
    3.4562106 = sum of:
      3.4562106 = weight(author_txt:witten in 4083) [ClassicSimilarity], result of:
        3.4562106 = fieldWeight in 4083, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.216561 = idf(docFreq=11, maxDocs=44421)
          0.375 = fieldNorm(doc=4083)
    
  4. Bainbridge, D.; Dewsnip, M.; Witten, l.H.: Searching digital music libraries (2005) 3.46
    3.4562106 = sum of:
      3.4562106 = weight(author_txt:witten in 1997) [ClassicSimilarity], result of:
        3.4562106 = fieldWeight in 1997, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.216561 = idf(docFreq=11, maxDocs=44421)
          0.375 = fieldNorm(doc=1997)
    
  5. Witten, I.H.; Bainbridge, D.; Boddie, S.J.: Greenstone : open-source digital library software (2001) 3.46
    3.4562106 = sum of:
      3.4562106 = weight(author_txt:witten in 2225) [ClassicSimilarity], result of:
        3.4562106 = fieldWeight in 2225, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.216561 = idf(docFreq=11, maxDocs=44421)
          0.375 = fieldNorm(doc=2225)
    

Similar documents (content)

  1. Jones, S.; Paynter, G.W.: Automatic extractionof document keyphrases for use in digital libraries : evaluations and applications (2002) 0.38
    0.38099822 = sum of:
      0.38099822 = product of:
        1.1906195 = sum of:
          0.056318272 = weight(abstract_txt:manually in 1601) [ClassicSimilarity], result of:
            0.056318272 = score(doc=1601,freq=1.0), product of:
              0.13560005 = queryWeight, product of:
                1.0295577 = boost
                6.6452217 = idf(docFreq=156, maxDocs=44421)
                0.019819818 = queryNorm
              0.41532636 = fieldWeight in 1601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.6452217 = idf(docFreq=156, maxDocs=44421)
                0.0625 = fieldNorm(doc=1601)
          0.05681023 = weight(abstract_txt:descriptors in 1601) [ClassicSimilarity], result of:
            0.05681023 = score(doc=1601,freq=1.0), product of:
              0.13638857 = queryWeight, product of:
                1.0325468 = boost
                6.664515 = idf(docFreq=153, maxDocs=44421)
                0.019819818 = queryNorm
              0.4165322 = fieldWeight in 1601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.664515 = idf(docFreq=153, maxDocs=44421)
                0.0625 = fieldNorm(doc=1601)
          0.42159805 = weight(abstract_txt:keyphrases in 1601) [ClassicSimilarity], result of:
            0.42159805 = score(doc=1601,freq=7.0), product of:
              0.27126473 = queryWeight, product of:
                1.456188 = boost
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.019819818 = queryNorm
              1.5541941 = fieldWeight in 1601, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.0625 = fieldNorm(doc=1601)
          0.051046394 = weight(abstract_txt:training in 1601) [ClassicSimilarity], result of:
            0.051046394 = score(doc=1601,freq=1.0), product of:
              0.16000995 = queryWeight, product of:
                1.5816458 = boost
                5.104322 = idf(docFreq=732, maxDocs=44421)
                0.019819818 = queryNorm
              0.31902012 = fieldWeight in 1601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.104322 = idf(docFreq=732, maxDocs=44421)
                0.0625 = fieldNorm(doc=1601)
          0.07613668 = weight(abstract_txt:automatic in 1601) [ClassicSimilarity], result of:
            0.07613668 = score(doc=1601,freq=2.0), product of:
              0.1657892 = queryWeight, product of:
                1.6099555 = boost
                5.1956835 = idf(docFreq=668, maxDocs=44421)
                0.019819818 = queryNorm
              0.45923787 = fieldWeight in 1601, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.1956835 = idf(docFreq=668, maxDocs=44421)
                0.0625 = fieldNorm(doc=1601)
          0.06088522 = weight(abstract_txt:domain in 1601) [ClassicSimilarity], result of:
            0.06088522 = score(doc=1601,freq=1.0), product of:
              0.20600383 = queryWeight, product of:
                2.1979563 = boost
                4.7288613 = idf(docFreq=1066, maxDocs=44421)
                0.019819818 = queryNorm
              0.29555383 = fieldWeight in 1601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7288613 = idf(docFreq=1066, maxDocs=44421)
                0.0625 = fieldNorm(doc=1601)
          0.05381739 = weight(abstract_txt:documents in 1601) [ClassicSimilarity], result of:
            0.05381739 = score(doc=1601,freq=1.0), product of:
              0.20883119 = queryWeight, product of:
                2.5553386 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.019819818 = queryNorm
              0.25770763 = fieldWeight in 1601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0625 = fieldNorm(doc=1601)
          0.41400737 = weight(abstract_txt:keyphrase in 1601) [ClassicSimilarity], result of:
            0.41400737 = score(doc=1601,freq=2.0), product of:
              0.5126634 = queryWeight, product of:
                2.8310785 = boost
                9.1365185 = idf(docFreq=12, maxDocs=44421)
                0.019819818 = queryNorm
              0.80756176 = fieldWeight in 1601, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.1365185 = idf(docFreq=12, maxDocs=44421)
                0.0625 = fieldNorm(doc=1601)
        0.32 = coord(8/25)
    
  2. Wu, Y.-f.B.; Li, Q.; Bot, R.S.; Chen, X.: Finding nuggets in documents : a machine learning approach (2006) 0.35
    0.34897068 = sum of:
      0.34897068 = product of:
        1.2463238 = sum of:
          0.05198835 = weight(abstract_txt:summary in 290) [ClassicSimilarity], result of:
            0.05198835 = score(doc=290,freq=1.0), product of:
              0.12855756 = queryWeight, product of:
                1.0024658 = boost
                6.470359 = idf(docFreq=186, maxDocs=44421)
                0.019819818 = queryNorm
              0.40439743 = fieldWeight in 290, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.470359 = idf(docFreq=186, maxDocs=44421)
                0.0625 = fieldNorm(doc=290)
          0.056318272 = weight(abstract_txt:manually in 290) [ClassicSimilarity], result of:
            0.056318272 = score(doc=290,freq=1.0), product of:
              0.13560005 = queryWeight, product of:
                1.0295577 = boost
                6.6452217 = idf(docFreq=156, maxDocs=44421)
                0.019819818 = queryNorm
              0.41532636 = fieldWeight in 290, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.6452217 = idf(docFreq=156, maxDocs=44421)
                0.0625 = fieldNorm(doc=290)
          0.06231449 = weight(abstract_txt:phrases in 290) [ClassicSimilarity], result of:
            0.06231449 = score(doc=290,freq=1.0), product of:
              0.14506178 = queryWeight, product of:
                1.0648717 = boost
                6.8731537 = idf(docFreq=124, maxDocs=44421)
                0.019819818 = queryNorm
              0.4295721 = fieldWeight in 290, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.8731537 = idf(docFreq=124, maxDocs=44421)
                0.0625 = fieldNorm(doc=290)
          0.42159805 = weight(abstract_txt:keyphrases in 290) [ClassicSimilarity], result of:
            0.42159805 = score(doc=290,freq=7.0), product of:
              0.27126473 = queryWeight, product of:
                1.456188 = boost
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.019819818 = queryNorm
              1.5541941 = fieldWeight in 290, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.0625 = fieldNorm(doc=290)
          0.053836763 = weight(abstract_txt:automatic in 290) [ClassicSimilarity], result of:
            0.053836763 = score(doc=290,freq=1.0), product of:
              0.1657892 = queryWeight, product of:
                1.6099555 = boost
                5.1956835 = idf(docFreq=668, maxDocs=44421)
                0.019819818 = queryNorm
              0.32473022 = fieldWeight in 290, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.1956835 = idf(docFreq=668, maxDocs=44421)
                0.0625 = fieldNorm(doc=290)
          0.09321445 = weight(abstract_txt:documents in 290) [ClassicSimilarity], result of:
            0.09321445 = score(doc=290,freq=3.0), product of:
              0.20883119 = queryWeight, product of:
                2.5553386 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.019819818 = queryNorm
              0.4463627 = fieldWeight in 290, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0625 = fieldNorm(doc=290)
          0.50705343 = weight(abstract_txt:keyphrase in 290) [ClassicSimilarity], result of:
            0.50705343 = score(doc=290,freq=3.0), product of:
              0.5126634 = queryWeight, product of:
                2.8310785 = boost
                9.1365185 = idf(docFreq=12, maxDocs=44421)
                0.019819818 = queryNorm
              0.9890571 = fieldWeight in 290, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.1365185 = idf(docFreq=12, maxDocs=44421)
                0.0625 = fieldNorm(doc=290)
        0.28 = coord(7/25)
    
  3. Jiang, Y.; Meng, R.; Huang, Y.; Lu, W.; Liu, J.: Generating keyphrases for readers : a controllable keyphrase generation framework (2023) 0.28
    0.28181204 = sum of:
      0.28181204 = product of:
        1.1742169 = sum of:
          0.06231449 = weight(abstract_txt:phrases in 2014) [ClassicSimilarity], result of:
            0.06231449 = score(doc=2014,freq=1.0), product of:
              0.14506178 = queryWeight, product of:
                1.0648717 = boost
                6.8731537 = idf(docFreq=124, maxDocs=44421)
                0.019819818 = queryNorm
              0.4295721 = fieldWeight in 2014, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.8731537 = idf(docFreq=124, maxDocs=44421)
                0.0625 = fieldNorm(doc=2014)
          0.31869817 = weight(abstract_txt:keyphrases in 2014) [ClassicSimilarity], result of:
            0.31869817 = score(doc=2014,freq=4.0), product of:
              0.27126473 = queryWeight, product of:
                1.456188 = boost
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.019819818 = queryNorm
              1.1748604 = fieldWeight in 2014, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.0625 = fieldNorm(doc=2014)
          0.02387908 = weight(abstract_txt:with in 2014) [ClassicSimilarity], result of:
            0.02387908 = score(doc=2014,freq=4.0), product of:
              0.076531224 = queryWeight, product of:
                1.5469279 = boost
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.019819818 = queryNorm
              0.31201747 = fieldWeight in 2014, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.0625 = fieldNorm(doc=2014)
          0.053836763 = weight(abstract_txt:automatic in 2014) [ClassicSimilarity], result of:
            0.053836763 = score(doc=2014,freq=1.0), product of:
              0.1657892 = queryWeight, product of:
                1.6099555 = boost
                5.1956835 = idf(docFreq=668, maxDocs=44421)
                0.019819818 = queryNorm
              0.32473022 = fieldWeight in 2014, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.1956835 = idf(docFreq=668, maxDocs=44421)
                0.0625 = fieldNorm(doc=2014)
          0.06088522 = weight(abstract_txt:domain in 2014) [ClassicSimilarity], result of:
            0.06088522 = score(doc=2014,freq=1.0), product of:
              0.20600383 = queryWeight, product of:
                2.1979563 = boost
                4.7288613 = idf(docFreq=1066, maxDocs=44421)
                0.019819818 = queryNorm
              0.29555383 = fieldWeight in 2014, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7288613 = idf(docFreq=1066, maxDocs=44421)
                0.0625 = fieldNorm(doc=2014)
          0.6546031 = weight(abstract_txt:keyphrase in 2014) [ClassicSimilarity], result of:
            0.6546031 = score(doc=2014,freq=5.0), product of:
              0.5126634 = queryWeight, product of:
                2.8310785 = boost
                9.1365185 = idf(docFreq=12, maxDocs=44421)
                0.019819818 = queryNorm
              1.2768673 = fieldWeight in 2014, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                9.1365185 = idf(docFreq=12, maxDocs=44421)
                0.0625 = fieldNorm(doc=2014)
        0.24 = coord(6/25)
    
  4. Zhang, Y.; Zhang, C.; Li, J.: Joint modeling of characters, words, and conversation contexts for microblog keyphrase extraction (2020) 0.19
    0.19226173 = sum of:
      0.19226173 = product of:
        0.9613086 = sum of:
          0.06717806 = weight(abstract_txt:trained in 816) [ClassicSimilarity], result of:
            0.06717806 = score(doc=816,freq=1.0), product of:
              0.15251479 = queryWeight, product of:
                1.0918846 = boost
                7.0475073 = idf(docFreq=104, maxDocs=44421)
                0.019819818 = queryNorm
              0.4404692 = fieldWeight in 816, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.0475073 = idf(docFreq=104, maxDocs=44421)
                0.0625 = fieldNorm(doc=816)
          0.01193954 = weight(abstract_txt:with in 816) [ClassicSimilarity], result of:
            0.01193954 = score(doc=816,freq=1.0), product of:
              0.076531224 = queryWeight, product of:
                1.5469279 = boost
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.019819818 = queryNorm
              0.15600874 = fieldWeight in 816, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.0625 = fieldNorm(doc=816)
          0.053836763 = weight(abstract_txt:automatic in 816) [ClassicSimilarity], result of:
            0.053836763 = score(doc=816,freq=1.0), product of:
              0.1657892 = queryWeight, product of:
                1.6099555 = boost
                5.1956835 = idf(docFreq=668, maxDocs=44421)
                0.019819818 = queryNorm
              0.32473022 = fieldWeight in 816, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.1956835 = idf(docFreq=668, maxDocs=44421)
                0.0625 = fieldNorm(doc=816)
          0.05381739 = weight(abstract_txt:documents in 816) [ClassicSimilarity], result of:
            0.05381739 = score(doc=816,freq=1.0), product of:
              0.20883119 = queryWeight, product of:
                2.5553386 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.019819818 = queryNorm
              0.25770763 = fieldWeight in 816, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0625 = fieldNorm(doc=816)
          0.77453685 = weight(abstract_txt:keyphrase in 816) [ClassicSimilarity], result of:
            0.77453685 = score(doc=816,freq=7.0), product of:
              0.5126634 = queryWeight, product of:
                2.8310785 = boost
                9.1365185 = idf(docFreq=12, maxDocs=44421)
                0.019819818 = queryNorm
              1.5108097 = fieldWeight in 816, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                9.1365185 = idf(docFreq=12, maxDocs=44421)
                0.0625 = fieldNorm(doc=816)
        0.2 = coord(5/25)
    
  5. Pirkola, A.: Constructing topic-specific search keyphrase suggestion tools for Web information retrieval (2010) 0.18
    0.18058512 = sum of:
      0.18058512 = product of:
        1.128657 = sum of:
          0.13491483 = weight(abstract_txt:phrases in 665) [ClassicSimilarity], result of:
            0.13491483 = score(doc=665,freq=3.0), product of:
              0.14506178 = queryWeight, product of:
                1.0648717 = boost
                6.8731537 = idf(docFreq=124, maxDocs=44421)
                0.019819818 = queryNorm
              0.9300509 = fieldWeight in 665, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.8731537 = idf(docFreq=124, maxDocs=44421)
                0.078125 = fieldNorm(doc=665)
          0.3450009 = weight(abstract_txt:keyphrases in 665) [ClassicSimilarity], result of:
            0.3450009 = score(doc=665,freq=3.0), product of:
              0.27126473 = queryWeight, product of:
                1.456188 = boost
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.019819818 = queryNorm
              1.2718236 = fieldWeight in 665, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.078125 = fieldNorm(doc=665)
          0.014924424 = weight(abstract_txt:with in 665) [ClassicSimilarity], result of:
            0.014924424 = score(doc=665,freq=1.0), product of:
              0.076531224 = queryWeight, product of:
                1.5469279 = boost
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.019819818 = queryNorm
              0.19501092 = fieldWeight in 665, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.078125 = fieldNorm(doc=665)
          0.6338168 = weight(abstract_txt:keyphrase in 665) [ClassicSimilarity], result of:
            0.6338168 = score(doc=665,freq=3.0), product of:
              0.5126634 = queryWeight, product of:
                2.8310785 = boost
                9.1365185 = idf(docFreq=12, maxDocs=44421)
                0.019819818 = queryNorm
              1.2363214 = fieldWeight in 665, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.1365185 = idf(docFreq=12, maxDocs=44421)
                0.078125 = fieldNorm(doc=665)
        0.16 = coord(4/25)