Document (#30291)

Author
Wu, Y.-f.B.
Li, Q.
Bot, R.S.
Chen, X.
Title
Finding nuggets in documents : a machine learning approach
Source
Journal of the American Society for Information Science and Technology. 57(2006) no.6, S.740-752
Year
2006
Abstract
Document keyphrases provide a concise summary of a document's content, offering semantic metadata summarizing a document. They can be used in many applications related to knowledge management and text mining, such as automatic text summarization, development of search engines, document clustering, document classification, thesaurus construction, and browsing interfaces. Because only a small portion of documents have keyphrases assigned by authors, and it is time-consuming and costly to manually assign keyphrases to documents, it is necessary to develop an algorithm to automatically generate keyphrases for documents. This paper describes a Keyphrase Identification Program (KIP), which extracts document keyphrases by using prior positive samples of human identified phrases to assign weights to the candidate keyphrases. The logic of our algorithm is: The more keywords a candidate keyphrase contains and the more significant these keywords are, the more likely this candidate phrase is a keyphrase. KIP's learning function can enrich the glossary database by automatically adding new identified keyphrases to the database. KIP's personalization feature will let the user build a glossary database specifically suitable for the area of his/her interest. The evaluation results show that KIP's performance is better than the systems we compared to and that the learning function is effective.
Theme
Automatisches Abstracting

Similar documents (author)

  1. Chen, Y.N.; Chen, S.J.: ¬A metadata practice of the OFLA FRBR model : a case study for the National Palace Museum in Taipai (2004) 4.34
    4.3394766 = sum of:
      4.3394766 = weight(author_txt:chen in 4384) [ClassicSimilarity], result of:
        4.3394766 = score(doc=4384,freq=2.0), product of:
          0.99999994 = queryWeight, product of:
            6.136947 = idf(docFreq=260, maxDocs=44421)
            0.16294746 = queryNorm
          4.339477 = fieldWeight in 4384, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            6.136947 = idf(docFreq=260, maxDocs=44421)
            0.5 = fieldNorm(doc=4384)
    
  2. Chen, C.C.; Chen, H.H.; Chen, K.H.: ¬The design of the XML/Metadata management system (2000) 3.99
    3.9860637 = sum of:
      3.9860637 = weight(author_txt:chen in 5633) [ClassicSimilarity], result of:
        3.9860637 = score(doc=5633,freq=3.0), product of:
          0.99999994 = queryWeight, product of:
            6.136947 = idf(docFreq=260, maxDocs=44421)
            0.16294746 = queryNorm
          3.986064 = fieldWeight in 5633, product of:
            1.7320508 = tf(freq=3.0), with freq of:
              3.0 = termFreq=3.0
            6.136947 = idf(docFreq=260, maxDocs=44421)
            0.375 = fieldNorm(doc=5633)
    
  3. Chen, W.Y.: Observations on cataloguing and classification (1991) 3.84
    3.8355918 = sum of:
      3.8355918 = weight(author_txt:chen in 4183) [ClassicSimilarity], result of:
        3.8355918 = score(doc=4183,freq=1.0), product of:
          0.99999994 = queryWeight, product of:
            6.136947 = idf(docFreq=260, maxDocs=44421)
            0.16294746 = queryNorm
          3.835592 = fieldWeight in 4183, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            6.136947 = idf(docFreq=260, maxDocs=44421)
            0.625 = fieldNorm(doc=4183)
    
  4. Chen, H.: Knowledge-based document retrieval : framework and design (1992) 3.84
    3.8355918 = sum of:
      3.8355918 = weight(author_txt:chen in 5282) [ClassicSimilarity], result of:
        3.8355918 = score(doc=5282,freq=1.0), product of:
          0.99999994 = queryWeight, product of:
            6.136947 = idf(docFreq=260, maxDocs=44421)
            0.16294746 = queryNorm
          3.835592 = fieldWeight in 5282, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            6.136947 = idf(docFreq=260, maxDocs=44421)
            0.625 = fieldNorm(doc=5282)
    
  5. Chen, P.S.: On inference rules of logic-based information retrieval systems (1994) 3.84
    3.8355918 = sum of:
      3.8355918 = weight(author_txt:chen in 6730) [ClassicSimilarity], result of:
        3.8355918 = score(doc=6730,freq=1.0), product of:
          0.99999994 = queryWeight, product of:
            6.136947 = idf(docFreq=260, maxDocs=44421)
            0.16294746 = queryNorm
          3.835592 = fieldWeight in 6730, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            6.136947 = idf(docFreq=260, maxDocs=44421)
            0.625 = fieldNorm(doc=6730)
    

Similar documents (content)

  1. Jones, S.; Paynter, G.W.: Automatic extractionof document keyphrases for use in digital libraries : evaluations and applications (2002) 0.92
    0.9179231 = sum of:
      0.9179231 = product of:
        1.9123398 = sum of:
          0.03154948 = weight(abstract_txt:consuming in 1601) [ClassicSimilarity], result of:
            0.03154948 = score(doc=1601,freq=1.0), product of:
              0.06954187 = queryWeight, product of:
                7.2588162 = idf(docFreq=84, maxDocs=44421)
                0.009580332 = queryNorm
              0.45367602 = fieldWeight in 1601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.2588162 = idf(docFreq=84, maxDocs=44421)
                0.0625 = fieldNorm(doc=1601)
          0.038118117 = weight(abstract_txt:costly in 1601) [ClassicSimilarity], result of:
            0.038118117 = score(doc=1601,freq=1.0), product of:
              0.078887075 = queryWeight, product of:
                1.065074 = boost
                7.731176 = idf(docFreq=52, maxDocs=44421)
                0.009580332 = queryNorm
              0.4831985 = fieldWeight in 1601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.731176 = idf(docFreq=52, maxDocs=44421)
                0.0625 = fieldNorm(doc=1601)
          0.04059 = weight(abstract_txt:concise in 1601) [ClassicSimilarity], result of:
            0.04059 = score(doc=1601,freq=1.0), product of:
              0.08226169 = queryWeight, product of:
                1.0876161 = boost
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.009580332 = queryNorm
              0.4934253 = fieldWeight in 1601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.0625 = fieldNorm(doc=1601)
          0.027770547 = weight(abstract_txt:automatically in 1601) [ClassicSimilarity], result of:
            0.027770547 = score(doc=1601,freq=1.0), product of:
              0.080473185 = queryWeight, product of:
                1.5213089 = boost
                5.521451 = idf(docFreq=482, maxDocs=44421)
                0.009580332 = queryNorm
              0.3450907 = fieldWeight in 1601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.521451 = idf(docFreq=482, maxDocs=44421)
                0.0625 = fieldNorm(doc=1601)
          0.028546682 = weight(abstract_txt:function in 1601) [ClassicSimilarity], result of:
            0.028546682 = score(doc=1601,freq=1.0), product of:
              0.08196567 = queryWeight, product of:
                1.5353515 = boost
                5.5724173 = idf(docFreq=458, maxDocs=44421)
                0.009580332 = queryNorm
              0.34827608 = fieldWeight in 1601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5724173 = idf(docFreq=458, maxDocs=44421)
                0.0625 = fieldNorm(doc=1601)
          0.030633407 = weight(abstract_txt:algorithm in 1601) [ClassicSimilarity], result of:
            0.030633407 = score(doc=1601,freq=1.0), product of:
              0.08591291 = queryWeight, product of:
                1.571886 = boost
                5.7050157 = idf(docFreq=401, maxDocs=44421)
                0.009580332 = queryNorm
              0.35656348 = fieldWeight in 1601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7050157 = idf(docFreq=401, maxDocs=44421)
                0.0625 = fieldNorm(doc=1601)
          0.035713136 = weight(abstract_txt:keywords in 1601) [ClassicSimilarity], result of:
            0.035713136 = score(doc=1601,freq=1.0), product of:
              0.095165655 = queryWeight, product of:
                1.6543673 = boost
                6.004374 = idf(docFreq=297, maxDocs=44421)
                0.009580332 = queryNorm
              0.37527338 = fieldWeight in 1601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.004374 = idf(docFreq=297, maxDocs=44421)
                0.0625 = fieldNorm(doc=1601)
          0.064364344 = weight(abstract_txt:assign in 1601) [ClassicSimilarity], result of:
            0.064364344 = score(doc=1601,freq=1.0), product of:
              0.14093703 = queryWeight, product of:
                2.0132809 = boost
                7.3070183 = idf(docFreq=80, maxDocs=44421)
                0.009580332 = queryNorm
              0.45668864 = fieldWeight in 1601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.3070183 = idf(docFreq=80, maxDocs=44421)
                0.0625 = fieldNorm(doc=1601)
          0.02313111 = weight(abstract_txt:documents in 1601) [ClassicSimilarity], result of:
            0.02313111 = score(doc=1601,freq=1.0), product of:
              0.08975718 = queryWeight, product of:
                2.2721732 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.009580332 = queryNorm
              0.25770763 = fieldWeight in 1601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0625 = fieldNorm(doc=1601)
          0.056566194 = weight(abstract_txt:document in 1601) [ClassicSimilarity], result of:
            0.056566194 = score(doc=1601,freq=3.0), product of:
              0.12168559 = queryWeight, product of:
                2.9578857 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.009580332 = queryNorm
              0.46485534 = fieldWeight in 1601, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.0625 = fieldNorm(doc=1601)
          0.26691514 = weight(abstract_txt:keyphrase in 1601) [ClassicSimilarity], result of:
            0.26691514 = score(doc=1601,freq=2.0), product of:
              0.3305198 = queryWeight, product of:
                3.7760365 = boost
                9.1365185 = idf(docFreq=12, maxDocs=44421)
                0.009580332 = queryNorm
              0.80756176 = fieldWeight in 1601, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.1365185 = idf(docFreq=12, maxDocs=44421)
                0.0625 = fieldNorm(doc=1601)
          1.2684417 = weight(abstract_txt:keyphrases in 1601) [ClassicSimilarity], result of:
            1.2684417 = score(doc=1601,freq=7.0), product of:
              0.8161411 = queryWeight, product of:
                9.063762 = boost
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.009580332 = queryNorm
              1.5541941 = fieldWeight in 1601, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.0625 = fieldNorm(doc=1601)
        0.48 = coord(12/25)
    
  2. Jiang, Y.; Meng, R.; Huang, Y.; Lu, W.; Liu, J.: Generating keyphrases for readers : a controllable keyphrase generation framework (2023) 0.29
    0.29170546 = sum of:
      0.29170546 = product of:
        1.4585273 = sum of:
          0.010885623 = weight(abstract_txt:text in 2014) [ClassicSimilarity], result of:
            0.010885623 = score(doc=2014,freq=1.0), product of:
              0.04310197 = queryWeight, product of:
                1.1133722 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.009580332 = queryNorm
              0.25255513 = fieldWeight in 2014, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=2014)
          0.040371105 = weight(abstract_txt:function in 2014) [ClassicSimilarity], result of:
            0.040371105 = score(doc=2014,freq=2.0), product of:
              0.08196567 = queryWeight, product of:
                1.5353515 = boost
                5.5724173 = idf(docFreq=458, maxDocs=44421)
                0.009580332 = queryNorm
              0.49253675 = fieldWeight in 2014, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.5724173 = idf(docFreq=458, maxDocs=44421)
                0.0625 = fieldNorm(doc=2014)
          0.026388803 = weight(abstract_txt:learning in 2014) [ClassicSimilarity], result of:
            0.026388803 = score(doc=2014,freq=1.0), product of:
              0.08903726 = queryWeight, product of:
                1.9598523 = boost
                4.7420692 = idf(docFreq=1052, maxDocs=44421)
                0.009580332 = queryNorm
              0.29637933 = fieldWeight in 2014, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7420692 = idf(docFreq=1052, maxDocs=44421)
                0.0625 = fieldNorm(doc=2014)
          0.4220299 = weight(abstract_txt:keyphrase in 2014) [ClassicSimilarity], result of:
            0.4220299 = score(doc=2014,freq=5.0), product of:
              0.3305198 = queryWeight, product of:
                3.7760365 = boost
                9.1365185 = idf(docFreq=12, maxDocs=44421)
                0.009580332 = queryNorm
              1.2768673 = fieldWeight in 2014, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                9.1365185 = idf(docFreq=12, maxDocs=44421)
                0.0625 = fieldNorm(doc=2014)
          0.9588519 = weight(abstract_txt:keyphrases in 2014) [ClassicSimilarity], result of:
            0.9588519 = score(doc=2014,freq=4.0), product of:
              0.8161411 = queryWeight, product of:
                9.063762 = boost
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.009580332 = queryNorm
              1.1748604 = fieldWeight in 2014, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.0625 = fieldNorm(doc=2014)
        0.2 = coord(5/25)
    
  3. Pirkola, A.: Constructing topic-specific search keyphrase suggestion tools for Web information retrieval (2010) 0.24
    0.2384438 = sum of:
      0.2384438 = product of:
        1.4902738 = sum of:
          0.019243246 = weight(abstract_txt:text in 665) [ClassicSimilarity], result of:
            0.019243246 = score(doc=665,freq=2.0), product of:
              0.04310197 = queryWeight, product of:
                1.1133722 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.009580332 = queryNorm
              0.4464586 = fieldWeight in 665, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.078125 = fieldNorm(doc=665)
          0.02441432 = weight(abstract_txt:identified in 665) [ClassicSimilarity], result of:
            0.02441432 = score(doc=665,freq=1.0), product of:
              0.0636431 = queryWeight, product of:
                1.3529055 = boost
                4.9102464 = idf(docFreq=889, maxDocs=44421)
                0.009580332 = queryNorm
              0.383613 = fieldWeight in 665, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.9102464 = idf(docFreq=889, maxDocs=44421)
                0.078125 = fieldNorm(doc=665)
          0.4086287 = weight(abstract_txt:keyphrase in 665) [ClassicSimilarity], result of:
            0.4086287 = score(doc=665,freq=3.0), product of:
              0.3305198 = queryWeight, product of:
                3.7760365 = boost
                9.1365185 = idf(docFreq=12, maxDocs=44421)
                0.009580332 = queryNorm
              1.2363214 = fieldWeight in 665, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.1365185 = idf(docFreq=12, maxDocs=44421)
                0.078125 = fieldNorm(doc=665)
          1.0379876 = weight(abstract_txt:keyphrases in 665) [ClassicSimilarity], result of:
            1.0379876 = score(doc=665,freq=3.0), product of:
              0.8161411 = queryWeight, product of:
                9.063762 = boost
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.009580332 = queryNorm
              1.2718236 = fieldWeight in 665, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.078125 = fieldNorm(doc=665)
        0.16 = coord(4/25)
    
  4. Daudaravicius, V.: ¬A framework for keyphrase extraction from scientific journals (2016) 0.23
    0.22680591 = sum of:
      0.22680591 = product of:
        1.417537 = sum of:
          0.041655824 = weight(abstract_txt:automatically in 3930) [ClassicSimilarity], result of:
            0.041655824 = score(doc=3930,freq=1.0), product of:
              0.080473185 = queryWeight, product of:
                1.5213089 = boost
                5.521451 = idf(docFreq=482, maxDocs=44421)
                0.009580332 = queryNorm
              0.51763606 = fieldWeight in 3930, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.521451 = idf(docFreq=482, maxDocs=44421)
                0.09375 = fieldNorm(doc=3930)
          0.07575901 = weight(abstract_txt:keywords in 3930) [ClassicSimilarity], result of:
            0.07575901 = score(doc=3930,freq=2.0), product of:
              0.095165655 = queryWeight, product of:
                1.6543673 = boost
                6.004374 = idf(docFreq=297, maxDocs=44421)
                0.009580332 = queryNorm
              0.7960751 = fieldWeight in 3930, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.004374 = idf(docFreq=297, maxDocs=44421)
                0.09375 = fieldNorm(doc=3930)
          0.28310627 = weight(abstract_txt:keyphrase in 3930) [ClassicSimilarity], result of:
            0.28310627 = score(doc=3930,freq=1.0), product of:
              0.3305198 = queryWeight, product of:
                3.7760365 = boost
                9.1365185 = idf(docFreq=12, maxDocs=44421)
                0.009580332 = queryNorm
              0.8565486 = fieldWeight in 3930, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.1365185 = idf(docFreq=12, maxDocs=44421)
                0.09375 = fieldNorm(doc=3930)
          1.0170159 = weight(abstract_txt:keyphrases in 3930) [ClassicSimilarity], result of:
            1.0170159 = score(doc=3930,freq=2.0), product of:
              0.8161411 = queryWeight, product of:
                9.063762 = boost
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.009580332 = queryNorm
              1.2461276 = fieldWeight in 3930, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.09375 = fieldNorm(doc=3930)
        0.16 = coord(4/25)
    
  5. Martín-Moncunill, D.; García-Barriocanal, E.; Sicilia, M.-A.; Sánchez-Alonso, S.: Evaluating the practical applicability of thesaurus-based keyphrase extraction in the agricultural domain : insights from the VOA3R project (2015) 0.21
    0.21037939 = sum of:
      0.21037939 = product of:
        1.3148712 = sum of:
          0.009694101 = weight(abstract_txt:more in 3106) [ClassicSimilarity], result of:
            0.009694101 = score(doc=3106,freq=1.0), product of:
              0.045669932 = queryWeight, product of:
                1.40363 = boost
                3.3962307 = idf(docFreq=4044, maxDocs=44421)
                0.009580332 = queryNorm
              0.21226442 = fieldWeight in 3106, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.3962307 = idf(docFreq=4044, maxDocs=44421)
                0.0625 = fieldNorm(doc=3106)
          0.019422272 = weight(abstract_txt:database in 3106) [ClassicSimilarity], result of:
            0.019422272 = score(doc=3106,freq=1.0), product of:
              0.0725814 = queryWeight, product of:
                1.7694982 = boost
                4.2814875 = idf(docFreq=1668, maxDocs=44421)
                0.009580332 = queryNorm
              0.26759297 = fieldWeight in 3106, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2814875 = idf(docFreq=1668, maxDocs=44421)
                0.0625 = fieldNorm(doc=3106)
          0.32690296 = weight(abstract_txt:keyphrase in 3106) [ClassicSimilarity], result of:
            0.32690296 = score(doc=3106,freq=3.0), product of:
              0.3305198 = queryWeight, product of:
                3.7760365 = boost
                9.1365185 = idf(docFreq=12, maxDocs=44421)
                0.009580332 = queryNorm
              0.9890571 = fieldWeight in 3106, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.1365185 = idf(docFreq=12, maxDocs=44421)
                0.0625 = fieldNorm(doc=3106)
          0.9588519 = weight(abstract_txt:keyphrases in 3106) [ClassicSimilarity], result of:
            0.9588519 = score(doc=3106,freq=4.0), product of:
              0.8161411 = queryWeight, product of:
                9.063762 = boost
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.009580332 = queryNorm
              1.1748604 = fieldWeight in 3106, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.0625 = fieldNorm(doc=3106)
        0.16 = coord(4/25)