Document (#34287)

Author
Khoo, C.S.G.
Ou, S.
Title
Machine versus human clustering of concepts across documents
Source
Culture and identity in knowledge organization: Proceedings of the Tenth International ISKO Conference 5-8 August 2008, Montreal, Canada. Ed. by Clément Arsenault and Joseph T. Tennis
Imprint
Würzburg : Ergon Verlag
Year
2008
Pages
S.333-339
Series
Advances in knowledge organization; vol.11
Content
An automated method for clustering terms/concepts from a set of documents on the same topic was developed for the purpose of multidocument summarization. The clustering method makes use of a combination of lexical overlap between multiword terms, syntactic constraints and semantic consideration based on a manually constructed taxonomy to generate hierarchically organized clusters of terms. This study evaluates the machine-generated clusters by calculating the proportion of overlap with two sets of human-generated clusters for 15 topics. It was found that the overlap between machine-generated clusters and individual human-generated clusters are higher than that between two human-generated clusters. A quailtative analysis of the human clustering found that clusters formed are either semantic-conceptual based or lexical based (similar to machine clustering). The semantic-conceptual based clusters that were formed tended to be different for different human coders. This has raised questions about whether machine-generated clustering can be evaluated by comparing with human clustering.
Footnote
Vgl. unter: http://www.ergon-verlag.de/isko_ko/tocs/0497f79b0c0b3ed06/0497f79b0c0b5550a/index.php.
Theme
Automatisches Klassifizieren

Similar documents (author)

  1. Khoo, C.S.G.; Poo, D.C.C.: ¬An expert system approach to online catalog subject searching (1994) 6.10
    6.103445 = sum of:
      6.103445 = sum of:
        2.7373245 = weight(author_txt:khoo in 7302) [ClassicSimilarity], result of:
          2.7373245 = score(doc=7302,freq=1.0), product of:
            0.65689176 = queryWeight, product of:
              8.334172 = idf(docFreq=28, maxDocs=44421)
              0.078819074 = queryNorm
            4.167086 = fieldWeight in 7302, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.334172 = idf(docFreq=28, maxDocs=44421)
              0.5 = fieldNorm(doc=7302)
        3.3661203 = weight(author_txt:c.s.g in 7302) [ClassicSimilarity], result of:
          3.3661203 = score(doc=7302,freq=1.0), product of:
            0.753985 = queryWeight, product of:
              1.0713576 = boost
              8.928879 = idf(docFreq=15, maxDocs=44421)
              0.078819074 = queryNorm
            4.4644394 = fieldWeight in 7302, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.928879 = idf(docFreq=15, maxDocs=44421)
              0.5 = fieldNorm(doc=7302)
    
  2. Chaudhry, A.S.; Khoo, C.S.G..: ¬A survey of the top-level categories in the structure of corporate Websites (2008) 6.10
    6.103445 = sum of:
      6.103445 = sum of:
        2.7373245 = weight(author_txt:khoo in 3259) [ClassicSimilarity], result of:
          2.7373245 = score(doc=3259,freq=1.0), product of:
            0.65689176 = queryWeight, product of:
              8.334172 = idf(docFreq=28, maxDocs=44421)
              0.078819074 = queryNorm
            4.167086 = fieldWeight in 3259, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.334172 = idf(docFreq=28, maxDocs=44421)
              0.5 = fieldNorm(doc=3259)
        3.3661203 = weight(author_txt:c.s.g in 3259) [ClassicSimilarity], result of:
          3.3661203 = score(doc=3259,freq=1.0), product of:
            0.753985 = queryWeight, product of:
              1.0713576 = boost
              8.928879 = idf(docFreq=15, maxDocs=44421)
              0.078819074 = queryNorm
            4.4644394 = fieldWeight in 3259, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.928879 = idf(docFreq=15, maxDocs=44421)
              0.5 = fieldNorm(doc=3259)
    
  3. Poo, D.C.C.; Khoo, C.S.G.: Online Catalog Subject Searching (2009) 6.10
    6.103445 = sum of:
      6.103445 = sum of:
        2.7373245 = weight(author_txt:khoo in 838) [ClassicSimilarity], result of:
          2.7373245 = score(doc=838,freq=1.0), product of:
            0.65689176 = queryWeight, product of:
              8.334172 = idf(docFreq=28, maxDocs=44421)
              0.078819074 = queryNorm
            4.167086 = fieldWeight in 838, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.334172 = idf(docFreq=28, maxDocs=44421)
              0.5 = fieldNorm(doc=838)
        3.3661203 = weight(author_txt:c.s.g in 838) [ClassicSimilarity], result of:
          3.3661203 = score(doc=838,freq=1.0), product of:
            0.753985 = queryWeight, product of:
              1.0713576 = boost
              8.928879 = idf(docFreq=15, maxDocs=44421)
              0.078819074 = queryNorm
            4.4644394 = fieldWeight in 838, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.928879 = idf(docFreq=15, maxDocs=44421)
              0.5 = fieldNorm(doc=838)
    
  4. Sun, G.; Khoo, C.S.G.: ¬A framework to represent variables and values in social science research data sets to support data curation and reuse (2018) 6.10
    6.103445 = sum of:
      6.103445 = sum of:
        2.7373245 = weight(author_txt:khoo in 744) [ClassicSimilarity], result of:
          2.7373245 = score(doc=744,freq=1.0), product of:
            0.65689176 = queryWeight, product of:
              8.334172 = idf(docFreq=28, maxDocs=44421)
              0.078819074 = queryNorm
            4.167086 = fieldWeight in 744, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.334172 = idf(docFreq=28, maxDocs=44421)
              0.5 = fieldNorm(doc=744)
        3.3661203 = weight(author_txt:c.s.g in 744) [ClassicSimilarity], result of:
          3.3661203 = score(doc=744,freq=1.0), product of:
            0.753985 = queryWeight, product of:
              1.0713576 = boost
              8.928879 = idf(docFreq=15, maxDocs=44421)
              0.078819074 = queryNorm
            4.4644394 = fieldWeight in 744, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.928879 = idf(docFreq=15, maxDocs=44421)
              0.5 = fieldNorm(doc=744)
    
  5. Khoo, C.S.G.; Wan, K.-W.: ¬A simple relevancy-ranking strategy for an interface to Boolean OPACs (2004) 5.34
    5.340514 = sum of:
      5.340514 = sum of:
        2.395159 = weight(author_txt:khoo in 3509) [ClassicSimilarity], result of:
          2.395159 = score(doc=3509,freq=1.0), product of:
            0.65689176 = queryWeight, product of:
              8.334172 = idf(docFreq=28, maxDocs=44421)
              0.078819074 = queryNorm
            3.6462004 = fieldWeight in 3509, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.334172 = idf(docFreq=28, maxDocs=44421)
              0.4375 = fieldNorm(doc=3509)
        2.9453552 = weight(author_txt:c.s.g in 3509) [ClassicSimilarity], result of:
          2.9453552 = score(doc=3509,freq=1.0), product of:
            0.753985 = queryWeight, product of:
              1.0713576 = boost
              8.928879 = idf(docFreq=15, maxDocs=44421)
              0.078819074 = queryNorm
            3.9063845 = fieldWeight in 3509, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.928879 = idf(docFreq=15, maxDocs=44421)
              0.4375 = fieldNorm(doc=3509)
    

Similar documents (content)

  1. Huang, L.; Milne, D.; Frank, E.; Witten, I.H.: Learning a concept-based document similarity measure (2012) 0.40
    0.39958772 = sum of:
      0.39958772 = product of:
        0.6992785 = sum of:
          0.100626275 = weight(abstract_txt:documents in 1372) [ClassicSimilarity], result of:
            0.100626275 = score(doc=1372,freq=2.0), product of:
              0.2208814 = queryWeight, product of:
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0535688 = queryNorm
              0.455567 = fieldWeight in 1372, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.078125 = fieldNorm(doc=1372)
          0.10412324 = weight(abstract_txt:human in 1372) [ClassicSimilarity], result of:
            0.10412324 = score(doc=1372,freq=1.0), product of:
              0.28470385 = queryWeight, product of:
                1.1353168 = boost
                4.681277 = idf(docFreq=1118, maxDocs=44421)
                0.0535688 = queryNorm
              0.36572474 = fieldWeight in 1372, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.681277 = idf(docFreq=1118, maxDocs=44421)
                0.078125 = fieldNorm(doc=1372)
          0.1489762 = weight(abstract_txt:machine in 1372) [ClassicSimilarity], result of:
            0.1489762 = score(doc=1372,freq=1.0), product of:
              0.3614982 = queryWeight, product of:
                1.2793032 = boost
                5.274979 = idf(docFreq=617, maxDocs=44421)
                0.0535688 = queryNorm
              0.41210774 = fieldWeight in 1372, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.274979 = idf(docFreq=617, maxDocs=44421)
                0.078125 = fieldNorm(doc=1372)
          0.34555277 = weight(abstract_txt:clustering in 1372) [ClassicSimilarity], result of:
            0.34555277 = score(doc=1372,freq=2.0), product of:
              0.50276047 = queryWeight, product of:
                1.5086933 = boost
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.0535688 = queryNorm
              0.68731093 = fieldWeight in 1372, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.078125 = fieldNorm(doc=1372)
        0.5714286 = coord(4/7)
    
  2. Zheng, H.-T.; Borchert, C.; Kim, H.-G.: Exploiting corpus-related ontologies for conceptualizing document corpora (2009) 0.38
    0.38047865 = sum of:
      0.38047865 = product of:
        0.6658376 = sum of:
          0.08050103 = weight(abstract_txt:documents in 152) [ClassicSimilarity], result of:
            0.08050103 = score(doc=152,freq=2.0), product of:
              0.2208814 = queryWeight, product of:
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0535688 = queryNorm
              0.3644536 = fieldWeight in 152, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0625 = fieldNorm(doc=152)
          0.2023591 = weight(abstract_txt:concepts in 152) [ClassicSimilarity], result of:
            0.2023591 = score(doc=152,freq=7.0), product of:
              0.26895773 = queryWeight, product of:
                1.1034749 = boost
                4.549982 = idf(docFreq=1275, maxDocs=44421)
                0.0535688 = queryNorm
              0.7523825 = fieldWeight in 152, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                4.549982 = idf(docFreq=1275, maxDocs=44421)
                0.0625 = fieldNorm(doc=152)
          0.10653525 = weight(abstract_txt:across in 152) [ClassicSimilarity], result of:
            0.10653525 = score(doc=152,freq=1.0), product of:
              0.33545202 = queryWeight, product of:
                1.2323544 = boost
                5.081394 = idf(docFreq=749, maxDocs=44421)
                0.0535688 = queryNorm
              0.31758714 = fieldWeight in 152, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.081394 = idf(docFreq=749, maxDocs=44421)
                0.0625 = fieldNorm(doc=152)
          0.2764422 = weight(abstract_txt:clustering in 152) [ClassicSimilarity], result of:
            0.2764422 = score(doc=152,freq=2.0), product of:
              0.50276047 = queryWeight, product of:
                1.5086933 = boost
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.0535688 = queryNorm
              0.54984874 = fieldWeight in 152, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.0625 = fieldNorm(doc=152)
        0.5714286 = coord(4/7)
    
  3. Golub, K.: Automatic subject indexing of text (2019) 0.32
    0.31947502 = sum of:
      0.31947502 = product of:
        0.55908126 = sum of:
          0.056922823 = weight(abstract_txt:documents in 268) [ClassicSimilarity], result of:
            0.056922823 = score(doc=268,freq=1.0), product of:
              0.2208814 = queryWeight, product of:
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0535688 = queryNorm
              0.25770763 = fieldWeight in 268, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0625 = fieldNorm(doc=268)
          0.10653525 = weight(abstract_txt:across in 268) [ClassicSimilarity], result of:
            0.10653525 = score(doc=268,freq=1.0), product of:
              0.33545202 = queryWeight, product of:
                1.2323544 = boost
                5.081394 = idf(docFreq=749, maxDocs=44421)
                0.0535688 = queryNorm
              0.31758714 = fieldWeight in 268, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.081394 = idf(docFreq=749, maxDocs=44421)
                0.0625 = fieldNorm(doc=268)
          0.11918097 = weight(abstract_txt:machine in 268) [ClassicSimilarity], result of:
            0.11918097 = score(doc=268,freq=1.0), product of:
              0.3614982 = queryWeight, product of:
                1.2793032 = boost
                5.274979 = idf(docFreq=617, maxDocs=44421)
                0.0535688 = queryNorm
              0.3296862 = fieldWeight in 268, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.274979 = idf(docFreq=617, maxDocs=44421)
                0.0625 = fieldNorm(doc=268)
          0.2764422 = weight(abstract_txt:clustering in 268) [ClassicSimilarity], result of:
            0.2764422 = score(doc=268,freq=2.0), product of:
              0.50276047 = queryWeight, product of:
                1.5086933 = boost
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.0535688 = queryNorm
              0.54984874 = fieldWeight in 268, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.0625 = fieldNorm(doc=268)
        0.5714286 = coord(4/7)
    
  4. Baker, T.: Languages for Dublin Core (1998) 0.31
    0.30997336 = sum of:
      0.30997336 = product of:
        0.4339627 = sum of:
          0.035576764 = weight(abstract_txt:documents in 2257) [ClassicSimilarity], result of:
            0.035576764 = score(doc=2257,freq=1.0), product of:
              0.2208814 = queryWeight, product of:
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0535688 = queryNorm
              0.16106726 = fieldWeight in 2257, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0390625 = fieldNorm(doc=2257)
          0.09017336 = weight(abstract_txt:human in 2257) [ClassicSimilarity], result of:
            0.09017336 = score(doc=2257,freq=3.0), product of:
              0.28470385 = queryWeight, product of:
                1.1353168 = boost
                4.681277 = idf(docFreq=1118, maxDocs=44421)
                0.0535688 = queryNorm
              0.3167269 = fieldWeight in 2257, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.681277 = idf(docFreq=1118, maxDocs=44421)
                0.0390625 = fieldNorm(doc=2257)
          0.094164744 = weight(abstract_txt:across in 2257) [ClassicSimilarity], result of:
            0.094164744 = score(doc=2257,freq=2.0), product of:
              0.33545202 = queryWeight, product of:
                1.2323544 = boost
                5.081394 = idf(docFreq=749, maxDocs=44421)
                0.0535688 = queryNorm
              0.28071 = fieldWeight in 2257, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.081394 = idf(docFreq=749, maxDocs=44421)
                0.0390625 = fieldNorm(doc=2257)
          0.0744881 = weight(abstract_txt:machine in 2257) [ClassicSimilarity], result of:
            0.0744881 = score(doc=2257,freq=1.0), product of:
              0.3614982 = queryWeight, product of:
                1.2793032 = boost
                5.274979 = idf(docFreq=617, maxDocs=44421)
                0.0535688 = queryNorm
              0.20605387 = fieldWeight in 2257, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.274979 = idf(docFreq=617, maxDocs=44421)
                0.0390625 = fieldNorm(doc=2257)
          0.13955973 = weight(abstract_txt:versus in 2257) [ClassicSimilarity], result of:
            0.13955973 = score(doc=2257,freq=1.0), product of:
              0.5493995 = queryWeight, product of:
                1.5771194 = boost
                6.5029707 = idf(docFreq=180, maxDocs=44421)
                0.0535688 = queryNorm
              0.2540223 = fieldWeight in 2257, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5029707 = idf(docFreq=180, maxDocs=44421)
                0.0390625 = fieldNorm(doc=2257)
        0.71428573 = coord(5/7)
    
  5. Losee, R.M.; Church Jr., L.: Are two document clusters better than one? : the cluster performance question for information retrieval (2005) 0.31
    0.30580235 = sum of:
      0.30580235 = product of:
        0.71353877 = sum of:
          0.085384235 = weight(abstract_txt:documents in 4270) [ClassicSimilarity], result of:
            0.085384235 = score(doc=4270,freq=1.0), product of:
              0.2208814 = queryWeight, product of:
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0535688 = queryNorm
              0.38656145 = fieldWeight in 4270, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.09375 = fieldNorm(doc=4270)
          0.29321125 = weight(abstract_txt:clustering in 4270) [ClassicSimilarity], result of:
            0.29321125 = score(doc=4270,freq=1.0), product of:
              0.50276047 = queryWeight, product of:
                1.5086933 = boost
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.0535688 = queryNorm
              0.58320266 = fieldWeight in 4270, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.09375 = fieldNorm(doc=4270)
          0.33494332 = weight(abstract_txt:versus in 4270) [ClassicSimilarity], result of:
            0.33494332 = score(doc=4270,freq=1.0), product of:
              0.5493995 = queryWeight, product of:
                1.5771194 = boost
                6.5029707 = idf(docFreq=180, maxDocs=44421)
                0.0535688 = queryNorm
              0.6096535 = fieldWeight in 4270, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5029707 = idf(docFreq=180, maxDocs=44421)
                0.09375 = fieldNorm(doc=4270)
        0.42857143 = coord(3/7)