Document (#35916)

Author
Westerman, S.J.
Cribbin, T.
Collins, J.
Title
Human assessments of document similarity
Source
Journal of the American Society for Information Science and Technology. 61(2010) no.8, S.1535-1542
Year
2010
Abstract
Two studies are reported that examined the reliability of human assessments of document similarity and the association between human ratings and the results of n-gram automatic text analysis (ATA). Human interassessor reliability (IAR) was moderate to poor. However, correlations between average human ratings and n-gram solutions were strong. The average correlation between ATA and individual human solutions was greater than IAR. N-gram length influenced the strength of association, but optimum string length depended on the nature of the text (technical vs. nontechnical). We conclude that the methodology applied in previous studies may have led to overoptimistic views on human reliability, but that an optimal n-gram solution can provide a good approximation of the average human assessment of document similarity, a result that has important implications for future development of document visualization systems.
Theme
Indexierungsstudien
Object
n-grams

Similar documents (author)

  1. Collins, B.R.: Beyond cruising : reviewing (1996) 5.15
    5.1473327 = sum of:
      5.1473327 = weight(author_txt:collins in 4807) [ClassicSimilarity], result of:
        5.1473327 = fieldWeight in 4807, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.235732 = idf(docFreq=31, maxDocs=44421)
          0.625 = fieldNorm(doc=4807)
    
  2. Collins, H.M.: ¬A review of Hubert Dreyfus' What computers still can't do (1996) 5.15
    5.1473327 = sum of:
      5.1473327 = weight(author_txt:collins in 6841) [ClassicSimilarity], result of:
        5.1473327 = fieldWeight in 6841, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.235732 = idf(docFreq=31, maxDocs=44421)
          0.625 = fieldNorm(doc=6841)
    
  3. Collins, B.R.: Webwatch (1996) 5.15
    5.1473327 = sum of:
      5.1473327 = weight(author_txt:collins in 25) [ClassicSimilarity], result of:
        5.1473327 = fieldWeight in 25, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.235732 = idf(docFreq=31, maxDocs=44421)
          0.625 = fieldNorm(doc=25)
    
  4. Collins, M.: Leveling the information playing field : Illinois public libraries (1996) 5.15
    5.1473327 = sum of:
      5.1473327 = weight(author_txt:collins in 387) [ClassicSimilarity], result of:
        5.1473327 = fieldWeight in 387, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.235732 = idf(docFreq=31, maxDocs=44421)
          0.625 = fieldNorm(doc=387)
    
  5. Collins, B.R.: Webwatch (1997) 5.15
    5.1473327 = sum of:
      5.1473327 = weight(author_txt:collins in 172) [ClassicSimilarity], result of:
        5.1473327 = fieldWeight in 172, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.235732 = idf(docFreq=31, maxDocs=44421)
          0.625 = fieldNorm(doc=172)
    

Similar documents (content)

  1. Ekmekcioglu, F.C.; Lynch, M.F.; Willet, P.: Development and evaluation of conflation techniques for the implementation of a document retrieval system for Turkish text databases (1995) 0.17
    0.1681742 = sum of:
      0.1681742 = product of:
        0.70072585 = sum of:
          0.07368184 = weight(abstract_txt:string in 5865) [ClassicSimilarity], result of:
            0.07368184 = score(doc=5865,freq=1.0), product of:
              0.10946724 = queryWeight, product of:
                1.0730622 = boost
                7.179679 = idf(docFreq=91, maxDocs=44421)
                0.014208697 = queryNorm
              0.67309487 = fieldWeight in 5865, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.179679 = idf(docFreq=91, maxDocs=44421)
                0.09375 = fieldNorm(doc=5865)
          0.037155136 = weight(abstract_txt:text in 5865) [ClassicSimilarity], result of:
            0.037155136 = score(doc=5865,freq=2.0), product of:
              0.06935159 = queryWeight, product of:
                1.2078862 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.014208697 = queryNorm
              0.5357503 = fieldWeight in 5865, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.09375 = fieldNorm(doc=5865)
          0.010533268 = weight(abstract_txt:that in 5865) [ClassicSimilarity], result of:
            0.010533268 = score(doc=5865,freq=1.0), product of:
              0.04750864 = queryWeight, product of:
                1.4138362 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.014208697 = queryNorm
              0.22171268 = fieldWeight in 5865, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.09375 = fieldNorm(doc=5865)
          0.06305754 = weight(abstract_txt:document in 5865) [ClassicSimilarity], result of:
            0.06305754 = score(doc=5865,freq=1.0), product of:
              0.1566349 = queryWeight, product of:
                2.5671844 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.014208697 = queryNorm
              0.40257657 = fieldWeight in 5865, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.09375 = fieldNorm(doc=5865)
          0.1176306 = weight(abstract_txt:similarity in 5865) [ClassicSimilarity], result of:
            0.1176306 = score(doc=5865,freq=1.0), product of:
              0.21565746 = queryWeight, product of:
                2.608709 = boost
                5.8181453 = idf(docFreq=358, maxDocs=44421)
                0.014208697 = queryNorm
              0.5454511 = fieldWeight in 5865, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8181453 = idf(docFreq=358, maxDocs=44421)
                0.09375 = fieldNorm(doc=5865)
          0.39866745 = weight(abstract_txt:gram in 5865) [ClassicSimilarity], result of:
            0.39866745 = score(doc=5865,freq=1.0), product of:
              0.53555536 = queryWeight, product of:
                4.7469535 = boost
                7.9402676 = idf(docFreq=42, maxDocs=44421)
                0.014208697 = queryNorm
              0.7444001 = fieldWeight in 5865, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.9402676 = idf(docFreq=42, maxDocs=44421)
                0.09375 = fieldNorm(doc=5865)
        0.24 = coord(6/25)
    
  2. Losee, R.M.: Upper bounds for retrieval performance and their user measuring performance and generating optimal queries : can it get any better than this? (1994) 0.13
    0.13329855 = sum of:
      0.13329855 = product of:
        0.47606623 = sum of:
          0.08607265 = weight(abstract_txt:optimal in 7417) [ClassicSimilarity], result of:
            0.08607265 = score(doc=7417,freq=3.0), product of:
              0.09506801 = queryWeight, product of:
                6.690832 = idf(docFreq=149, maxDocs=44421)
                0.014208697 = queryNorm
              0.9053798 = fieldWeight in 7417, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.690832 = idf(docFreq=149, maxDocs=44421)
                0.078125 = fieldNorm(doc=7417)
          0.021893876 = weight(abstract_txt:text in 7417) [ClassicSimilarity], result of:
            0.021893876 = score(doc=7417,freq=1.0), product of:
              0.06935159 = queryWeight, product of:
                1.2078862 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.014208697 = queryNorm
              0.3156939 = fieldWeight in 7417, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.078125 = fieldNorm(doc=7417)
          0.102731794 = weight(abstract_txt:optimum in 7417) [ClassicSimilarity], result of:
            0.102731794 = score(doc=7417,freq=1.0), product of:
              0.15427703 = queryWeight, product of:
                1.2738944 = boost
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.014208697 = queryNorm
              0.6658917 = fieldWeight in 7417, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.078125 = fieldNorm(doc=7417)
          0.017555445 = weight(abstract_txt:that in 7417) [ClassicSimilarity], result of:
            0.017555445 = score(doc=7417,freq=4.0), product of:
              0.04750864 = queryWeight, product of:
                1.4138362 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.014208697 = queryNorm
              0.3695211 = fieldWeight in 7417, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.078125 = fieldNorm(doc=7417)
          0.092919715 = weight(abstract_txt:length in 7417) [ClassicSimilarity], result of:
            0.092919715 = score(doc=7417,freq=1.0), product of:
              0.18179415 = queryWeight, product of:
                1.9556348 = boost
                6.5424123 = idf(docFreq=173, maxDocs=44421)
                0.014208697 = queryNorm
              0.511126 = fieldWeight in 7417, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5424123 = idf(docFreq=173, maxDocs=44421)
                0.078125 = fieldNorm(doc=7417)
          0.052547947 = weight(abstract_txt:document in 7417) [ClassicSimilarity], result of:
            0.052547947 = score(doc=7417,freq=1.0), product of:
              0.1566349 = queryWeight, product of:
                2.5671844 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.014208697 = queryNorm
              0.33548045 = fieldWeight in 7417, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.078125 = fieldNorm(doc=7417)
          0.102344796 = weight(abstract_txt:average in 7417) [ClassicSimilarity], result of:
            0.102344796 = score(doc=7417,freq=1.0), product of:
              0.22194684 = queryWeight, product of:
                2.6464758 = boost
                5.9023747 = idf(docFreq=329, maxDocs=44421)
                0.014208697 = queryNorm
              0.46112302 = fieldWeight in 7417, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9023747 = idf(docFreq=329, maxDocs=44421)
                0.078125 = fieldNorm(doc=7417)
        0.28 = coord(7/25)
    
  3. Ravana, S.D.; Rajagopal, P.; Balakrishnan, V.: Ranking retrieval systems using pseudo relevance judgments (2015) 0.11
    0.114936695 = sum of:
      0.114936695 = product of:
        0.4789029 = sum of:
          0.0175151 = weight(abstract_txt:text in 3591) [ClassicSimilarity], result of:
            0.0175151 = score(doc=3591,freq=1.0), product of:
              0.06935159 = queryWeight, product of:
                1.2078862 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.014208697 = queryNorm
              0.25255513 = fieldWeight in 3591, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=3591)
          0.012162768 = weight(abstract_txt:that in 3591) [ClassicSimilarity], result of:
            0.012162768 = score(doc=3591,freq=3.0), product of:
              0.04750864 = queryWeight, product of:
                1.4138362 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.014208697 = queryNorm
              0.25601172 = fieldWeight in 3591, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.0625 = fieldNorm(doc=3591)
          0.016455945 = weight(abstract_txt:between in 3591) [ClassicSimilarity], result of:
            0.016455945 = score(doc=3591,freq=1.0), product of:
              0.076154165 = queryWeight, product of:
                1.550209 = boost
                3.4573963 = idf(docFreq=3804, maxDocs=44421)
                0.014208697 = queryNorm
              0.21608727 = fieldWeight in 3591, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4573963 = idf(docFreq=3804, maxDocs=44421)
                0.0625 = fieldNorm(doc=3591)
          0.08407672 = weight(abstract_txt:document in 3591) [ClassicSimilarity], result of:
            0.08407672 = score(doc=3591,freq=4.0), product of:
              0.1566349 = queryWeight, product of:
                2.5671844 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.014208697 = queryNorm
              0.53676873 = fieldWeight in 3591, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.0625 = fieldNorm(doc=3591)
          0.08187584 = weight(abstract_txt:average in 3591) [ClassicSimilarity], result of:
            0.08187584 = score(doc=3591,freq=1.0), product of:
              0.22194684 = queryWeight, product of:
                2.6464758 = boost
                5.9023747 = idf(docFreq=329, maxDocs=44421)
                0.014208697 = queryNorm
              0.36889842 = fieldWeight in 3591, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9023747 = idf(docFreq=329, maxDocs=44421)
                0.0625 = fieldNorm(doc=3591)
          0.26681653 = weight(abstract_txt:human in 3591) [ClassicSimilarity], result of:
            0.26681653 = score(doc=3591,freq=6.0), product of:
              0.37229976 = queryWeight, product of:
                5.5972433 = boost
                4.681277 = idf(docFreq=1118, maxDocs=44421)
                0.014208697 = queryNorm
              0.7166712 = fieldWeight in 3591, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.681277 = idf(docFreq=1118, maxDocs=44421)
                0.0625 = fieldNorm(doc=3591)
        0.24 = coord(6/25)
    
  4. Leroy, G.; Miller, T.; Rosemblat, G.; Browne, A.: ¬A balanced approach to health information evaluation : a vocabulary-based naïve Bayes classifier and readability formulas (2008) 0.11
    0.11343776 = sum of:
      0.11343776 = product of:
        0.40513486 = sum of:
          0.030962614 = weight(abstract_txt:text in 2998) [ClassicSimilarity], result of:
            0.030962614 = score(doc=2998,freq=2.0), product of:
              0.06935159 = queryWeight, product of:
                1.2078862 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.014208697 = queryNorm
              0.4464586 = fieldWeight in 2998, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.078125 = fieldNorm(doc=2998)
          0.025580814 = weight(abstract_txt:studies in 2998) [ClassicSimilarity], result of:
            0.025580814 = score(doc=2998,freq=1.0), product of:
              0.076933876 = queryWeight, product of:
                1.2722036 = boost
                4.25605 = idf(docFreq=1711, maxDocs=44421)
                0.014208697 = queryNorm
              0.3325039 = fieldWeight in 2998, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.25605 = idf(docFreq=1711, maxDocs=44421)
                0.078125 = fieldNorm(doc=2998)
          0.01520346 = weight(abstract_txt:that in 2998) [ClassicSimilarity], result of:
            0.01520346 = score(doc=2998,freq=3.0), product of:
              0.04750864 = queryWeight, product of:
                1.4138362 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.014208697 = queryNorm
              0.32001466 = fieldWeight in 2998, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.078125 = fieldNorm(doc=2998)
          0.020569932 = weight(abstract_txt:between in 2998) [ClassicSimilarity], result of:
            0.020569932 = score(doc=2998,freq=1.0), product of:
              0.076154165 = queryWeight, product of:
                1.550209 = boost
                3.4573963 = idf(docFreq=3804, maxDocs=44421)
                0.014208697 = queryNorm
              0.2701091 = fieldWeight in 2998, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4573963 = idf(docFreq=3804, maxDocs=44421)
                0.078125 = fieldNorm(doc=2998)
          0.07431402 = weight(abstract_txt:document in 2998) [ClassicSimilarity], result of:
            0.07431402 = score(doc=2998,freq=2.0), product of:
              0.1566349 = queryWeight, product of:
                2.5671844 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.014208697 = queryNorm
              0.47444102 = fieldWeight in 2998, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.078125 = fieldNorm(doc=2998)
          0.102344796 = weight(abstract_txt:average in 2998) [ClassicSimilarity], result of:
            0.102344796 = score(doc=2998,freq=1.0), product of:
              0.22194684 = queryWeight, product of:
                2.6464758 = boost
                5.9023747 = idf(docFreq=329, maxDocs=44421)
                0.014208697 = queryNorm
              0.46112302 = fieldWeight in 2998, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9023747 = idf(docFreq=329, maxDocs=44421)
                0.078125 = fieldNorm(doc=2998)
          0.13615924 = weight(abstract_txt:human in 2998) [ClassicSimilarity], result of:
            0.13615924 = score(doc=2998,freq=1.0), product of:
              0.37229976 = queryWeight, product of:
                5.5972433 = boost
                4.681277 = idf(docFreq=1118, maxDocs=44421)
                0.014208697 = queryNorm
              0.36572474 = fieldWeight in 2998, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.681277 = idf(docFreq=1118, maxDocs=44421)
                0.078125 = fieldNorm(doc=2998)
        0.28 = coord(7/25)
    
  5. Mens, G. Le; Kovács; B.; Hannan, M.T.; Pros, G.: Uncovering the semantics of concepts using GPT-4 (2023) 0.11
    0.11159288 = sum of:
      0.11159288 = product of:
        0.46497035 = sum of:
          0.034269337 = weight(abstract_txt:text in 2305) [ClassicSimilarity], result of:
            0.034269337 = score(doc=2305,freq=5.0), product of:
              0.06935159 = queryWeight, product of:
                1.2078862 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.014208697 = queryNorm
              0.49413916 = fieldWeight in 2305, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2305)
          0.01373931 = weight(abstract_txt:that in 2305) [ClassicSimilarity], result of:
            0.01373931 = score(doc=2305,freq=5.0), product of:
              0.04750864 = queryWeight, product of:
                1.4138362 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.014208697 = queryNorm
              0.28919604 = fieldWeight in 2305, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2305)
          0.09251491 = weight(abstract_txt:ratings in 2305) [ClassicSimilarity], result of:
            0.09251491 = score(doc=2305,freq=1.0), product of:
              0.22992374 = queryWeight, product of:
                2.1993265 = boost
                7.357662 = idf(docFreq=76, maxDocs=44421)
                0.014208697 = queryNorm
              0.40237215 = fieldWeight in 2305, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.357662 = idf(docFreq=76, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2305)
          0.036783565 = weight(abstract_txt:document in 2305) [ClassicSimilarity], result of:
            0.036783565 = score(doc=2305,freq=1.0), product of:
              0.1566349 = queryWeight, product of:
                2.5671844 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.014208697 = queryNorm
              0.23483633 = fieldWeight in 2305, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2305)
          0.097040296 = weight(abstract_txt:similarity in 2305) [ClassicSimilarity], result of:
            0.097040296 = score(doc=2305,freq=2.0), product of:
              0.21565746 = queryWeight, product of:
                2.608709 = boost
                5.8181453 = idf(docFreq=358, maxDocs=44421)
                0.014208697 = queryNorm
              0.4499742 = fieldWeight in 2305, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.8181453 = idf(docFreq=358, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2305)
          0.19062293 = weight(abstract_txt:human in 2305) [ClassicSimilarity], result of:
            0.19062293 = score(doc=2305,freq=4.0), product of:
              0.37229976 = queryWeight, product of:
                5.5972433 = boost
                4.681277 = idf(docFreq=1118, maxDocs=44421)
                0.014208697 = queryNorm
              0.5120146 = fieldWeight in 2305, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.681277 = idf(docFreq=1118, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2305)
        0.24 = coord(6/25)