Document (#34544)

Author
Sitas, A.
Kapidakis, S.
Title
Duplicate detection algorithms of bibliographic descriptions
Source
Library hi tech. 26(2008) no.2, S.287-301
Year
2008
Abstract
Purpose - The purpose of this paper is to focus on duplicate record detection algorithms used for detection in bibliographic databases. Design/methodology/approach - Individual algorithms, their application process for duplicate detection and their results are described based on available literature (published articles), information found at various library web sites and follow-up e-mail communications. Findings - Algorithms are categorized according to their application as a process of a single step or two consecutive steps. The results of deletion, merging, and temporary and virtual consolidation of duplicate records are studied. Originality/value - The paper presents an overview of the duplication detection algorithms and an up-to-date state of their application in different library systems.
Theme
Formalerschließung

Similar documents (content)

  1. Hustand, S.: Problems of duplicate records (1986) 0.28
    0.28360382 = sum of:
      0.28360382 = product of:
        1.1816826 = sum of:
          0.07497553 = weight(abstract_txt:merging in 265) [ClassicSimilarity], result of:
            0.07497553 = score(doc=265,freq=1.0), product of:
              0.1272358 = queryWeight, product of:
                1.2732646 = boost
                7.5425844 = idf(docFreq=63, maxDocs=44421)
                0.013248614 = queryNorm
              0.5892644 = fieldWeight in 265, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5425844 = idf(docFreq=63, maxDocs=44421)
                0.078125 = fieldNorm(doc=265)
          0.13357532 = weight(abstract_txt:duplication in 265) [ClassicSimilarity], result of:
            0.13357532 = score(doc=265,freq=2.0), product of:
              0.14841248 = queryWeight, product of:
                1.3751473 = boost
                8.146119 = idf(docFreq=34, maxDocs=44421)
                0.013248614 = queryNorm
              0.9000275 = fieldWeight in 265, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.146119 = idf(docFreq=34, maxDocs=44421)
                0.078125 = fieldNorm(doc=265)
          0.026461693 = weight(abstract_txt:bibliographic in 265) [ClassicSimilarity], result of:
            0.026461693 = score(doc=265,freq=1.0), product of:
              0.08006045 = queryWeight, product of:
                1.4283612 = boost
                4.230674 = idf(docFreq=1755, maxDocs=44421)
                0.013248614 = queryNorm
              0.3305214 = fieldWeight in 265, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.230674 = idf(docFreq=1755, maxDocs=44421)
                0.078125 = fieldNorm(doc=265)
          0.16179517 = weight(abstract_txt:algorithms in 265) [ClassicSimilarity], result of:
            0.16179517 = score(doc=265,freq=1.0), product of:
              0.36332616 = queryWeight, product of:
                4.811133 = boost
                5.7000527 = idf(docFreq=403, maxDocs=44421)
                0.013248614 = queryNorm
              0.4453166 = fieldWeight in 265, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7000527 = idf(docFreq=403, maxDocs=44421)
                0.078125 = fieldNorm(doc=265)
          0.51328987 = weight(abstract_txt:duplicate in 265) [ClassicSimilarity], result of:
            0.51328987 = score(doc=265,freq=2.0), product of:
              0.57798254 = queryWeight, product of:
                5.4275193 = boost
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.013248614 = queryNorm
              0.88807154 = fieldWeight in 265, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.078125 = fieldNorm(doc=265)
          0.27158496 = weight(abstract_txt:detection in 265) [ClassicSimilarity], result of:
            0.27158496 = score(doc=265,freq=1.0), product of:
              0.5131647 = queryWeight, product of:
                5.7177796 = boost
                6.774214 = idf(docFreq=137, maxDocs=44421)
                0.013248614 = queryNorm
              0.5292355 = fieldWeight in 265, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.774214 = idf(docFreq=137, maxDocs=44421)
                0.078125 = fieldNorm(doc=265)
        0.24 = coord(6/25)
    
  2. Cousins, S.A.: Duplicate detection and record consolidation in large bibliographic databases : the COPAC database experience (1998) 0.25
    0.2510006 = sum of:
      0.2510006 = product of:
        1.255003 = sum of:
          0.011331973 = weight(abstract_txt:library in 3833) [ClassicSimilarity], result of:
            0.011331973 = score(doc=3833,freq=1.0), product of:
              0.045485884 = queryWeight, product of:
                1.0766321 = boost
                3.188885 = idf(docFreq=4976, maxDocs=44421)
                0.013248614 = queryNorm
              0.24913163 = fieldWeight in 3833, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.188885 = idf(docFreq=4976, maxDocs=44421)
                0.078125 = fieldNorm(doc=3833)
          0.09445201 = weight(abstract_txt:duplication in 3833) [ClassicSimilarity], result of:
            0.09445201 = score(doc=3833,freq=1.0), product of:
              0.14841248 = queryWeight, product of:
                1.3751473 = boost
                8.146119 = idf(docFreq=34, maxDocs=44421)
                0.013248614 = queryNorm
              0.63641554 = fieldWeight in 3833, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.146119 = idf(docFreq=34, maxDocs=44421)
                0.078125 = fieldNorm(doc=3833)
          0.13649079 = weight(abstract_txt:consolidation in 3833) [ClassicSimilarity], result of:
            0.13649079 = score(doc=3833,freq=2.0), product of:
              0.15056425 = queryWeight, product of:
                1.3850803 = boost
                8.20496 = idf(docFreq=32, maxDocs=44421)
                0.013248614 = queryNorm
              0.90652853 = fieldWeight in 3833, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.20496 = idf(docFreq=32, maxDocs=44421)
                0.078125 = fieldNorm(doc=3833)
          0.6286491 = weight(abstract_txt:duplicate in 3833) [ClassicSimilarity], result of:
            0.6286491 = score(doc=3833,freq=3.0), product of:
              0.57798254 = queryWeight, product of:
                5.4275193 = boost
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.013248614 = queryNorm
              1.087661 = fieldWeight in 3833, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.078125 = fieldNorm(doc=3833)
          0.38407913 = weight(abstract_txt:detection in 3833) [ClassicSimilarity], result of:
            0.38407913 = score(doc=3833,freq=2.0), product of:
              0.5131647 = queryWeight, product of:
                5.7177796 = boost
                6.774214 = idf(docFreq=137, maxDocs=44421)
                0.013248614 = queryNorm
              0.74845195 = fieldWeight in 3833, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.774214 = idf(docFreq=137, maxDocs=44421)
                0.078125 = fieldNorm(doc=3833)
        0.2 = coord(5/25)
    
  3. Meir, D.D.; Lazinger, S.S.: Measuring the performance of a merging algorithm : mismatches, missed-matches, and overlap in Israel's union list (1998) 0.25
    0.24971403 = sum of:
      0.24971403 = product of:
        1.0404751 = sum of:
          0.02080341 = weight(abstract_txt:results in 4382) [ClassicSimilarity], result of:
            0.02080341 = score(doc=4382,freq=2.0), product of:
              0.05412767 = queryWeight, product of:
                1.1744612 = boost
                3.4786456 = idf(docFreq=3724, maxDocs=44421)
                0.013248614 = queryNorm
              0.38433966 = fieldWeight in 4382, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4786456 = idf(docFreq=3724, maxDocs=44421)
                0.078125 = fieldNorm(doc=4382)
          0.12986141 = weight(abstract_txt:merging in 4382) [ClassicSimilarity], result of:
            0.12986141 = score(doc=4382,freq=3.0), product of:
              0.1272358 = queryWeight, product of:
                1.2732646 = boost
                7.5425844 = idf(docFreq=63, maxDocs=44421)
                0.013248614 = queryNorm
              1.0206358 = fieldWeight in 4382, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.5425844 = idf(docFreq=63, maxDocs=44421)
                0.078125 = fieldNorm(doc=4382)
          0.026461693 = weight(abstract_txt:bibliographic in 4382) [ClassicSimilarity], result of:
            0.026461693 = score(doc=4382,freq=1.0), product of:
              0.08006045 = queryWeight, product of:
                1.4283612 = boost
                4.230674 = idf(docFreq=1755, maxDocs=44421)
                0.013248614 = queryNorm
              0.3305214 = fieldWeight in 4382, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.230674 = idf(docFreq=1755, maxDocs=44421)
                0.078125 = fieldNorm(doc=4382)
          0.22881293 = weight(abstract_txt:algorithms in 4382) [ClassicSimilarity], result of:
            0.22881293 = score(doc=4382,freq=2.0), product of:
              0.36332616 = queryWeight, product of:
                4.811133 = boost
                5.7000527 = idf(docFreq=403, maxDocs=44421)
                0.013248614 = queryNorm
              0.6297728 = fieldWeight in 4382, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.7000527 = idf(docFreq=403, maxDocs=44421)
                0.078125 = fieldNorm(doc=4382)
          0.3629507 = weight(abstract_txt:duplicate in 4382) [ClassicSimilarity], result of:
            0.3629507 = score(doc=4382,freq=1.0), product of:
              0.57798254 = queryWeight, product of:
                5.4275193 = boost
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.013248614 = queryNorm
              0.6279614 = fieldWeight in 4382, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.078125 = fieldNorm(doc=4382)
          0.27158496 = weight(abstract_txt:detection in 4382) [ClassicSimilarity], result of:
            0.27158496 = score(doc=4382,freq=1.0), product of:
              0.5131647 = queryWeight, product of:
                5.7177796 = boost
                6.774214 = idf(docFreq=137, maxDocs=44421)
                0.013248614 = queryNorm
              0.5292355 = fieldWeight in 4382, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.774214 = idf(docFreq=137, maxDocs=44421)
                0.078125 = fieldNorm(doc=4382)
        0.24 = coord(6/25)
    
  4. Sedhai, S.; Sun, A.: ¬An analysis of 14 Million tweets on hashtag-oriented spamming* (2017) 0.14
    0.14167757 = sum of:
      0.14167757 = product of:
        0.70838785 = sum of:
          0.029057182 = weight(abstract_txt:descriptions in 4683) [ClassicSimilarity], result of:
            0.029057182 = score(doc=4683,freq=1.0), product of:
              0.078482345 = queryWeight, product of:
                5.9238153 = idf(docFreq=322, maxDocs=44421)
                0.013248614 = queryNorm
              0.37023845 = fieldWeight in 4683, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9238153 = idf(docFreq=322, maxDocs=44421)
                0.0625 = fieldNorm(doc=4683)
          0.016399419 = weight(abstract_txt:paper in 4683) [ClassicSimilarity], result of:
            0.016399419 = score(doc=4683,freq=2.0), product of:
              0.053598825 = queryWeight, product of:
                1.1687098 = boost
                3.4616103 = idf(docFreq=3788, maxDocs=44421)
                0.013248614 = queryNorm
              0.30596602 = fieldWeight in 4683, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4616103 = idf(docFreq=3788, maxDocs=44421)
                0.0625 = fieldNorm(doc=4683)
          0.03503145 = weight(abstract_txt:their in 4683) [ClassicSimilarity], result of:
            0.03503145 = score(doc=4683,freq=4.0), product of:
              0.088901356 = queryWeight, product of:
                2.1286204 = boost
                3.1523883 = idf(docFreq=5161, maxDocs=44421)
                0.013248614 = queryNorm
              0.39404854 = fieldWeight in 4683, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.1523883 = idf(docFreq=5161, maxDocs=44421)
                0.0625 = fieldNorm(doc=4683)
          0.41063187 = weight(abstract_txt:duplicate in 4683) [ClassicSimilarity], result of:
            0.41063187 = score(doc=4683,freq=2.0), product of:
              0.57798254 = queryWeight, product of:
                5.4275193 = boost
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.013248614 = queryNorm
              0.7104572 = fieldWeight in 4683, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.0625 = fieldNorm(doc=4683)
          0.21726796 = weight(abstract_txt:detection in 4683) [ClassicSimilarity], result of:
            0.21726796 = score(doc=4683,freq=1.0), product of:
              0.5131647 = queryWeight, product of:
                5.7177796 = boost
                6.774214 = idf(docFreq=137, maxDocs=44421)
                0.013248614 = queryNorm
              0.42338836 = fieldWeight in 4683, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.774214 = idf(docFreq=137, maxDocs=44421)
                0.0625 = fieldNorm(doc=4683)
        0.2 = coord(5/25)
    
  5. Weiss, P.J.: Getting the expert into the system : expert systems and cataloging (1995) 0.14
    0.13622615 = sum of:
      0.13622615 = product of:
        1.1352179 = sum of:
          0.119960845 = weight(abstract_txt:merging in 2465) [ClassicSimilarity], result of:
            0.119960845 = score(doc=2465,freq=1.0), product of:
              0.1272358 = queryWeight, product of:
                1.2732646 = boost
                7.5425844 = idf(docFreq=63, maxDocs=44421)
                0.013248614 = queryNorm
              0.94282305 = fieldWeight in 2465, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5425844 = idf(docFreq=63, maxDocs=44421)
                0.125 = fieldNorm(doc=2465)
          0.58072114 = weight(abstract_txt:duplicate in 2465) [ClassicSimilarity], result of:
            0.58072114 = score(doc=2465,freq=1.0), product of:
              0.57798254 = queryWeight, product of:
                5.4275193 = boost
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.013248614 = queryNorm
              1.0047382 = fieldWeight in 2465, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.125 = fieldNorm(doc=2465)
          0.43453592 = weight(abstract_txt:detection in 2465) [ClassicSimilarity], result of:
            0.43453592 = score(doc=2465,freq=1.0), product of:
              0.5131647 = queryWeight, product of:
                5.7177796 = boost
                6.774214 = idf(docFreq=137, maxDocs=44421)
                0.013248614 = queryNorm
              0.8467767 = fieldWeight in 2465, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.774214 = idf(docFreq=137, maxDocs=44421)
                0.125 = fieldNorm(doc=2465)
        0.12 = coord(3/25)