Document (#34109)

Author
Vu, Q.M.
Takasu, A.
Adachi, J.
Title
Improving the performance of personal name disambiguation using web directories
Source
Information processing and management. 44(2008) no.4, S.1546-1561
Year
2008
Abstract
Frequent requests from users to search engines on the World Wide Web are to search for information about people using personal names. Current search engines only return sets of documents containing the name queried, but, as several people usually share a personal name, the resulting sets often contain documents relevant to several people. It is necessary to disambiguate people in these result sets in order to to help users find the person of interest more readily. In the task of name disambiguation, effective measurement of similarities in the documents is a crucial step towards the final disambiguation. We propose a new method that uses web directories as a knowledge base to find common contexts in documents and uses the common contexts measure to determine document similarities. Experiments, conducted on documents mentioning real people on the web, together with several famous web directory structures, suggest that there are significant advantages in using web directories to disambiguate people compared with other conventional methods.

Similar documents (content)

  1. Spink, A.; Jansen, B.J.; Pedersen , J.: Searching for people on Web search engines (2004) 0.24
    0.24205345 = sum of:
      0.24205345 = product of:
        0.8644766 = sum of:
          0.03201478 = weight(abstract_txt:common in 5429) [ClassicSimilarity], result of:
            0.03201478 = score(doc=5429,freq=1.0), product of:
              0.10660991 = queryWeight, product of:
                1.3881925 = boost
                4.8047733 = idf(docFreq=988, maxDocs=44421)
                0.015983613 = queryNorm
              0.30029833 = fieldWeight in 5429, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.8047733 = idf(docFreq=988, maxDocs=44421)
                0.0625 = fieldNorm(doc=5429)
          0.017884322 = weight(abstract_txt:using in 5429) [ClassicSimilarity], result of:
            0.017884322 = score(doc=5429,freq=1.0), product of:
              0.08277693 = queryWeight, product of:
                1.4981359 = boost
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.015983613 = queryNorm
              0.21605442 = fieldWeight in 5429, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.0625 = fieldNorm(doc=5429)
          0.104637675 = weight(abstract_txt:engines in 5429) [ClassicSimilarity], result of:
            0.104637675 = score(doc=5429,freq=6.0), product of:
              0.1292128 = queryWeight, product of:
                1.5282828 = boost
                5.2896495 = idf(docFreq=608, maxDocs=44421)
                0.015983613 = queryNorm
              0.8098089 = fieldWeight in 5429, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                5.2896495 = idf(docFreq=608, maxDocs=44421)
                0.0625 = fieldNorm(doc=5429)
          0.059770495 = weight(abstract_txt:search in 5429) [ClassicSimilarity], result of:
            0.059770495 = score(doc=5429,freq=8.0), product of:
              0.092517145 = queryWeight, product of:
                1.5838268 = boost
                3.654598 = idf(docFreq=3123, maxDocs=44421)
                0.015983613 = queryNorm
              0.6460478 = fieldWeight in 5429, product of:
                2.828427 = tf(freq=8.0), with freq of:
                  8.0 = termFreq=8.0
                3.654598 = idf(docFreq=3123, maxDocs=44421)
                0.0625 = fieldNorm(doc=5429)
          0.17681526 = weight(abstract_txt:personal in 5429) [ClassicSimilarity], result of:
            0.17681526 = score(doc=5429,freq=8.0), product of:
              0.19065323 = queryWeight, product of:
                2.273624 = boost
                5.246269 = idf(docFreq=635, maxDocs=44421)
                0.015983613 = queryNorm
              0.9274181 = fieldWeight in 5429, product of:
                2.828427 = tf(freq=8.0), with freq of:
                  8.0 = termFreq=8.0
                5.246269 = idf(docFreq=635, maxDocs=44421)
                0.0625 = fieldNorm(doc=5429)
          0.26819485 = weight(abstract_txt:name in 5429) [ClassicSimilarity], result of:
            0.26819485 = score(doc=5429,freq=6.0), product of:
              0.30489978 = queryWeight, product of:
                3.3200488 = boost
                5.7456303 = idf(docFreq=385, maxDocs=44421)
                0.015983613 = queryNorm
              0.87961644 = fieldWeight in 5429, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                5.7456303 = idf(docFreq=385, maxDocs=44421)
                0.0625 = fieldNorm(doc=5429)
          0.20515923 = weight(abstract_txt:people in 5429) [ClassicSimilarity], result of:
            0.20515923 = score(doc=5429,freq=4.0), product of:
              0.3341784 = queryWeight, product of:
                4.2569714 = boost
                4.9113703 = idf(docFreq=888, maxDocs=44421)
                0.015983613 = queryNorm
              0.6139213 = fieldWeight in 5429, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.9113703 = idf(docFreq=888, maxDocs=44421)
                0.0625 = fieldNorm(doc=5429)
        0.28 = coord(7/25)
    
  2. Kim, J.; Kim, J.; Owen-Smith, J.: Ethnicity-based name partitioning for author name disambiguation using supervised machine learning (2021) 0.22
    0.22009857 = sum of:
      0.22009857 = product of:
        1.1004928 = sum of:
          0.081254184 = weight(abstract_txt:similarities in 1312) [ClassicSimilarity], result of:
            0.081254184 = score(doc=1312,freq=1.0), product of:
              0.19836318 = queryWeight, product of:
                1.8935704 = boost
                6.553973 = idf(docFreq=171, maxDocs=44421)
                0.015983613 = queryNorm
              0.40962332 = fieldWeight in 1312, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.553973 = idf(docFreq=171, maxDocs=44421)
                0.0625 = fieldNorm(doc=1312)
          0.04073837 = weight(abstract_txt:several in 1312) [ClassicSimilarity], result of:
            0.04073837 = score(doc=1312,freq=1.0), product of:
              0.14330569 = queryWeight, product of:
                1.9711889 = boost
                4.548416 = idf(docFreq=1277, maxDocs=44421)
                0.015983613 = queryNorm
              0.284276 = fieldWeight in 1312, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.548416 = idf(docFreq=1277, maxDocs=44421)
                0.0625 = fieldNorm(doc=1312)
          0.18725172 = weight(abstract_txt:disambiguate in 1312) [ClassicSimilarity], result of:
            0.18725172 = score(doc=1312,freq=1.0), product of:
              0.34608367 = queryWeight, product of:
                2.50116 = boost
                8.656945 = idf(docFreq=20, maxDocs=44421)
                0.015983613 = queryNorm
              0.5410591 = fieldWeight in 1312, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.656945 = idf(docFreq=20, maxDocs=44421)
                0.0625 = fieldNorm(doc=1312)
          0.38157415 = weight(abstract_txt:disambiguation in 1312) [ClassicSimilarity], result of:
            0.38157415 = score(doc=1312,freq=5.0), product of:
              0.37238336 = queryWeight, product of:
                3.1775448 = boost
                7.33202 = idf(docFreq=78, maxDocs=44421)
                0.015983613 = queryNorm
              1.024681 = fieldWeight in 1312, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.33202 = idf(docFreq=78, maxDocs=44421)
                0.0625 = fieldNorm(doc=1312)
          0.4096744 = weight(abstract_txt:name in 1312) [ClassicSimilarity], result of:
            0.4096744 = score(doc=1312,freq=14.0), product of:
              0.30489978 = queryWeight, product of:
                3.3200488 = boost
                5.7456303 = idf(docFreq=385, maxDocs=44421)
                0.015983613 = queryNorm
              1.3436363 = fieldWeight in 1312, product of:
                3.7416575 = tf(freq=14.0), with freq of:
                  14.0 = termFreq=14.0
                5.7456303 = idf(docFreq=385, maxDocs=44421)
                0.0625 = fieldNorm(doc=1312)
        0.2 = coord(5/25)
    
  3. Dumais, S.T.: Latent semantic analysis (2003) 0.17
    0.17248394 = sum of:
      0.17248394 = product of:
        0.39200896 = sum of:
          0.0065511614 = weight(abstract_txt:users in 3462) [ClassicSimilarity], result of:
            0.0065511614 = score(doc=3462,freq=1.0), product of:
              0.058766447 = queryWeight, product of:
                1.0306605 = boost
                3.5672934 = idf(docFreq=3408, maxDocs=44421)
                0.015983613 = queryNorm
              0.11147792 = fieldWeight in 3462, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.5672934 = idf(docFreq=3408, maxDocs=44421)
                0.03125 = fieldNorm(doc=3462)
          0.038471892 = weight(abstract_txt:return in 3462) [ClassicSimilarity], result of:
            0.038471892 = score(doc=3462,freq=2.0), product of:
              0.120501645 = queryWeight, product of:
                1.043596 = boost
                7.2241306 = idf(docFreq=87, maxDocs=44421)
                0.015983613 = queryNorm
              0.31926447 = fieldWeight in 3462, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.2241306 = idf(docFreq=87, maxDocs=44421)
                0.03125 = fieldNorm(doc=3462)
          0.029093826 = weight(abstract_txt:find in 3462) [ClassicSimilarity], result of:
            0.029093826 = score(doc=3462,freq=3.0), product of:
              0.110089034 = queryWeight, product of:
                1.4106619 = boost
                4.8825436 = idf(docFreq=914, maxDocs=44421)
                0.015983613 = queryNorm
              0.26427543 = fieldWeight in 3462, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.8825436 = idf(docFreq=914, maxDocs=44421)
                0.03125 = fieldNorm(doc=3462)
          0.012646125 = weight(abstract_txt:using in 3462) [ClassicSimilarity], result of:
            0.012646125 = score(doc=3462,freq=2.0), product of:
              0.08277693 = queryWeight, product of:
                1.4981359 = boost
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.015983613 = queryNorm
              0.15277354 = fieldWeight in 3462, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.03125 = fieldNorm(doc=3462)
          0.021359075 = weight(abstract_txt:engines in 3462) [ClassicSimilarity], result of:
            0.021359075 = score(doc=3462,freq=1.0), product of:
              0.1292128 = queryWeight, product of:
                1.5282828 = boost
                5.2896495 = idf(docFreq=608, maxDocs=44421)
                0.015983613 = queryNorm
              0.16530155 = fieldWeight in 3462, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2896495 = idf(docFreq=608, maxDocs=44421)
                0.03125 = fieldNorm(doc=3462)
          0.018300902 = weight(abstract_txt:search in 3462) [ClassicSimilarity], result of:
            0.018300902 = score(doc=3462,freq=3.0), product of:
              0.092517145 = queryWeight, product of:
                1.5838268 = boost
                3.654598 = idf(docFreq=3123, maxDocs=44421)
                0.015983613 = queryNorm
              0.19781092 = fieldWeight in 3462, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.654598 = idf(docFreq=3123, maxDocs=44421)
                0.03125 = fieldNorm(doc=3462)
          0.02733556 = weight(abstract_txt:contexts in 3462) [ClassicSimilarity], result of:
            0.02733556 = score(doc=3462,freq=1.0), product of:
              0.15231262 = queryWeight, product of:
                1.659277 = boost
                5.743043 = idf(docFreq=386, maxDocs=44421)
                0.015983613 = queryNorm
              0.17947009 = fieldWeight in 3462, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.743043 = idf(docFreq=386, maxDocs=44421)
                0.03125 = fieldNorm(doc=3462)
          0.040627092 = weight(abstract_txt:similarities in 3462) [ClassicSimilarity], result of:
            0.040627092 = score(doc=3462,freq=1.0), product of:
              0.19836318 = queryWeight, product of:
                1.8935704 = boost
                6.553973 = idf(docFreq=171, maxDocs=44421)
                0.015983613 = queryNorm
              0.20481166 = fieldWeight in 3462, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.553973 = idf(docFreq=171, maxDocs=44421)
                0.03125 = fieldNorm(doc=3462)
          0.028806377 = weight(abstract_txt:several in 3462) [ClassicSimilarity], result of:
            0.028806377 = score(doc=3462,freq=2.0), product of:
              0.14330569 = queryWeight, product of:
                1.9711889 = boost
                4.548416 = idf(docFreq=1277, maxDocs=44421)
                0.015983613 = queryNorm
              0.20101349 = fieldWeight in 3462, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.548416 = idf(docFreq=1277, maxDocs=44421)
                0.03125 = fieldNorm(doc=3462)
          0.079980396 = weight(abstract_txt:documents in 3462) [ClassicSimilarity], result of:
            0.079980396 = score(doc=3462,freq=10.0), product of:
              0.19628462 = queryWeight, product of:
                2.9782698 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.015983613 = queryNorm
              0.40747154 = fieldWeight in 3462, product of:
                3.1622777 = tf(freq=10.0), with freq of:
                  10.0 = termFreq=10.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.03125 = fieldNorm(doc=3462)
          0.08883654 = weight(abstract_txt:people in 3462) [ClassicSimilarity], result of:
            0.08883654 = score(doc=3462,freq=3.0), product of:
              0.3341784 = queryWeight, product of:
                4.2569714 = boost
                4.9113703 = idf(docFreq=888, maxDocs=44421)
                0.015983613 = queryNorm
              0.2658357 = fieldWeight in 3462, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.9113703 = idf(docFreq=888, maxDocs=44421)
                0.03125 = fieldNorm(doc=3462)
        0.44 = coord(11/25)
    
  4. Krovetz, R.; Croft, W.B.: Lexical ambiguity and information retrieval (1992) 0.13
    0.13151613 = sum of:
      0.13151613 = product of:
        0.6575806 = sum of:
          0.048022166 = weight(abstract_txt:common in 4027) [ClassicSimilarity], result of:
            0.048022166 = score(doc=4027,freq=1.0), product of:
              0.10660991 = queryWeight, product of:
                1.3881925 = boost
                4.8047733 = idf(docFreq=988, maxDocs=44421)
                0.015983613 = queryNorm
              0.4504475 = fieldWeight in 4027, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.8047733 = idf(docFreq=988, maxDocs=44421)
                0.09375 = fieldNorm(doc=4027)
          0.055036753 = weight(abstract_txt:uses in 4027) [ClassicSimilarity], result of:
            0.055036753 = score(doc=4027,freq=1.0), product of:
              0.11675396 = queryWeight, product of:
                1.4527361 = boost
                5.0281696 = idf(docFreq=790, maxDocs=44421)
                0.015983613 = queryNorm
              0.4713909 = fieldWeight in 4027, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.0281696 = idf(docFreq=790, maxDocs=44421)
                0.09375 = fieldNorm(doc=4027)
          0.061107554 = weight(abstract_txt:several in 4027) [ClassicSimilarity], result of:
            0.061107554 = score(doc=4027,freq=1.0), product of:
              0.14330569 = queryWeight, product of:
                1.9711889 = boost
                4.548416 = idf(docFreq=1277, maxDocs=44421)
                0.015983613 = queryNorm
              0.426414 = fieldWeight in 4027, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.548416 = idf(docFreq=1277, maxDocs=44421)
                0.09375 = fieldNorm(doc=4027)
          0.1314212 = weight(abstract_txt:documents in 4027) [ClassicSimilarity], result of:
            0.1314212 = score(doc=4027,freq=3.0), product of:
              0.19628462 = queryWeight, product of:
                2.9782698 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.015983613 = queryNorm
              0.66954404 = fieldWeight in 4027, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.09375 = fieldNorm(doc=4027)
          0.36199299 = weight(abstract_txt:disambiguation in 4027) [ClassicSimilarity], result of:
            0.36199299 = score(doc=4027,freq=2.0), product of:
              0.37238336 = queryWeight, product of:
                3.1775448 = boost
                7.33202 = idf(docFreq=78, maxDocs=44421)
                0.015983613 = queryNorm
              0.97209764 = fieldWeight in 4027, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.33202 = idf(docFreq=78, maxDocs=44421)
                0.09375 = fieldNorm(doc=4027)
        0.2 = coord(5/25)
    
  5. Khoo, C.S.G.; Ng, K.; Ou, S.: ¬An exploratory study of human clustering of Web pages (2003) 0.13
    0.13142744 = sum of:
      0.13142744 = product of:
        0.36507618 = sum of:
          0.021973263 = weight(abstract_txt:users in 3741) [ClassicSimilarity], result of:
            0.021973263 = score(doc=3741,freq=5.0), product of:
              0.058766447 = queryWeight, product of:
                1.0306605 = boost
                3.5672934 = idf(docFreq=3408, maxDocs=44421)
                0.015983613 = queryNorm
              0.3739083 = fieldWeight in 3741, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                3.5672934 = idf(docFreq=3408, maxDocs=44421)
                0.046875 = fieldNorm(doc=3741)
          0.024011083 = weight(abstract_txt:common in 3741) [ClassicSimilarity], result of:
            0.024011083 = score(doc=3741,freq=1.0), product of:
              0.10660991 = queryWeight, product of:
                1.3881925 = boost
                4.8047733 = idf(docFreq=988, maxDocs=44421)
                0.015983613 = queryNorm
              0.22522375 = fieldWeight in 3741, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.8047733 = idf(docFreq=988, maxDocs=44421)
                0.046875 = fieldNorm(doc=3741)
          0.035632513 = weight(abstract_txt:find in 3741) [ClassicSimilarity], result of:
            0.035632513 = score(doc=3741,freq=2.0), product of:
              0.110089034 = queryWeight, product of:
                1.4106619 = boost
                4.8825436 = idf(docFreq=914, maxDocs=44421)
                0.015983613 = queryNorm
              0.32366997 = fieldWeight in 3741, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.8825436 = idf(docFreq=914, maxDocs=44421)
                0.046875 = fieldNorm(doc=3741)
          0.01341324 = weight(abstract_txt:using in 3741) [ClassicSimilarity], result of:
            0.01341324 = score(doc=3741,freq=1.0), product of:
              0.08277693 = queryWeight, product of:
                1.4981359 = boost
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.015983613 = queryNorm
              0.16204081 = fieldWeight in 3741, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.046875 = fieldNorm(doc=3741)
          0.06407722 = weight(abstract_txt:engines in 3741) [ClassicSimilarity], result of:
            0.06407722 = score(doc=3741,freq=4.0), product of:
              0.1292128 = queryWeight, product of:
                1.5282828 = boost
                5.2896495 = idf(docFreq=608, maxDocs=44421)
                0.015983613 = queryNorm
              0.49590462 = fieldWeight in 3741, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.2896495 = idf(docFreq=608, maxDocs=44421)
                0.046875 = fieldNorm(doc=3741)
          0.04482787 = weight(abstract_txt:search in 3741) [ClassicSimilarity], result of:
            0.04482787 = score(doc=3741,freq=8.0), product of:
              0.092517145 = queryWeight, product of:
                1.5838268 = boost
                3.654598 = idf(docFreq=3123, maxDocs=44421)
                0.015983613 = queryNorm
              0.4845358 = fieldWeight in 3741, product of:
                2.828427 = tf(freq=8.0), with freq of:
                  8.0 = termFreq=8.0
                3.654598 = idf(docFreq=3123, maxDocs=44421)
                0.046875 = fieldNorm(doc=3741)
          0.030553777 = weight(abstract_txt:several in 3741) [ClassicSimilarity], result of:
            0.030553777 = score(doc=3741,freq=1.0), product of:
              0.14330569 = queryWeight, product of:
                1.9711889 = boost
                4.548416 = idf(docFreq=1277, maxDocs=44421)
                0.015983613 = queryNorm
              0.213207 = fieldWeight in 3741, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.548416 = idf(docFreq=1277, maxDocs=44421)
                0.046875 = fieldNorm(doc=3741)
          0.053652484 = weight(abstract_txt:documents in 3741) [ClassicSimilarity], result of:
            0.053652484 = score(doc=3741,freq=2.0), product of:
              0.19628462 = queryWeight, product of:
                2.9782698 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.015983613 = queryNorm
              0.27334023 = fieldWeight in 3741, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.046875 = fieldNorm(doc=3741)
          0.07693471 = weight(abstract_txt:people in 3741) [ClassicSimilarity], result of:
            0.07693471 = score(doc=3741,freq=1.0), product of:
              0.3341784 = queryWeight, product of:
                4.2569714 = boost
                4.9113703 = idf(docFreq=888, maxDocs=44421)
                0.015983613 = queryNorm
              0.23022048 = fieldWeight in 3741, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.9113703 = idf(docFreq=888, maxDocs=44421)
                0.046875 = fieldNorm(doc=3741)
        0.36 = coord(9/25)