Document (#37247)

Author
Levin, M.
Krawczyk, S.
Bethard, S.
Jurafsky, D.
Title
Citation-based bootstrapping for large-scale author disambiguation
Source
Journal of the American Society for Information Science and Technology. 63(2012) no.5, S.1030-1047
Year
2012
Abstract
We present a new, two-stage, self-supervised algorithm for author disambiguation in large bibliographic databases. In the first "bootstrap" stage, a collection of high-precision features is used to bootstrap a training set with positive and negative examples of coreferring authors. A supervised feature-based classifier is then trained on the bootstrap clusters and used to cluster the authors in a larger unlabeled dataset. Our self-supervised approach shares the advantages of unsupervised approaches (no need for expensive hand labels) as well as supervised approaches (a rich set of features that can be discriminatively trained). The algorithm disambiguates 54,000,000 author instances in Thomson Reuters' Web of Knowledge with B3 F1 of.807. We analyze parameters and features, particularly those from citation networks, which have not been deeply investigated in author disambiguation. The most important citation feature is self-citation, which can be approximated without expensive extraction of the full network. For the supervised stage, the minor improvement due to other citation features (increasing F1 from.748 to.767) suggests they may not be worth the trouble of extracting from databases that don't already have them. A lean feature set without expensive abstract and title features performs 130 times faster with about equal F1.
Theme
Informetrie
Computerlinguistik

Similar documents (content)

  1. Ferreira, A.A.; Veloso, A.; Gonçalves, M.A.; Laender, A.H.F.: Self-training author name disambiguation for information scarce scenarios (2014) 0.30
    0.30118045 = sum of:
      0.30118045 = product of:
        0.94118893 = sum of:
          0.006589745 = weight(abstract_txt:from in 2292) [ClassicSimilarity], result of:
            0.006589745 = score(doc=2292,freq=1.0), product of:
              0.038209744 = queryWeight, product of:
                1.0298947 = boost
                2.759399 = idf(docFreq=7646, maxDocs=44421)
                0.0134451855 = queryNorm
              0.17246243 = fieldWeight in 2292, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.759399 = idf(docFreq=7646, maxDocs=44421)
                0.0625 = fieldNorm(doc=2292)
          0.020959118 = weight(abstract_txt:authors in 2292) [ClassicSimilarity], result of:
            0.020959118 = score(doc=2292,freq=1.0), product of:
              0.072190486 = queryWeight, product of:
                1.1558465 = boost
                4.6452923 = idf(docFreq=1159, maxDocs=44421)
                0.0134451855 = queryNorm
              0.29033077 = fieldWeight in 2292, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.6452923 = idf(docFreq=1159, maxDocs=44421)
                0.0625 = fieldNorm(doc=2292)
          0.09706443 = weight(abstract_txt:bootstrapping in 2292) [ClassicSimilarity], result of:
            0.09706443 = score(doc=2292,freq=1.0), product of:
              0.15919448 = queryWeight, product of:
                1.2136939 = boost
                9.755557 = idf(docFreq=6, maxDocs=44421)
                0.0134451855 = queryNorm
              0.6097223 = fieldWeight in 2292, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.755557 = idf(docFreq=6, maxDocs=44421)
                0.0625 = fieldNorm(doc=2292)
          0.09269759 = weight(abstract_txt:self in 2292) [ClassicSimilarity], result of:
            0.09269759 = score(doc=2292,freq=3.0), product of:
              0.15438327 = queryWeight, product of:
                2.0701694 = boost
                5.5466094 = idf(docFreq=470, maxDocs=44421)
                0.0134451855 = queryNorm
              0.60043806 = fieldWeight in 2292, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.5466094 = idf(docFreq=470, maxDocs=44421)
                0.0625 = fieldNorm(doc=2292)
          0.115490936 = weight(abstract_txt:author in 2292) [ClassicSimilarity], result of:
            0.115490936 = score(doc=2292,freq=5.0), product of:
              0.16593954 = queryWeight, product of:
                2.4782784 = boost
                4.980042 = idf(docFreq=829, maxDocs=44421)
                0.0134451855 = queryNorm
              0.69598204 = fieldWeight in 2292, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.980042 = idf(docFreq=829, maxDocs=44421)
                0.0625 = fieldNorm(doc=2292)
          0.21411972 = weight(abstract_txt:disambiguation in 2292) [ClassicSimilarity], result of:
            0.21411972 = score(doc=2292,freq=3.0), product of:
              0.26976922 = queryWeight, product of:
                2.736541 = boost
                7.33202 = idf(docFreq=78, maxDocs=44421)
                0.0134451855 = queryNorm
              0.7937144 = fieldWeight in 2292, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.33202 = idf(docFreq=78, maxDocs=44421)
                0.0625 = fieldNorm(doc=2292)
          0.08645196 = weight(abstract_txt:citation in 2292) [ClassicSimilarity], result of:
            0.08645196 = score(doc=2292,freq=2.0), product of:
              0.20000976 = queryWeight, product of:
                3.0419757 = boost
                4.890223 = idf(docFreq=907, maxDocs=44421)
                0.0134451855 = queryNorm
              0.43223873 = fieldWeight in 2292, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.890223 = idf(docFreq=907, maxDocs=44421)
                0.0625 = fieldNorm(doc=2292)
          0.3078154 = weight(abstract_txt:supervised in 2292) [ClassicSimilarity], result of:
            0.3078154 = score(doc=2292,freq=2.0), product of:
              0.46636742 = queryWeight, product of:
                4.645091 = boost
                7.467361 = idf(docFreq=68, maxDocs=44421)
                0.0134451855 = queryNorm
              0.6600277 = fieldWeight in 2292, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.467361 = idf(docFreq=68, maxDocs=44421)
                0.0625 = fieldNorm(doc=2292)
        0.32 = coord(8/25)
    
  2. Ko, Y.; Seo, J.: Text classification from unlabeled documents with bootstrapping and feature projection techniques (2009) 0.25
    0.24729937 = sum of:
      0.24729937 = product of:
        0.88321203 = sum of:
          0.018330429 = weight(abstract_txt:large in 3452) [ClassicSimilarity], result of:
            0.018330429 = score(doc=3452,freq=1.0), product of:
              0.06602064 = queryWeight, product of:
                1.1053505 = boost
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.0134451855 = queryNorm
              0.27764696 = fieldWeight in 3452, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.0625 = fieldNorm(doc=3452)
          0.02032741 = weight(abstract_txt:approaches in 3452) [ClassicSimilarity], result of:
            0.02032741 = score(doc=3452,freq=1.0), product of:
              0.07073255 = queryWeight, product of:
                1.1441153 = boost
                4.5981455 = idf(docFreq=1215, maxDocs=44421)
                0.0134451855 = queryNorm
              0.2873841 = fieldWeight in 3452, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.5981455 = idf(docFreq=1215, maxDocs=44421)
                0.0625 = fieldNorm(doc=3452)
          0.122757375 = weight(abstract_txt:unlabeled in 3452) [ClassicSimilarity], result of:
            0.122757375 = score(doc=3452,freq=2.0), product of:
              0.14776662 = queryWeight, product of:
                1.1693199 = boost
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.0134451855 = queryNorm
              0.8307517 = fieldWeight in 3452, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.0625 = fieldNorm(doc=3452)
          0.09706443 = weight(abstract_txt:bootstrapping in 3452) [ClassicSimilarity], result of:
            0.09706443 = score(doc=3452,freq=1.0), product of:
              0.15919448 = queryWeight, product of:
                1.2136939 = boost
                9.755557 = idf(docFreq=6, maxDocs=44421)
                0.0134451855 = queryNorm
              0.6097223 = fieldWeight in 3452, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.755557 = idf(docFreq=6, maxDocs=44421)
                0.0625 = fieldNorm(doc=3452)
          0.06449196 = weight(abstract_txt:feature in 3452) [ClassicSimilarity], result of:
            0.06449196 = score(doc=3452,freq=1.0), product of:
              0.17482308 = queryWeight, product of:
                2.2029526 = boost
                5.9023747 = idf(docFreq=329, maxDocs=44421)
                0.0134451855 = queryNorm
              0.36889842 = fieldWeight in 3452, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9023747 = idf(docFreq=329, maxDocs=44421)
                0.0625 = fieldNorm(doc=3452)
          0.12492366 = weight(abstract_txt:expensive in 3452) [ClassicSimilarity], result of:
            0.12492366 = score(doc=3452,freq=1.0), product of:
              0.27165946 = queryWeight, product of:
                2.7461116 = boost
                7.357662 = idf(docFreq=76, maxDocs=44421)
                0.0134451855 = queryNorm
              0.4598539 = fieldWeight in 3452, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.357662 = idf(docFreq=76, maxDocs=44421)
                0.0625 = fieldNorm(doc=3452)
          0.43531674 = weight(abstract_txt:supervised in 3452) [ClassicSimilarity], result of:
            0.43531674 = score(doc=3452,freq=4.0), product of:
              0.46636742 = queryWeight, product of:
                4.645091 = boost
                7.467361 = idf(docFreq=68, maxDocs=44421)
                0.0134451855 = queryNorm
              0.9334201 = fieldWeight in 3452, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.467361 = idf(docFreq=68, maxDocs=44421)
                0.0625 = fieldNorm(doc=3452)
        0.28 = coord(7/25)
    
  3. Suakkaphong, N.; Zhang, Z.; Chen, H.: Disease named entity recognition using semisupervised learning and conditional random fields (2011) 0.20
    0.20295198 = sum of:
      0.20295198 = product of:
        0.63422495 = sum of:
          0.009319307 = weight(abstract_txt:from in 367) [ClassicSimilarity], result of:
            0.009319307 = score(doc=367,freq=2.0), product of:
              0.038209744 = queryWeight, product of:
                1.0298947 = boost
                2.759399 = idf(docFreq=7646, maxDocs=44421)
                0.0134451855 = queryNorm
              0.2438987 = fieldWeight in 367, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.759399 = idf(docFreq=7646, maxDocs=44421)
                0.0625 = fieldNorm(doc=367)
          0.01798057 = weight(abstract_txt:databases in 367) [ClassicSimilarity], result of:
            0.01798057 = score(doc=367,freq=1.0), product of:
              0.06517788 = queryWeight, product of:
                1.0982729 = boost
                4.413907 = idf(docFreq=1461, maxDocs=44421)
                0.0134451855 = queryNorm
              0.2758692 = fieldWeight in 367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.413907 = idf(docFreq=1461, maxDocs=44421)
                0.0625 = fieldNorm(doc=367)
          0.025923142 = weight(abstract_txt:large in 367) [ClassicSimilarity], result of:
            0.025923142 = score(doc=367,freq=2.0), product of:
              0.06602064 = queryWeight, product of:
                1.1053505 = boost
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.0134451855 = queryNorm
              0.3926521 = fieldWeight in 367, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.0625 = fieldNorm(doc=367)
          0.122757375 = weight(abstract_txt:unlabeled in 367) [ClassicSimilarity], result of:
            0.122757375 = score(doc=367,freq=2.0), product of:
              0.14776662 = queryWeight, product of:
                1.1693199 = boost
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.0134451855 = queryNorm
              0.8307517 = fieldWeight in 367, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.0625 = fieldNorm(doc=367)
          0.13726984 = weight(abstract_txt:bootstrapping in 367) [ClassicSimilarity], result of:
            0.13726984 = score(doc=367,freq=2.0), product of:
              0.15919448 = queryWeight, product of:
                1.2136939 = boost
                9.755557 = idf(docFreq=6, maxDocs=44421)
                0.0134451855 = queryNorm
              0.86227757 = fieldWeight in 367, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.755557 = idf(docFreq=6, maxDocs=44421)
                0.0625 = fieldNorm(doc=367)
          0.038824372 = weight(abstract_txt:algorithm in 367) [ClassicSimilarity], result of:
            0.038824372 = score(doc=367,freq=1.0), product of:
              0.10888488 = queryWeight, product of:
                1.4195279 = boost
                5.7050157 = idf(docFreq=401, maxDocs=44421)
                0.0134451855 = queryNorm
              0.35656348 = fieldWeight in 367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7050157 = idf(docFreq=401, maxDocs=44421)
                0.0625 = fieldNorm(doc=367)
          0.06449196 = weight(abstract_txt:feature in 367) [ClassicSimilarity], result of:
            0.06449196 = score(doc=367,freq=1.0), product of:
              0.17482308 = queryWeight, product of:
                2.2029526 = boost
                5.9023747 = idf(docFreq=329, maxDocs=44421)
                0.0134451855 = queryNorm
              0.36889842 = fieldWeight in 367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9023747 = idf(docFreq=329, maxDocs=44421)
                0.0625 = fieldNorm(doc=367)
          0.21765837 = weight(abstract_txt:supervised in 367) [ClassicSimilarity], result of:
            0.21765837 = score(doc=367,freq=1.0), product of:
              0.46636742 = queryWeight, product of:
                4.645091 = boost
                7.467361 = idf(docFreq=68, maxDocs=44421)
                0.0134451855 = queryNorm
              0.46671006 = fieldWeight in 367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.467361 = idf(docFreq=68, maxDocs=44421)
                0.0625 = fieldNorm(doc=367)
        0.32 = coord(8/25)
    
  4. Zhao, D.; Strotmann, A.: In-text author citation analysis : feasibility, benefits, and limitations (2014) 0.19
    0.18544558 = sum of:
      0.18544558 = product of:
        0.66230565 = sum of:
          0.009319307 = weight(abstract_txt:from in 2535) [ClassicSimilarity], result of:
            0.009319307 = score(doc=2535,freq=2.0), product of:
              0.038209744 = queryWeight, product of:
                1.0298947 = boost
                2.759399 = idf(docFreq=7646, maxDocs=44421)
                0.0134451855 = queryNorm
              0.2438987 = fieldWeight in 2535, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.759399 = idf(docFreq=7646, maxDocs=44421)
                0.0625 = fieldNorm(doc=2535)
          0.03596114 = weight(abstract_txt:databases in 2535) [ClassicSimilarity], result of:
            0.03596114 = score(doc=2535,freq=4.0), product of:
              0.06517788 = queryWeight, product of:
                1.0982729 = boost
                4.413907 = idf(docFreq=1461, maxDocs=44421)
                0.0134451855 = queryNorm
              0.5517384 = fieldWeight in 2535, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.413907 = idf(docFreq=1461, maxDocs=44421)
                0.0625 = fieldNorm(doc=2535)
          0.029640669 = weight(abstract_txt:authors in 2535) [ClassicSimilarity], result of:
            0.029640669 = score(doc=2535,freq=2.0), product of:
              0.072190486 = queryWeight, product of:
                1.1558465 = boost
                4.6452923 = idf(docFreq=1159, maxDocs=44421)
                0.0134451855 = queryNorm
              0.4105897 = fieldWeight in 2535, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.6452923 = idf(docFreq=1159, maxDocs=44421)
                0.0625 = fieldNorm(doc=2535)
          0.02971229 = weight(abstract_txt:without in 2535) [ClassicSimilarity], result of:
            0.02971229 = score(doc=2535,freq=1.0), product of:
              0.09110077 = queryWeight, product of:
                1.2984378 = boost
                5.2183604 = idf(docFreq=653, maxDocs=44421)
                0.0134451855 = queryNorm
              0.32614753 = fieldWeight in 2535, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2183604 = idf(docFreq=653, maxDocs=44421)
                0.0625 = fieldNorm(doc=2535)
          0.14608577 = weight(abstract_txt:author in 2535) [ClassicSimilarity], result of:
            0.14608577 = score(doc=2535,freq=8.0), product of:
              0.16593954 = queryWeight, product of:
                2.4782784 = boost
                4.980042 = idf(docFreq=829, maxDocs=44421)
                0.0134451855 = queryNorm
              0.88035536 = fieldWeight in 2535, product of:
                2.828427 = tf(freq=8.0), with freq of:
                  8.0 = termFreq=8.0
                4.980042 = idf(docFreq=829, maxDocs=44421)
                0.0625 = fieldNorm(doc=2535)
          0.17482801 = weight(abstract_txt:disambiguation in 2535) [ClassicSimilarity], result of:
            0.17482801 = score(doc=2535,freq=2.0), product of:
              0.26976922 = queryWeight, product of:
                2.736541 = boost
                7.33202 = idf(docFreq=78, maxDocs=44421)
                0.0134451855 = queryNorm
              0.6480651 = fieldWeight in 2535, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.33202 = idf(docFreq=78, maxDocs=44421)
                0.0625 = fieldNorm(doc=2535)
          0.23675847 = weight(abstract_txt:citation in 2535) [ClassicSimilarity], result of:
            0.23675847 = score(doc=2535,freq=15.0), product of:
              0.20000976 = queryWeight, product of:
                3.0419757 = boost
                4.890223 = idf(docFreq=907, maxDocs=44421)
                0.0134451855 = queryNorm
              1.1837345 = fieldWeight in 2535, product of:
                3.8729835 = tf(freq=15.0), with freq of:
                  15.0 = termFreq=15.0
                4.890223 = idf(docFreq=907, maxDocs=44421)
                0.0625 = fieldNorm(doc=2535)
        0.28 = coord(7/25)
    
  5. Strotmann, A.; Zhao, D.: Author name disambiguation : what difference does it make in author-based citation analysis? (2012) 0.13
    0.13361555 = sum of:
      0.13361555 = product of:
        0.55673146 = sum of:
          0.006589745 = weight(abstract_txt:from in 1389) [ClassicSimilarity], result of:
            0.006589745 = score(doc=1389,freq=1.0), product of:
              0.038209744 = queryWeight, product of:
                1.0298947 = boost
                2.759399 = idf(docFreq=7646, maxDocs=44421)
                0.0134451855 = queryNorm
              0.17246243 = fieldWeight in 1389, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.759399 = idf(docFreq=7646, maxDocs=44421)
                0.0625 = fieldNorm(doc=1389)
          0.02032741 = weight(abstract_txt:approaches in 1389) [ClassicSimilarity], result of:
            0.02032741 = score(doc=1389,freq=1.0), product of:
              0.07073255 = queryWeight, product of:
                1.1441153 = boost
                4.5981455 = idf(docFreq=1215, maxDocs=44421)
                0.0134451855 = queryNorm
              0.2873841 = fieldWeight in 1389, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.5981455 = idf(docFreq=1215, maxDocs=44421)
                0.0625 = fieldNorm(doc=1389)
          0.041918237 = weight(abstract_txt:authors in 1389) [ClassicSimilarity], result of:
            0.041918237 = score(doc=1389,freq=4.0), product of:
              0.072190486 = queryWeight, product of:
                1.1558465 = boost
                4.6452923 = idf(docFreq=1159, maxDocs=44421)
                0.0134451855 = queryNorm
              0.58066154 = fieldWeight in 1389, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.6452923 = idf(docFreq=1159, maxDocs=44421)
                0.0625 = fieldNorm(doc=1389)
          0.16332886 = weight(abstract_txt:author in 1389) [ClassicSimilarity], result of:
            0.16332886 = score(doc=1389,freq=10.0), product of:
              0.16593954 = queryWeight, product of:
                2.4782784 = boost
                4.980042 = idf(docFreq=829, maxDocs=44421)
                0.0134451855 = queryNorm
              0.98426723 = fieldWeight in 1389, product of:
                3.1622777 = tf(freq=10.0), with freq of:
                  10.0 = termFreq=10.0
                4.980042 = idf(docFreq=829, maxDocs=44421)
                0.0625 = fieldNorm(doc=1389)
          0.17482801 = weight(abstract_txt:disambiguation in 1389) [ClassicSimilarity], result of:
            0.17482801 = score(doc=1389,freq=2.0), product of:
              0.26976922 = queryWeight, product of:
                2.736541 = boost
                7.33202 = idf(docFreq=78, maxDocs=44421)
                0.0134451855 = queryNorm
              0.6480651 = fieldWeight in 1389, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.33202 = idf(docFreq=78, maxDocs=44421)
                0.0625 = fieldNorm(doc=1389)
          0.1497392 = weight(abstract_txt:citation in 1389) [ClassicSimilarity], result of:
            0.1497392 = score(doc=1389,freq=6.0), product of:
              0.20000976 = queryWeight, product of:
                3.0419757 = boost
                4.890223 = idf(docFreq=907, maxDocs=44421)
                0.0134451855 = queryNorm
              0.7486595 = fieldWeight in 1389, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.890223 = idf(docFreq=907, maxDocs=44421)
                0.0625 = fieldNorm(doc=1389)
        0.24 = coord(6/25)