Document (#34452)

Author
Sánchez-de-Madariaga, R.
Fernández-del-Castillo, J.R.
Title
¬The bootstrapping of the Yarowsky algorithm in real corpora
Source
Information processing and management. 45(2009) no.1, S.55-69
Year
2009
Abstract
The Yarowsky bootstrapping algorithm resolves the homograph-level word sense disambiguation (WSD) problem, which is the sense granularity level required for real natural language processing (NLP) applications. At the same time it resolves the knowledge acquisition bottleneck problem affecting most WSD algorithms and can be easily applied to foreign language corpora. However, this paper shows that the Yarowsky algorithm is significantly less accurate when applied to domain fluctuating, real corpora. This paper also introduces a new bootstrapping methodology that performs much better when applied to these corpora. The accuracy achieved in non-domain fluctuating corpora is not reached due to inherent domain fluctuation ambiguities.
Theme
Retrievalalgorithmen

Similar documents (author)

  1. Castillo, M. Davey => Davey Castillo, M.: 0.99
    0.9924187 = sum of:
      0.9924187 = product of:
        2.977256 = sum of:
          2.977256 = weight(author_txt:castillo in 2446) [ClassicSimilarity], result of:
            2.977256 = score(doc=2446,freq=2.0), product of:
              0.6330409 = queryWeight, product of:
                1.0924256 = boost
                8.868255 = idf(docFreq=16, maxDocs=44421)
                0.0653434 = queryNorm
              4.703102 = fieldWeight in 2446, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.868255 = idf(docFreq=16, maxDocs=44421)
                0.375 = fieldNorm(doc=2446)
        0.33333334 = coord(1/3)
    
  2. Castillo, J. Ruiz -> Ruiz-Castillo, J.: 0.99
    0.9924187 = sum of:
      0.9924187 = product of:
        2.977256 = sum of:
          2.977256 = weight(author_txt:castillo in 8229) [ClassicSimilarity], result of:
            2.977256 = score(doc=8229,freq=2.0), product of:
              0.6330409 = queryWeight, product of:
                1.0924256 = boost
                8.868255 = idf(docFreq=16, maxDocs=44421)
                0.0653434 = queryNorm
              4.703102 = fieldWeight in 8229, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.868255 = idf(docFreq=16, maxDocs=44421)
                0.375 = fieldNorm(doc=8229)
        0.33333334 = coord(1/3)
    
  3. Castillo, J. Ruiz- => Ruiz-Castillo, J.: 0.99
    0.9924187 = sum of:
      0.9924187 = product of:
        2.977256 = sum of:
          2.977256 = weight(author_txt:castillo in 2887) [ClassicSimilarity], result of:
            2.977256 = score(doc=2887,freq=2.0), product of:
              0.6330409 = queryWeight, product of:
                1.0924256 = boost
                8.868255 = idf(docFreq=16, maxDocs=44421)
                0.0653434 = queryNorm
              4.703102 = fieldWeight in 2887, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.868255 = idf(docFreq=16, maxDocs=44421)
                0.375 = fieldNorm(doc=2887)
        0.33333334 = coord(1/3)
    
  4. Moreno Fernández, L.M. -> Fernández, L.M.M.: 0.97
    0.9731702 = sum of:
      0.9731702 = product of:
        2.9195106 = sum of:
          2.9195106 = weight(author_txt:fernández in 5950) [ClassicSimilarity], result of:
            2.9195106 = score(doc=5950,freq=2.0), product of:
              0.5638062 = queryWeight, product of:
                1.0309578 = boost
                8.369263 = idf(docFreq=27, maxDocs=44421)
                0.0653434 = queryNorm
              5.178217 = fieldWeight in 5950, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.369263 = idf(docFreq=27, maxDocs=44421)
                0.4375 = fieldNorm(doc=5950)
        0.33333334 = coord(1/3)
    
  5. Sánchez, M.F.: Semantically enhanced Information Retrieval : an ontology-based approach (2006) 0.90
    0.89712536 = sum of:
      0.89712536 = product of:
        2.691376 = sum of:
          2.691376 = weight(author_txt:sánchez in 327) [ClassicSimilarity], result of:
            2.691376 = score(doc=327,freq=1.0), product of:
              0.5304544 = queryWeight, product of:
                8.117949 = idf(docFreq=35, maxDocs=44421)
                0.0653434 = queryNorm
              5.073718 = fieldWeight in 327, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.117949 = idf(docFreq=35, maxDocs=44421)
                0.625 = fieldNorm(doc=327)
        0.33333334 = coord(1/3)
    

Similar documents (content)

  1. Tsujii, J.-I.: Automatic acquisition of semantic collocation from corpora (1995) 0.13
    0.13215522 = sum of:
      0.13215522 = product of:
        0.8259702 = sum of:
          0.05934663 = weight(abstract_txt:acquisition in 4777) [ClassicSimilarity], result of:
            0.05934663 = score(doc=4777,freq=1.0), product of:
              0.07611427 = queryWeight, product of:
                6.2376356 = idf(docFreq=235, maxDocs=44421)
                0.012202423 = queryNorm
              0.77970445 = fieldWeight in 4777, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.2376356 = idf(docFreq=235, maxDocs=44421)
                0.125 = fieldNorm(doc=4777)
          0.10949643 = weight(abstract_txt:real in 4777) [ClassicSimilarity], result of:
            0.10949643 = score(doc=4777,freq=1.0), product of:
              0.16513625 = queryWeight, product of:
                2.5512252 = boost
                5.304538 = idf(docFreq=599, maxDocs=44421)
                0.012202423 = queryNorm
              0.6630672 = fieldWeight in 4777, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.304538 = idf(docFreq=599, maxDocs=44421)
                0.125 = fieldNorm(doc=4777)
          0.2359329 = weight(abstract_txt:algorithm in 4777) [ClassicSimilarity], result of:
            0.2359329 = score(doc=4777,freq=3.0), product of:
              0.19101216 = queryWeight, product of:
                2.7438357 = boost
                5.7050157 = idf(docFreq=401, maxDocs=44421)
                0.012202423 = queryNorm
              1.2351722 = fieldWeight in 4777, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.7050157 = idf(docFreq=401, maxDocs=44421)
                0.125 = fieldNorm(doc=4777)
          0.42119423 = weight(abstract_txt:corpora in 4777) [ClassicSimilarity], result of:
            0.42119423 = score(doc=4777,freq=1.0), product of:
              0.48066992 = queryWeight, product of:
                5.619212 = boost
                7.01012 = idf(docFreq=108, maxDocs=44421)
                0.012202423 = queryNorm
              0.876265 = fieldWeight in 4777, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.01012 = idf(docFreq=108, maxDocs=44421)
                0.125 = fieldNorm(doc=4777)
        0.16 = coord(4/25)
    
  2. Yang, C.C.; Li, K.W.: Automatic construction of English/Chinese parallel corpora (2003) 0.11
    0.1142823 = sum of:
      0.1142823 = product of:
        0.47617626 = sum of:
          0.008875211 = weight(abstract_txt:paper in 2683) [ClassicSimilarity], result of:
            0.008875211 = score(doc=2683,freq=1.0), product of:
              0.04688268 = queryWeight, product of:
                1.1099111 = boost
                3.4616103 = idf(docFreq=3788, maxDocs=44421)
                0.012202423 = queryNorm
              0.18930681 = fieldWeight in 2683, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4616103 = idf(docFreq=3788, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2683)
          0.031052092 = weight(abstract_txt:language in 2683) [ClassicSimilarity], result of:
            0.031052092 = score(doc=2683,freq=4.0), product of:
              0.068066575 = queryWeight, product of:
                1.3373617 = boost
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.012202423 = queryNorm
              0.45620176 = fieldWeight in 2683, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2683)
          0.03366046 = weight(abstract_txt:level in 2683) [ClassicSimilarity], result of:
            0.03366046 = score(doc=2683,freq=3.0), product of:
              0.07905565 = queryWeight, product of:
                1.4412802 = boost
                4.4950905 = idf(docFreq=1347, maxDocs=44421)
                0.012202423 = queryNorm
              0.42578185 = fieldWeight in 2683, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.4950905 = idf(docFreq=1347, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2683)
          0.04799767 = weight(abstract_txt:domain in 2683) [ClassicSimilarity], result of:
            0.04799767 = score(doc=2683,freq=2.0), product of:
              0.13123827 = queryWeight, product of:
                2.2743528 = boost
                4.7288613 = idf(docFreq=1066, maxDocs=44421)
                0.012202423 = queryNorm
              0.3657292 = fieldWeight in 2683, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.7288613 = idf(docFreq=1066, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2683)
          0.035421554 = weight(abstract_txt:applied in 2683) [ClassicSimilarity], result of:
            0.035421554 = score(doc=2683,freq=1.0), product of:
              0.13503161 = queryWeight, product of:
                2.306988 = boost
                4.7967167 = idf(docFreq=996, maxDocs=44421)
                0.012202423 = queryNorm
              0.26232046 = fieldWeight in 2683, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7967167 = idf(docFreq=996, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2683)
          0.31916928 = weight(abstract_txt:corpora in 2683) [ClassicSimilarity], result of:
            0.31916928 = score(doc=2683,freq=3.0), product of:
              0.48066992 = queryWeight, product of:
                5.619212 = boost
                7.01012 = idf(docFreq=108, maxDocs=44421)
                0.012202423 = queryNorm
              0.6640093 = fieldWeight in 2683, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.01012 = idf(docFreq=108, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2683)
        0.24 = coord(6/25)
    
  3. Markó, K.G.: Foundation, implementation and evaluation of the MorphoSaurus system (2008) 0.10
    0.10294017 = sum of:
      0.10294017 = product of:
        0.2859449 = sum of:
          0.018545823 = weight(abstract_txt:acquisition in 415) [ClassicSimilarity], result of:
            0.018545823 = score(doc=415,freq=1.0), product of:
              0.07611427 = queryWeight, product of:
                6.2376356 = idf(docFreq=235, maxDocs=44421)
                0.012202423 = queryNorm
              0.24365765 = fieldWeight in 415, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.2376356 = idf(docFreq=235, maxDocs=44421)
                0.0390625 = fieldNorm(doc=415)
          0.0218645 = weight(abstract_txt:inherent in 415) [ClassicSimilarity], result of:
            0.0218645 = score(doc=415,freq=1.0), product of:
              0.08494315 = queryWeight, product of:
                1.0564067 = boost
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.012202423 = queryNorm
              0.25740156 = fieldWeight in 415, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.0390625 = fieldNorm(doc=415)
          0.04259631 = weight(abstract_txt:disambiguation in 415) [ClassicSimilarity], result of:
            0.04259631 = score(doc=415,freq=2.0), product of:
              0.10516551 = queryWeight, product of:
                1.1754485 = boost
                7.33202 = idf(docFreq=78, maxDocs=44421)
                0.012202423 = queryNorm
              0.40504068 = fieldWeight in 415, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.33202 = idf(docFreq=78, maxDocs=44421)
                0.0390625 = fieldNorm(doc=415)
          0.022180066 = weight(abstract_txt:language in 415) [ClassicSimilarity], result of:
            0.022180066 = score(doc=415,freq=4.0), product of:
              0.068066575 = queryWeight, product of:
                1.3373617 = boost
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.012202423 = queryNorm
              0.3258584 = fieldWeight in 415, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.0390625 = fieldNorm(doc=415)
          0.03018702 = weight(abstract_txt:sense in 415) [ClassicSimilarity], result of:
            0.03018702 = score(doc=415,freq=1.0), product of:
              0.1326963 = queryWeight, product of:
                1.8672882 = boost
                5.823732 = idf(docFreq=356, maxDocs=44421)
                0.012202423 = queryNorm
              0.22748953 = fieldWeight in 415, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.823732 = idf(docFreq=356, maxDocs=44421)
                0.0390625 = fieldNorm(doc=415)
          0.048484966 = weight(abstract_txt:domain in 415) [ClassicSimilarity], result of:
            0.048484966 = score(doc=415,freq=4.0), product of:
              0.13123827 = queryWeight, product of:
                2.2743528 = boost
                4.7288613 = idf(docFreq=1066, maxDocs=44421)
                0.012202423 = queryNorm
              0.36944228 = fieldWeight in 415, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.7288613 = idf(docFreq=1066, maxDocs=44421)
                0.0390625 = fieldNorm(doc=415)
          0.025301108 = weight(abstract_txt:applied in 415) [ClassicSimilarity], result of:
            0.025301108 = score(doc=415,freq=1.0), product of:
              0.13503161 = queryWeight, product of:
                2.306988 = boost
                4.7967167 = idf(docFreq=996, maxDocs=44421)
                0.012202423 = queryNorm
              0.18737175 = fieldWeight in 415, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7967167 = idf(docFreq=996, maxDocs=44421)
                0.0390625 = fieldNorm(doc=415)
          0.034217637 = weight(abstract_txt:real in 415) [ClassicSimilarity], result of:
            0.034217637 = score(doc=415,freq=1.0), product of:
              0.16513625 = queryWeight, product of:
                2.5512252 = boost
                5.304538 = idf(docFreq=599, maxDocs=44421)
                0.012202423 = queryNorm
              0.20720851 = fieldWeight in 415, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.304538 = idf(docFreq=599, maxDocs=44421)
                0.0390625 = fieldNorm(doc=415)
          0.042567473 = weight(abstract_txt:algorithm in 415) [ClassicSimilarity], result of:
            0.042567473 = score(doc=415,freq=1.0), product of:
              0.19101216 = queryWeight, product of:
                2.7438357 = boost
                5.7050157 = idf(docFreq=401, maxDocs=44421)
                0.012202423 = queryNorm
              0.22285217 = fieldWeight in 415, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7050157 = idf(docFreq=401, maxDocs=44421)
                0.0390625 = fieldNorm(doc=415)
        0.36 = coord(9/25)
    
  4. Snajder, J.; Dalbelo Basic, B.D.; Tadic, M.: Automatic acquisition of inflectional lexica for morphological normalisation (2008) 0.10
    0.10130227 = sum of:
      0.10130227 = product of:
        0.42209283 = sum of:
          0.037091646 = weight(abstract_txt:acquisition in 3910) [ClassicSimilarity], result of:
            0.037091646 = score(doc=3910,freq=1.0), product of:
              0.07611427 = queryWeight, product of:
                6.2376356 = idf(docFreq=235, maxDocs=44421)
                0.012202423 = queryNorm
              0.4873153 = fieldWeight in 3910, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.2376356 = idf(docFreq=235, maxDocs=44421)
                0.078125 = fieldNorm(doc=3910)
          0.012678874 = weight(abstract_txt:paper in 3910) [ClassicSimilarity], result of:
            0.012678874 = score(doc=3910,freq=1.0), product of:
              0.04688268 = queryWeight, product of:
                1.1099111 = boost
                3.4616103 = idf(docFreq=3788, maxDocs=44421)
                0.012202423 = queryNorm
              0.2704383 = fieldWeight in 3910, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4616103 = idf(docFreq=3788, maxDocs=44421)
                0.078125 = fieldNorm(doc=3910)
          0.03136735 = weight(abstract_txt:language in 3910) [ClassicSimilarity], result of:
            0.03136735 = score(doc=3910,freq=2.0), product of:
              0.068066575 = queryWeight, product of:
                1.3373617 = boost
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.012202423 = queryNorm
              0.46083337 = fieldWeight in 3910, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.078125 = fieldNorm(doc=3910)
          0.027106356 = weight(abstract_txt:problem in 3910) [ClassicSimilarity], result of:
            0.027106356 = score(doc=3910,freq=1.0), product of:
              0.07780475 = queryWeight, product of:
                1.429832 = boost
                4.4593854 = idf(docFreq=1396, maxDocs=44421)
                0.012202423 = queryNorm
              0.34838948 = fieldWeight in 3910, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.4593854 = idf(docFreq=1396, maxDocs=44421)
                0.078125 = fieldNorm(doc=3910)
          0.050602216 = weight(abstract_txt:applied in 3910) [ClassicSimilarity], result of:
            0.050602216 = score(doc=3910,freq=1.0), product of:
              0.13503161 = queryWeight, product of:
                2.306988 = boost
                4.7967167 = idf(docFreq=996, maxDocs=44421)
                0.012202423 = queryNorm
              0.3747435 = fieldWeight in 3910, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7967167 = idf(docFreq=996, maxDocs=44421)
                0.078125 = fieldNorm(doc=3910)
          0.2632464 = weight(abstract_txt:corpora in 3910) [ClassicSimilarity], result of:
            0.2632464 = score(doc=3910,freq=1.0), product of:
              0.48066992 = queryWeight, product of:
                5.619212 = boost
                7.01012 = idf(docFreq=108, maxDocs=44421)
                0.012202423 = queryNorm
              0.5476656 = fieldWeight in 3910, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.01012 = idf(docFreq=108, maxDocs=44421)
                0.078125 = fieldNorm(doc=3910)
        0.24 = coord(6/25)
    
  5. Suakkaphong, N.; Zhang, Z.; Chen, H.: Disease named entity recognition using semisupervised learning and conditional random fields (2011) 0.10
    0.10063857 = sum of:
      0.10063857 = product of:
        0.62899107 = sum of:
          0.038787972 = weight(abstract_txt:domain in 367) [ClassicSimilarity], result of:
            0.038787972 = score(doc=367,freq=1.0), product of:
              0.13123827 = queryWeight, product of:
                2.2743528 = boost
                4.7288613 = idf(docFreq=1066, maxDocs=44421)
                0.012202423 = queryNorm
              0.29555383 = fieldWeight in 367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7288613 = idf(docFreq=1066, maxDocs=44421)
                0.0625 = fieldNorm(doc=367)
          0.040481772 = weight(abstract_txt:applied in 367) [ClassicSimilarity], result of:
            0.040481772 = score(doc=367,freq=1.0), product of:
              0.13503161 = queryWeight, product of:
                2.306988 = boost
                4.7967167 = idf(docFreq=996, maxDocs=44421)
                0.012202423 = queryNorm
              0.2997948 = fieldWeight in 367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7967167 = idf(docFreq=996, maxDocs=44421)
                0.0625 = fieldNorm(doc=367)
          0.06810796 = weight(abstract_txt:algorithm in 367) [ClassicSimilarity], result of:
            0.06810796 = score(doc=367,freq=1.0), product of:
              0.19101216 = queryWeight, product of:
                2.7438357 = boost
                5.7050157 = idf(docFreq=401, maxDocs=44421)
                0.012202423 = queryNorm
              0.35656348 = fieldWeight in 367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7050157 = idf(docFreq=401, maxDocs=44421)
                0.0625 = fieldNorm(doc=367)
          0.48161337 = weight(abstract_txt:bootstrapping in 367) [ClassicSimilarity], result of:
            0.48161337 = score(doc=367,freq=2.0), product of:
              0.55853635 = queryWeight, product of:
                4.6919494 = boost
                9.755557 = idf(docFreq=6, maxDocs=44421)
                0.012202423 = queryNorm
              0.86227757 = fieldWeight in 367, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.755557 = idf(docFreq=6, maxDocs=44421)
                0.0625 = fieldNorm(doc=367)
        0.16 = coord(4/25)