Document (#34083)

Author
Zhang, Y.
Xu, W.
Title
Fast exact maximum likelihood estimation for mixture of language model
Source
Information processing and management. 44(2008) no.3, S.1076-1085
Year
2008
Abstract
Language modeling is an effective and theoretically attractive probabilistic framework for text information retrieval. The basic idea of this approach is to estimate a language model of a given document (or document set), and then do retrieval or classification based on this model. A common language modeling approach assumes the data D is generated from a mixture of several language models. The core problem is to find the maximum likelihood estimation of one language model mixture, given the fixed mixture weights and the other language model mixture. The EM algorithm is usually used to find the solution. In this paper, we proof that an exact maximum likelihood estimation of the unknown mixture component exists and can be calculated using the new algorithm we proposed. We further improve the algorithm and provide an efficient algorithm of O(k) complexity to find the exact solution, where k is the number of words occurring at least once in data D. Furthermore, we proof the probabilities of many words are exactly zeros, and the MLE estimation is implemented as a feature selection technique explicitly.

Similar documents (author)

  1. Zhang, M.; Zhang, Y.: Professional organizations in Twittersphere : an empirical study of U.S. library and information science professional organizations-related Tweets (2020) 4.53
    4.5277104 = sum of:
      4.5277104 = weight(author_txt:zhang in 775) [ClassicSimilarity], result of:
        4.5277104 = score(doc=775,freq=2.0), product of:
          0.99999994 = queryWeight, product of:
            6.40315 = idf(docFreq=199, maxDocs=44421)
            0.15617312 = queryNorm
          4.527711 = fieldWeight in 775, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            6.40315 = idf(docFreq=199, maxDocs=44421)
            0.5 = fieldNorm(doc=775)
    
  2. Zhang, Y.; Zhang, C.: Enhancing keyphrase extraction from microblogs using human reading time (2021) 4.53
    4.5277104 = sum of:
      4.5277104 = weight(author_txt:zhang in 1238) [ClassicSimilarity], result of:
        4.5277104 = score(doc=1238,freq=2.0), product of:
          0.99999994 = queryWeight, product of:
            6.40315 = idf(docFreq=199, maxDocs=44421)
            0.15617312 = queryNorm
          4.527711 = fieldWeight in 1238, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            6.40315 = idf(docFreq=199, maxDocs=44421)
            0.5 = fieldNorm(doc=1238)
    
  3. Zhang, J.: TOFIR: A tool of facilitating information retrieval : introduce a visual retrieval model (2001) 4.00
    4.0019684 = sum of:
      4.0019684 = weight(author_txt:zhang in 7710) [ClassicSimilarity], result of:
        4.0019684 = score(doc=7710,freq=1.0), product of:
          0.99999994 = queryWeight, product of:
            6.40315 = idf(docFreq=199, maxDocs=44421)
            0.15617312 = queryNorm
          4.001969 = fieldWeight in 7710, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            6.40315 = idf(docFreq=199, maxDocs=44421)
            0.625 = fieldNorm(doc=7710)
    
  4. Zhang, A.: Multimedia file formats on the Internet : a beginner's guide for PC users (1995) 4.00
    4.0019684 = sum of:
      4.0019684 = weight(author_txt:zhang in 3280) [ClassicSimilarity], result of:
        4.0019684 = score(doc=3280,freq=1.0), product of:
          0.99999994 = queryWeight, product of:
            6.40315 = idf(docFreq=199, maxDocs=44421)
            0.15617312 = queryNorm
          4.001969 = fieldWeight in 3280, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            6.40315 = idf(docFreq=199, maxDocs=44421)
            0.625 = fieldNorm(doc=3280)
    
  5. Zhang, J.: ¬A representational analysis of relational information displays (1996) 4.00
    4.0019684 = sum of:
      4.0019684 = weight(author_txt:zhang in 6471) [ClassicSimilarity], result of:
        4.0019684 = score(doc=6471,freq=1.0), product of:
          0.99999994 = queryWeight, product of:
            6.40315 = idf(docFreq=199, maxDocs=44421)
            0.15617312 = queryNorm
          4.001969 = fieldWeight in 6471, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            6.40315 = idf(docFreq=199, maxDocs=44421)
            0.625 = fieldNorm(doc=6471)
    

Similar documents (content)

  1. Ponte, J.M.: Language models for relevance feedback (2000) 0.17
    0.17206298 = sum of:
      0.17206298 = product of:
        0.61451066 = sum of:
          0.028776683 = weight(abstract_txt:approach in 1035) [ClassicSimilarity], result of:
            0.028776683 = score(doc=1035,freq=5.0), product of:
              0.044031274 = queryWeight, product of:
                1.0324498 = boost
                3.741144 = idf(docFreq=2864, maxDocs=44421)
                0.011399554 = queryNorm
              0.653551 = fieldWeight in 1035, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                3.741144 = idf(docFreq=2864, maxDocs=44421)
                0.078125 = fieldNorm(doc=1035)
          0.027522571 = weight(abstract_txt:document in 1035) [ClassicSimilarity], result of:
            0.027522571 = score(doc=1035,freq=2.0), product of:
              0.058010522 = queryWeight, product of:
                1.1850637 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.011399554 = queryNorm
              0.47444102 = fieldWeight in 1035, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.078125 = fieldNorm(doc=1035)
          0.09184468 = weight(abstract_txt:modeling in 1035) [ClassicSimilarity], result of:
            0.09184468 = score(doc=1035,freq=3.0), product of:
              0.11316697 = queryWeight, product of:
                1.6551912 = boost
                5.997685 = idf(docFreq=299, maxDocs=44421)
                0.011399554 = queryNorm
              0.81158555 = fieldWeight in 1035, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.997685 = idf(docFreq=299, maxDocs=44421)
                0.078125 = fieldNorm(doc=1035)
          0.117067575 = weight(abstract_txt:proof in 1035) [ClassicSimilarity], result of:
            0.117067575 = score(doc=1035,freq=1.0), product of:
              0.19187358 = queryWeight, product of:
                2.1552415 = boost
                7.809647 = idf(docFreq=48, maxDocs=44421)
                0.011399554 = queryNorm
              0.6101287 = fieldWeight in 1035, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.809647 = idf(docFreq=48, maxDocs=44421)
                0.078125 = fieldNorm(doc=1035)
          0.067236125 = weight(abstract_txt:model in 1035) [ClassicSimilarity], result of:
            0.067236125 = score(doc=1035,freq=3.0), product of:
              0.124757156 = queryWeight, product of:
                2.7478375 = boost
                3.9827821 = idf(docFreq=2249, maxDocs=44421)
                0.011399554 = queryNorm
              0.538936 = fieldWeight in 1035, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.9827821 = idf(docFreq=2249, maxDocs=44421)
                0.078125 = fieldNorm(doc=1035)
          0.15722202 = weight(abstract_txt:likelihood in 1035) [ClassicSimilarity], result of:
            0.15722202 = score(doc=1035,freq=1.0), product of:
              0.2673602 = queryWeight, product of:
                3.1158917 = boost
                7.5270805 = idf(docFreq=64, maxDocs=44421)
                0.011399554 = queryNorm
              0.58805317 = fieldWeight in 1035, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5270805 = idf(docFreq=64, maxDocs=44421)
                0.078125 = fieldNorm(doc=1035)
          0.124840975 = weight(abstract_txt:language in 1035) [ClassicSimilarity], result of:
            0.124840975 = score(doc=1035,freq=4.0), product of:
              0.19155708 = queryWeight, product of:
                4.0287604 = boost
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.011399554 = queryNorm
              0.6517168 = fieldWeight in 1035, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.078125 = fieldNorm(doc=1035)
        0.28 = coord(7/25)
    
  2. Lassalle, E.; Lassalle, E.: Semantic models in information retrieval (2012) 0.17
    0.16995992 = sum of:
      0.16995992 = product of:
        0.70816636 = sum of:
          0.030207379 = weight(abstract_txt:words in 1097) [ClassicSimilarity], result of:
            0.030207379 = score(doc=1097,freq=1.0), product of:
              0.09024147 = queryWeight, product of:
                1.4780577 = boost
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.011399554 = queryNorm
              0.33473945 = fieldWeight in 1097, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.0625 = fieldNorm(doc=1097)
          0.04242124 = weight(abstract_txt:modeling in 1097) [ClassicSimilarity], result of:
            0.04242124 = score(doc=1097,freq=1.0), product of:
              0.11316697 = queryWeight, product of:
                1.6551912 = boost
                5.997685 = idf(docFreq=299, maxDocs=44421)
                0.011399554 = queryNorm
              0.3748553 = fieldWeight in 1097, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.997685 = idf(docFreq=299, maxDocs=44421)
                0.0625 = fieldNorm(doc=1097)
          0.06211007 = weight(abstract_txt:model in 1097) [ClassicSimilarity], result of:
            0.06211007 = score(doc=1097,freq=4.0), product of:
              0.124757156 = queryWeight, product of:
                2.7478375 = boost
                3.9827821 = idf(docFreq=2249, maxDocs=44421)
                0.011399554 = queryNorm
              0.49784777 = fieldWeight in 1097, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.9827821 = idf(docFreq=2249, maxDocs=44421)
                0.0625 = fieldNorm(doc=1097)
          0.070620716 = weight(abstract_txt:language in 1097) [ClassicSimilarity], result of:
            0.070620716 = score(doc=1097,freq=2.0), product of:
              0.19155708 = queryWeight, product of:
                4.0287604 = boost
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.011399554 = queryNorm
              0.3686667 = fieldWeight in 1097, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.0625 = fieldNorm(doc=1097)
          0.19350252 = weight(abstract_txt:estimation in 1097) [ClassicSimilarity], result of:
            0.19350252 = score(doc=1097,freq=1.0), product of:
              0.39216173 = queryWeight, product of:
                4.3574853 = boost
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.011399554 = queryNorm
              0.4934253 = fieldWeight in 1097, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.0625 = fieldNorm(doc=1097)
          0.30930442 = weight(abstract_txt:mixture in 1097) [ClassicSimilarity], result of:
            0.30930442 = score(doc=1097,freq=1.0), product of:
              0.61370826 = queryWeight, product of:
                6.6762094 = boost
                8.063882 = idf(docFreq=37, maxDocs=44421)
                0.011399554 = queryNorm
              0.5039926 = fieldWeight in 1097, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.063882 = idf(docFreq=37, maxDocs=44421)
                0.0625 = fieldNorm(doc=1097)
        0.24 = coord(6/25)
    
  3. Dang, E.K.F.; Luk, R.W.P.; Allan, J.: Beyond bag-of-words : bigram-enhanced context-dependent term weights (2014) 0.15
    0.15416107 = sum of:
      0.15416107 = product of:
        0.6423378 = sum of:
          0.073213615 = weight(abstract_txt:weights in 2283) [ClassicSimilarity], result of:
            0.073213615 = score(doc=2283,freq=5.0), product of:
              0.08261394 = queryWeight, product of:
                7.2471204 = idf(docFreq=85, maxDocs=44421)
                0.011399554 = queryNorm
              0.88621384 = fieldWeight in 2283, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.2471204 = idf(docFreq=85, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2283)
          0.009008526 = weight(abstract_txt:approach in 2283) [ClassicSimilarity], result of:
            0.009008526 = score(doc=2283,freq=1.0), product of:
              0.044031274 = queryWeight, product of:
                1.0324498 = boost
                3.741144 = idf(docFreq=2864, maxDocs=44421)
                0.011399554 = queryNorm
              0.20459381 = fieldWeight in 2283, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.741144 = idf(docFreq=2864, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2283)
          0.023595689 = weight(abstract_txt:document in 2283) [ClassicSimilarity], result of:
            0.023595689 = score(doc=2283,freq=3.0), product of:
              0.058010522 = queryWeight, product of:
                1.1850637 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.011399554 = queryNorm
              0.4067484 = fieldWeight in 2283, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2283)
          0.026431456 = weight(abstract_txt:words in 2283) [ClassicSimilarity], result of:
            0.026431456 = score(doc=2283,freq=1.0), product of:
              0.09024147 = queryWeight, product of:
                1.4780577 = boost
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.011399554 = queryNorm
              0.29289702 = fieldWeight in 2283, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2283)
          0.23944715 = weight(abstract_txt:estimation in 2283) [ClassicSimilarity], result of:
            0.23944715 = score(doc=2283,freq=2.0), product of:
              0.39216173 = queryWeight, product of:
                4.3574853 = boost
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.011399554 = queryNorm
              0.61058265 = fieldWeight in 2283, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2283)
          0.2706414 = weight(abstract_txt:mixture in 2283) [ClassicSimilarity], result of:
            0.2706414 = score(doc=2283,freq=1.0), product of:
              0.61370826 = queryWeight, product of:
                6.6762094 = boost
                8.063882 = idf(docFreq=37, maxDocs=44421)
                0.011399554 = queryNorm
              0.44099355 = fieldWeight in 2283, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.063882 = idf(docFreq=37, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2283)
        0.24 = coord(6/25)
    
  4. Bodoff, D.; Wu, B.; Wong, K.Y.M.: Relevance data for language models using maximum likelihood (2003) 0.13
    0.12566523 = sum of:
      0.12566523 = product of:
        0.6283262 = sum of:
          0.015443187 = weight(abstract_txt:approach in 2822) [ClassicSimilarity], result of:
            0.015443187 = score(doc=2822,freq=1.0), product of:
              0.044031274 = queryWeight, product of:
                1.0324498 = boost
                3.741144 = idf(docFreq=2864, maxDocs=44421)
                0.011399554 = queryNorm
              0.35073224 = fieldWeight in 2822, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.741144 = idf(docFreq=2864, maxDocs=44421)
                0.09375 = fieldNorm(doc=2822)
          0.033027083 = weight(abstract_txt:document in 2822) [ClassicSimilarity], result of:
            0.033027083 = score(doc=2822,freq=2.0), product of:
              0.058010522 = queryWeight, product of:
                1.1850637 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.011399554 = queryNorm
              0.5693292 = fieldWeight in 2822, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.09375 = fieldNorm(doc=2822)
          0.23813671 = weight(abstract_txt:maximum in 2822) [ClassicSimilarity], result of:
            0.23813671 = score(doc=2822,freq=2.0), product of:
              0.24784184 = queryWeight, product of:
                3.0 = boost
                7.2471204 = idf(docFreq=85, maxDocs=44421)
                0.011399554 = queryNorm
              0.9608415 = fieldWeight in 2822, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.2471204 = idf(docFreq=85, maxDocs=44421)
                0.09375 = fieldNorm(doc=2822)
          0.26681462 = weight(abstract_txt:likelihood in 2822) [ClassicSimilarity], result of:
            0.26681462 = score(doc=2822,freq=2.0), product of:
              0.2673602 = queryWeight, product of:
                3.1158917 = boost
                7.5270805 = idf(docFreq=64, maxDocs=44421)
                0.011399554 = queryNorm
              0.9979593 = fieldWeight in 2822, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.5270805 = idf(docFreq=64, maxDocs=44421)
                0.09375 = fieldNorm(doc=2822)
          0.074904576 = weight(abstract_txt:language in 2822) [ClassicSimilarity], result of:
            0.074904576 = score(doc=2822,freq=1.0), product of:
              0.19155708 = queryWeight, product of:
                4.0287604 = boost
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.011399554 = queryNorm
              0.39103007 = fieldWeight in 2822, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.09375 = fieldNorm(doc=2822)
        0.2 = coord(5/25)
    
  5. Meij, E.; Rijke, M. de: Thesaurus-based feedback to support mixed search and browsing environments (2007) 0.12
    0.12152466 = sum of:
      0.12152466 = product of:
        0.5063528 = sum of:
          0.015569117 = weight(abstract_txt:document in 3432) [ClassicSimilarity], result of:
            0.015569117 = score(doc=3432,freq=1.0), product of:
              0.058010522 = queryWeight, product of:
                1.1850637 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.011399554 = queryNorm
              0.26838437 = fieldWeight in 3432, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.0625 = fieldNorm(doc=3432)
          0.048548635 = weight(abstract_txt:find in 3432) [ClassicSimilarity], result of:
            0.048548635 = score(doc=3432,freq=2.0), product of:
              0.11249569 = queryWeight, product of:
                2.0211656 = boost
                4.8825436 = idf(docFreq=914, maxDocs=44421)
                0.011399554 = queryNorm
              0.43155995 = fieldWeight in 3432, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.8825436 = idf(docFreq=914, maxDocs=44421)
                0.0625 = fieldNorm(doc=3432)
          0.12577762 = weight(abstract_txt:likelihood in 3432) [ClassicSimilarity], result of:
            0.12577762 = score(doc=3432,freq=1.0), product of:
              0.2673602 = queryWeight, product of:
                3.1158917 = boost
                7.5270805 = idf(docFreq=64, maxDocs=44421)
                0.011399554 = queryNorm
              0.47044253 = fieldWeight in 3432, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5270805 = idf(docFreq=64, maxDocs=44421)
                0.0625 = fieldNorm(doc=3432)
          0.0730185 = weight(abstract_txt:algorithm in 3432) [ClassicSimilarity], result of:
            0.0730185 = score(doc=3432,freq=1.0), product of:
              0.204784 = queryWeight, product of:
                3.1488454 = boost
                5.7050157 = idf(docFreq=401, maxDocs=44421)
                0.011399554 = queryNorm
              0.35656348 = fieldWeight in 3432, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7050157 = idf(docFreq=401, maxDocs=44421)
                0.0625 = fieldNorm(doc=3432)
          0.049936388 = weight(abstract_txt:language in 3432) [ClassicSimilarity], result of:
            0.049936388 = score(doc=3432,freq=1.0), product of:
              0.19155708 = queryWeight, product of:
                4.0287604 = boost
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.011399554 = queryNorm
              0.26068673 = fieldWeight in 3432, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.0625 = fieldNorm(doc=3432)
          0.19350252 = weight(abstract_txt:estimation in 3432) [ClassicSimilarity], result of:
            0.19350252 = score(doc=3432,freq=1.0), product of:
              0.39216173 = queryWeight, product of:
                4.3574853 = boost
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.011399554 = queryNorm
              0.4934253 = fieldWeight in 3432, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.0625 = fieldNorm(doc=3432)
        0.24 = coord(6/25)