Document (#28628)

Author
Dunning, T.
Title
Statistical identification of language
Source
http://citeseer.ist.psu.edu/cache/papers/cs/36/http:zSzzSzwww.comp.lancs.ac.ukzSzcomputingzSzresearchzSzucrelzSzpaperszSzlingdet.pdf/dunning94statistical.pdf
Year
1994
Series
Technical report CRL MCCS-94-273, Computing Research Lab, New Mexico State University
Abstract
A statistically based program has been written which learns to distinguish between languages. The amount of training text that such a program needs is surprisingly small, and the amount of text needed to make an identification is also quite small. The program incorporates no linguistic presuppositions other than the assumption that text can be encoded as a string of bytes. Such a program can be used to determine which language small bits of text are in. It also shows a potential for what might be called 'statistical philology' in that it may be applied directly to phonetic transcriptions to help elucidate family trees among language dialects. A variant of this program has been shown to be useful as a quality control in biochemistry. In this application, genetic sequences are assumed to be expressions in a language peculiar to the organism from which the sequence is taken. Thus language identification becomes species identification.
Theme
Computerlinguistik

Similar documents (content)

  1. Huang, X.; Peng, F,; An, A.; Schuurmans, D.: Dynamic Web log session identification with statistical language models (2004) 0.08
    0.08450743 = sum of:
      0.08450743 = product of:
        0.5281714 = sum of:
          0.014292829 = weight(abstract_txt:which in 4096) [ClassicSimilarity], result of:
            0.014292829 = score(doc=4096,freq=1.0), product of:
              0.06278704 = queryWeight, product of:
                1.1816807 = boost
                2.9137893 = idf(docFreq=6552, maxDocs=44421)
                0.018235251 = queryNorm
              0.2276398 = fieldWeight in 4096, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.9137893 = idf(docFreq=6552, maxDocs=44421)
                0.078125 = fieldNorm(doc=4096)
          0.092950195 = weight(abstract_txt:statistical in 4096) [ClassicSimilarity], result of:
            0.092950195 = score(doc=4096,freq=2.0), product of:
              0.15167628 = queryWeight, product of:
                1.4996101 = boost
                5.5466094 = idf(docFreq=470, maxDocs=44421)
                0.018235251 = queryNorm
              0.61281955 = fieldWeight in 4096, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.5466094 = idf(docFreq=470, maxDocs=44421)
                0.078125 = fieldNorm(doc=4096)
          0.15624072 = weight(abstract_txt:language in 4096) [ClassicSimilarity], result of:
            0.15624072 = score(doc=4096,freq=5.0), product of:
              0.21442743 = queryWeight, product of:
                2.8192246 = boost
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.018235251 = queryNorm
              0.7286415 = fieldWeight in 4096, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.078125 = fieldNorm(doc=4096)
          0.26468772 = weight(abstract_txt:identification in 4096) [ClassicSimilarity], result of:
            0.26468772 = score(doc=4096,freq=3.0), product of:
              0.33539218 = queryWeight, product of:
                3.1536317 = boost
                5.8321705 = idf(docFreq=353, maxDocs=44421)
                0.018235251 = queryNorm
              0.7891887 = fieldWeight in 4096, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.8321705 = idf(docFreq=353, maxDocs=44421)
                0.078125 = fieldNorm(doc=4096)
        0.16 = coord(4/25)
    
  2. Liu, X.; Croft, W.B.: Statistical language modeling for information retrieval (2004) 0.08
    0.08271588 = sum of:
      0.08271588 = product of:
        0.41357937 = sum of:
          0.055179164 = weight(abstract_txt:sequences in 5277) [ClassicSimilarity], result of:
            0.055179164 = score(doc=5277,freq=1.0), product of:
              0.1358946 = queryWeight, product of:
                1.0037038 = boost
                7.4248013 = idf(docFreq=71, maxDocs=44421)
                0.018235251 = queryNorm
              0.40604383 = fieldWeight in 5277, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.4248013 = idf(docFreq=71, maxDocs=44421)
                0.0546875 = fieldNorm(doc=5277)
          0.01000498 = weight(abstract_txt:which in 5277) [ClassicSimilarity], result of:
            0.01000498 = score(doc=5277,freq=1.0), product of:
              0.06278704 = queryWeight, product of:
                1.1816807 = boost
                2.9137893 = idf(docFreq=6552, maxDocs=44421)
                0.018235251 = queryNorm
              0.15934785 = fieldWeight in 5277, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.9137893 = idf(docFreq=6552, maxDocs=44421)
                0.0546875 = fieldNorm(doc=5277)
          0.121725716 = weight(abstract_txt:statistical in 5277) [ClassicSimilarity], result of:
            0.121725716 = score(doc=5277,freq=7.0), product of:
              0.15167628 = queryWeight, product of:
                1.4996101 = boost
                5.5466094 = idf(docFreq=470, maxDocs=44421)
                0.018235251 = queryNorm
              0.80253625 = fieldWeight in 5277, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                5.5466094 = idf(docFreq=470, maxDocs=44421)
                0.0546875 = fieldNorm(doc=5277)
          0.05031809 = weight(abstract_txt:text in 5277) [ClassicSimilarity], result of:
            0.05031809 = score(doc=5277,freq=2.0), product of:
              0.16100705 = queryWeight, product of:
                2.1850276 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.018235251 = queryNorm
              0.31252104 = fieldWeight in 5277, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0546875 = fieldNorm(doc=5277)
          0.17635143 = weight(abstract_txt:language in 5277) [ClassicSimilarity], result of:
            0.17635143 = score(doc=5277,freq=13.0), product of:
              0.21442743 = queryWeight, product of:
                2.8192246 = boost
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.018235251 = queryNorm
              0.8224294 = fieldWeight in 5277, product of:
                3.6055512 = tf(freq=13.0), with freq of:
                  13.0 = termFreq=13.0
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.0546875 = fieldNorm(doc=5277)
        0.2 = coord(5/25)
    
  3. Wu, Y.-f.B.; Li, Q.; Bot, R.S.; Chen, X.: Finding nuggets in documents : a machine learning approach (2006) 0.08
    0.08071141 = sum of:
      0.08071141 = product of:
        0.40355703 = sum of:
          0.011434263 = weight(abstract_txt:which in 290) [ClassicSimilarity], result of:
            0.011434263 = score(doc=290,freq=1.0), product of:
              0.06278704 = queryWeight, product of:
                1.1816807 = boost
                2.9137893 = idf(docFreq=6552, maxDocs=44421)
                0.018235251 = queryNorm
              0.18211183 = fieldWeight in 290, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.9137893 = idf(docFreq=6552, maxDocs=44421)
                0.0625 = fieldNorm(doc=290)
          0.07025161 = weight(abstract_txt:small in 290) [ClassicSimilarity], result of:
            0.07025161 = score(doc=290,freq=1.0), product of:
              0.21062121 = queryWeight, product of:
                2.1642938 = boost
                5.3367167 = idf(docFreq=580, maxDocs=44421)
                0.018235251 = queryNorm
              0.3335448 = fieldWeight in 290, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.3367167 = idf(docFreq=580, maxDocs=44421)
                0.0625 = fieldNorm(doc=290)
          0.057506386 = weight(abstract_txt:text in 290) [ClassicSimilarity], result of:
            0.057506386 = score(doc=290,freq=2.0), product of:
              0.16100705 = queryWeight, product of:
                2.1850276 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.018235251 = queryNorm
              0.3571669 = fieldWeight in 290, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=290)
          0.12225402 = weight(abstract_txt:identification in 290) [ClassicSimilarity], result of:
            0.12225402 = score(doc=290,freq=1.0), product of:
              0.33539218 = queryWeight, product of:
                3.1536317 = boost
                5.8321705 = idf(docFreq=353, maxDocs=44421)
                0.018235251 = queryNorm
              0.36451066 = fieldWeight in 290, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8321705 = idf(docFreq=353, maxDocs=44421)
                0.0625 = fieldNorm(doc=290)
          0.14211076 = weight(abstract_txt:program in 290) [ClassicSimilarity], result of:
            0.14211076 = score(doc=290,freq=1.0), product of:
              0.39942214 = queryWeight, product of:
                3.8477387 = boost
                5.6926546 = idf(docFreq=406, maxDocs=44421)
                0.018235251 = queryNorm
              0.3557909 = fieldWeight in 290, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.6926546 = idf(docFreq=406, maxDocs=44421)
                0.0625 = fieldNorm(doc=290)
        0.2 = coord(5/25)
    
  4. Caseiro, D.: Automatic language identification bibliography : Last Update: 20 September 1999 (1999) 0.07
    0.07126097 = sum of:
      0.07126097 = product of:
        0.8907621 = sum of:
          0.27949193 = weight(abstract_txt:language in 1841) [ClassicSimilarity], result of:
            0.27949193 = score(doc=1841,freq=1.0), product of:
              0.21442743 = queryWeight, product of:
                2.8192246 = boost
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.018235251 = queryNorm
              1.3034337 = fieldWeight in 1841, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.3125 = fieldNorm(doc=1841)
          0.6112701 = weight(abstract_txt:identification in 1841) [ClassicSimilarity], result of:
            0.6112701 = score(doc=1841,freq=1.0), product of:
              0.33539218 = queryWeight, product of:
                3.1536317 = boost
                5.8321705 = idf(docFreq=353, maxDocs=44421)
                0.018235251 = queryNorm
              1.8225533 = fieldWeight in 1841, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8321705 = idf(docFreq=353, maxDocs=44421)
                0.3125 = fieldNorm(doc=1841)
        0.08 = coord(2/25)
    
  5. Akman, K.I.: ¬A new text compression technique based on natural language structure (1995) 0.07
    0.070010096 = sum of:
      0.070010096 = product of:
        0.35005048 = sum of:
          0.085341305 = weight(abstract_txt:trees in 1928) [ClassicSimilarity], result of:
            0.085341305 = score(doc=1928,freq=1.0), product of:
              0.14328156 = queryWeight, product of:
                1.0306226 = boost
                7.62393 = idf(docFreq=58, maxDocs=44421)
                0.018235251 = queryNorm
              0.59561956 = fieldWeight in 1928, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.62393 = idf(docFreq=58, maxDocs=44421)
                0.078125 = fieldNorm(doc=1928)
          0.11925135 = weight(abstract_txt:bits in 1928) [ClassicSimilarity], result of:
            0.11925135 = score(doc=1928,freq=1.0), product of:
              0.1790852 = queryWeight, product of:
                1.1522171 = boost
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.018235251 = queryNorm
              0.6658917 = fieldWeight in 1928, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.078125 = fieldNorm(doc=1928)
          0.024755906 = weight(abstract_txt:which in 1928) [ClassicSimilarity], result of:
            0.024755906 = score(doc=1928,freq=3.0), product of:
              0.06278704 = queryWeight, product of:
                1.1816807 = boost
                2.9137893 = idf(docFreq=6552, maxDocs=44421)
                0.018235251 = queryNorm
              0.39428368 = fieldWeight in 1928, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.9137893 = idf(docFreq=6552, maxDocs=44421)
                0.078125 = fieldNorm(doc=1928)
          0.050828945 = weight(abstract_txt:text in 1928) [ClassicSimilarity], result of:
            0.050828945 = score(doc=1928,freq=1.0), product of:
              0.16100705 = queryWeight, product of:
                2.1850276 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.018235251 = queryNorm
              0.3156939 = fieldWeight in 1928, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.078125 = fieldNorm(doc=1928)
          0.06987298 = weight(abstract_txt:language in 1928) [ClassicSimilarity], result of:
            0.06987298 = score(doc=1928,freq=1.0), product of:
              0.21442743 = queryWeight, product of:
                2.8192246 = boost
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.018235251 = queryNorm
              0.3258584 = fieldWeight in 1928, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.078125 = fieldNorm(doc=1928)
        0.2 = coord(5/25)