Document (#27093)

Comeau, D.C.
Wilbur, W.J.
Non-Word Identification or Spell Checking Without a Dictionary
Journal of the American Society for Information Science and technology. 55(2004) no.2, S.169-177
MEDLINE is a collection of more than 12 million references and abstracts covering recent life science literature. With its continued growth and cutting-edge terminology, spell-checking with a traditional lexicon based approach requires significant additional manual followup. In this work, an internal corpus based context quality rating a, frequency, and simple misspelling transformations are used to rank words from most likely to be misspellings to least likely. Eleven-point average precisions of 0.891 have been achieved within a class of 42,340 all alphabetic words having an a score less than 10. Our models predict that 16,274 or 38% of these words are misspellings. Based an test data, this result has a recall of 79% and a precision of 86%. In other words, spell checking can be done by statistics instead of with a dictionary. As an application we examine the time history of low a words in MEDLINE titles and abstracts.

Similar documents (author)

  1. Wilbur, W.J.: Global term weights for document retrieval learned from TREC data (2001) 5.62
    5.620886 = sum of:
      5.620886 = weight(author_txt:wilbur in 2646) [ClassicSimilarity], result of:
        5.620886 = fieldWeight in 2646, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.993418 = idf(docFreq=14, maxDocs=44421)
          0.625 = fieldNorm(doc=2646)
  2. Wilbur, W.J.: Human subjectivity and performance limits in document retrieval (1996) 5.62
    5.620886 = sum of:
      5.620886 = weight(author_txt:wilbur in 6675) [ClassicSimilarity], result of:
        5.620886 = fieldWeight in 6675, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.993418 = idf(docFreq=14, maxDocs=44421)
          0.625 = fieldNorm(doc=6675)
  3. Wilbur, W.J.: ¬A comparison of group and individual performance among subject experts and untrained workers at the document retrieval task (1998) 5.62
    5.620886 = sum of:
      5.620886 = weight(author_txt:wilbur in 4263) [ClassicSimilarity], result of:
        5.620886 = fieldWeight in 4263, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.993418 = idf(docFreq=14, maxDocs=44421)
          0.625 = fieldNorm(doc=4263)
  4. Wilbur, W.J.: Human subjectivity and performance limits in document retrieval (1999) 5.62
    5.620886 = sum of:
      5.620886 = weight(author_txt:wilbur in 5539) [ClassicSimilarity], result of:
        5.620886 = fieldWeight in 5539, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.993418 = idf(docFreq=14, maxDocs=44421)
          0.625 = fieldNorm(doc=5539)
  5. Wilbur, W.J.: ¬A retrieval system based on automatic relevance weighting of search terms (1992) 5.62
    5.620886 = sum of:
      5.620886 = weight(author_txt:wilbur in 6269) [ClassicSimilarity], result of:
        5.620886 = fieldWeight in 6269, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.993418 = idf(docFreq=14, maxDocs=44421)
          0.625 = fieldNorm(doc=6269)

Similar documents (content)

  1. Lee, K.H.; Ng, M.K.M.; Lu, Q.: Text segmentation for Chinese spell checking (1999) 0.35
    0.35019076 = sum of:
      0.35019076 = product of:
        1.2506813 = sum of:
          0.009058472 = weight(abstract_txt:with in 4913) [ClassicSimilarity], result of:
            0.009058472 = score(doc=4913,freq=2.0), product of:
              0.041057363 = queryWeight, product of:
                1.2061743 = boost
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.013636788 = queryNorm
              0.22062966 = fieldWeight in 4913, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.0625 = fieldNorm(doc=4913)
          0.022904595 = weight(abstract_txt:than in 4913) [ClassicSimilarity], result of:
            0.022904595 = score(doc=4913,freq=2.0), product of:
              0.06656906 = queryWeight, product of:
                1.2540219 = boost
                3.8927383 = idf(docFreq=2461, maxDocs=44421)
                0.013636788 = queryNorm
              0.3440727 = fieldWeight in 4913, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.8927383 = idf(docFreq=2461, maxDocs=44421)
                0.0625 = fieldNorm(doc=4913)
          0.018784037 = weight(abstract_txt:based in 4913) [ClassicSimilarity], result of:
            0.018784037 = score(doc=4913,freq=2.0), product of:
              0.06676472 = queryWeight, product of:
                1.5381123 = boost
                3.1830752 = idf(docFreq=5005, maxDocs=44421)
                0.013636788 = queryNorm
              0.28134674 = fieldWeight in 4913, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.1830752 = idf(docFreq=5005, maxDocs=44421)
                0.0625 = fieldNorm(doc=4913)
          0.08033874 = weight(abstract_txt:dictionary in 4913) [ClassicSimilarity], result of:
            0.08033874 = score(doc=4913,freq=1.0), product of:
              0.19362019 = queryWeight, product of:
                2.138672 = boost
                6.6388726 = idf(docFreq=157, maxDocs=44421)
                0.013636788 = queryNorm
              0.41492954 = fieldWeight in 4913, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.6388726 = idf(docFreq=157, maxDocs=44421)
                0.0625 = fieldNorm(doc=4913)
          0.34524062 = weight(abstract_txt:checking in 4913) [ClassicSimilarity], result of:
            0.34524062 = score(doc=4913,freq=3.0), product of:
              0.40619874 = queryWeight, product of:
                3.793882 = boost
                7.85132 = idf(docFreq=46, maxDocs=44421)
                0.013636788 = queryNorm
              0.8499303 = fieldWeight in 4913, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.85132 = idf(docFreq=46, maxDocs=44421)
                0.0625 = fieldNorm(doc=4913)
          0.563447 = weight(abstract_txt:spell in 4913) [ClassicSimilarity], result of:
            0.563447 = score(doc=4913,freq=4.0), product of:
              0.51157945 = queryWeight, product of:
                4.257661 = boost
                8.811096 = idf(docFreq=17, maxDocs=44421)
                0.013636788 = queryNorm
              1.101387 = fieldWeight in 4913, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                8.811096 = idf(docFreq=17, maxDocs=44421)
                0.0625 = fieldNorm(doc=4913)
          0.2109078 = weight(abstract_txt:words in 4913) [ClassicSimilarity], result of:
            0.2109078 = score(doc=4913,freq=4.0), product of:
              0.31503278 = queryWeight, product of:
                4.3133707 = boost
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.013636788 = queryNorm
              0.6694789 = fieldWeight in 4913, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.0625 = fieldNorm(doc=4913)
        0.28 = coord(7/25)
  2. Drabenstott, K.M.; Weller, M.S.: Handling spelling errors in online catalog searches (1996) 0.15
    0.14762354 = sum of:
      0.14762354 = product of:
        0.7381177 = sum of:
          0.012810615 = weight(abstract_txt:with in 6973) [ClassicSimilarity], result of:
            0.012810615 = score(doc=6973,freq=4.0), product of:
              0.041057363 = queryWeight, product of:
                1.2061743 = boost
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.013636788 = queryNorm
              0.31201747 = fieldWeight in 6973, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.0625 = fieldNorm(doc=6973)
          0.022904595 = weight(abstract_txt:than in 6973) [ClassicSimilarity], result of:
            0.022904595 = score(doc=6973,freq=2.0), product of:
              0.06656906 = queryWeight, product of:
                1.2540219 = boost
                3.8927383 = idf(docFreq=2461, maxDocs=44421)
                0.013636788 = queryNorm
              0.3440727 = fieldWeight in 6973, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.8927383 = idf(docFreq=2461, maxDocs=44421)
                0.0625 = fieldNorm(doc=6973)
          0.35394338 = weight(abstract_txt:misspellings in 6973) [ClassicSimilarity], result of:
            0.35394338 = score(doc=6973,freq=3.0), product of:
              0.36078578 = queryWeight, product of:
                2.9194 = boost
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.013636788 = queryNorm
              0.9810347 = fieldWeight in 6973, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.0625 = fieldNorm(doc=6973)
          0.19932476 = weight(abstract_txt:checking in 6973) [ClassicSimilarity], result of:
            0.19932476 = score(doc=6973,freq=1.0), product of:
              0.40619874 = queryWeight, product of:
                3.793882 = boost
                7.85132 = idf(docFreq=46, maxDocs=44421)
                0.013636788 = queryNorm
              0.4907075 = fieldWeight in 6973, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.85132 = idf(docFreq=46, maxDocs=44421)
                0.0625 = fieldNorm(doc=6973)
          0.14913432 = weight(abstract_txt:words in 6973) [ClassicSimilarity], result of:
            0.14913432 = score(doc=6973,freq=2.0), product of:
              0.31503278 = queryWeight, product of:
                4.3133707 = boost
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.013636788 = queryNorm
              0.47339305 = fieldWeight in 6973, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.0625 = fieldNorm(doc=6973)
        0.2 = coord(5/25)
  3. Kim, W.; Wilbur, W.J.: Corpus-based statistical screening for content-bearing terms (2001) 0.12
    0.12401317 = sum of:
      0.12401317 = product of:
        0.44290417 = sum of:
          0.06932378 = weight(abstract_txt:score in 188) [ClassicSimilarity], result of:
            0.06932378 = score(doc=188,freq=4.0), product of:
              0.106296256 = queryWeight, product of:
                1.1205026 = boost
                6.9565353 = idf(docFreq=114, maxDocs=44421)
                0.013636788 = queryNorm
              0.6521752 = fieldWeight in 188, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.9565353 = idf(docFreq=114, maxDocs=44421)
                0.046875 = fieldNorm(doc=188)
          0.008320739 = weight(abstract_txt:with in 188) [ClassicSimilarity], result of:
            0.008320739 = score(doc=188,freq=3.0), product of:
              0.041057363 = queryWeight, product of:
                1.2061743 = boost
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.013636788 = queryNorm
              0.2026613 = fieldWeight in 188, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.046875 = fieldNorm(doc=188)
          0.012146997 = weight(abstract_txt:than in 188) [ClassicSimilarity], result of:
            0.012146997 = score(doc=188,freq=1.0), product of:
              0.06656906 = queryWeight, product of:
                1.2540219 = boost
                3.8927383 = idf(docFreq=2461, maxDocs=44421)
                0.013636788 = queryNorm
              0.18247211 = fieldWeight in 188, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.8927383 = idf(docFreq=2461, maxDocs=44421)
                0.046875 = fieldNorm(doc=188)
          0.061148275 = weight(abstract_txt:eleven in 188) [ClassicSimilarity], result of:
            0.061148275 = score(doc=188,freq=1.0), product of:
              0.15519318 = queryWeight, product of:
                1.3539113 = boost
                8.405631 = idf(docFreq=26, maxDocs=44421)
                0.013636788 = queryNorm
              0.39401394 = fieldWeight in 188, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.405631 = idf(docFreq=26, maxDocs=44421)
                0.046875 = fieldNorm(doc=188)
          0.00996174 = weight(abstract_txt:based in 188) [ClassicSimilarity], result of:
            0.00996174 = score(doc=188,freq=1.0), product of:
              0.06676472 = queryWeight, product of:
                1.5381123 = boost
                3.1830752 = idf(docFreq=5005, maxDocs=44421)
                0.013636788 = queryNorm
              0.14920665 = fieldWeight in 188, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1830752 = idf(docFreq=5005, maxDocs=44421)
                0.046875 = fieldNorm(doc=188)
          0.14501402 = weight(abstract_txt:medline in 188) [ClassicSimilarity], result of:
            0.14501402 = score(doc=188,freq=5.0), product of:
              0.2033495 = queryWeight, product of:
                2.1917472 = boost
                6.803628 = idf(docFreq=133, maxDocs=44421)
                0.013636788 = queryNorm
              0.71312696 = fieldWeight in 188, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                6.803628 = idf(docFreq=133, maxDocs=44421)
                0.046875 = fieldNorm(doc=188)
          0.13698862 = weight(abstract_txt:words in 188) [ClassicSimilarity], result of:
            0.13698862 = score(doc=188,freq=3.0), product of:
              0.31503278 = queryWeight, product of:
                4.3133707 = boost
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.013636788 = queryNorm
              0.43483928 = fieldWeight in 188, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.046875 = fieldNorm(doc=188)
        0.28 = coord(7/25)
  4. Rubashkin, V.S.; Lakhuti, D.G.: Semanticheskii (kontseptual'nyi) slovar' dlya informatsionnykh tekhnologii, ch.1 (1998) 0.10
    0.09941933 = sum of:
      0.09941933 = product of:
        0.62137085 = sum of:
          0.04048999 = weight(abstract_txt:than in 4253) [ClassicSimilarity], result of:
            0.04048999 = score(doc=4253,freq=1.0), product of:
              0.06656906 = queryWeight, product of:
                1.2540219 = boost
                3.8927383 = idf(docFreq=2461, maxDocs=44421)
                0.013636788 = queryNorm
              0.60824037 = fieldWeight in 4253, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.8927383 = idf(docFreq=2461, maxDocs=44421)
                0.15625 = fieldNorm(doc=4253)
          0.0332058 = weight(abstract_txt:based in 4253) [ClassicSimilarity], result of:
            0.0332058 = score(doc=4253,freq=1.0), product of:
              0.06676472 = queryWeight, product of:
                1.5381123 = boost
                3.1830752 = idf(docFreq=5005, maxDocs=44421)
                0.013636788 = queryNorm
              0.4973555 = fieldWeight in 4253, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1830752 = idf(docFreq=5005, maxDocs=44421)
                0.15625 = fieldNorm(doc=4253)
          0.2840403 = weight(abstract_txt:dictionary in 4253) [ClassicSimilarity], result of:
            0.2840403 = score(doc=4253,freq=2.0), product of:
              0.19362019 = queryWeight, product of:
                2.138672 = boost
                6.6388726 = idf(docFreq=157, maxDocs=44421)
                0.013636788 = queryNorm
              1.4669974 = fieldWeight in 4253, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.6388726 = idf(docFreq=157, maxDocs=44421)
                0.15625 = fieldNorm(doc=4253)
          0.26363474 = weight(abstract_txt:words in 4253) [ClassicSimilarity], result of:
            0.26363474 = score(doc=4253,freq=1.0), product of:
              0.31503278 = queryWeight, product of:
                4.3133707 = boost
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.013636788 = queryNorm
              0.8368486 = fieldWeight in 4253, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.15625 = fieldNorm(doc=4253)
        0.16 = coord(4/25)
  5. Willson, R.; Given, L.M.: ¬The effect of spelling and retrieval system familiarity on search behavior in online public access catalogs : a mixed methods study (2010) 0.09
    0.09306988 = sum of:
      0.09306988 = product of:
        0.7755824 = sum of:
          0.012810615 = weight(abstract_txt:with in 42) [ClassicSimilarity], result of:
            0.012810615 = score(doc=42,freq=4.0), product of:
              0.041057363 = queryWeight, product of:
                1.2061743 = boost
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.013636788 = queryNorm
              0.31201747 = fieldWeight in 42, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.0625 = fieldNorm(doc=42)
          0.19932476 = weight(abstract_txt:checking in 42) [ClassicSimilarity], result of:
            0.19932476 = score(doc=42,freq=1.0), product of:
              0.40619874 = queryWeight, product of:
                3.793882 = boost
                7.85132 = idf(docFreq=46, maxDocs=44421)
                0.013636788 = queryNorm
              0.4907075 = fieldWeight in 42, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.85132 = idf(docFreq=46, maxDocs=44421)
                0.0625 = fieldNorm(doc=42)
          0.563447 = weight(abstract_txt:spell in 42) [ClassicSimilarity], result of:
            0.563447 = score(doc=42,freq=4.0), product of:
              0.51157945 = queryWeight, product of:
                4.257661 = boost
                8.811096 = idf(docFreq=17, maxDocs=44421)
                0.013636788 = queryNorm
              1.101387 = fieldWeight in 42, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                8.811096 = idf(docFreq=17, maxDocs=44421)
                0.0625 = fieldNorm(doc=42)
        0.12 = coord(3/25)