Document (#29340)

Author
Brener, N.E.
lyengar, S.S.
Pianykh, O.S.
Title
¬A conclusive methodology for rating OCR performance
Source
Journal of the American Society for Information Science and Technology. 56(2005) no.12, S.1274-1287
Year
2005
Abstract
One of the most challenging topics in the automatic document rating process is the development of a rating scheure for the image quality of documents. As part of the Department of Energy (DOE) document declassification program, we have developed a generalized rating system to predict the optical character recognition (OCR) accuracy level that is achieved when processing a document. The need for such a system emerged from the declassification of degraded, typewriter-era documents, which is currently a time-consuming manual process. This article presents the statistical analysis of the most influential document quality features affecting OCR accuracy, develops consistent predictive models for four currently used OCR engines, and studies the applicability of different OCR products to the DOE document declassification process. This study is expected to lead to an efficient and completely automated document declassification system.
Object
OCR

Similar documents (content)

  1. Jiang, X.; Tan, A.-H.: CRCTOL: a semantic-based domain ontology learning system (2009) 0.15
    0.15029082 = sum of:
      0.15029082 = product of:
        0.53675294 = sum of:
          0.05245642 = weight(abstract_txt:generalized in 307) [ClassicSimilarity], result of:
            0.05245642 = score(doc=307,freq=1.0), product of:
              0.1357349 = queryWeight, product of:
                1.0866939 = boost
                7.0667386 = idf(docFreq=102, maxDocs=44421)
                0.017675238 = queryNorm
              0.38646227 = fieldWeight in 307, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.0667386 = idf(docFreq=102, maxDocs=44421)
                0.0546875 = fieldNorm(doc=307)
          0.020840745 = weight(abstract_txt:documents in 307) [ClassicSimilarity], result of:
            0.020840745 = score(doc=307,freq=1.0), product of:
              0.092422545 = queryWeight, product of:
                1.2681348 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.017675238 = queryNorm
              0.22549418 = fieldWeight in 307, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0546875 = fieldNorm(doc=307)
          0.042119574 = weight(abstract_txt:quality in 307) [ClassicSimilarity], result of:
            0.042119574 = score(doc=307,freq=2.0), product of:
              0.11725961 = queryWeight, product of:
                1.4284028 = boost
                4.6444306 = idf(docFreq=1160, maxDocs=44421)
                0.017675238 = queryNorm
              0.35919935 = fieldWeight in 307, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.6444306 = idf(docFreq=1160, maxDocs=44421)
                0.0546875 = fieldNorm(doc=307)
          0.01710901 = weight(abstract_txt:system in 307) [ClassicSimilarity], result of:
            0.01710901 = score(doc=307,freq=1.0), product of:
              0.09275759 = queryWeight, product of:
                1.5559543 = boost
                3.372775 = idf(docFreq=4140, maxDocs=44421)
                0.017675238 = queryNorm
              0.18444863 = fieldWeight in 307, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.372775 = idf(docFreq=4140, maxDocs=44421)
                0.0546875 = fieldNorm(doc=307)
          0.06278772 = weight(abstract_txt:accuracy in 307) [ClassicSimilarity], result of:
            0.06278772 = score(doc=307,freq=1.0), product of:
              0.19279048 = queryWeight, product of:
                1.831552 = boost
                5.9552646 = idf(docFreq=312, maxDocs=44421)
                0.017675238 = queryNorm
              0.32567853 = fieldWeight in 307, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9552646 = idf(docFreq=312, maxDocs=44421)
                0.0546875 = fieldNorm(doc=307)
          0.070619464 = weight(abstract_txt:document in 307) [ClassicSimilarity], result of:
            0.070619464 = score(doc=307,freq=1.0), product of:
              0.3007178 = queryWeight, product of:
                3.96202 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.017675238 = queryNorm
              0.23483633 = fieldWeight in 307, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.0546875 = fieldNorm(doc=307)
          0.27082005 = weight(abstract_txt:rating in 307) [ClassicSimilarity], result of:
            0.27082005 = score(doc=307,freq=1.0), product of:
              0.64362514 = queryWeight, product of:
                4.732689 = boost
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.017675238 = queryNorm
              0.42077297 = fieldWeight in 307, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.0546875 = fieldNorm(doc=307)
        0.28 = coord(7/25)
    
  2. Li, H.; Bhowmick, S.S.; Sun, A.: AffRank: affinity-driven ranking of products in online social rating networks (2011) 0.14
    0.13990466 = sum of:
      0.13990466 = product of:
        0.6995233 = sum of:
          0.05264059 = weight(abstract_txt:predict in 483) [ClassicSimilarity], result of:
            0.05264059 = score(doc=483,freq=1.0), product of:
              0.12446435 = queryWeight, product of:
                1.0406003 = boost
                6.7669935 = idf(docFreq=138, maxDocs=44421)
                0.017675238 = queryNorm
              0.4229371 = fieldWeight in 483, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.7669935 = idf(docFreq=138, maxDocs=44421)
                0.0625 = fieldNorm(doc=483)
          0.055943836 = weight(abstract_txt:affecting in 483) [ClassicSimilarity], result of:
            0.055943836 = score(doc=483,freq=1.0), product of:
              0.1296182 = queryWeight, product of:
                1.0619265 = boost
                6.905677 = idf(docFreq=120, maxDocs=44421)
                0.017675238 = queryNorm
              0.4316048 = fieldWeight in 483, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.905677 = idf(docFreq=120, maxDocs=44421)
                0.0625 = fieldNorm(doc=483)
          0.020816412 = weight(abstract_txt:most in 483) [ClassicSimilarity], result of:
            0.020816412 = score(doc=483,freq=1.0), product of:
              0.08448476 = queryWeight, product of:
                1.2124552 = boost
                3.94228 = idf(docFreq=2342, maxDocs=44421)
                0.017675238 = queryNorm
              0.2463925 = fieldWeight in 483, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.94228 = idf(docFreq=2342, maxDocs=44421)
                0.0625 = fieldNorm(doc=483)
          0.034037758 = weight(abstract_txt:quality in 483) [ClassicSimilarity], result of:
            0.034037758 = score(doc=483,freq=1.0), product of:
              0.11725961 = queryWeight, product of:
                1.4284028 = boost
                4.6444306 = idf(docFreq=1160, maxDocs=44421)
                0.017675238 = queryNorm
              0.2902769 = fieldWeight in 483, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.6444306 = idf(docFreq=1160, maxDocs=44421)
                0.0625 = fieldNorm(doc=483)
          0.5360847 = weight(abstract_txt:rating in 483) [ClassicSimilarity], result of:
            0.5360847 = score(doc=483,freq=3.0), product of:
              0.64362514 = queryWeight, product of:
                4.732689 = boost
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.017675238 = queryNorm
              0.8329145 = fieldWeight in 483, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.0625 = fieldNorm(doc=483)
        0.2 = coord(5/25)
    
  3. Taylor, S.L.: Integrating natural language understanding with document structure analysis (1994) 0.14
    0.13878192 = sum of:
      0.13878192 = product of:
        0.57825804 = sum of:
          0.07061619 = weight(abstract_txt:character in 1862) [ClassicSimilarity], result of:
            0.07061619 = score(doc=1862,freq=1.0), product of:
              0.11553312 = queryWeight, product of:
                1.00257 = boost
                6.519684 = idf(docFreq=177, maxDocs=44421)
                0.017675238 = queryNorm
              0.61122036 = fieldWeight in 1862, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.519684 = idf(docFreq=177, maxDocs=44421)
                0.09375 = fieldNorm(doc=1862)
          0.07434794 = weight(abstract_txt:develops in 1862) [ClassicSimilarity], result of:
            0.07434794 = score(doc=1862,freq=1.0), product of:
              0.11956836 = queryWeight, product of:
                1.0199282 = boost
                6.6325636 = idf(docFreq=158, maxDocs=44421)
                0.017675238 = queryNorm
              0.6218028 = fieldWeight in 1862, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.6325636 = idf(docFreq=158, maxDocs=44421)
                0.09375 = fieldNorm(doc=1862)
          0.102036834 = weight(abstract_txt:optical in 1862) [ClassicSimilarity], result of:
            0.102036834 = score(doc=1862,freq=1.0), product of:
              0.14766411 = queryWeight, product of:
                1.1334411 = boost
                7.370734 = idf(docFreq=75, maxDocs=44421)
                0.017675238 = queryNorm
              0.6910063 = fieldWeight in 1862, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.370734 = idf(docFreq=75, maxDocs=44421)
                0.09375 = fieldNorm(doc=1862)
          0.03122462 = weight(abstract_txt:most in 1862) [ClassicSimilarity], result of:
            0.03122462 = score(doc=1862,freq=1.0), product of:
              0.08448476 = queryWeight, product of:
                1.2124552 = boost
                3.94228 = idf(docFreq=2342, maxDocs=44421)
                0.017675238 = queryNorm
              0.36958876 = fieldWeight in 1862, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.94228 = idf(docFreq=2342, maxDocs=44421)
                0.09375 = fieldNorm(doc=1862)
          0.029329734 = weight(abstract_txt:system in 1862) [ClassicSimilarity], result of:
            0.029329734 = score(doc=1862,freq=1.0), product of:
              0.09275759 = queryWeight, product of:
                1.5559543 = boost
                3.372775 = idf(docFreq=4140, maxDocs=44421)
                0.017675238 = queryNorm
              0.31619766 = fieldWeight in 1862, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.372775 = idf(docFreq=4140, maxDocs=44421)
                0.09375 = fieldNorm(doc=1862)
          0.27070272 = weight(abstract_txt:document in 1862) [ClassicSimilarity], result of:
            0.27070272 = score(doc=1862,freq=5.0), product of:
              0.3007178 = queryWeight, product of:
                3.96202 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.017675238 = queryNorm
              0.9001885 = fieldWeight in 1862, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.09375 = fieldNorm(doc=1862)
        0.24 = coord(6/25)
    
  4. Tagheva, K.; Borsack, J.; Condit, A.: Effects of OCR errors on ranking and feedback using the vector space model (1996) 0.12
    0.122216284 = sum of:
      0.122216284 = product of:
        0.6110814 = sum of:
          0.08238556 = weight(abstract_txt:character in 5019) [ClassicSimilarity], result of:
            0.08238556 = score(doc=5019,freq=1.0), product of:
              0.11553312 = queryWeight, product of:
                1.00257 = boost
                6.519684 = idf(docFreq=177, maxDocs=44421)
                0.017675238 = queryNorm
              0.7130904 = fieldWeight in 5019, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.519684 = idf(docFreq=177, maxDocs=44421)
                0.109375 = fieldNorm(doc=5019)
          0.11904298 = weight(abstract_txt:optical in 5019) [ClassicSimilarity], result of:
            0.11904298 = score(doc=5019,freq=1.0), product of:
              0.14766411 = queryWeight, product of:
                1.1334411 = boost
                7.370734 = idf(docFreq=75, maxDocs=44421)
                0.017675238 = queryNorm
              0.80617404 = fieldWeight in 5019, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.370734 = idf(docFreq=75, maxDocs=44421)
                0.109375 = fieldNorm(doc=5019)
          0.04168149 = weight(abstract_txt:documents in 5019) [ClassicSimilarity], result of:
            0.04168149 = score(doc=5019,freq=1.0), product of:
              0.092422545 = queryWeight, product of:
                1.2681348 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.017675238 = queryNorm
              0.45098835 = fieldWeight in 5019, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.109375 = fieldNorm(doc=5019)
          0.22673248 = weight(abstract_txt:degraded in 5019) [ClassicSimilarity], result of:
            0.22673248 = score(doc=5019,freq=1.0), product of:
              0.22688979 = queryWeight, product of:
                1.4049761 = boost
                9.1365185 = idf(docFreq=12, maxDocs=44421)
                0.017675238 = queryNorm
              0.9993067 = fieldWeight in 5019, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.1365185 = idf(docFreq=12, maxDocs=44421)
                0.109375 = fieldNorm(doc=5019)
          0.14123893 = weight(abstract_txt:document in 5019) [ClassicSimilarity], result of:
            0.14123893 = score(doc=5019,freq=1.0), product of:
              0.3007178 = queryWeight, product of:
                3.96202 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.017675238 = queryNorm
              0.46967265 = fieldWeight in 5019, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.109375 = fieldNorm(doc=5019)
        0.2 = coord(5/25)
    
  5. Broadhurst, R.: ¬The digitisation of library material (1993) 0.11
    0.11368749 = sum of:
      0.11368749 = product of:
        0.56843746 = sum of:
          0.094154924 = weight(abstract_txt:character in 6255) [ClassicSimilarity], result of:
            0.094154924 = score(doc=6255,freq=1.0), product of:
              0.11553312 = queryWeight, product of:
                1.00257 = boost
                6.519684 = idf(docFreq=177, maxDocs=44421)
                0.017675238 = queryNorm
              0.8149605 = fieldWeight in 6255, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.519684 = idf(docFreq=177, maxDocs=44421)
                0.125 = fieldNorm(doc=6255)
          0.13604912 = weight(abstract_txt:optical in 6255) [ClassicSimilarity], result of:
            0.13604912 = score(doc=6255,freq=1.0), product of:
              0.14766411 = queryWeight, product of:
                1.1334411 = boost
                7.370734 = idf(docFreq=75, maxDocs=44421)
                0.017675238 = queryNorm
              0.9213418 = fieldWeight in 6255, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.370734 = idf(docFreq=75, maxDocs=44421)
                0.125 = fieldNorm(doc=6255)
          0.10916205 = weight(abstract_txt:currently in 6255) [ClassicSimilarity], result of:
            0.10916205 = score(doc=6255,freq=1.0), product of:
              0.16064563 = queryWeight, product of:
                1.6719024 = boost
                5.4361663 = idf(docFreq=525, maxDocs=44421)
                0.017675238 = queryNorm
              0.6795208 = fieldWeight in 6255, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.4361663 = idf(docFreq=525, maxDocs=44421)
                0.125 = fieldNorm(doc=6255)
          0.06765548 = weight(abstract_txt:process in 6255) [ClassicSimilarity], result of:
            0.06765548 = score(doc=6255,freq=1.0), product of:
              0.13367604 = queryWeight, product of:
                1.8678796 = boost
                4.048922 = idf(docFreq=2105, maxDocs=44421)
                0.017675238 = queryNorm
              0.50611526 = fieldWeight in 6255, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.048922 = idf(docFreq=2105, maxDocs=44421)
                0.125 = fieldNorm(doc=6255)
          0.16141592 = weight(abstract_txt:document in 6255) [ClassicSimilarity], result of:
            0.16141592 = score(doc=6255,freq=1.0), product of:
              0.3007178 = queryWeight, product of:
                3.96202 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.017675238 = queryNorm
              0.53676873 = fieldWeight in 6255, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.125 = fieldNorm(doc=6255)
        0.2 = coord(5/25)