Document (#35315)

Author
Roitblat, H.L.
Kershaw, A.
Oot, P.
Title
Document categorization in legal electronic discovery : computer classification vs. manual review
Source
Journal of the American Society for Information Science and Technology. 61(2010) no.1, S.70-80
Year
2009
Abstract
In litigation in the US, the parties are obligated to produce to one another, when requested, those documents that are potentially relevant to issues and facts of the litigation (called discovery). As the volume of electronic documents continues to grow, the expense of dealing with this obligation threatens to surpass the amounts at issue and the time to identify these relevant documents can delay a case for months or years. The same holds true for government investigations and third-parties served with subpoenas. As a result, litigants are looking for ways to reduce the time and expense of discovery. One approach is to supplant or reduce the traditional means of having people, usually attorneys, read each document, with automated procedures that use information retrieval and machine categorization to identify the relevant documents. This study compared an original categorization, obtained as part of a response to a Department of Justice Request and produced by having one or more of 225 attorneys review each document with automated categorization systems provided by two legal service providers. The goal was to determine whether the automated systems could categorize documents at least as well as human reviewers could, thereby saving time and expense. The results support the idea that machine categorization is no less accurate at identifying relevant/responsive documents than employing a team of reviewers. Based on these results, it would appear that using machine categorization can be a reasonable substitute for human review.
Field
Rechtswissenschaft

Similar documents (content)

  1. Yang, Y.; Wilbur, J.: Using corpus statistics to remove redundant words in text categorization (1996) 0.23
    0.22753014 = sum of:
      0.22753014 = product of:
        0.81260765 = sum of:
          0.0073966654 = weight(abstract_txt:that in 4267) [ClassicSimilarity], result of:
            0.0073966654 = score(doc=4267,freq=1.0), product of:
              0.04003379 = queryWeight, product of:
                1.0802455 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.015670577 = queryNorm
              0.18476056 = fieldWeight in 4267, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.078125 = fieldNorm(doc=4267)
          0.012299854 = weight(abstract_txt:with in 4267) [ClassicSimilarity], result of:
            0.012299854 = score(doc=4267,freq=2.0), product of:
              0.044599093 = queryWeight, product of:
                1.1401767 = boost
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.015670577 = queryNorm
              0.2757871 = fieldWeight in 4267, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.078125 = fieldNorm(doc=4267)
          0.029948654 = weight(abstract_txt:time in 4267) [ClassicSimilarity], result of:
            0.029948654 = score(doc=4267,freq=1.0), product of:
              0.09240058 = queryWeight, product of:
                1.4212717 = boost
                4.1487055 = idf(docFreq=1905, maxDocs=44421)
                0.015670577 = queryNorm
              0.3241176 = fieldWeight in 4267, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1487055 = idf(docFreq=1905, maxDocs=44421)
                0.078125 = fieldNorm(doc=4267)
          0.06942888 = weight(abstract_txt:reduce in 4267) [ClassicSimilarity], result of:
            0.06942888 = score(doc=4267,freq=1.0), product of:
              0.14139026 = queryWeight, product of:
                1.4355022 = boost
                6.285367 = idf(docFreq=224, maxDocs=44421)
                0.015670577 = queryNorm
              0.49104428 = fieldWeight in 4267, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.285367 = idf(docFreq=224, maxDocs=44421)
                0.078125 = fieldNorm(doc=4267)
          0.07370056 = weight(abstract_txt:automated in 4267) [ClassicSimilarity], result of:
            0.07370056 = score(doc=4267,freq=1.0), product of:
              0.16842388 = queryWeight, product of:
                1.9188524 = boost
                5.6011486 = idf(docFreq=445, maxDocs=44421)
                0.015670577 = queryNorm
              0.43758973 = fieldWeight in 4267, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.6011486 = idf(docFreq=445, maxDocs=44421)
                0.078125 = fieldNorm(doc=4267)
          0.083162256 = weight(abstract_txt:documents in 4267) [ClassicSimilarity], result of:
            0.083162256 = score(doc=4267,freq=2.0), product of:
              0.1825467 = queryWeight, product of:
                2.8251517 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.015670577 = queryNorm
              0.455567 = fieldWeight in 4267, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.078125 = fieldNorm(doc=4267)
          0.5366708 = weight(abstract_txt:categorization in 4267) [ClassicSimilarity], result of:
            0.5366708 = score(doc=4267,freq=5.0), product of:
              0.4662102 = queryWeight, product of:
                4.514874 = boost
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.015670577 = queryNorm
              1.1511348 = fieldWeight in 4267, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.078125 = fieldNorm(doc=4267)
        0.28 = coord(7/25)
    
  2. Kim, J.-H.; Choi, K.-S.: Patent document categorization based on semantic structural information (2007) 0.21
    0.20845689 = sum of:
      0.20845689 = product of:
        0.7444889 = sum of:
          0.008368372 = weight(abstract_txt:that in 1933) [ClassicSimilarity], result of:
            0.008368372 = score(doc=1933,freq=2.0), product of:
              0.04003379 = queryWeight, product of:
                1.0802455 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.015670577 = queryNorm
              0.20903271 = fieldWeight in 1933, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.0625 = fieldNorm(doc=1933)
          0.006957848 = weight(abstract_txt:with in 1933) [ClassicSimilarity], result of:
            0.006957848 = score(doc=1933,freq=1.0), product of:
              0.044599093 = queryWeight, product of:
                1.1401767 = boost
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.015670577 = queryNorm
              0.15600874 = fieldWeight in 1933, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.0625 = fieldNorm(doc=1933)
          0.023958925 = weight(abstract_txt:time in 1933) [ClassicSimilarity], result of:
            0.023958925 = score(doc=1933,freq=1.0), product of:
              0.09240058 = queryWeight, product of:
                1.4212717 = boost
                4.1487055 = idf(docFreq=1905, maxDocs=44421)
                0.015670577 = queryNorm
              0.2592941 = fieldWeight in 1933, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1487055 = idf(docFreq=1905, maxDocs=44421)
                0.0625 = fieldNorm(doc=1933)
          0.037573017 = weight(abstract_txt:document in 1933) [ClassicSimilarity], result of:
            0.037573017 = score(doc=1933,freq=2.0), product of:
              0.09899286 = queryWeight, product of:
                1.4710983 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.015670577 = queryNorm
              0.3795528 = fieldWeight in 1933, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.0625 = fieldNorm(doc=1933)
          0.044399846 = weight(abstract_txt:relevant in 1933) [ClassicSimilarity], result of:
            0.044399846 = score(doc=1933,freq=1.0), product of:
              0.1534371 = queryWeight, product of:
                2.1148243 = boost
                4.6298943 = idf(docFreq=1177, maxDocs=44421)
                0.015670577 = queryNorm
              0.2893684 = fieldWeight in 1933, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.6298943 = idf(docFreq=1177, maxDocs=44421)
                0.0625 = fieldNorm(doc=1933)
          0.11523301 = weight(abstract_txt:documents in 1933) [ClassicSimilarity], result of:
            0.11523301 = score(doc=1933,freq=6.0), product of:
              0.1825467 = queryWeight, product of:
                2.8251517 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.015670577 = queryNorm
              0.6312522 = fieldWeight in 1933, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0625 = fieldNorm(doc=1933)
          0.5079979 = weight(abstract_txt:categorization in 1933) [ClassicSimilarity], result of:
            0.5079979 = score(doc=1933,freq=7.0), product of:
              0.4662102 = queryWeight, product of:
                4.514874 = boost
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.015670577 = queryNorm
              1.0896327 = fieldWeight in 1933, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.0625 = fieldNorm(doc=1933)
        0.28 = coord(7/25)
    
  3. Goren-Bar, D.; Kuflik, T.: Supporting user-subjective categorization with self-organizing maps and learning vector quantization (2005) 0.20
    0.2022672 = sum of:
      0.2022672 = product of:
        0.84278 = sum of:
          0.022947282 = weight(abstract_txt:human in 4325) [ClassicSimilarity], result of:
            0.022947282 = score(doc=4325,freq=1.0), product of:
              0.07843085 = queryWeight, product of:
                1.0691473 = boost
                4.681277 = idf(docFreq=1118, maxDocs=44421)
                0.015670577 = queryNorm
              0.2925798 = fieldWeight in 4325, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.681277 = idf(docFreq=1118, maxDocs=44421)
                0.0625 = fieldNorm(doc=4325)
          0.01024912 = weight(abstract_txt:that in 4325) [ClassicSimilarity], result of:
            0.01024912 = score(doc=4325,freq=3.0), product of:
              0.04003379 = queryWeight, product of:
                1.0802455 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.015670577 = queryNorm
              0.25601172 = fieldWeight in 4325, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.0625 = fieldNorm(doc=4325)
          0.009839883 = weight(abstract_txt:with in 4325) [ClassicSimilarity], result of:
            0.009839883 = score(doc=4325,freq=2.0), product of:
              0.044599093 = queryWeight, product of:
                1.1401767 = boost
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.015670577 = queryNorm
              0.22062966 = fieldWeight in 4325, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.0625 = fieldNorm(doc=4325)
          0.053136274 = weight(abstract_txt:document in 4325) [ClassicSimilarity], result of:
            0.053136274 = score(doc=4325,freq=4.0), product of:
              0.09899286 = queryWeight, product of:
                1.4710983 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.015670577 = queryNorm
              0.53676873 = fieldWeight in 4325, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.0625 = fieldNorm(doc=4325)
          0.08148204 = weight(abstract_txt:documents in 4325) [ClassicSimilarity], result of:
            0.08148204 = score(doc=4325,freq=3.0), product of:
              0.1825467 = queryWeight, product of:
                2.8251517 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.015670577 = queryNorm
              0.4463627 = fieldWeight in 4325, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0625 = fieldNorm(doc=4325)
          0.66512537 = weight(abstract_txt:categorization in 4325) [ClassicSimilarity], result of:
            0.66512537 = score(doc=4325,freq=12.0), product of:
              0.4662102 = queryWeight, product of:
                4.514874 = boost
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.015670577 = queryNorm
              1.4266642 = fieldWeight in 4325, product of:
                3.4641016 = tf(freq=12.0), with freq of:
                  12.0 = termFreq=12.0
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.0625 = fieldNorm(doc=4325)
        0.24 = coord(6/25)
    
  4. Han, K.; Rezapour, R.; Nakamura, K.; Devkota, D.; Miller, D.C.; Diesner, J.: ¬An expert-in-the-loop method for domain-specific document categorization based on small training data (2023) 0.19
    0.1935105 = sum of:
      0.1935105 = product of:
        0.60472035 = sum of:
          0.013231559 = weight(abstract_txt:that in 1969) [ClassicSimilarity], result of:
            0.013231559 = score(doc=1969,freq=5.0), product of:
              0.04003379 = queryWeight, product of:
                1.0802455 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.015670577 = queryNorm
              0.33050975 = fieldWeight in 1969, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.0625 = fieldNorm(doc=1969)
          0.026739197 = weight(abstract_txt:identify in 1969) [ClassicSimilarity], result of:
            0.026739197 = score(doc=1969,freq=1.0), product of:
              0.08684903 = queryWeight, product of:
                1.1250623 = boost
                4.9261017 = idf(docFreq=875, maxDocs=44421)
                0.015670577 = queryNorm
              0.30788136 = fieldWeight in 1969, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.9261017 = idf(docFreq=875, maxDocs=44421)
                0.0625 = fieldNorm(doc=1969)
          0.013915696 = weight(abstract_txt:with in 1969) [ClassicSimilarity], result of:
            0.013915696 = score(doc=1969,freq=4.0), product of:
              0.044599093 = queryWeight, product of:
                1.1401767 = boost
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.015670577 = queryNorm
              0.31201747 = fieldWeight in 1969, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.0625 = fieldNorm(doc=1969)
          0.023958925 = weight(abstract_txt:time in 1969) [ClassicSimilarity], result of:
            0.023958925 = score(doc=1969,freq=1.0), product of:
              0.09240058 = queryWeight, product of:
                1.4212717 = boost
                4.1487055 = idf(docFreq=1905, maxDocs=44421)
                0.015670577 = queryNorm
              0.2592941 = fieldWeight in 1969, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1487055 = idf(docFreq=1905, maxDocs=44421)
                0.0625 = fieldNorm(doc=1969)
          0.083382666 = weight(abstract_txt:automated in 1969) [ClassicSimilarity], result of:
            0.083382666 = score(doc=1969,freq=2.0), product of:
              0.16842388 = queryWeight, product of:
                1.9188524 = boost
                5.6011486 = idf(docFreq=445, maxDocs=44421)
                0.015670577 = queryNorm
              0.49507627 = fieldWeight in 1969, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.6011486 = idf(docFreq=445, maxDocs=44421)
                0.0625 = fieldNorm(doc=1969)
          0.044399846 = weight(abstract_txt:relevant in 1969) [ClassicSimilarity], result of:
            0.044399846 = score(doc=1969,freq=1.0), product of:
              0.1534371 = queryWeight, product of:
                2.1148243 = boost
                4.6298943 = idf(docFreq=1177, maxDocs=44421)
                0.015670577 = queryNorm
              0.2893684 = fieldWeight in 1969, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.6298943 = idf(docFreq=1177, maxDocs=44421)
                0.0625 = fieldNorm(doc=1969)
          0.0665298 = weight(abstract_txt:documents in 1969) [ClassicSimilarity], result of:
            0.0665298 = score(doc=1969,freq=2.0), product of:
              0.1825467 = queryWeight, product of:
                2.8251517 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.015670577 = queryNorm
              0.3644536 = fieldWeight in 1969, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0625 = fieldNorm(doc=1969)
          0.33256269 = weight(abstract_txt:categorization in 1969) [ClassicSimilarity], result of:
            0.33256269 = score(doc=1969,freq=3.0), product of:
              0.4662102 = queryWeight, product of:
                4.514874 = boost
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.015670577 = queryNorm
              0.7133321 = fieldWeight in 1969, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.0625 = fieldNorm(doc=1969)
        0.32 = coord(8/25)
    
  5. Collins-Thompson, K.; Callan, J.: Predicting reading difficulty with statistical language models (2005) 0.17
    0.17048237 = sum of:
      0.17048237 = product of:
        0.5327574 = sum of:
          0.013231559 = weight(abstract_txt:that in 5579) [ClassicSimilarity], result of:
            0.013231559 = score(doc=5579,freq=5.0), product of:
              0.04003379 = queryWeight, product of:
                1.0802455 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.015670577 = queryNorm
              0.33050975 = fieldWeight in 5579, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.0625 = fieldNorm(doc=5579)
          0.026739197 = weight(abstract_txt:identify in 5579) [ClassicSimilarity], result of:
            0.026739197 = score(doc=5579,freq=1.0), product of:
              0.08684903 = queryWeight, product of:
                1.1250623 = boost
                4.9261017 = idf(docFreq=875, maxDocs=44421)
                0.015670577 = queryNorm
              0.30788136 = fieldWeight in 5579, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.9261017 = idf(docFreq=875, maxDocs=44421)
                0.0625 = fieldNorm(doc=5579)
          0.009839883 = weight(abstract_txt:with in 5579) [ClassicSimilarity], result of:
            0.009839883 = score(doc=5579,freq=2.0), product of:
              0.044599093 = queryWeight, product of:
                1.1401767 = boost
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.015670577 = queryNorm
              0.22062966 = fieldWeight in 5579, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.0625 = fieldNorm(doc=5579)
          0.026568137 = weight(abstract_txt:document in 5579) [ClassicSimilarity], result of:
            0.026568137 = score(doc=5579,freq=1.0), product of:
              0.09899286 = queryWeight, product of:
                1.4710983 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.015670577 = queryNorm
              0.26838437 = fieldWeight in 5579, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.0625 = fieldNorm(doc=5579)
          0.05896045 = weight(abstract_txt:automated in 5579) [ClassicSimilarity], result of:
            0.05896045 = score(doc=5579,freq=1.0), product of:
              0.16842388 = queryWeight, product of:
                1.9188524 = boost
                5.6011486 = idf(docFreq=445, maxDocs=44421)
                0.015670577 = queryNorm
              0.3500718 = fieldWeight in 5579, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.6011486 = idf(docFreq=445, maxDocs=44421)
                0.0625 = fieldNorm(doc=5579)
          0.044399846 = weight(abstract_txt:relevant in 5579) [ClassicSimilarity], result of:
            0.044399846 = score(doc=5579,freq=1.0), product of:
              0.1534371 = queryWeight, product of:
                2.1148243 = boost
                4.6298943 = idf(docFreq=1177, maxDocs=44421)
                0.015670577 = queryNorm
              0.2893684 = fieldWeight in 5579, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.6298943 = idf(docFreq=1177, maxDocs=44421)
                0.0625 = fieldNorm(doc=5579)
          0.08148204 = weight(abstract_txt:documents in 5579) [ClassicSimilarity], result of:
            0.08148204 = score(doc=5579,freq=3.0), product of:
              0.1825467 = queryWeight, product of:
                2.8251517 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.015670577 = queryNorm
              0.4463627 = fieldWeight in 5579, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0625 = fieldNorm(doc=5579)
          0.2715363 = weight(abstract_txt:categorization in 5579) [ClassicSimilarity], result of:
            0.2715363 = score(doc=5579,freq=2.0), product of:
              0.4662102 = queryWeight, product of:
                4.514874 = boost
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.015670577 = queryNorm
              0.5824332 = fieldWeight in 5579, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.0625 = fieldNorm(doc=5579)
        0.32 = coord(8/25)