Document (#29994)

Author
Adiego, J.
Navarro, G.
Fuente, P. de la
Title
Lempel-Ziv compression of highly structured documents
Source
Journal of the American Society for Information Science and Technology. 58(2007) no.4, S.461-478
Year
2007
Abstract
The authors describe Lempel-Ziv to Compress Structure (LZCS), a novel Lempel-Ziv approach suitable for compressing structured documents. LZCS takes advantage of repeated substructures that may appear in the documents, by replacing them with a backward reference to their previous occurrence. The result of the LZCS transformation is still a valid structured document, which is human-readable and can be transmitted by ASCII channels. Moreover, LZCS transformed documents are easy to search, display, access at random, and navigate. In a second stage, the transformed documents can be further compressed using any semistatic technique, so that it is still possible to do all those operations efficiently; or with any adaptive technique to boost compression. LZCS is especially efficient in the compression of collections of highly structured data, such as extensible markup language (XML) forms, invoices, e-commerce, and Web-service exchange documents. The comparison with other structure-aware and standard compressors shows that LZCS is a competitive choice for these type of documents, whereas the others are not well-suited to support navigation or random access. When joined to an adaptive compressor, LZCS obtains by far the best compression ratios.

Similar documents (author)

  1. Navarro, M.A.E. -> Esteban Navarro, M.A.: 5.42
    5.418135 = sum of:
      5.418135 = weight(author_txt:navarro in 2821) [ClassicSimilarity], result of:
        5.418135 = fieldWeight in 2821, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.757029 = idf(docFreq=18, maxDocs=44421)
          0.4375 = fieldNorm(doc=2821)
    
  2. Molina, C. Navarro- -> Navarro-Molina, C.: 4.64
    4.644116 = sum of:
      4.644116 = weight(author_txt:navarro in 945) [ClassicSimilarity], result of:
        4.644116 = fieldWeight in 945, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.757029 = idf(docFreq=18, maxDocs=44421)
          0.375 = fieldNorm(doc=945)
    
  3. Navarro, M.A. Esteban -> Esteban Navarro, M.A.: 4.64
    4.644116 = sum of:
      4.644116 = weight(author_txt:navarro in 2551) [ClassicSimilarity], result of:
        4.644116 = fieldWeight in 2551, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.757029 = idf(docFreq=18, maxDocs=44421)
          0.375 = fieldNorm(doc=2551)
    
  4. Esteban Navarro, M.A.: Aplicaciones de la terminologia para la docencia de la gestion de lenguajes documentales (1995) 4.38
    4.3785143 = sum of:
      4.3785143 = weight(author_txt:navarro in 5497) [ClassicSimilarity], result of:
        4.3785143 = fieldWeight in 5497, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.757029 = idf(docFreq=18, maxDocs=44421)
          0.5 = fieldNorm(doc=5497)
    
  5. Esteban Navarro, M.A.: Fundamentos epistemologicos de la classificacion documental (1995) 4.38
    4.3785143 = sum of:
      4.3785143 = weight(author_txt:navarro in 5615) [ClassicSimilarity], result of:
        4.3785143 = fieldWeight in 5615, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.757029 = idf(docFreq=18, maxDocs=44421)
          0.5 = fieldNorm(doc=5615)
    

Similar documents (content)

  1. Cannane, A.; Williams, H.E.: General-purpose compression for efficient retrieval (2001) 0.37
    0.37453246 = sum of:
      0.37453246 = product of:
        1.337616 = sum of:
          0.017620347 = weight(abstract_txt:access in 6705) [ClassicSimilarity], result of:
            0.017620347 = score(doc=6705,freq=1.0), product of:
              0.06177357 = queryWeight, product of:
                1.002683 = boost
                3.6510832 = idf(docFreq=3134, maxDocs=44421)
                0.01687397 = queryNorm
              0.2852409 = fieldWeight in 6705, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6510832 = idf(docFreq=3134, maxDocs=44421)
                0.078125 = fieldNorm(doc=6705)
          0.008445924 = weight(abstract_txt:with in 6705) [ClassicSimilarity], result of:
            0.008445924 = score(doc=6705,freq=1.0), product of:
              0.04331001 = queryWeight, product of:
                1.028258 = boost
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.01687397 = queryNorm
              0.19501092 = fieldWeight in 6705, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.078125 = fieldNorm(doc=6705)
          0.06338992 = weight(abstract_txt:technique in 6705) [ClassicSimilarity], result of:
            0.06338992 = score(doc=6705,freq=1.0), product of:
              0.14503512 = queryWeight, product of:
                1.5363809 = boost
                5.5944448 = idf(docFreq=448, maxDocs=44421)
                0.01687397 = queryNorm
              0.437066 = fieldWeight in 6705, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5944448 = idf(docFreq=448, maxDocs=44421)
                0.078125 = fieldNorm(doc=6705)
          0.10085211 = weight(abstract_txt:random in 6705) [ClassicSimilarity], result of:
            0.10085211 = score(doc=6705,freq=1.0), product of:
              0.1976589 = queryWeight, product of:
                1.793579 = boost
                6.5309834 = idf(docFreq=175, maxDocs=44421)
                0.01687397 = queryNorm
              0.5102331 = fieldWeight in 6705, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5309834 = idf(docFreq=175, maxDocs=44421)
                0.078125 = fieldNorm(doc=6705)
          0.12053026 = weight(abstract_txt:adaptive in 6705) [ClassicSimilarity], result of:
            0.12053026 = score(doc=6705,freq=1.0), product of:
              0.22259928 = queryWeight, product of:
                1.9033743 = boost
                6.930783 = idf(docFreq=117, maxDocs=44421)
                0.01687397 = queryNorm
              0.5414674 = fieldWeight in 6705, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.930783 = idf(docFreq=117, maxDocs=44421)
                0.078125 = fieldNorm(doc=6705)
          0.088829875 = weight(abstract_txt:documents in 6705) [ClassicSimilarity], result of:
            0.088829875 = score(doc=6705,freq=1.0), product of:
              0.27575397 = queryWeight, product of:
                3.9633026 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.01687397 = queryNorm
              0.32213452 = fieldWeight in 6705, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.078125 = fieldNorm(doc=6705)
          0.9379476 = weight(abstract_txt:compression in 6705) [ClassicSimilarity], result of:
            0.9379476 = score(doc=6705,freq=9.0), product of:
              0.52946985 = queryWeight, product of:
                4.151432 = boost
                7.558333 = idf(docFreq=62, maxDocs=44421)
                0.01687397 = queryNorm
              1.7714844 = fieldWeight in 6705, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                7.558333 = idf(docFreq=62, maxDocs=44421)
                0.078125 = fieldNorm(doc=6705)
        0.28 = coord(7/25)
    
  2. Moffat, A.; Bell, T.A.H.: In situ generation of compressed inverted files (1995) 0.13
    0.12976182 = sum of:
      0.12976182 = product of:
        0.6488091 = sum of:
          0.017620347 = weight(abstract_txt:access in 2716) [ClassicSimilarity], result of:
            0.017620347 = score(doc=2716,freq=1.0), product of:
              0.06177357 = queryWeight, product of:
                1.002683 = boost
                3.6510832 = idf(docFreq=3134, maxDocs=44421)
                0.01687397 = queryNorm
              0.2852409 = fieldWeight in 2716, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6510832 = idf(docFreq=3134, maxDocs=44421)
                0.078125 = fieldNorm(doc=2716)
          0.1288576 = weight(abstract_txt:compressed in 2716) [ClassicSimilarity], result of:
            0.1288576 = score(doc=2716,freq=1.0), product of:
              0.1847239 = queryWeight, product of:
                1.2260519 = boost
                8.928879 = idf(docFreq=15, maxDocs=44421)
                0.01687397 = queryNorm
              0.69756866 = fieldWeight in 2716, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.928879 = idf(docFreq=15, maxDocs=44421)
                0.078125 = fieldNorm(doc=2716)
          0.10085211 = weight(abstract_txt:random in 2716) [ClassicSimilarity], result of:
            0.10085211 = score(doc=2716,freq=1.0), product of:
              0.1976589 = queryWeight, product of:
                1.793579 = boost
                6.5309834 = idf(docFreq=175, maxDocs=44421)
                0.01687397 = queryNorm
              0.5102331 = fieldWeight in 2716, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5309834 = idf(docFreq=175, maxDocs=44421)
                0.078125 = fieldNorm(doc=2716)
          0.088829875 = weight(abstract_txt:documents in 2716) [ClassicSimilarity], result of:
            0.088829875 = score(doc=2716,freq=1.0), product of:
              0.27575397 = queryWeight, product of:
                3.9633026 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.01687397 = queryNorm
              0.32213452 = fieldWeight in 2716, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.078125 = fieldNorm(doc=2716)
          0.31264916 = weight(abstract_txt:compression in 2716) [ClassicSimilarity], result of:
            0.31264916 = score(doc=2716,freq=1.0), product of:
              0.52946985 = queryWeight, product of:
                4.151432 = boost
                7.558333 = idf(docFreq=62, maxDocs=44421)
                0.01687397 = queryNorm
              0.59049475 = fieldWeight in 2716, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.558333 = idf(docFreq=62, maxDocs=44421)
                0.078125 = fieldNorm(doc=2716)
        0.2 = coord(5/25)
    
  3. Gillman, P.: Data handling and text compression (1992) 0.12
    0.124170475 = sum of:
      0.124170475 = product of:
        0.62085235 = sum of:
          0.0067567397 = weight(abstract_txt:with in 5305) [ClassicSimilarity], result of:
            0.0067567397 = score(doc=5305,freq=1.0), product of:
              0.04331001 = queryWeight, product of:
                1.028258 = boost
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.01687397 = queryNorm
              0.15600874 = fieldWeight in 5305, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.0625 = fieldNorm(doc=5305)
          0.075934954 = weight(abstract_txt:ascii in 5305) [ClassicSimilarity], result of:
            0.075934954 = score(doc=5305,freq=1.0), product of:
              0.1506668 = queryWeight, product of:
                1.1072766 = boost
                8.063882 = idf(docFreq=37, maxDocs=44421)
                0.01687397 = queryNorm
              0.5039926 = fieldWeight in 5305, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.063882 = idf(docFreq=37, maxDocs=44421)
                0.0625 = fieldNorm(doc=5305)
          0.11337464 = weight(abstract_txt:compressing in 5305) [ClassicSimilarity], result of:
            0.11337464 = score(doc=5305,freq=1.0), product of:
              0.19681899 = queryWeight, product of:
                1.2655544 = boost
                9.216561 = idf(docFreq=11, maxDocs=44421)
                0.01687397 = queryNorm
              0.5760351 = fieldWeight in 5305, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.216561 = idf(docFreq=11, maxDocs=44421)
                0.0625 = fieldNorm(doc=5305)
          0.0710639 = weight(abstract_txt:documents in 5305) [ClassicSimilarity], result of:
            0.0710639 = score(doc=5305,freq=1.0), product of:
              0.27575397 = queryWeight, product of:
                3.9633026 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.01687397 = queryNorm
              0.25770763 = fieldWeight in 5305, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0625 = fieldNorm(doc=5305)
          0.35372216 = weight(abstract_txt:compression in 5305) [ClassicSimilarity], result of:
            0.35372216 = score(doc=5305,freq=2.0), product of:
              0.52946985 = queryWeight, product of:
                4.151432 = boost
                7.558333 = idf(docFreq=62, maxDocs=44421)
                0.01687397 = queryNorm
              0.6680685 = fieldWeight in 5305, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.558333 = idf(docFreq=62, maxDocs=44421)
                0.0625 = fieldNorm(doc=5305)
        0.2 = coord(5/25)
    
  4. Nomoto, T.: Discriminative sentence compression with conditional random fields (2007) 0.11
    0.109485805 = sum of:
      0.109485805 = product of:
        0.6842863 = sum of:
          0.0119443415 = weight(abstract_txt:with in 1945) [ClassicSimilarity], result of:
            0.0119443415 = score(doc=1945,freq=2.0), product of:
              0.04331001 = queryWeight, product of:
                1.028258 = boost
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.01687397 = queryNorm
              0.2757871 = fieldWeight in 1945, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.078125 = fieldNorm(doc=1945)
          0.02996562 = weight(abstract_txt:structure in 1945) [ClassicSimilarity], result of:
            0.02996562 = score(doc=1945,freq=1.0), product of:
              0.088012 = queryWeight, product of:
                1.1968322 = boost
                4.3580413 = idf(docFreq=1545, maxDocs=44421)
                0.01687397 = queryNorm
              0.34047198 = fieldWeight in 1945, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.3580413 = idf(docFreq=1545, maxDocs=44421)
                0.078125 = fieldNorm(doc=1945)
          0.10085211 = weight(abstract_txt:random in 1945) [ClassicSimilarity], result of:
            0.10085211 = score(doc=1945,freq=1.0), product of:
              0.1976589 = queryWeight, product of:
                1.793579 = boost
                6.5309834 = idf(docFreq=175, maxDocs=44421)
                0.01687397 = queryNorm
              0.5102331 = fieldWeight in 1945, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5309834 = idf(docFreq=175, maxDocs=44421)
                0.078125 = fieldNorm(doc=1945)
          0.54152423 = weight(abstract_txt:compression in 1945) [ClassicSimilarity], result of:
            0.54152423 = score(doc=1945,freq=3.0), product of:
              0.52946985 = queryWeight, product of:
                4.151432 = boost
                7.558333 = idf(docFreq=62, maxDocs=44421)
                0.01687397 = queryNorm
              1.022767 = fieldWeight in 1945, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.558333 = idf(docFreq=62, maxDocs=44421)
                0.078125 = fieldNorm(doc=1945)
        0.16 = coord(4/25)
    
  5. Lalmas, M.: XML information retrieval (2009) 0.11
    0.106994964 = sum of:
      0.106994964 = product of:
        0.5349748 = sum of:
          0.021144416 = weight(abstract_txt:access in 867) [ClassicSimilarity], result of:
            0.021144416 = score(doc=867,freq=1.0), product of:
              0.06177357 = queryWeight, product of:
                1.002683 = boost
                3.6510832 = idf(docFreq=3134, maxDocs=44421)
                0.01687397 = queryNorm
              0.34228906 = fieldWeight in 867, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6510832 = idf(docFreq=3134, maxDocs=44421)
                0.09375 = fieldNorm(doc=867)
          0.08561925 = weight(abstract_txt:extensible in 867) [ClassicSimilarity], result of:
            0.08561925 = score(doc=867,freq=1.0), product of:
              0.124559395 = queryWeight, product of:
                1.0067823 = boost
                7.33202 = idf(docFreq=78, maxDocs=44421)
                0.01687397 = queryNorm
              0.68737686 = fieldWeight in 867, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.33202 = idf(docFreq=78, maxDocs=44421)
                0.09375 = fieldNorm(doc=867)
          0.05085334 = weight(abstract_txt:structure in 867) [ClassicSimilarity], result of:
            0.05085334 = score(doc=867,freq=2.0), product of:
              0.088012 = queryWeight, product of:
                1.1968322 = boost
                4.3580413 = idf(docFreq=1545, maxDocs=44421)
                0.01687397 = queryNorm
              0.5778001 = fieldWeight in 867, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.3580413 = idf(docFreq=1545, maxDocs=44421)
                0.09375 = fieldNorm(doc=867)
          0.13900223 = weight(abstract_txt:structured in 867) [ClassicSimilarity], result of:
            0.13900223 = score(doc=867,freq=1.0), product of:
              0.2731262 = queryWeight, product of:
                2.981666 = boost
                5.428591 = idf(docFreq=529, maxDocs=44421)
                0.01687397 = queryNorm
              0.5089304 = fieldWeight in 867, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.428591 = idf(docFreq=529, maxDocs=44421)
                0.09375 = fieldNorm(doc=867)
          0.23835559 = weight(abstract_txt:documents in 867) [ClassicSimilarity], result of:
            0.23835559 = score(doc=867,freq=5.0), product of:
              0.27575397 = queryWeight, product of:
                3.9633026 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.01687397 = queryNorm
              0.86437774 = fieldWeight in 867, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.09375 = fieldNorm(doc=867)
        0.2 = coord(5/25)