Document (#33045)

Author
Moffat, A.
Isal, R.Y.K.
Title
Word-based text compression using the Burrows-Wheeler transform
Source
Information processing and management. 41(2005) no.5, S.1175-1192
Year
2005
Abstract
Block-sorting is an innovative compression mechanism introduced in 1994 by Burrows and Wheeler. It involves three steps: permuting the input one block at a time through the use of the Burrows-Wheeler transform (bwt); applying a move-to-front (mtf) transform to each of the permuted blocks; and then entropy coding the output with a Huffman or arithmetic coder. Until now, block-sorting implementations have assumed that the input message is a sequence of characters. In this paper we extend the block-sorting mechanism to word-based models. We also consider other recency transformations, and are able to show improved compression results compared to mtf and uniform arithmetic coding. For large files of text, the combination of word-based modeling, bwt, and mtf-like transformations allows excellent compression effectiveness to be attained within reasonable resource costs.
Theme
Computerlinguistik

Similar documents (author)

  1. Moffat, A.; Bell, T.A.H.: In situ generation of compressed inverted files (1995) 4.81
    4.811013 = sum of:
      4.811013 = weight(author_txt:moffat in 2716) [ClassicSimilarity], result of:
        4.811013 = fieldWeight in 2716, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.622026 = idf(docFreq=7, maxDocs=44421)
          0.5 = fieldNorm(doc=2716)
    
  2. Moffat, A.; Zobel, J.: Self-indexing inverted files for fast text retrieval (1996) 4.81
    4.811013 = sum of:
      4.811013 = weight(author_txt:moffat in 1009) [ClassicSimilarity], result of:
        4.811013 = fieldWeight in 1009, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.622026 = idf(docFreq=7, maxDocs=44421)
          0.5 = fieldNorm(doc=1009)
    
  3. Wan, R.; Moffat, A.: Block merging for off-line compression (2007) 4.81
    4.811013 = sum of:
      4.811013 = weight(author_txt:moffat in 1081) [ClassicSimilarity], result of:
        4.811013 = fieldWeight in 1081, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.622026 = idf(docFreq=7, maxDocs=44421)
          0.5 = fieldNorm(doc=1081)
    
  4. Moffat, A.; Mackenzie, J.: How much freedom does an effectiveness metric really have? (2024) 4.81
    4.811013 = sum of:
      4.811013 = weight(author_txt:moffat in 2289) [ClassicSimilarity], result of:
        4.811013 = fieldWeight in 2289, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.622026 = idf(docFreq=7, maxDocs=44421)
          0.5 = fieldNorm(doc=2289)
    
  5. Witten, I.H.; Moffat, A.; Bell, T.C.: Managing gigabytes : compressing and indexing documents and images (1994) 3.61
    3.60826 = sum of:
      3.60826 = weight(author_txt:moffat in 4083) [ClassicSimilarity], result of:
        3.60826 = fieldWeight in 4083, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.622026 = idf(docFreq=7, maxDocs=44421)
          0.375 = fieldNorm(doc=4083)
    

Similar documents (content)

  1. Cheng, K.-S.; Young, G.H.; Wong, K.-F.: ¬A study on word-based and integral-bit Chinese text compression algorithms (1999) 0.33
    0.3309789 = sum of:
      0.3309789 = product of:
        1.3790787 = sum of:
          0.026446773 = weight(abstract_txt:text in 4056) [ClassicSimilarity], result of:
            0.026446773 = score(doc=4056,freq=1.0), product of:
              0.059838187 = queryWeight, product of:
                1.183921 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.012507759 = queryNorm
              0.44197148 = fieldWeight in 4056, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.109375 = fieldNorm(doc=4056)
          0.019389922 = weight(abstract_txt:based in 4056) [ClassicSimilarity], result of:
            0.019389922 = score(doc=4056,freq=1.0), product of:
              0.055694345 = queryWeight, product of:
                1.3988936 = boost
                3.1830752 = idf(docFreq=5005, maxDocs=44421)
                0.012507759 = queryNorm
              0.34814885 = fieldWeight in 4056, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1830752 = idf(docFreq=5005, maxDocs=44421)
                0.109375 = fieldNorm(doc=4056)
          0.17509168 = weight(abstract_txt:coding in 4056) [ClassicSimilarity], result of:
            0.17509168 = score(doc=4056,freq=2.0), product of:
              0.16745457 = queryWeight, product of:
                1.9805326 = boost
                6.759825 = idf(docFreq=139, maxDocs=44421)
                0.012507759 = queryNorm
              1.0456071 = fieldWeight in 4056, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.759825 = idf(docFreq=139, maxDocs=44421)
                0.109375 = fieldNorm(doc=4056)
          0.13673654 = weight(abstract_txt:word in 4056) [ClassicSimilarity], result of:
            0.13673654 = score(doc=4056,freq=2.0), product of:
              0.16255741 = queryWeight, product of:
                2.3899155 = boost
                5.4380693 = idf(docFreq=524, maxDocs=44421)
                0.012507759 = queryNorm
              0.84115845 = fieldWeight in 4056, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.4380693 = idf(docFreq=524, maxDocs=44421)
                0.109375 = fieldNorm(doc=4056)
          0.4218813 = weight(abstract_txt:arithmetic in 4056) [ClassicSimilarity], result of:
            0.4218813 = score(doc=4056,freq=2.0), product of:
              0.30096328 = queryWeight, product of:
                2.6551573 = boost
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.012507759 = queryNorm
              1.40177 = fieldWeight in 4056, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.109375 = fieldNorm(doc=4056)
          0.59953254 = weight(abstract_txt:compression in 4056) [ClassicSimilarity], result of:
            0.59953254 = score(doc=4056,freq=3.0), product of:
              0.4187049 = queryWeight, product of:
                4.428968 = boost
                7.558333 = idf(docFreq=62, maxDocs=44421)
                0.012507759 = queryNorm
              1.4318737 = fieldWeight in 4056, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.558333 = idf(docFreq=62, maxDocs=44421)
                0.109375 = fieldNorm(doc=4056)
        0.24 = coord(6/25)
    
  2. Wan, R.; Moffat, A.: Block merging for off-line compression (2007) 0.17
    0.16787075 = sum of:
      0.16787075 = product of:
        0.8393537 = sum of:
          0.09221049 = weight(abstract_txt:blocks in 1081) [ClassicSimilarity], result of:
            0.09221049 = score(doc=1081,freq=2.0), product of:
              0.108471476 = queryWeight, product of:
                1.1271359 = boost
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.012507759 = queryNorm
              0.8500897 = fieldWeight in 1081, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.078125 = fieldNorm(doc=1081)
          0.018890552 = weight(abstract_txt:text in 1081) [ClassicSimilarity], result of:
            0.018890552 = score(doc=1081,freq=1.0), product of:
              0.059838187 = queryWeight, product of:
                1.183921 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.012507759 = queryNorm
              0.3156939 = fieldWeight in 1081, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.078125 = fieldNorm(doc=1081)
          0.072010994 = weight(abstract_txt:mechanism in 1081) [ClassicSimilarity], result of:
            0.072010994 = score(doc=1081,freq=1.0), product of:
              0.14602073 = queryWeight, product of:
                1.8494422 = boost
                6.312396 = idf(docFreq=218, maxDocs=44421)
                0.012507759 = queryNorm
              0.49315596 = fieldWeight in 1081, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.312396 = idf(docFreq=218, maxDocs=44421)
                0.078125 = fieldNorm(doc=1081)
          0.24724305 = weight(abstract_txt:compression in 1081) [ClassicSimilarity], result of:
            0.24724305 = score(doc=1081,freq=1.0), product of:
              0.4187049 = queryWeight, product of:
                4.428968 = boost
                7.558333 = idf(docFreq=62, maxDocs=44421)
                0.012507759 = queryNorm
              0.59049475 = fieldWeight in 1081, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.558333 = idf(docFreq=62, maxDocs=44421)
                0.078125 = fieldNorm(doc=1081)
          0.40899858 = weight(abstract_txt:block in 1081) [ClassicSimilarity], result of:
            0.40899858 = score(doc=1081,freq=2.0), product of:
              0.46483254 = queryWeight, product of:
                4.666559 = boost
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.012507759 = queryNorm
              0.8798837 = fieldWeight in 1081, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.078125 = fieldNorm(doc=1081)
        0.2 = coord(5/25)
    
  3. Steinmetz, R.: Data compression in multimedia computing : principles and techniques (1994) 0.11
    0.110126905 = sum of:
      0.110126905 = product of:
        0.5506345 = sum of:
          0.081799716 = weight(abstract_txt:entropy in 8181) [ClassicSimilarity], result of:
            0.081799716 = score(doc=8181,freq=2.0), product of:
              0.116208136 = queryWeight, product of:
                1.1666398 = boost
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.012507759 = queryNorm
              0.70390695 = fieldWeight in 8181, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.0625 = fieldNorm(doc=8181)
          0.015112441 = weight(abstract_txt:text in 8181) [ClassicSimilarity], result of:
            0.015112441 = score(doc=8181,freq=1.0), product of:
              0.059838187 = queryWeight, product of:
                1.183921 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.012507759 = queryNorm
              0.25255513 = fieldWeight in 8181, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=8181)
          0.011079956 = weight(abstract_txt:based in 8181) [ClassicSimilarity], result of:
            0.011079956 = score(doc=8181,freq=1.0), product of:
              0.055694345 = queryWeight, product of:
                1.3988936 = boost
                3.1830752 = idf(docFreq=5005, maxDocs=44421)
                0.012507759 = queryNorm
              0.1989422 = fieldWeight in 8181, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1830752 = idf(docFreq=5005, maxDocs=44421)
                0.0625 = fieldNorm(doc=8181)
          0.100052394 = weight(abstract_txt:coding in 8181) [ClassicSimilarity], result of:
            0.100052394 = score(doc=8181,freq=2.0), product of:
              0.16745457 = queryWeight, product of:
                1.9805326 = boost
                6.759825 = idf(docFreq=139, maxDocs=44421)
                0.012507759 = queryNorm
              0.5974898 = fieldWeight in 8181, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.759825 = idf(docFreq=139, maxDocs=44421)
                0.0625 = fieldNorm(doc=8181)
          0.34259 = weight(abstract_txt:compression in 8181) [ClassicSimilarity], result of:
            0.34259 = score(doc=8181,freq=3.0), product of:
              0.4187049 = queryWeight, product of:
                4.428968 = boost
                7.558333 = idf(docFreq=62, maxDocs=44421)
                0.012507759 = queryNorm
              0.8182135 = fieldWeight in 8181, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.558333 = idf(docFreq=62, maxDocs=44421)
                0.0625 = fieldNorm(doc=8181)
        0.2 = coord(5/25)
    
  4. Fersini, E.; Messina, E.; Archetti, F.: Enhancing web page classification through image-block importance analysis (2008) 0.10
    0.103455 = sum of:
      0.103455 = product of:
        0.64659375 = sum of:
          0.11293432 = weight(abstract_txt:blocks in 3102) [ClassicSimilarity], result of:
            0.11293432 = score(doc=3102,freq=3.0), product of:
              0.108471476 = queryWeight, product of:
                1.1271359 = boost
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.012507759 = queryNorm
              1.0411431 = fieldWeight in 3102, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.078125 = fieldNorm(doc=3102)
          0.018890552 = weight(abstract_txt:text in 3102) [ClassicSimilarity], result of:
            0.018890552 = score(doc=3102,freq=1.0), product of:
              0.059838187 = queryWeight, product of:
                1.183921 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.012507759 = queryNorm
              0.3156939 = fieldWeight in 3102, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.078125 = fieldNorm(doc=3102)
          0.013849944 = weight(abstract_txt:based in 3102) [ClassicSimilarity], result of:
            0.013849944 = score(doc=3102,freq=1.0), product of:
              0.055694345 = queryWeight, product of:
                1.3988936 = boost
                3.1830752 = idf(docFreq=5005, maxDocs=44421)
                0.012507759 = queryNorm
              0.24867775 = fieldWeight in 3102, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1830752 = idf(docFreq=5005, maxDocs=44421)
                0.078125 = fieldNorm(doc=3102)
          0.5009189 = weight(abstract_txt:block in 3102) [ClassicSimilarity], result of:
            0.5009189 = score(doc=3102,freq=3.0), product of:
              0.46483254 = queryWeight, product of:
                4.666559 = boost
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.012507759 = queryNorm
              1.077633 = fieldWeight in 3102, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.078125 = fieldNorm(doc=3102)
        0.16 = coord(4/25)
    
  5. Tsai, R.T.-H.; Chiu, B.; Wu, C.-E.: Visual webpage block importance prediction using conditional random fields (2011) 0.10
    0.097507 = sum of:
      0.097507 = product of:
        0.60941875 = sum of:
          0.07285451 = weight(abstract_txt:sequence in 924) [ClassicSimilarity], result of:
            0.07285451 = score(doc=924,freq=4.0), product of:
              0.085381344 = queryWeight, product of:
                6.82627 = idf(docFreq=130, maxDocs=44421)
                0.012507759 = queryNorm
              0.85328376 = fieldWeight in 924, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.82627 = idf(docFreq=130, maxDocs=44421)
                0.0625 = fieldNorm(doc=924)
          0.11663807 = weight(abstract_txt:blocks in 924) [ClassicSimilarity], result of:
            0.11663807 = score(doc=924,freq=5.0), product of:
              0.108471476 = queryWeight, product of:
                1.1271359 = boost
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.012507759 = queryNorm
              1.0752879 = fieldWeight in 924, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.0625 = fieldNorm(doc=924)
          0.019191045 = weight(abstract_txt:based in 924) [ClassicSimilarity], result of:
            0.019191045 = score(doc=924,freq=3.0), product of:
              0.055694345 = queryWeight, product of:
                1.3988936 = boost
                3.1830752 = idf(docFreq=5005, maxDocs=44421)
                0.012507759 = queryNorm
              0.344578 = fieldWeight in 924, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.1830752 = idf(docFreq=5005, maxDocs=44421)
                0.0625 = fieldNorm(doc=924)
          0.4007351 = weight(abstract_txt:block in 924) [ClassicSimilarity], result of:
            0.4007351 = score(doc=924,freq=3.0), product of:
              0.46483254 = queryWeight, product of:
                4.666559 = boost
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.012507759 = queryNorm
              0.8621064 = fieldWeight in 924, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.0625 = fieldNorm(doc=924)
        0.16 = coord(4/25)