Document (#25304)

Author
Lam, W.
Wong, K.-F.
Wong, C.-Y.
Title
Chinese document indexing based on new partitioned signature file : model and evaluation
Source
Journal of the American Society for Information Science and technology. 52(2001) no.7, S.584-597
Year
2001
Abstract
In this article we investigate the use of signature files in Chinese information retrieval system and propose a new partitioning method for Chinese signature file based on the characteristic of Chinese words. Our partitioning method, called Partitioned Signature File for Chinese (PSFC), offers faster search efficiency than the traditional single signature file approach. We devise a general scheme for controlling the trade-off between the false drop and storage overhead while maintaining the search space reduction in PSFC. An analytical study is presented to support the claims of our method. We also propose two new hashing methods for Chinese signature files so that the signature file will be more suitable for dynamic environment while the retrieval performance is maintained. Furthermore, we have implemented PSFC and the new hashing methods, and we evaluated them using a large-scale real-world Chinese document corpus, namely, the TREC-5 (Text REtrieval Conference) Chinese collection. The experimental results confirm the features of PSFC and demonstrate its superiority over the traditional single signature file method

Similar documents (author)

  1. Wong, S.K.M.: On modelling information retrieval with probabilistic inference (1995) 5.13
    5.1281 = sum of:
      5.1281 = weight(author_txt:wong in 2006) [ClassicSimilarity], result of:
        5.1281 = fieldWeight in 2006, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.20496 = idf(docFreq=32, maxDocs=44421)
          0.625 = fieldNorm(doc=2006)
    
  2. Wong, K.: Frühe Spuren des menschlichen Geistes (2005) 5.13
    5.1281 = sum of:
      5.1281 = weight(author_txt:wong in 1983) [ClassicSimilarity], result of:
        5.1281 = fieldWeight in 1983, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.20496 = idf(docFreq=32, maxDocs=44421)
          0.625 = fieldNorm(doc=1983)
    
  3. Salton, G.; Wong, A.: Generation and search of clustered files (1978) 4.10
    4.10248 = sum of:
      4.10248 = weight(author_txt:wong in 2410) [ClassicSimilarity], result of:
        4.10248 = fieldWeight in 2410, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.20496 = idf(docFreq=32, maxDocs=44421)
          0.5 = fieldNorm(doc=2410)
    
  4. Wong, S.K.M.; Yao, Y.Y.: ¬An information-theoretic measure of term specifics (1992) 4.10
    4.10248 = sum of:
      4.10248 = weight(author_txt:wong in 4806) [ClassicSimilarity], result of:
        4.10248 = fieldWeight in 4806, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.20496 = idf(docFreq=32, maxDocs=44421)
          0.5 = fieldNorm(doc=4806)
    
  5. Wong, W.Y.P.; Lee, D.L.: Implementation of partial document ranking using inverted files (1993) 4.10
    4.10248 = sum of:
      4.10248 = weight(author_txt:wong in 6538) [ClassicSimilarity], result of:
        4.10248 = fieldWeight in 6538, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.20496 = idf(docFreq=32, maxDocs=44421)
          0.5 = fieldNorm(doc=6538)
    

Similar documents (content)

  1. Lee, D.L.; Ren, L.: Document ranking on weight-partitioned signature files (1996) 0.64
    0.63947797 = sum of:
      0.63947797 = product of:
        1.9983687 = sum of:
          0.011606079 = weight(abstract_txt:search in 3417) [ClassicSimilarity], result of:
            0.011606079 = score(doc=3417,freq=1.0), product of:
              0.033874635 = queryWeight, product of:
                1.0133603 = boost
                3.654598 = idf(docFreq=3123, maxDocs=44421)
                0.009146841 = queryNorm
              0.34261855 = fieldWeight in 3417, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.654598 = idf(docFreq=3123, maxDocs=44421)
                0.09375 = fieldNorm(doc=3417)
          0.07401383 = weight(abstract_txt:false in 3417) [ClassicSimilarity], result of:
            0.07401383 = score(doc=3417,freq=2.0), product of:
              0.073384814 = queryWeight, product of:
                1.0546653 = boost
                7.607123 = idf(docFreq=59, maxDocs=44421)
                0.009146841 = queryNorm
              1.0085715 = fieldWeight in 3417, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.607123 = idf(docFreq=59, maxDocs=44421)
                0.09375 = fieldNorm(doc=3417)
          0.037655484 = weight(abstract_txt:document in 3417) [ClassicSimilarity], result of:
            0.037655484 = score(doc=3417,freq=4.0), product of:
              0.046768103 = queryWeight, product of:
                1.1906974 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.009146841 = queryNorm
              0.80515313 = fieldWeight in 3417, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.09375 = fieldNorm(doc=3417)
          0.014985964 = weight(abstract_txt:retrieval in 3417) [ClassicSimilarity], result of:
            0.014985964 = score(doc=3417,freq=1.0), product of:
              0.04598023 = queryWeight, product of:
                1.4459649 = boost
                3.4765 = idf(docFreq=3732, maxDocs=44421)
                0.009146841 = queryNorm
              0.3259219 = fieldWeight in 3417, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4765 = idf(docFreq=3732, maxDocs=44421)
                0.09375 = fieldNorm(doc=3417)
          0.044618987 = weight(abstract_txt:files in 3417) [ClassicSimilarity], result of:
            0.044618987 = score(doc=3417,freq=1.0), product of:
              0.08313121 = queryWeight, product of:
                1.5874811 = boost
                5.7251167 = idf(docFreq=393, maxDocs=44421)
                0.009146841 = queryNorm
              0.5367297 = fieldWeight in 3417, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7251167 = idf(docFreq=393, maxDocs=44421)
                0.09375 = fieldNorm(doc=3417)
          0.2502723 = weight(abstract_txt:partitioned in 3417) [ClassicSimilarity], result of:
            0.2502723 = score(doc=3417,freq=2.0), product of:
              0.20829692 = queryWeight, product of:
                2.5128582 = boost
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.009146841 = queryNorm
              1.2015171 = fieldWeight in 3417, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.09375 = fieldNorm(doc=3417)
          0.24831744 = weight(abstract_txt:file in 3417) [ClassicSimilarity], result of:
            0.24831744 = score(doc=3417,freq=4.0), product of:
              0.23719718 = queryWeight, product of:
                4.6445317 = boost
                5.58337 = idf(docFreq=453, maxDocs=44421)
                0.009146841 = queryNorm
              1.0468819 = fieldWeight in 3417, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.58337 = idf(docFreq=453, maxDocs=44421)
                0.09375 = fieldNorm(doc=3417)
          1.3168987 = weight(abstract_txt:signature in 3417) [ClassicSimilarity], result of:
            1.3168987 = score(doc=3417,freq=5.0), product of:
              0.7370255 = queryWeight, product of:
                9.453612 = boost
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.009146841 = queryNorm
              1.786775 = fieldWeight in 3417, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.09375 = fieldNorm(doc=3417)
        0.32 = coord(8/25)
    
  2. Kelledy, F.; Smeaton, A.F.: Signature files and beyond (1996) 0.38
    0.38164413 = sum of:
      0.38164413 = product of:
        1.5901839 = sum of:
          0.013677895 = weight(abstract_txt:search in 42) [ClassicSimilarity], result of:
            0.013677895 = score(doc=42,freq=2.0), product of:
              0.033874635 = queryWeight, product of:
                1.0133603 = boost
                3.654598 = idf(docFreq=3123, maxDocs=44421)
                0.009146841 = queryNorm
              0.40377986 = fieldWeight in 42, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.654598 = idf(docFreq=3123, maxDocs=44421)
                0.078125 = fieldNorm(doc=42)
          0.014095448 = weight(abstract_txt:methods in 42) [ClassicSimilarity], result of:
            0.014095448 = score(doc=42,freq=1.0), product of:
              0.043543603 = queryWeight, product of:
                1.1489172 = boost
                4.1434727 = idf(docFreq=1915, maxDocs=44421)
                0.009146841 = queryNorm
              0.3237088 = fieldWeight in 42, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1434727 = idf(docFreq=1915, maxDocs=44421)
                0.078125 = fieldNorm(doc=42)
          0.07436498 = weight(abstract_txt:files in 42) [ClassicSimilarity], result of:
            0.07436498 = score(doc=42,freq=4.0), product of:
              0.08313121 = queryWeight, product of:
                1.5874811 = boost
                5.7251167 = idf(docFreq=393, maxDocs=44421)
                0.009146841 = queryNorm
              0.8945495 = fieldWeight in 42, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.7251167 = idf(docFreq=393, maxDocs=44421)
                0.078125 = fieldNorm(doc=42)
          0.2443075 = weight(abstract_txt:partitioning in 42) [ClassicSimilarity], result of:
            0.2443075 = score(doc=42,freq=3.0), product of:
              0.20220377 = queryWeight, product of:
                2.475832 = boost
                8.928879 = idf(docFreq=15, maxDocs=44421)
                0.009146841 = queryNorm
              1.2082243 = fieldWeight in 42, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                8.928879 = idf(docFreq=15, maxDocs=44421)
                0.078125 = fieldNorm(doc=42)
          0.14632244 = weight(abstract_txt:file in 42) [ClassicSimilarity], result of:
            0.14632244 = score(doc=42,freq=2.0), product of:
              0.23719718 = queryWeight, product of:
                4.6445317 = boost
                5.58337 = idf(docFreq=453, maxDocs=44421)
                0.009146841 = queryNorm
              0.6168811 = fieldWeight in 42, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.58337 = idf(docFreq=453, maxDocs=44421)
                0.078125 = fieldNorm(doc=42)
          1.0974156 = weight(abstract_txt:signature in 42) [ClassicSimilarity], result of:
            1.0974156 = score(doc=42,freq=5.0), product of:
              0.7370255 = queryWeight, product of:
                9.453612 = boost
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.009146841 = queryNorm
              1.4889791 = fieldWeight in 42, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.078125 = fieldNorm(doc=42)
        0.24 = coord(6/25)
    
  3. Carterette, B.; Can, F.: Comparing inverted files and signature files for searching a large lexicon (2005) 0.31
    0.31237465 = sum of:
      0.31237465 = product of:
        1.5618732 = sum of:
          0.037351854 = weight(abstract_txt:faster in 2029) [ClassicSimilarity], result of:
            0.037351854 = score(doc=2029,freq=1.0), product of:
              0.066181496 = queryWeight, product of:
                1.0015666 = boost
                7.2241306 = idf(docFreq=87, maxDocs=44421)
                0.009146841 = queryNorm
              0.5643852 = fieldWeight in 2029, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.2241306 = idf(docFreq=87, maxDocs=44421)
                0.078125 = fieldNorm(doc=2029)
          0.06440196 = weight(abstract_txt:files in 2029) [ClassicSimilarity], result of:
            0.06440196 = score(doc=2029,freq=3.0), product of:
              0.08313121 = queryWeight, product of:
                1.5874811 = boost
                5.7251167 = idf(docFreq=393, maxDocs=44421)
                0.009146841 = queryNorm
              0.77470255 = fieldWeight in 2029, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.7251167 = idf(docFreq=393, maxDocs=44421)
                0.078125 = fieldNorm(doc=2029)
          0.051029626 = weight(abstract_txt:method in 2029) [ClassicSimilarity], result of:
            0.051029626 = score(doc=2029,freq=2.0), product of:
              0.10266443 = queryWeight, product of:
                2.494891 = boost
                4.4988065 = idf(docFreq=1342, maxDocs=44421)
                0.009146841 = queryNorm
              0.4970526 = fieldWeight in 2029, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.4988065 = idf(docFreq=1342, maxDocs=44421)
                0.078125 = fieldNorm(doc=2029)
          0.20693119 = weight(abstract_txt:file in 2029) [ClassicSimilarity], result of:
            0.20693119 = score(doc=2029,freq=4.0), product of:
              0.23719718 = queryWeight, product of:
                4.6445317 = boost
                5.58337 = idf(docFreq=453, maxDocs=44421)
                0.009146841 = queryNorm
              0.8724016 = fieldWeight in 2029, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.58337 = idf(docFreq=453, maxDocs=44421)
                0.078125 = fieldNorm(doc=2029)
          1.2021586 = weight(abstract_txt:signature in 2029) [ClassicSimilarity], result of:
            1.2021586 = score(doc=2029,freq=6.0), product of:
              0.7370255 = queryWeight, product of:
                9.453612 = boost
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.009146841 = queryNorm
              1.6310949 = fieldWeight in 2029, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.078125 = fieldNorm(doc=2029)
        0.2 = coord(5/25)
    
  4. Faloutsos, C.: Signature files (1992) 0.17
    0.17197043 = sum of:
      0.17197043 = product of:
        1.4330869 = sum of:
          0.034179643 = weight(abstract_txt:methods in 4499) [ClassicSimilarity], result of:
            0.034179643 = score(doc=4499,freq=3.0), product of:
              0.043543603 = queryWeight, product of:
                1.1489172 = boost
                4.1434727 = idf(docFreq=1915, maxDocs=44421)
                0.009146841 = queryNorm
              0.7849521 = fieldWeight in 4499, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.1434727 = idf(docFreq=1915, maxDocs=44421)
                0.109375 = fieldNorm(doc=4499)
          0.024725579 = weight(abstract_txt:retrieval in 4499) [ClassicSimilarity], result of:
            0.024725579 = score(doc=4499,freq=2.0), product of:
              0.04598023 = queryWeight, product of:
                1.4459649 = boost
                3.4765 = idf(docFreq=3732, maxDocs=44421)
                0.009146841 = queryNorm
              0.5377437 = fieldWeight in 4499, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4765 = idf(docFreq=3732, maxDocs=44421)
                0.109375 = fieldNorm(doc=4499)
          1.3741816 = weight(abstract_txt:signature in 4499) [ClassicSimilarity], result of:
            1.3741816 = score(doc=4499,freq=4.0), product of:
              0.7370255 = queryWeight, product of:
                9.453612 = boost
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.009146841 = queryNorm
              1.8644967 = fieldWeight in 4499, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.109375 = fieldNorm(doc=4499)
        0.12 = coord(3/25)
    
  5. Almerri, J.; McGregor, D.R.: Codon signatures : a document retrieval method (1996) 0.17
    0.16541965 = sum of:
      0.16541965 = product of:
        0.82709825 = sum of:
          0.015689785 = weight(abstract_txt:document in 39) [ClassicSimilarity], result of:
            0.015689785 = score(doc=39,freq=1.0), product of:
              0.046768103 = queryWeight, product of:
                1.1906974 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.009146841 = queryNorm
              0.33548045 = fieldWeight in 39, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.078125 = fieldNorm(doc=39)
          0.017661126 = weight(abstract_txt:retrieval in 39) [ClassicSimilarity], result of:
            0.017661126 = score(doc=39,freq=2.0), product of:
              0.04598023 = queryWeight, product of:
                1.4459649 = boost
                3.4765 = idf(docFreq=3732, maxDocs=44421)
                0.009146841 = queryNorm
              0.3841026 = fieldWeight in 39, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4765 = idf(docFreq=3732, maxDocs=44421)
                0.078125 = fieldNorm(doc=39)
          0.03718249 = weight(abstract_txt:files in 39) [ClassicSimilarity], result of:
            0.03718249 = score(doc=39,freq=1.0), product of:
              0.08313121 = queryWeight, product of:
                1.5874811 = boost
                5.7251167 = idf(docFreq=393, maxDocs=44421)
                0.009146841 = queryNorm
              0.44727474 = fieldWeight in 39, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7251167 = idf(docFreq=393, maxDocs=44421)
                0.078125 = fieldNorm(doc=39)
          0.06249827 = weight(abstract_txt:method in 39) [ClassicSimilarity], result of:
            0.06249827 = score(doc=39,freq=3.0), product of:
              0.10266443 = queryWeight, product of:
                2.494891 = boost
                4.4988065 = idf(docFreq=1342, maxDocs=44421)
                0.009146841 = queryNorm
              0.6087626 = fieldWeight in 39, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.4988065 = idf(docFreq=1342, maxDocs=44421)
                0.078125 = fieldNorm(doc=39)
          0.6940666 = weight(abstract_txt:signature in 39) [ClassicSimilarity], result of:
            0.6940666 = score(doc=39,freq=2.0), product of:
              0.7370255 = queryWeight, product of:
                9.453612 = boost
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.009146841 = queryNorm
              0.9417131 = fieldWeight in 39, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.078125 = fieldNorm(doc=39)
        0.2 = coord(5/25)