Document (#34103)

Author
Fersini, E.
Messina, E.
Archetti, F.
Title
Enhancing web page classification through image-block importance analysis
Source
Information processing and management. 44(2008) no.4, S.1431-1447
Year
2008
Abstract
We present a term weighting approach for improving web page classification, based on the assumption that the images of a web page are those elements which mainly attract the attention of the user. This assumption implies that the text contained in the visual block in which an image is located, called image-block, should contain significant information about the page contents. In this paper we propose a new metric, called the Inverse Term Importance Metric, aimed at assigning higher weights to important terms contained into important image-blocks identified by performing a visual layout analysis. We propose different methods to estimate the visual image-blocks importance, to smooth the term weight according to the importance of the blocks in which the term is located. The traditional TFxIDF model is modified accordingly and used in the classification task. The effectiveness of this new metric and the proposed block evaluation methods have been validated using different classification algorithms.

Similar documents (content)

  1. Moura, E.S. de; Fernandes, D.; Ribeiro-Neto, B.; Silva, A.S. da; Gonçalves, M.A.: Using structural information to improve search in Web collections (2010) 0.44
    0.44132185 = sum of:
      0.44132185 = product of:
        1.3791308 = sum of:
          0.088734046 = weight(abstract_txt:weight in 119) [ClassicSimilarity], result of:
            0.088734046 = score(doc=119,freq=2.0), product of:
              0.10856904 = queryWeight, product of:
                1.0123695 = boost
                7.3974023 = idf(docFreq=73, maxDocs=44421)
                0.014497319 = queryNorm
              0.8173052 = fieldWeight in 119, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.3974023 = idf(docFreq=73, maxDocs=44421)
                0.078125 = fieldNorm(doc=119)
          0.07060197 = weight(abstract_txt:inverse in 119) [ClassicSimilarity], result of:
            0.07060197 = score(doc=119,freq=1.0), product of:
              0.117453784 = queryWeight, product of:
                1.0529786 = boost
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.014497319 = queryNorm
              0.60110426 = fieldWeight in 119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.078125 = fieldNorm(doc=119)
          0.02205266 = weight(abstract_txt:methods in 119) [ClassicSimilarity], result of:
            0.02205266 = score(doc=119,freq=1.0), product of:
              0.068124995 = queryWeight, product of:
                1.1341077 = boost
                4.1434727 = idf(docFreq=1915, maxDocs=44421)
                0.014497319 = queryNorm
              0.3237088 = fieldWeight in 119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1434727 = idf(docFreq=1915, maxDocs=44421)
                0.078125 = fieldNorm(doc=119)
          0.042143088 = weight(abstract_txt:propose in 119) [ClassicSimilarity], result of:
            0.042143088 = score(doc=119,freq=1.0), product of:
              0.10490996 = queryWeight, product of:
                1.4073737 = boost
                5.1418524 = idf(docFreq=705, maxDocs=44421)
                0.014497319 = queryNorm
              0.40170723 = fieldWeight in 119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.1418524 = idf(docFreq=705, maxDocs=44421)
                0.078125 = fieldNorm(doc=119)
          0.09664967 = weight(abstract_txt:term in 119) [ClassicSimilarity], result of:
            0.09664967 = score(doc=119,freq=2.0), product of:
              0.18244532 = queryWeight, product of:
                2.6247165 = boost
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.014497319 = queryNorm
              0.52974594 = fieldWeight in 119, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.078125 = fieldNorm(doc=119)
          0.29953876 = weight(abstract_txt:blocks in 119) [ClassicSimilarity], result of:
            0.29953876 = score(doc=119,freq=2.0), product of:
              0.35236135 = queryWeight, product of:
                3.158936 = boost
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.014497319 = queryNorm
              0.8500897 = fieldWeight in 119, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.078125 = fieldNorm(doc=119)
          0.1331022 = weight(abstract_txt:page in 119) [ClassicSimilarity], result of:
            0.1331022 = score(doc=119,freq=1.0), product of:
              0.28453296 = queryWeight, product of:
                3.2777994 = boost
                5.987735 = idf(docFreq=302, maxDocs=44421)
                0.014497319 = queryNorm
              0.4677918 = fieldWeight in 119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.987735 = idf(docFreq=302, maxDocs=44421)
                0.078125 = fieldNorm(doc=119)
          0.6263084 = weight(abstract_txt:block in 119) [ClassicSimilarity], result of:
            0.6263084 = score(doc=119,freq=4.0), product of:
              0.5033244 = queryWeight, product of:
                4.359534 = boost
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.014497319 = queryNorm
              1.2443434 = fieldWeight in 119, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.078125 = fieldNorm(doc=119)
        0.32 = coord(8/25)
    
  2. Tsai, R.T.-H.; Chiu, B.; Wu, C.-E.: Visual webpage block importance prediction using conditional random fields (2011) 0.21
    0.20604832 = sum of:
      0.20604832 = product of:
        1.0302416 = sum of:
          0.018405735 = weight(abstract_txt:which in 924) [ClassicSimilarity], result of:
            0.018405735 = score(doc=924,freq=4.0), product of:
              0.050534155 = queryWeight, product of:
                1.1962974 = boost
                2.9137893 = idf(docFreq=6552, maxDocs=44421)
                0.014497319 = queryNorm
              0.36422366 = fieldWeight in 924, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.9137893 = idf(docFreq=6552, maxDocs=44421)
                0.0625 = fieldNorm(doc=924)
          0.092545085 = weight(abstract_txt:importance in 924) [ClassicSimilarity], result of:
            0.092545085 = score(doc=924,freq=2.0), product of:
              0.20567179 = queryWeight, product of:
                2.7867846 = boost
                5.0907717 = idf(docFreq=742, maxDocs=44421)
                0.014497319 = queryNorm
              0.44996488 = fieldWeight in 924, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.0907717 = idf(docFreq=742, maxDocs=44421)
                0.0625 = fieldNorm(doc=924)
          0.37888992 = weight(abstract_txt:blocks in 924) [ClassicSimilarity], result of:
            0.37888992 = score(doc=924,freq=5.0), product of:
              0.35236135 = queryWeight, product of:
                3.158936 = boost
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.014497319 = queryNorm
              1.0752879 = fieldWeight in 924, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.0625 = fieldNorm(doc=924)
          0.106481746 = weight(abstract_txt:page in 924) [ClassicSimilarity], result of:
            0.106481746 = score(doc=924,freq=1.0), product of:
              0.28453296 = queryWeight, product of:
                3.2777994 = boost
                5.987735 = idf(docFreq=302, maxDocs=44421)
                0.014497319 = queryNorm
              0.37423342 = fieldWeight in 924, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.987735 = idf(docFreq=302, maxDocs=44421)
                0.0625 = fieldNorm(doc=924)
          0.43391916 = weight(abstract_txt:block in 924) [ClassicSimilarity], result of:
            0.43391916 = score(doc=924,freq=3.0), product of:
              0.5033244 = queryWeight, product of:
                4.359534 = boost
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.014497319 = queryNorm
              0.8621064 = fieldWeight in 924, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.0625 = fieldNorm(doc=924)
        0.2 = coord(5/25)
    
  3. Wan, X.; Yang, J.; Xiao, J.: Towards a unified approach to document similarity search using manifold-ranking of blocks (2008) 0.14
    0.14370508 = sum of:
      0.14370508 = product of:
        0.89815676 = sum of:
          0.05163319 = weight(abstract_txt:validated in 3081) [ClassicSimilarity], result of:
            0.05163319 = score(doc=3081,freq=1.0), product of:
              0.11063226 = queryWeight, product of:
                1.0219437 = boost
                7.467361 = idf(docFreq=68, maxDocs=44421)
                0.014497319 = queryNorm
              0.46671006 = fieldWeight in 3081, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.467361 = idf(docFreq=68, maxDocs=44421)
                0.0625 = fieldNorm(doc=3081)
          0.03371447 = weight(abstract_txt:propose in 3081) [ClassicSimilarity], result of:
            0.03371447 = score(doc=3081,freq=1.0), product of:
              0.10490996 = queryWeight, product of:
                1.4073737 = boost
                5.1418524 = idf(docFreq=705, maxDocs=44421)
                0.014497319 = queryNorm
              0.32136577 = fieldWeight in 3081, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.1418524 = idf(docFreq=705, maxDocs=44421)
                0.0625 = fieldNorm(doc=3081)
          0.37888992 = weight(abstract_txt:blocks in 3081) [ClassicSimilarity], result of:
            0.37888992 = score(doc=3081,freq=5.0), product of:
              0.35236135 = queryWeight, product of:
                3.158936 = boost
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.014497319 = queryNorm
              1.0752879 = fieldWeight in 3081, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.0625 = fieldNorm(doc=3081)
          0.43391916 = weight(abstract_txt:block in 3081) [ClassicSimilarity], result of:
            0.43391916 = score(doc=3081,freq=3.0), product of:
              0.5033244 = queryWeight, product of:
                4.359534 = boost
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.014497319 = queryNorm
              0.8621064 = fieldWeight in 3081, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.0625 = fieldNorm(doc=3081)
        0.16 = coord(4/25)
    
  4. Baeza-Yates, R.; Navarro, G.: Block addressing indices for approximate text retrieval (2000) 0.13
    0.13292289 = sum of:
      0.13292289 = product of:
        0.5538454 = sum of:
          0.01851064 = weight(abstract_txt:important in 5295) [ClassicSimilarity], result of:
            0.01851064 = score(doc=5295,freq=1.0), product of:
              0.070342876 = queryWeight, product of:
                1.1524209 = boost
                4.21038 = idf(docFreq=1791, maxDocs=44421)
                0.014497319 = queryNorm
              0.26314875 = fieldWeight in 5295, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.21038 = idf(docFreq=1791, maxDocs=44421)
                0.0625 = fieldNorm(doc=5295)
          0.01301482 = weight(abstract_txt:which in 5295) [ClassicSimilarity], result of:
            0.01301482 = score(doc=5295,freq=2.0), product of:
              0.050534155 = queryWeight, product of:
                1.1962974 = boost
                2.9137893 = idf(docFreq=6552, maxDocs=44421)
                0.014497319 = queryNorm
              0.25754502 = fieldWeight in 5295, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.9137893 = idf(docFreq=6552, maxDocs=44421)
                0.0625 = fieldNorm(doc=5295)
          0.036912598 = weight(abstract_txt:called in 5295) [ClassicSimilarity], result of:
            0.036912598 = score(doc=5295,freq=1.0), product of:
              0.11144371 = queryWeight, product of:
                1.4505371 = boost
                5.2995505 = idf(docFreq=602, maxDocs=44421)
                0.014497319 = queryNorm
              0.3312219 = fieldWeight in 5295, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2995505 = idf(docFreq=602, maxDocs=44421)
                0.0625 = fieldNorm(doc=5295)
          0.065439254 = weight(abstract_txt:importance in 5295) [ClassicSimilarity], result of:
            0.065439254 = score(doc=5295,freq=1.0), product of:
              0.20567179 = queryWeight, product of:
                2.7867846 = boost
                5.0907717 = idf(docFreq=742, maxDocs=44421)
                0.014497319 = queryNorm
              0.31817323 = fieldWeight in 5295, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.0907717 = idf(docFreq=742, maxDocs=44421)
                0.0625 = fieldNorm(doc=5295)
          0.16944472 = weight(abstract_txt:blocks in 5295) [ClassicSimilarity], result of:
            0.16944472 = score(doc=5295,freq=1.0), product of:
              0.35236135 = queryWeight, product of:
                3.158936 = boost
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.014497319 = queryNorm
              0.4808834 = fieldWeight in 5295, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.0625 = fieldNorm(doc=5295)
          0.25052336 = weight(abstract_txt:block in 5295) [ClassicSimilarity], result of:
            0.25052336 = score(doc=5295,freq=1.0), product of:
              0.5033244 = queryWeight, product of:
                4.359534 = boost
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.014497319 = queryNorm
              0.49773738 = fieldWeight in 5295, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.0625 = fieldNorm(doc=5295)
        0.24 = coord(6/25)
    
  5. Riehm, S.M.: ¬A first look at FirstSearch (1992) 0.12
    0.11971298 = sum of:
      0.11971298 = product of:
        0.74820614 = sum of:
          0.02205266 = weight(abstract_txt:methods in 2344) [ClassicSimilarity], result of:
            0.02205266 = score(doc=2344,freq=1.0), product of:
              0.068124995 = queryWeight, product of:
                1.1341077 = boost
                4.1434727 = idf(docFreq=1915, maxDocs=44421)
                0.014497319 = queryNorm
              0.3237088 = fieldWeight in 2344, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1434727 = idf(docFreq=1915, maxDocs=44421)
                0.078125 = fieldNorm(doc=2344)
          0.04614075 = weight(abstract_txt:called in 2344) [ClassicSimilarity], result of:
            0.04614075 = score(doc=2344,freq=1.0), product of:
              0.11144371 = queryWeight, product of:
                1.4505371 = boost
                5.2995505 = idf(docFreq=602, maxDocs=44421)
                0.014497319 = queryNorm
              0.4140274 = fieldWeight in 2344, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2995505 = idf(docFreq=602, maxDocs=44421)
                0.078125 = fieldNorm(doc=2344)
          0.36685857 = weight(abstract_txt:blocks in 2344) [ClassicSimilarity], result of:
            0.36685857 = score(doc=2344,freq=3.0), product of:
              0.35236135 = queryWeight, product of:
                3.158936 = boost
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.014497319 = queryNorm
              1.0411431 = fieldWeight in 2344, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.078125 = fieldNorm(doc=2344)
          0.3131542 = weight(abstract_txt:block in 2344) [ClassicSimilarity], result of:
            0.3131542 = score(doc=2344,freq=1.0), product of:
              0.5033244 = queryWeight, product of:
                4.359534 = boost
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.014497319 = queryNorm
              0.6221717 = fieldWeight in 2344, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.078125 = fieldNorm(doc=2344)
        0.16 = coord(4/25)