Document (#28464)

Author
Thelwall, M.
Title
Text characteristics of English language university Web sites
Source
Journal of the American Society for Information Science and Technology. 56(2005) no.6, S.609-619
Year
2005
Abstract
The nature of the contents of academic Web sites is of direct relevance to the new field of scientific Web intelligence, and for search engine and topic-specific crawler designers. We analyze word frequencies in national academic Webs using the Web sites of three Englishspeaking nations: Australia, New Zealand, and the United Kingdom. Strong regularities were found in page size and word frequency distributions, but with significant anomalies. At least 26% of pages contain no words. High frequency words include university names and acronyms, Internet terminology, and computing product names: not always words in common usage away from the Web. A minority of low frequency words are spelling mistakes, with other common types including nonwords, proper names, foreign language terms or computer science variable names. Based upon these findings, recommendations for data cleansing and filtering are made, particularly for clustering applications.

Similar documents (author)

  1. Thelwall, M.; Thelwall, S.: ¬A thematic analysis of highly retweeted early COVID-19 tweets : consensus, information, dissent and lockdown life (2020) 4.89
    4.888919 = sum of:
      4.888919 = weight(author_txt:thelwall in 1179) [ClassicSimilarity], result of:
        4.888919 = fieldWeight in 1179, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.9139757 = idf(docFreq=119, maxDocs=44421)
          0.5 = fieldNorm(doc=1179)
    
  2. Thelwall, M.: Extracting macroscopic information from Web links (2001) 4.32
    4.3212347 = sum of:
      4.3212347 = weight(author_txt:thelwall in 851) [ClassicSimilarity], result of:
        4.3212347 = fieldWeight in 851, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          6.9139757 = idf(docFreq=119, maxDocs=44421)
          0.625 = fieldNorm(doc=851)
    
  3. Thelwall, M.: Conceptualizing documentation on the Web : an evaluation of different heuristic-based models for counting links between university Web sites (2002) 4.32
    4.3212347 = sum of:
      4.3212347 = weight(author_txt:thelwall in 1978) [ClassicSimilarity], result of:
        4.3212347 = fieldWeight in 1978, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          6.9139757 = idf(docFreq=119, maxDocs=44421)
          0.625 = fieldNorm(doc=1978)
    
  4. Thelwall, M.: Bibliometrics to webometrics (2009) 4.32
    4.3212347 = sum of:
      4.3212347 = weight(author_txt:thelwall in 5239) [ClassicSimilarity], result of:
        4.3212347 = fieldWeight in 5239, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          6.9139757 = idf(docFreq=119, maxDocs=44421)
          0.625 = fieldNorm(doc=5239)
    
  5. Thelwall, M.: ¬A layered approach for investigating the topological structure of communities in the Web (2003) 4.32
    4.3212347 = sum of:
      4.3212347 = weight(author_txt:thelwall in 5450) [ClassicSimilarity], result of:
        4.3212347 = fieldWeight in 5450, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          6.9139757 = idf(docFreq=119, maxDocs=44421)
          0.625 = fieldNorm(doc=5450)
    

Similar documents (content)

  1. Price, L.; Thelwall, M.: ¬The clustering power of low frequency words in academic webs (2005) 0.27
    0.27097872 = sum of:
      0.27097872 = product of:
        1.129078 = sum of:
          0.13739082 = weight(abstract_txt:zealand in 4561) [ClassicSimilarity], result of:
            0.13739082 = score(doc=4561,freq=1.0), product of:
              0.18113752 = queryWeight, product of:
                1.1510834 = boost
                8.090549 = idf(docFreq=36, maxDocs=44421)
                0.01945018 = queryNorm
              0.758489 = fieldWeight in 4561, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.090549 = idf(docFreq=36, maxDocs=44421)
                0.09375 = fieldNorm(doc=4561)
          0.09161821 = weight(abstract_txt:academic in 4561) [ClassicSimilarity], result of:
            0.09161821 = score(doc=4561,freq=3.0), product of:
              0.12077974 = queryWeight, product of:
                1.3292743 = boost
                4.6714945 = idf(docFreq=1129, maxDocs=44421)
                0.01945018 = queryNorm
              0.7585561 = fieldWeight in 4561, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.6714945 = idf(docFreq=1129, maxDocs=44421)
                0.09375 = fieldNorm(doc=4561)
          0.08344261 = weight(abstract_txt:word in 4561) [ClassicSimilarity], result of:
            0.08344261 = score(doc=4561,freq=1.0), product of:
              0.16367105 = queryWeight, product of:
                1.5474032 = boost
                5.4380693 = idf(docFreq=524, maxDocs=44421)
                0.01945018 = queryNorm
              0.50981903 = fieldWeight in 4561, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.4380693 = idf(docFreq=524, maxDocs=44421)
                0.09375 = fieldNorm(doc=4561)
          0.21278352 = weight(abstract_txt:sites in 4561) [ClassicSimilarity], result of:
            0.21278352 = score(doc=4561,freq=3.0), product of:
              0.24247219 = queryWeight, product of:
                2.3067162 = boost
                5.4043584 = idf(docFreq=542, maxDocs=44421)
                0.01945018 = queryNorm
              0.87755847 = fieldWeight in 4561, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.4043584 = idf(docFreq=542, maxDocs=44421)
                0.09375 = fieldNorm(doc=4561)
          0.3277056 = weight(abstract_txt:frequency in 4561) [ClassicSimilarity], result of:
            0.3277056 = score(doc=4561,freq=4.0), product of:
              0.29379627 = queryWeight, product of:
                2.5391383 = boost
                5.948895 = idf(docFreq=314, maxDocs=44421)
                0.01945018 = queryNorm
              1.1154178 = fieldWeight in 4561, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.948895 = idf(docFreq=314, maxDocs=44421)
                0.09375 = fieldNorm(doc=4561)
          0.27613723 = weight(abstract_txt:words in 4561) [ClassicSimilarity], result of:
            0.27613723 = score(doc=4561,freq=3.0), product of:
              0.31751642 = queryWeight, product of:
                3.0480049 = boost
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.01945018 = queryNorm
              0.86967856 = fieldWeight in 4561, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.09375 = fieldNorm(doc=4561)
        0.24 = coord(6/25)
    
  2. Spink, A.; Wolfram, D.; Jansen, B.J.; Saracevic, T.: Searching the Web : the public and their queries (2001) 0.19
    0.19272539 = sum of:
      0.19272539 = product of:
        0.60226685 = sum of:
          0.05826538 = weight(abstract_txt:spelling in 980) [ClassicSimilarity], result of:
            0.05826538 = score(doc=980,freq=1.0), product of:
              0.16230442 = queryWeight, product of:
                1.0896016 = boost
                7.6584163 = idf(docFreq=56, maxDocs=44421)
                0.01945018 = queryNorm
              0.35898826 = fieldWeight in 980, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6584163 = idf(docFreq=56, maxDocs=44421)
                0.046875 = fieldNorm(doc=980)
          0.06437561 = weight(abstract_txt:minority in 980) [ClassicSimilarity], result of:
            0.06437561 = score(doc=980,freq=1.0), product of:
              0.17346193 = queryWeight, product of:
                1.1264311 = boost
                7.917278 = idf(docFreq=43, maxDocs=44421)
                0.01945018 = queryNorm
              0.3711224 = fieldWeight in 980, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.917278 = idf(docFreq=43, maxDocs=44421)
                0.046875 = fieldNorm(doc=980)
          0.018825233 = weight(abstract_txt:language in 980) [ClassicSimilarity], result of:
            0.018825233 = score(doc=980,freq=1.0), product of:
              0.09628534 = queryWeight, product of:
                1.186855 = boost
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.01945018 = queryNorm
              0.19551504 = fieldWeight in 980, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.046875 = fieldNorm(doc=980)
          0.07604231 = weight(abstract_txt:mistakes in 980) [ClassicSimilarity], result of:
            0.07604231 = score(doc=980,freq=1.0), product of:
              0.1938326 = queryWeight, product of:
                1.1907374 = boost
                8.369263 = idf(docFreq=27, maxDocs=44421)
                0.01945018 = queryNorm
              0.3923092 = fieldWeight in 980, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.369263 = idf(docFreq=27, maxDocs=44421)
                0.046875 = fieldNorm(doc=980)
          0.0868685 = weight(abstract_txt:sites in 980) [ClassicSimilarity], result of:
            0.0868685 = score(doc=980,freq=2.0), product of:
              0.24247219 = queryWeight, product of:
                2.3067162 = boost
                5.4043584 = idf(docFreq=542, maxDocs=44421)
                0.01945018 = queryNorm
              0.3582617 = fieldWeight in 980, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.4043584 = idf(docFreq=542, maxDocs=44421)
                0.046875 = fieldNorm(doc=980)
          0.0819264 = weight(abstract_txt:frequency in 980) [ClassicSimilarity], result of:
            0.0819264 = score(doc=980,freq=1.0), product of:
              0.29379627 = queryWeight, product of:
                2.5391383 = boost
                5.948895 = idf(docFreq=314, maxDocs=44421)
                0.01945018 = queryNorm
              0.27885446 = fieldWeight in 980, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.948895 = idf(docFreq=314, maxDocs=44421)
                0.046875 = fieldNorm(doc=980)
          0.112732545 = weight(abstract_txt:words in 980) [ClassicSimilarity], result of:
            0.112732545 = score(doc=980,freq=2.0), product of:
              0.31751642 = queryWeight, product of:
                3.0480049 = boost
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.01945018 = queryNorm
              0.35504478 = fieldWeight in 980, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.046875 = fieldNorm(doc=980)
          0.10323082 = weight(abstract_txt:names in 980) [ClassicSimilarity], result of:
            0.10323082 = score(doc=980,freq=1.0), product of:
              0.37723866 = queryWeight, product of:
                3.322314 = boost
                5.8378363 = idf(docFreq=351, maxDocs=44421)
                0.01945018 = queryNorm
              0.27364856 = fieldWeight in 980, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8378363 = idf(docFreq=351, maxDocs=44421)
                0.046875 = fieldNorm(doc=980)
        0.32 = coord(8/25)
    
  3. Thelwall, M.; Wilkinson, D.: Graph structure in three national academic Webs : power laws with anomalies (2003) 0.18
    0.17988297 = sum of:
      0.17988297 = product of:
        0.74951243 = sum of:
          0.11449235 = weight(abstract_txt:zealand in 2681) [ClassicSimilarity], result of:
            0.11449235 = score(doc=2681,freq=1.0), product of:
              0.18113752 = queryWeight, product of:
                1.1510834 = boost
                8.090549 = idf(docFreq=36, maxDocs=44421)
                0.01945018 = queryNorm
              0.6320742 = fieldWeight in 2681, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.090549 = idf(docFreq=36, maxDocs=44421)
                0.078125 = fieldNorm(doc=2681)
          0.11811982 = weight(abstract_txt:webs in 2681) [ClassicSimilarity], result of:
            0.11811982 = score(doc=2681,freq=1.0), product of:
              0.1849436 = queryWeight, product of:
                1.1631138 = boost
                8.175107 = idf(docFreq=33, maxDocs=44421)
                0.01945018 = queryNorm
              0.6386802 = fieldWeight in 2681, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.175107 = idf(docFreq=33, maxDocs=44421)
                0.078125 = fieldNorm(doc=2681)
          0.18661353 = weight(abstract_txt:anomalies in 2681) [ClassicSimilarity], result of:
            0.18661353 = score(doc=2681,freq=2.0), product of:
              0.19911756 = queryWeight, product of:
                1.2068613 = boost
                8.482592 = idf(docFreq=24, maxDocs=44421)
                0.01945018 = queryNorm
              0.9372028 = fieldWeight in 2681, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.482592 = idf(docFreq=24, maxDocs=44421)
                0.078125 = fieldNorm(doc=2681)
          0.04749405 = weight(abstract_txt:university in 2681) [ClassicSimilarity], result of:
            0.04749405 = score(doc=2681,freq=2.0), product of:
              0.10075121 = queryWeight, product of:
                1.2140671 = boost
                4.2666197 = idf(docFreq=1693, maxDocs=44421)
                0.01945018 = queryNorm
              0.4713993 = fieldWeight in 2681, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.2666197 = idf(docFreq=1693, maxDocs=44421)
                0.078125 = fieldNorm(doc=2681)
          0.13801178 = weight(abstract_txt:regularities in 2681) [ClassicSimilarity], result of:
            0.13801178 = score(doc=2681,freq=1.0), product of:
              0.20516418 = queryWeight, product of:
                1.2250487 = boost
                8.610425 = idf(docFreq=21, maxDocs=44421)
                0.01945018 = queryNorm
              0.67268944 = fieldWeight in 2681, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.610425 = idf(docFreq=21, maxDocs=44421)
                0.078125 = fieldNorm(doc=2681)
          0.14478084 = weight(abstract_txt:sites in 2681) [ClassicSimilarity], result of:
            0.14478084 = score(doc=2681,freq=2.0), product of:
              0.24247219 = queryWeight, product of:
                2.3067162 = boost
                5.4043584 = idf(docFreq=542, maxDocs=44421)
                0.01945018 = queryNorm
              0.5971029 = fieldWeight in 2681, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.4043584 = idf(docFreq=542, maxDocs=44421)
                0.078125 = fieldNorm(doc=2681)
        0.24 = coord(6/25)
    
  4. Wacholder, N.; Byrd, R.J.: Retrieving information from full text using linguistic knowledge (1994) 0.12
    0.11980128 = sum of:
      0.11980128 = product of:
        0.5990064 = sum of:
          0.09710896 = weight(abstract_txt:spelling in 138) [ClassicSimilarity], result of:
            0.09710896 = score(doc=138,freq=1.0), product of:
              0.16230442 = queryWeight, product of:
                1.0896016 = boost
                7.6584163 = idf(docFreq=56, maxDocs=44421)
                0.01945018 = queryNorm
              0.59831375 = fieldWeight in 138, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6584163 = idf(docFreq=56, maxDocs=44421)
                0.078125 = fieldNorm(doc=138)
          0.054343764 = weight(abstract_txt:language in 138) [ClassicSimilarity], result of:
            0.054343764 = score(doc=138,freq=3.0), product of:
              0.09628534 = queryWeight, product of:
                1.186855 = boost
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.01945018 = queryNorm
              0.5644033 = fieldWeight in 138, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.078125 = fieldNorm(doc=138)
          0.14264573 = weight(abstract_txt:acronyms in 138) [ClassicSimilarity], result of:
            0.14264573 = score(doc=138,freq=1.0), product of:
              0.20973133 = queryWeight, product of:
                1.238609 = boost
                8.705735 = idf(docFreq=19, maxDocs=44421)
                0.01945018 = queryNorm
              0.68013555 = fieldWeight in 138, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.705735 = idf(docFreq=19, maxDocs=44421)
                0.078125 = fieldNorm(doc=138)
          0.13285659 = weight(abstract_txt:words in 138) [ClassicSimilarity], result of:
            0.13285659 = score(doc=138,freq=1.0), product of:
              0.31751642 = queryWeight, product of:
                3.0480049 = boost
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.01945018 = queryNorm
              0.4184243 = fieldWeight in 138, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.078125 = fieldNorm(doc=138)
          0.17205137 = weight(abstract_txt:names in 138) [ClassicSimilarity], result of:
            0.17205137 = score(doc=138,freq=1.0), product of:
              0.37723866 = queryWeight, product of:
                3.322314 = boost
                5.8378363 = idf(docFreq=351, maxDocs=44421)
                0.01945018 = queryNorm
              0.45608097 = fieldWeight in 138, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8378363 = idf(docFreq=351, maxDocs=44421)
                0.078125 = fieldNorm(doc=138)
        0.2 = coord(5/25)
    
  5. Riggs, F.W.: Information and social science : the need for onomantics (1989) 0.11
    0.11229215 = sum of:
      0.11229215 = product of:
        0.56146073 = sum of:
          0.0443715 = weight(abstract_txt:language in 2910) [ClassicSimilarity], result of:
            0.0443715 = score(doc=2910,freq=2.0), product of:
              0.09628534 = queryWeight, product of:
                1.186855 = boost
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.01945018 = queryNorm
              0.46083337 = fieldWeight in 2910, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.078125 = fieldNorm(doc=2910)
          0.14264573 = weight(abstract_txt:acronyms in 2910) [ClassicSimilarity], result of:
            0.14264573 = score(doc=2910,freq=1.0), product of:
              0.20973133 = queryWeight, product of:
                1.238609 = boost
                8.705735 = idf(docFreq=19, maxDocs=44421)
                0.01945018 = queryNorm
              0.68013555 = fieldWeight in 2910, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.705735 = idf(docFreq=19, maxDocs=44421)
                0.078125 = fieldNorm(doc=2910)
          0.06953551 = weight(abstract_txt:word in 2910) [ClassicSimilarity], result of:
            0.06953551 = score(doc=2910,freq=1.0), product of:
              0.16367105 = queryWeight, product of:
                1.5474032 = boost
                5.4380693 = idf(docFreq=524, maxDocs=44421)
                0.01945018 = queryNorm
              0.42484915 = fieldWeight in 2910, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.4380693 = idf(docFreq=524, maxDocs=44421)
                0.078125 = fieldNorm(doc=2910)
          0.13285659 = weight(abstract_txt:words in 2910) [ClassicSimilarity], result of:
            0.13285659 = score(doc=2910,freq=1.0), product of:
              0.31751642 = queryWeight, product of:
                3.0480049 = boost
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.01945018 = queryNorm
              0.4184243 = fieldWeight in 2910, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.078125 = fieldNorm(doc=2910)
          0.17205137 = weight(abstract_txt:names in 2910) [ClassicSimilarity], result of:
            0.17205137 = score(doc=2910,freq=1.0), product of:
              0.37723866 = queryWeight, product of:
                3.322314 = boost
                5.8378363 = idf(docFreq=351, maxDocs=44421)
                0.01945018 = queryNorm
              0.45608097 = fieldWeight in 2910, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8378363 = idf(docFreq=351, maxDocs=44421)
                0.078125 = fieldNorm(doc=2910)
        0.2 = coord(5/25)