Document (#30116)

Author
Duwairi, R.M.
Title
Machine learning for Arabic text categorization
Source
Journal of the American Society for Information Science and Technology. 57(2006) no.8, S.1005-1010
Year
2006
Abstract
In this article we propose a distance-based classifier for categorizing Arabic text. Each category is represented as a vector of words in an m-dimensional space, and documents are classified on the basis of their closeness to feature vectors of categories. The classifier, in its learning phase, scans the set of training documents to extract features of categories that capture inherent category-specific properties; in its testing phase the classifier uses previously determined category-specific features to categorize unclassified documents. Stemming was used to reduce the dimensionality of feature vectors of documents. The accuracy of the classifier was tested by carrying out several categorization tasks on an in-house collected Arabic corpus. The results show that the proposed classifier is very accurate and robust.
Theme
Computerlinguistik
Automatisches Klassifizieren

Similar documents (content)

  1. Mengle, S.S.R.; Goharian, N.: Ambiguity measure feature-selection algorithm (2009) 0.35
    0.35145172 = sum of:
      0.35145172 = product of:
        1.0982866 = sum of:
          0.04503458 = weight(abstract_txt:text in 2804) [ClassicSimilarity], result of:
            0.04503458 = score(doc=2804,freq=6.0), product of:
              0.07274341 = queryWeight, product of:
                1.2259675 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.014672964 = queryNorm
              0.6190881 = fieldWeight in 2804, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=2804)
          0.08723349 = weight(abstract_txt:dimensionality in 2804) [ClassicSimilarity], result of:
            0.08723349 = score(doc=2804,freq=1.0), product of:
              0.16302674 = queryWeight, product of:
                1.2977666 = boost
                8.561393 = idf(docFreq=22, maxDocs=44218)
                0.014672964 = queryNorm
              0.53508705 = fieldWeight in 2804, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.561393 = idf(docFreq=22, maxDocs=44218)
                0.0625 = fieldNorm(doc=2804)
          0.022323536 = weight(abstract_txt:specific in 2804) [ClassicSimilarity], result of:
            0.022323536 = score(doc=2804,freq=1.0), product of:
              0.08279205 = queryWeight, product of:
                1.3079058 = boost
                4.314141 = idf(docFreq=1607, maxDocs=44218)
                0.014672964 = queryNorm
              0.2696338 = fieldWeight in 2804, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.314141 = idf(docFreq=1607, maxDocs=44218)
                0.0625 = fieldNorm(doc=2804)
          0.04503651 = weight(abstract_txt:features in 2804) [ClassicSimilarity], result of:
            0.04503651 = score(doc=2804,freq=3.0), product of:
              0.09165357 = queryWeight, product of:
                1.3761218 = boost
                4.5391517 = idf(docFreq=1283, maxDocs=44218)
                0.014672964 = queryNorm
              0.49137756 = fieldWeight in 2804, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.5391517 = idf(docFreq=1283, maxDocs=44218)
                0.0625 = fieldNorm(doc=2804)
          0.1511361 = weight(abstract_txt:feature in 2804) [ClassicSimilarity], result of:
            0.1511361 = score(doc=2804,freq=7.0), product of:
              0.15489098 = queryWeight, product of:
                1.7889377 = boost
                5.9008293 = idf(docFreq=328, maxDocs=44218)
                0.014672964 = queryNorm
              0.9757579 = fieldWeight in 2804, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                5.9008293 = idf(docFreq=328, maxDocs=44218)
                0.0625 = fieldNorm(doc=2804)
          0.05504635 = weight(abstract_txt:documents in 2804) [ClassicSimilarity], result of:
            0.05504635 = score(doc=2804,freq=2.0), product of:
              0.15111202 = queryWeight, product of:
                2.4988873 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.014672964 = queryNorm
              0.36427513 = fieldWeight in 2804, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0625 = fieldNorm(doc=2804)
          0.10203348 = weight(abstract_txt:category in 2804) [ClassicSimilarity], result of:
            0.10203348 = score(doc=2804,freq=1.0), product of:
              0.26101905 = queryWeight, product of:
                2.8442247 = boost
                6.2544694 = idf(docFreq=230, maxDocs=44218)
                0.014672964 = queryNorm
              0.39090434 = fieldWeight in 2804, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.2544694 = idf(docFreq=230, maxDocs=44218)
                0.0625 = fieldNorm(doc=2804)
          0.5904426 = weight(abstract_txt:classifier in 2804) [ClassicSimilarity], result of:
            0.5904426 = score(doc=2804,freq=5.0), product of:
              0.58334005 = queryWeight, product of:
                5.4892507 = boost
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.014672964 = queryNorm
              1.0121757 = fieldWeight in 2804, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.0625 = fieldNorm(doc=2804)
        0.32 = coord(8/25)
    
  2. Duwairi, R.; Al-Refai, M.N.; Khasawneh, N.: Feature reduction techniques for Arabic text categorization (2009) 0.34
    0.34085017 = sum of:
      0.34085017 = product of:
        1.0651568 = sum of:
          0.114886135 = weight(abstract_txt:stemming in 3169) [ClassicSimilarity], result of:
            0.114886135 = score(doc=3169,freq=4.0), product of:
              0.123394296 = queryWeight, product of:
                1.129054 = boost
                7.448392 = idf(docFreq=69, maxDocs=44218)
                0.014672964 = queryNorm
              0.931049 = fieldWeight in 3169, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.448392 = idf(docFreq=69, maxDocs=44218)
                0.0625 = fieldNorm(doc=3169)
          0.01838529 = weight(abstract_txt:text in 3169) [ClassicSimilarity], result of:
            0.01838529 = score(doc=3169,freq=1.0), product of:
              0.07274341 = queryWeight, product of:
                1.2259675 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.014672964 = queryNorm
              0.25274166 = fieldWeight in 3169, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=3169)
          0.03859231 = weight(abstract_txt:categories in 3169) [ClassicSimilarity], result of:
            0.03859231 = score(doc=3169,freq=1.0), product of:
              0.11925607 = queryWeight, product of:
                1.5697207 = boost
                5.17774 = idf(docFreq=677, maxDocs=44218)
                0.014672964 = queryNorm
              0.32360876 = fieldWeight in 3169, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.17774 = idf(docFreq=677, maxDocs=44218)
                0.0625 = fieldNorm(doc=3169)
          0.05712408 = weight(abstract_txt:feature in 3169) [ClassicSimilarity], result of:
            0.05712408 = score(doc=3169,freq=1.0), product of:
              0.15489098 = queryWeight, product of:
                1.7889377 = boost
                5.9008293 = idf(docFreq=328, maxDocs=44218)
                0.014672964 = queryNorm
              0.36880183 = fieldWeight in 3169, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9008293 = idf(docFreq=328, maxDocs=44218)
                0.0625 = fieldNorm(doc=3169)
          0.32380652 = weight(abstract_txt:vectors in 3169) [ClassicSimilarity], result of:
            0.32380652 = score(doc=3169,freq=6.0), product of:
              0.27099 = queryWeight, product of:
                2.36624 = boost
                7.805067 = idf(docFreq=48, maxDocs=44218)
                0.014672964 = queryNorm
              1.1949021 = fieldWeight in 3169, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                7.805067 = idf(docFreq=48, maxDocs=44218)
                0.0625 = fieldNorm(doc=3169)
          0.06741773 = weight(abstract_txt:documents in 3169) [ClassicSimilarity], result of:
            0.06741773 = score(doc=3169,freq=3.0), product of:
              0.15111202 = queryWeight, product of:
                2.4988873 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.014672964 = queryNorm
              0.44614407 = fieldWeight in 3169, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0625 = fieldNorm(doc=3169)
          0.18089077 = weight(abstract_txt:arabic in 3169) [ClassicSimilarity], result of:
            0.18089077 = score(doc=3169,freq=1.0), product of:
              0.38234437 = queryWeight, product of:
                3.4423509 = boost
                7.5697527 = idf(docFreq=61, maxDocs=44218)
                0.014672964 = queryNorm
              0.47310954 = fieldWeight in 3169, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5697527 = idf(docFreq=61, maxDocs=44218)
                0.0625 = fieldNorm(doc=3169)
          0.26405397 = weight(abstract_txt:classifier in 3169) [ClassicSimilarity], result of:
            0.26405397 = score(doc=3169,freq=1.0), product of:
              0.58334005 = queryWeight, product of:
                5.4892507 = boost
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.014672964 = queryNorm
              0.45265874 = fieldWeight in 3169, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.0625 = fieldNorm(doc=3169)
        0.32 = coord(8/25)
    
  3. Sebastiani, F.: Machine learning in automated text categorization (2002) 0.25
    0.24609844 = sum of:
      0.24609844 = product of:
        1.0254102 = sum of:
          0.022981614 = weight(abstract_txt:text in 3389) [ClassicSimilarity], result of:
            0.022981614 = score(doc=3389,freq=1.0), product of:
              0.07274341 = queryWeight, product of:
                1.2259675 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.014672964 = queryNorm
              0.3159271 = fieldWeight in 3389, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.078125 = fieldNorm(doc=3389)
          0.064546235 = weight(abstract_txt:learning in 3389) [ClassicSimilarity], result of:
            0.064546235 = score(doc=3389,freq=3.0), product of:
              0.10040304 = queryWeight, product of:
                1.4403087 = boost
                4.750873 = idf(docFreq=1038, maxDocs=44218)
                0.014672964 = queryNorm
              0.6428713 = fieldWeight in 3389, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.750873 = idf(docFreq=1038, maxDocs=44218)
                0.078125 = fieldNorm(doc=3389)
          0.06822221 = weight(abstract_txt:categories in 3389) [ClassicSimilarity], result of:
            0.06822221 = score(doc=3389,freq=2.0), product of:
              0.11925607 = queryWeight, product of:
                1.5697207 = boost
                5.17774 = idf(docFreq=677, maxDocs=44218)
                0.014672964 = queryNorm
              0.5720649 = fieldWeight in 3389, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.17774 = idf(docFreq=677, maxDocs=44218)
                0.078125 = fieldNorm(doc=3389)
          0.1407173 = weight(abstract_txt:categorization in 3389) [ClassicSimilarity], result of:
            0.1407173 = score(doc=3389,freq=2.0), product of:
              0.1932391 = queryWeight, product of:
                1.9981571 = boost
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.014672964 = queryNorm
              0.72820306 = fieldWeight in 3389, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.078125 = fieldNorm(doc=3389)
          0.06880794 = weight(abstract_txt:documents in 3389) [ClassicSimilarity], result of:
            0.06880794 = score(doc=3389,freq=2.0), product of:
              0.15111202 = queryWeight, product of:
                2.4988873 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.014672964 = queryNorm
              0.4553439 = fieldWeight in 3389, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.078125 = fieldNorm(doc=3389)
          0.6601349 = weight(abstract_txt:classifier in 3389) [ClassicSimilarity], result of:
            0.6601349 = score(doc=3389,freq=4.0), product of:
              0.58334005 = queryWeight, product of:
                5.4892507 = boost
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.014672964 = queryNorm
              1.1316469 = fieldWeight in 3389, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.078125 = fieldNorm(doc=3389)
        0.24 = coord(6/25)
    
  4. Maghsoodi, N.; Homayounpour, M.M.: Improving Farsi multiclass text classification using a thesaurus and two-stage feature selection (2011) 0.23
    0.22803165 = sum of:
      0.22803165 = product of:
        0.7125989 = sum of:
          0.01838529 = weight(abstract_txt:text in 4775) [ClassicSimilarity], result of:
            0.01838529 = score(doc=4775,freq=1.0), product of:
              0.07274341 = queryWeight, product of:
                1.2259675 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.014672964 = queryNorm
              0.25274166 = fieldWeight in 4775, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=4775)
          0.04503651 = weight(abstract_txt:features in 4775) [ClassicSimilarity], result of:
            0.04503651 = score(doc=4775,freq=3.0), product of:
              0.09165357 = queryWeight, product of:
                1.3761218 = boost
                4.5391517 = idf(docFreq=1283, maxDocs=44218)
                0.014672964 = queryNorm
              0.49137756 = fieldWeight in 4775, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.5391517 = idf(docFreq=1283, maxDocs=44218)
                0.0625 = fieldNorm(doc=4775)
          0.054577764 = weight(abstract_txt:categories in 4775) [ClassicSimilarity], result of:
            0.054577764 = score(doc=4775,freq=2.0), product of:
              0.11925607 = queryWeight, product of:
                1.5697207 = boost
                5.17774 = idf(docFreq=677, maxDocs=44218)
                0.014672964 = queryNorm
              0.45765188 = fieldWeight in 4775, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.17774 = idf(docFreq=677, maxDocs=44218)
                0.0625 = fieldNorm(doc=4775)
          0.12773332 = weight(abstract_txt:feature in 4775) [ClassicSimilarity], result of:
            0.12773332 = score(doc=4775,freq=5.0), product of:
              0.15489098 = queryWeight, product of:
                1.7889377 = boost
                5.9008293 = idf(docFreq=328, maxDocs=44218)
                0.014672964 = queryNorm
              0.82466596 = fieldWeight in 4775, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.9008293 = idf(docFreq=328, maxDocs=44218)
                0.0625 = fieldNorm(doc=4775)
          0.068163976 = weight(abstract_txt:phase in 4775) [ClassicSimilarity], result of:
            0.068163976 = score(doc=4775,freq=1.0), product of:
              0.17425421 = queryWeight, product of:
                1.8974651 = boost
                6.258808 = idf(docFreq=229, maxDocs=44218)
                0.014672964 = queryNorm
              0.3911755 = fieldWeight in 4775, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.258808 = idf(docFreq=229, maxDocs=44218)
                0.0625 = fieldNorm(doc=4775)
          0.07960173 = weight(abstract_txt:categorization in 4775) [ClassicSimilarity], result of:
            0.07960173 = score(doc=4775,freq=1.0), product of:
              0.1932391 = queryWeight, product of:
                1.9981571 = boost
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.014672964 = queryNorm
              0.41193387 = fieldWeight in 4775, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.0625 = fieldNorm(doc=4775)
          0.05504635 = weight(abstract_txt:documents in 4775) [ClassicSimilarity], result of:
            0.05504635 = score(doc=4775,freq=2.0), product of:
              0.15111202 = queryWeight, product of:
                2.4988873 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.014672964 = queryNorm
              0.36427513 = fieldWeight in 4775, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0625 = fieldNorm(doc=4775)
          0.26405397 = weight(abstract_txt:classifier in 4775) [ClassicSimilarity], result of:
            0.26405397 = score(doc=4775,freq=1.0), product of:
              0.58334005 = queryWeight, product of:
                5.4892507 = boost
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.014672964 = queryNorm
              0.45265874 = fieldWeight in 4775, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.0625 = fieldNorm(doc=4775)
        0.32 = coord(8/25)
    
  5. Yang, Y.; Liu, X.: ¬A re-examination of text categorization methods (1999) 0.21
    0.20602055 = sum of:
      0.20602055 = product of:
        1.0301027 = sum of:
          0.027577935 = weight(abstract_txt:text in 3386) [ClassicSimilarity], result of:
            0.027577935 = score(doc=3386,freq=1.0), product of:
              0.07274341 = queryWeight, product of:
                1.2259675 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.014672964 = queryNorm
              0.37911248 = fieldWeight in 3386, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.09375 = fieldNorm(doc=3386)
          0.057888463 = weight(abstract_txt:categories in 3386) [ClassicSimilarity], result of:
            0.057888463 = score(doc=3386,freq=1.0), product of:
              0.11925607 = queryWeight, product of:
                1.5697207 = boost
                5.17774 = idf(docFreq=677, maxDocs=44218)
                0.014672964 = queryNorm
              0.48541313 = fieldWeight in 3386, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.17774 = idf(docFreq=677, maxDocs=44218)
                0.09375 = fieldNorm(doc=3386)
          0.11940259 = weight(abstract_txt:categorization in 3386) [ClassicSimilarity], result of:
            0.11940259 = score(doc=3386,freq=1.0), product of:
              0.1932391 = queryWeight, product of:
                1.9981571 = boost
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.014672964 = queryNorm
              0.6179008 = fieldWeight in 3386, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.09375 = fieldNorm(doc=3386)
          0.26509076 = weight(abstract_txt:category in 3386) [ClassicSimilarity], result of:
            0.26509076 = score(doc=3386,freq=3.0), product of:
              0.26101905 = queryWeight, product of:
                2.8442247 = boost
                6.2544694 = idf(docFreq=230, maxDocs=44218)
                0.014672964 = queryNorm
              1.0155993 = fieldWeight in 3386, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.2544694 = idf(docFreq=230, maxDocs=44218)
                0.09375 = fieldNorm(doc=3386)
          0.56014305 = weight(abstract_txt:classifier in 3386) [ClassicSimilarity], result of:
            0.56014305 = score(doc=3386,freq=2.0), product of:
              0.58334005 = queryWeight, product of:
                5.4892507 = boost
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.014672964 = queryNorm
              0.9602342 = fieldWeight in 3386, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.09375 = fieldNorm(doc=3386)
        0.2 = coord(5/25)