Document (#35097)

Author
Kanaan, G.
Al-Shalabi, R.
Ghwanmeh, S.
Al-Ma'adeed, H.
Title
¬A comparison of text-classification techniques applied to Arabic text
Source
Journal of the American Society for Information Science and Technology. 60(2009) no.9, S.1836-1844
Year
2009
Abstract
Many algorithms have been implemented for the problem of text classification. Most of the work in this area was carried out for English text. Very little research has been carried out on Arabic text. The nature of Arabic text is different than that of English text, and preprocessing of Arabic text is more challenging. This paper presents an implementation of three automatic text-classification techniques for Arabic text. A corpus of 1445 Arabic text documents belonging to nine categories has been automatically classified using the kNN, Rocchio, and naïve Bayes algorithms. The research results reveal that Naïve Bayes was the best performer, followed by kNN and Rocchio.
Theme
Automatisches Klassifizieren
Object
Bayes-Algorithmus
Naive-Bayes-Algorithmus
Rocchio-Algorithmus
kNN-Algorithmus

Similar documents (content)

  1. Rushdi-Saleh, M.; Martín-Valdivia, M.T.; Ureña-López, L.A.; Perea-Ortega, J.M.: OCA: Opinion corpus for Arabic (2011) 0.46
    0.45968276 = sum of:
      0.45968276 = product of:
        1.1492069 = sum of:
          0.05824666 = weight(abstract_txt:corpus in 360) [ClassicSimilarity], result of:
            0.05824666 = score(doc=360,freq=3.0), product of:
              0.07065791 = queryWeight, product of:
                1.1429944 = boost
                6.0919957 = idf(docFreq=272, maxDocs=44421)
                0.010147453 = queryNorm
              0.8243474 = fieldWeight in 360, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.0919957 = idf(docFreq=272, maxDocs=44421)
                0.078125 = fieldNorm(doc=360)
          0.013269833 = weight(abstract_txt:research in 360) [ClassicSimilarity], result of:
            0.013269833 = score(doc=360,freq=2.0), product of:
              0.038012885 = queryWeight, product of:
                1.1856163 = boost
                3.159582 = idf(docFreq=5124, maxDocs=44421)
                0.010147453 = queryNorm
              0.34908777 = fieldWeight in 360, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.159582 = idf(docFreq=5124, maxDocs=44421)
                0.078125 = fieldNorm(doc=360)
          0.044686254 = weight(abstract_txt:challenging in 360) [ClassicSimilarity], result of:
            0.044686254 = score(doc=360,freq=1.0), product of:
              0.08540235 = queryWeight, product of:
                1.2566046 = boost
                6.697521 = idf(docFreq=148, maxDocs=44421)
                0.010147453 = queryNorm
              0.52324384 = fieldWeight in 360, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.697521 = idf(docFreq=148, maxDocs=44421)
                0.078125 = fieldNorm(doc=360)
          0.029798314 = weight(abstract_txt:been in 360) [ClassicSimilarity], result of:
            0.029798314 = score(doc=360,freq=2.0), product of:
              0.07461831 = queryWeight, product of:
                2.0344503 = boost
                3.614442 = idf(docFreq=3251, maxDocs=44421)
                0.010147453 = queryNorm
              0.3993432 = fieldWeight in 360, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.614442 = idf(docFreq=3251, maxDocs=44421)
                0.078125 = fieldNorm(doc=360)
          0.05153501 = weight(abstract_txt:english in 360) [ClassicSimilarity], result of:
            0.05153501 = score(doc=360,freq=1.0), product of:
              0.11833106 = queryWeight, product of:
                2.0918384 = boost
                5.5745983 = idf(docFreq=457, maxDocs=44421)
                0.010147453 = queryNorm
              0.4355155 = fieldWeight in 360, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5745983 = idf(docFreq=457, maxDocs=44421)
                0.078125 = fieldNorm(doc=360)
          0.05509323 = weight(abstract_txt:algorithms in 360) [ClassicSimilarity], result of:
            0.05509323 = score(doc=360,freq=1.0), product of:
              0.12371699 = queryWeight, product of:
                2.1389143 = boost
                5.7000527 = idf(docFreq=403, maxDocs=44421)
                0.010147453 = queryNorm
              0.4453166 = fieldWeight in 360, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7000527 = idf(docFreq=403, maxDocs=44421)
                0.078125 = fieldNorm(doc=360)
          0.055972174 = weight(abstract_txt:carried in 360) [ClassicSimilarity], result of:
            0.055972174 = score(doc=360,freq=1.0), product of:
              0.12502934 = queryWeight, product of:
                2.150229 = boost
                5.7302055 = idf(docFreq=391, maxDocs=44421)
                0.010147453 = queryNorm
              0.4476723 = fieldWeight in 360, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7302055 = idf(docFreq=391, maxDocs=44421)
                0.078125 = fieldNorm(doc=360)
          0.1842054 = weight(abstract_txt:bayes in 360) [ClassicSimilarity], result of:
            0.1842054 = score(doc=360,freq=1.0), product of:
              0.27662966 = queryWeight, product of:
                3.1983654 = boost
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.010147453 = queryNorm
              0.6658917 = fieldWeight in 360, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.078125 = fieldNorm(doc=360)
          0.107957505 = weight(abstract_txt:text in 360) [ClassicSimilarity], result of:
            0.107957505 = score(doc=360,freq=1.0), product of:
              0.34196892 = queryWeight, product of:
                8.339757 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.010147453 = queryNorm
              0.3156939 = fieldWeight in 360, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.078125 = fieldNorm(doc=360)
          0.5484425 = weight(abstract_txt:arabic in 360) [ClassicSimilarity], result of:
            0.5484425 = score(doc=360,freq=2.0), product of:
              0.65536267 = queryWeight, product of:
                8.526685 = boost
                7.574333 = idf(docFreq=61, maxDocs=44421)
                0.010147453 = queryNorm
              0.83685344 = fieldWeight in 360, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.574333 = idf(docFreq=61, maxDocs=44421)
                0.078125 = fieldNorm(doc=360)
        0.4 = coord(10/25)
    
  2. Atlam, E.-S.; Morita, K.; Fuketa, M.; Aoe, J.-i.: ¬A new approach for Arabic text classification using Arabic field-association terms (2011) 0.37
    0.36883172 = sum of:
      0.36883172 = product of:
        1.0245326 = sum of:
          0.028326718 = weight(abstract_txt:automatically in 927) [ClassicSimilarity], result of:
            0.028326718 = score(doc=927,freq=2.0), product of:
              0.058042757 = queryWeight, product of:
                1.0359476 = boost
                5.521451 = idf(docFreq=482, maxDocs=44421)
                0.010147453 = queryNorm
              0.48803192 = fieldWeight in 927, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.521451 = idf(docFreq=482, maxDocs=44421)
                0.0625 = fieldNorm(doc=927)
          0.02575875 = weight(abstract_txt:followed in 927) [ClassicSimilarity], result of:
            0.02575875 = score(doc=927,freq=1.0), product of:
              0.068639964 = queryWeight, product of:
                1.1265546 = boost
                6.004374 = idf(docFreq=297, maxDocs=44421)
                0.010147453 = queryNorm
              0.37527338 = fieldWeight in 927, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.004374 = idf(docFreq=297, maxDocs=44421)
                0.0625 = fieldNorm(doc=927)
          0.007506551 = weight(abstract_txt:research in 927) [ClassicSimilarity], result of:
            0.007506551 = score(doc=927,freq=1.0), product of:
              0.038012885 = queryWeight, product of:
                1.1856163 = boost
                3.159582 = idf(docFreq=5124, maxDocs=44421)
                0.010147453 = queryNorm
              0.19747387 = fieldWeight in 927, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.159582 = idf(docFreq=5124, maxDocs=44421)
                0.0625 = fieldNorm(doc=927)
          0.041228008 = weight(abstract_txt:english in 927) [ClassicSimilarity], result of:
            0.041228008 = score(doc=927,freq=1.0), product of:
              0.11833106 = queryWeight, product of:
                2.0918384 = boost
                5.5745983 = idf(docFreq=457, maxDocs=44421)
                0.010147453 = queryNorm
              0.3484124 = fieldWeight in 927, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5745983 = idf(docFreq=457, maxDocs=44421)
                0.0625 = fieldNorm(doc=927)
          0.04477774 = weight(abstract_txt:carried in 927) [ClassicSimilarity], result of:
            0.04477774 = score(doc=927,freq=1.0), product of:
              0.12502934 = queryWeight, product of:
                2.150229 = boost
                5.7302055 = idf(docFreq=391, maxDocs=44421)
                0.010147453 = queryNorm
              0.35813785 = fieldWeight in 927, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7302055 = idf(docFreq=391, maxDocs=44421)
                0.0625 = fieldNorm(doc=927)
          0.02271258 = weight(abstract_txt:classification in 927) [ClassicSimilarity], result of:
            0.02271258 = score(doc=927,freq=1.0), product of:
              0.09102876 = queryWeight, product of:
                2.2470548 = boost
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.010147453 = queryNorm
              0.24950996 = fieldWeight in 927, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.0625 = fieldNorm(doc=927)
          0.14736432 = weight(abstract_txt:bayes in 927) [ClassicSimilarity], result of:
            0.14736432 = score(doc=927,freq=1.0), product of:
              0.27662966 = queryWeight, product of:
                3.1983654 = boost
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.010147453 = queryNorm
              0.53271335 = fieldWeight in 927, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.0625 = fieldNorm(doc=927)
          0.086366005 = weight(abstract_txt:text in 927) [ClassicSimilarity], result of:
            0.086366005 = score(doc=927,freq=1.0), product of:
              0.34196892 = queryWeight, product of:
                8.339757 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.010147453 = queryNorm
              0.25255513 = fieldWeight in 927, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=927)
          0.6204919 = weight(abstract_txt:arabic in 927) [ClassicSimilarity], result of:
            0.6204919 = score(doc=927,freq=4.0), product of:
              0.65536267 = queryWeight, product of:
                8.526685 = boost
                7.574333 = idf(docFreq=61, maxDocs=44421)
                0.010147453 = queryNorm
              0.94679165 = fieldWeight in 927, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.574333 = idf(docFreq=61, maxDocs=44421)
                0.0625 = fieldNorm(doc=927)
        0.36 = coord(9/25)
    
  3. Kanan, T.; Fox, E.A.: Automated arabic text classification with P-Stemmer, machine learning, and a tailored news article taxonomy (2016) 0.23
    0.23039214 = sum of:
      0.23039214 = product of:
        0.95996726 = sum of:
          0.010615867 = weight(abstract_txt:research in 4151) [ClassicSimilarity], result of:
            0.010615867 = score(doc=4151,freq=2.0), product of:
              0.038012885 = queryWeight, product of:
                1.1856163 = boost
                3.159582 = idf(docFreq=5124, maxDocs=44421)
                0.010147453 = queryNorm
              0.27927023 = fieldWeight in 4151, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.159582 = idf(docFreq=5124, maxDocs=44421)
                0.0625 = fieldNorm(doc=4151)
          0.031330697 = weight(abstract_txt:techniques in 4151) [ClassicSimilarity], result of:
            0.031330697 = score(doc=4151,freq=2.0), product of:
              0.07821209 = queryWeight, product of:
                1.7006528 = boost
                4.5321174 = idf(docFreq=1298, maxDocs=44421)
                0.010147453 = queryNorm
              0.40058637 = fieldWeight in 4151, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.5321174 = idf(docFreq=1298, maxDocs=44421)
                0.0625 = fieldNorm(doc=4151)
          0.02383865 = weight(abstract_txt:been in 4151) [ClassicSimilarity], result of:
            0.02383865 = score(doc=4151,freq=2.0), product of:
              0.07461831 = queryWeight, product of:
                2.0344503 = boost
                3.614442 = idf(docFreq=3251, maxDocs=44421)
                0.010147453 = queryNorm
              0.31947455 = fieldWeight in 4151, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.614442 = idf(docFreq=3251, maxDocs=44421)
                0.0625 = fieldNorm(doc=4151)
          0.041228008 = weight(abstract_txt:english in 4151) [ClassicSimilarity], result of:
            0.041228008 = score(doc=4151,freq=1.0), product of:
              0.11833106 = queryWeight, product of:
                2.0918384 = boost
                5.5745983 = idf(docFreq=457, maxDocs=44421)
                0.010147453 = queryNorm
              0.3484124 = fieldWeight in 4151, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5745983 = idf(docFreq=457, maxDocs=44421)
                0.0625 = fieldNorm(doc=4151)
          0.03212044 = weight(abstract_txt:classification in 4151) [ClassicSimilarity], result of:
            0.03212044 = score(doc=4151,freq=2.0), product of:
              0.09102876 = queryWeight, product of:
                2.2470548 = boost
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.010147453 = queryNorm
              0.35286036 = fieldWeight in 4151, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.0625 = fieldNorm(doc=4151)
          0.82083356 = weight(abstract_txt:arabic in 4151) [ClassicSimilarity], result of:
            0.82083356 = score(doc=4151,freq=7.0), product of:
              0.65536267 = queryWeight, product of:
                8.526685 = boost
                7.574333 = idf(docFreq=61, maxDocs=44421)
                0.010147453 = queryNorm
              1.2524875 = fieldWeight in 4151, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                7.574333 = idf(docFreq=61, maxDocs=44421)
                0.0625 = fieldNorm(doc=4151)
        0.24 = coord(6/25)
    
  4. Hmeidi, I.I.; Al-Shalabi, R.F.; Al-Taani, A.T.; Najadat, H.; Al-Hazaimeh, S.A.: ¬A novel approach to the extraction of roots from Arabic words using bigrams (2010) 0.18
    0.17723747 = sum of:
      0.17723747 = product of:
        0.8861873 = sum of:
          0.02690298 = weight(abstract_txt:corpus in 413) [ClassicSimilarity], result of:
            0.02690298 = score(doc=413,freq=1.0), product of:
              0.07065791 = queryWeight, product of:
                1.1429944 = boost
                6.0919957 = idf(docFreq=272, maxDocs=44421)
                0.010147453 = queryNorm
              0.38074973 = fieldWeight in 413, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0919957 = idf(docFreq=272, maxDocs=44421)
                0.0625 = fieldNorm(doc=413)
          0.016856473 = weight(abstract_txt:been in 413) [ClassicSimilarity], result of:
            0.016856473 = score(doc=413,freq=1.0), product of:
              0.07461831 = queryWeight, product of:
                2.0344503 = boost
                3.614442 = idf(docFreq=3251, maxDocs=44421)
                0.010147453 = queryNorm
              0.22590263 = fieldWeight in 413, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.614442 = idf(docFreq=3251, maxDocs=44421)
                0.0625 = fieldNorm(doc=413)
          0.06233087 = weight(abstract_txt:algorithms in 413) [ClassicSimilarity], result of:
            0.06233087 = score(doc=413,freq=2.0), product of:
              0.12371699 = queryWeight, product of:
                2.1389143 = boost
                5.7000527 = idf(docFreq=403, maxDocs=44421)
                0.010147453 = queryNorm
              0.5038182 = fieldWeight in 413, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.7000527 = idf(docFreq=403, maxDocs=44421)
                0.0625 = fieldNorm(doc=413)
          0.086366005 = weight(abstract_txt:text in 413) [ClassicSimilarity], result of:
            0.086366005 = score(doc=413,freq=1.0), product of:
              0.34196892 = queryWeight, product of:
                8.339757 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.010147453 = queryNorm
              0.25255513 = fieldWeight in 413, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=413)
          0.693731 = weight(abstract_txt:arabic in 413) [ClassicSimilarity], result of:
            0.693731 = score(doc=413,freq=5.0), product of:
              0.65536267 = queryWeight, product of:
                8.526685 = boost
                7.574333 = idf(docFreq=61, maxDocs=44421)
                0.010147453 = queryNorm
              1.0585452 = fieldWeight in 413, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.574333 = idf(docFreq=61, maxDocs=44421)
                0.0625 = fieldNorm(doc=413)
        0.2 = coord(5/25)
    
  5. Peng, F.; Huang, X.: Machine learning for Asian language text classification (2007) 0.17
    0.17025189 = sum of:
      0.17025189 = product of:
        0.6080425 = sum of:
          0.022817867 = weight(abstract_txt:implemented in 1831) [ClassicSimilarity], result of:
            0.022817867 = score(doc=1831,freq=1.0), product of:
              0.06331071 = queryWeight, product of:
                1.0819379 = boost
                5.7665734 = idf(docFreq=377, maxDocs=44421)
                0.010147453 = queryNorm
              0.36041084 = fieldWeight in 1831, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7665734 = idf(docFreq=377, maxDocs=44421)
                0.0625 = fieldNorm(doc=1831)
          0.007506551 = weight(abstract_txt:research in 1831) [ClassicSimilarity], result of:
            0.007506551 = score(doc=1831,freq=1.0), product of:
              0.038012885 = queryWeight, product of:
                1.1856163 = boost
                3.159582 = idf(docFreq=5124, maxDocs=44421)
                0.010147453 = queryNorm
              0.19747387 = fieldWeight in 1831, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.159582 = idf(docFreq=5124, maxDocs=44421)
                0.0625 = fieldNorm(doc=1831)
          0.031330697 = weight(abstract_txt:techniques in 1831) [ClassicSimilarity], result of:
            0.031330697 = score(doc=1831,freq=2.0), product of:
              0.07821209 = queryWeight, product of:
                1.7006528 = boost
                4.5321174 = idf(docFreq=1298, maxDocs=44421)
                0.010147453 = queryNorm
              0.40058637 = fieldWeight in 1831, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.5321174 = idf(docFreq=1298, maxDocs=44421)
                0.0625 = fieldNorm(doc=1831)
          0.06813775 = weight(abstract_txt:classification in 1831) [ClassicSimilarity], result of:
            0.06813775 = score(doc=1831,freq=9.0), product of:
              0.09102876 = queryWeight, product of:
                2.2470548 = boost
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.010147453 = queryNorm
              0.7485299 = fieldWeight in 1831, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.0625 = fieldNorm(doc=1831)
          0.13776505 = weight(abstract_txt:naïve in 1831) [ClassicSimilarity], result of:
            0.13776505 = score(doc=1831,freq=1.0), product of:
              0.26448226 = queryWeight, product of:
                3.1273537 = boost
                8.334172 = idf(docFreq=28, maxDocs=44421)
                0.010147453 = queryNorm
              0.52088577 = fieldWeight in 1831, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.334172 = idf(docFreq=28, maxDocs=44421)
                0.0625 = fieldNorm(doc=1831)
          0.14736432 = weight(abstract_txt:bayes in 1831) [ClassicSimilarity], result of:
            0.14736432 = score(doc=1831,freq=1.0), product of:
              0.27662966 = queryWeight, product of:
                3.1983654 = boost
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.010147453 = queryNorm
              0.53271335 = fieldWeight in 1831, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.0625 = fieldNorm(doc=1831)
          0.19312027 = weight(abstract_txt:text in 1831) [ClassicSimilarity], result of:
            0.19312027 = score(doc=1831,freq=5.0), product of:
              0.34196892 = queryWeight, product of:
                8.339757 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.010147453 = queryNorm
              0.56473047 = fieldWeight in 1831, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=1831)
        0.28 = coord(7/25)