Document (#40152)

Author
Kanan, T.
Fox, E.A.
Title
Automated arabic text classification with P-Stemmer, machine learning, and a tailored news article taxonomy
Source
Journal of the Association for Information Science and Technology. 67(2016) no.11, S.2667-2683
Year
2016
Abstract
Arabic news articles in electronic collections are difficult to study. Browsing by category is rarely supported. Although helpful machine-learning methods have been applied successfully to similar situations for English news articles, limited research has been completed to yield suitable solutions for Arabic news. In connection with a Qatar National Research Fund (QNRF)-funded project to build digital library community and infrastructure in Qatar, we developed software for browsing a collection of about 237,000 Arabic news articles, which should be applicable to other Arabic news collections. We designed a simple taxonomy for Arabic news stories that is suitable for the needs of Qatar and other nations, is compatible with the subject codes of the International Press Telecommunications Council, and was enhanced with the aid of a librarian expert as well as five Arabic-speaking volunteers. We developed tailored stemming (i.e., a new Arabic light stemmer called P-Stemmer) and automatic classification methods (the best being binary Support Vector Machines classifiers) to work with the taxonomy. Using evaluation techniques commonly used in the information retrieval community, including 10-fold cross-validation and the Wilcoxon signed-rank test, we showed that our approach to stemming and classification is superior to state-of-the-art techniques.
Content
Vgl.: http://onlinelibrary.wiley.com/doi/10.1002/asi.23609/full.
Theme
Automatisches Indexieren

Similar documents (content)

  1. Atlam, E.-S.; Morita, K.; Fuketa, M.; Aoe, J.-i.: ¬A new approach for Arabic text classification using Arabic field-association terms (2011) 0.22
    0.2230345 = sum of:
      0.2230345 = product of:
        0.92931044 = sum of:
          0.040079866 = weight(abstract_txt:classifiers in 4927) [ClassicSimilarity], result of:
            0.040079866 = score(doc=4927,freq=1.0), product of:
              0.08524797 = queryWeight, product of:
                1.0193919 = boost
                7.5225 = idf(docFreq=64, maxDocs=44218)
                0.011116823 = queryNorm
              0.47015625 = fieldWeight in 4927, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5225 = idf(docFreq=64, maxDocs=44218)
                0.0625 = fieldNorm(doc=4927)
          0.01342746 = weight(abstract_txt:methods in 4927) [ClassicSimilarity], result of:
            0.01342746 = score(doc=4927,freq=1.0), product of:
              0.05180907 = queryWeight, product of:
                1.1238725 = boost
                4.146752 = idf(docFreq=1900, maxDocs=44218)
                0.011116823 = queryNorm
              0.259172 = fieldWeight in 4927, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.146752 = idf(docFreq=1900, maxDocs=44218)
                0.0625 = fieldNorm(doc=4927)
          0.017970374 = weight(abstract_txt:classification in 4927) [ClassicSimilarity], result of:
            0.017970374 = score(doc=4927,freq=1.0), product of:
              0.07202419 = queryWeight, product of:
                1.622927 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.011116823 = queryNorm
              0.2495047 = fieldWeight in 4927, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.0625 = fieldNorm(doc=4927)
          0.007353444 = weight(abstract_txt:with in 4927) [ClassicSimilarity], result of:
            0.007353444 = score(doc=4927,freq=1.0), product of:
              0.04706706 = queryWeight, product of:
                1.6937242 = boost
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.011116823 = queryNorm
              0.15623334 = fieldWeight in 4927, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.0625 = fieldNorm(doc=4927)
          0.19704081 = weight(abstract_txt:news in 4927) [ClassicSimilarity], result of:
            0.19704081 = score(doc=4927,freq=2.0), product of:
              0.37421975 = queryWeight, product of:
                5.6508207 = boost
                5.957094 = idf(docFreq=310, maxDocs=44218)
                0.011116823 = queryNorm
              0.5265377 = fieldWeight in 4927, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.957094 = idf(docFreq=310, maxDocs=44218)
                0.0625 = fieldNorm(doc=4927)
          0.6534385 = weight(abstract_txt:arabic in 4927) [ClassicSimilarity], result of:
            0.6534385 = score(doc=4927,freq=4.0), product of:
              0.69057846 = queryWeight, product of:
                8.206362 = boost
                7.5697527 = idf(docFreq=61, maxDocs=44218)
                0.011116823 = queryNorm
              0.9462191 = fieldWeight in 4927, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.5697527 = idf(docFreq=61, maxDocs=44218)
                0.0625 = fieldNorm(doc=4927)
        0.24 = coord(6/25)
    
  2. Singh, V.K.; Ghosh, I.; Sonagara, D.: Detecting fake news stories via multimodal analysis (2021) 0.17
    0.17180283 = sum of:
      0.17180283 = product of:
        0.61358154 = sum of:
          0.053507876 = weight(abstract_txt:stories in 88) [ClassicSimilarity], result of:
            0.053507876 = score(doc=88,freq=2.0), product of:
              0.082035474 = queryWeight, product of:
                7.3793993 = idf(docFreq=74, maxDocs=44218)
                0.011116823 = queryNorm
              0.6522529 = fieldWeight in 88, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.3793993 = idf(docFreq=74, maxDocs=44218)
                0.0625 = fieldNorm(doc=88)
          0.017503344 = weight(abstract_txt:techniques in 88) [ClassicSimilarity], result of:
            0.017503344 = score(doc=88,freq=1.0), product of:
              0.06182402 = queryWeight, product of:
                1.2277014 = boost
                4.5298495 = idf(docFreq=1295, maxDocs=44218)
                0.011116823 = queryNorm
              0.2831156 = fieldWeight in 88, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.5298495 = idf(docFreq=1295, maxDocs=44218)
                0.0625 = fieldNorm(doc=88)
          0.020192495 = weight(abstract_txt:learning in 88) [ClassicSimilarity], result of:
            0.020192495 = score(doc=88,freq=1.0), product of:
              0.068004325 = queryWeight, product of:
                1.2876043 = boost
                4.750873 = idf(docFreq=1038, maxDocs=44218)
                0.011116823 = queryNorm
              0.29692957 = fieldWeight in 88, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.750873 = idf(docFreq=1038, maxDocs=44218)
                0.0625 = fieldNorm(doc=88)
          0.027695347 = weight(abstract_txt:machine in 88) [ClassicSimilarity], result of:
            0.027695347 = score(doc=88,freq=1.0), product of:
              0.08394878 = queryWeight, product of:
                1.4306103 = boost
                5.2785225 = idf(docFreq=612, maxDocs=44218)
                0.011116823 = queryNorm
              0.32990766 = fieldWeight in 88, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2785225 = idf(docFreq=612, maxDocs=44218)
                0.0625 = fieldNorm(doc=88)
          0.010399341 = weight(abstract_txt:with in 88) [ClassicSimilarity], result of:
            0.010399341 = score(doc=88,freq=2.0), product of:
              0.04706706 = queryWeight, product of:
                1.6937242 = boost
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.011116823 = queryNorm
              0.22094731 = fieldWeight in 88, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.0625 = fieldNorm(doc=88)
          0.043686505 = weight(abstract_txt:articles in 88) [ClassicSimilarity], result of:
            0.043686505 = score(doc=88,freq=2.0), product of:
              0.10335429 = queryWeight, product of:
                1.9441243 = boost
                4.7821565 = idf(docFreq=1006, maxDocs=44218)
                0.011116823 = queryNorm
              0.4226869 = fieldWeight in 88, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.7821565 = idf(docFreq=1006, maxDocs=44218)
                0.0625 = fieldNorm(doc=88)
          0.44059664 = weight(abstract_txt:news in 88) [ClassicSimilarity], result of:
            0.44059664 = score(doc=88,freq=10.0), product of:
              0.37421975 = queryWeight, product of:
                5.6508207 = boost
                5.957094 = idf(docFreq=310, maxDocs=44218)
                0.011116823 = queryNorm
              1.1773741 = fieldWeight in 88, product of:
                3.1622777 = tf(freq=10.0), with freq of:
                  10.0 = termFreq=10.0
                5.957094 = idf(docFreq=310, maxDocs=44218)
                0.0625 = fieldNorm(doc=88)
        0.28 = coord(7/25)
    
  3. Xu, J.; Weischedel, R.: Empirical studies on the impact of lexical resources on CLIR performance (2005) 0.16
    0.16031574 = sum of:
      0.16031574 = product of:
        1.0019734 = sum of:
          0.034619182 = weight(abstract_txt:machine in 1020) [ClassicSimilarity], result of:
            0.034619182 = score(doc=1020,freq=1.0), product of:
              0.08394878 = queryWeight, product of:
                1.4306103 = boost
                5.2785225 = idf(docFreq=612, maxDocs=44218)
                0.011116823 = queryNorm
              0.41238457 = fieldWeight in 1020, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2785225 = idf(docFreq=612, maxDocs=44218)
                0.078125 = fieldNorm(doc=1020)
          0.012999176 = weight(abstract_txt:with in 1020) [ClassicSimilarity], result of:
            0.012999176 = score(doc=1020,freq=2.0), product of:
              0.04706706 = queryWeight, product of:
                1.6937242 = boost
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.011116823 = queryNorm
              0.27618414 = fieldWeight in 1020, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.078125 = fieldNorm(doc=1020)
          0.13755687 = weight(abstract_txt:stemming in 1020) [ClassicSimilarity], result of:
            0.13755687 = score(doc=1020,freq=2.0), product of:
              0.16715321 = queryWeight, product of:
                2.0186987 = boost
                7.448392 = idf(docFreq=69, maxDocs=44218)
                0.011116823 = queryNorm
              0.8229388 = fieldWeight in 1020, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.448392 = idf(docFreq=69, maxDocs=44218)
                0.078125 = fieldNorm(doc=1020)
          0.81679815 = weight(abstract_txt:arabic in 1020) [ClassicSimilarity], result of:
            0.81679815 = score(doc=1020,freq=4.0), product of:
              0.69057846 = queryWeight, product of:
                8.206362 = boost
                7.5697527 = idf(docFreq=61, maxDocs=44218)
                0.011116823 = queryNorm
              1.1827738 = fieldWeight in 1020, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.5697527 = idf(docFreq=61, maxDocs=44218)
                0.078125 = fieldNorm(doc=1020)
        0.16 = coord(4/25)
    
  4. Abdelali, A.: Localization in modern standard Arabic (2004) 0.16
    0.15705685 = sum of:
      0.15705685 = product of:
        0.98160535 = sum of:
          0.016784323 = weight(abstract_txt:methods in 2066) [ClassicSimilarity], result of:
            0.016784323 = score(doc=2066,freq=1.0), product of:
              0.05180907 = queryWeight, product of:
                1.1238725 = boost
                4.146752 = idf(docFreq=1900, maxDocs=44218)
                0.011116823 = queryNorm
              0.32396498 = fieldWeight in 2066, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.146752 = idf(docFreq=1900, maxDocs=44218)
                0.078125 = fieldNorm(doc=2066)
          0.012999176 = weight(abstract_txt:with in 2066) [ClassicSimilarity], result of:
            0.012999176 = score(doc=2066,freq=2.0), product of:
              0.04706706 = queryWeight, product of:
                1.6937242 = boost
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.011116823 = queryNorm
              0.27618414 = fieldWeight in 2066, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.078125 = fieldNorm(doc=2066)
          0.038613778 = weight(abstract_txt:articles in 2066) [ClassicSimilarity], result of:
            0.038613778 = score(doc=2066,freq=1.0), product of:
              0.10335429 = queryWeight, product of:
                1.9441243 = boost
                4.7821565 = idf(docFreq=1006, maxDocs=44218)
                0.011116823 = queryNorm
              0.37360597 = fieldWeight in 2066, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7821565 = idf(docFreq=1006, maxDocs=44218)
                0.078125 = fieldNorm(doc=2066)
          0.91320807 = weight(abstract_txt:arabic in 2066) [ClassicSimilarity], result of:
            0.91320807 = score(doc=2066,freq=5.0), product of:
              0.69057846 = queryWeight, product of:
                8.206362 = boost
                7.5697527 = idf(docFreq=61, maxDocs=44218)
                0.011116823 = queryNorm
              1.3223814 = fieldWeight in 2066, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.5697527 = idf(docFreq=61, maxDocs=44218)
                0.078125 = fieldNorm(doc=2066)
        0.16 = coord(4/25)
    
  5. Arapakis, I.; Cambazoglu, B.B.; Lalmas, M.: On the feasibility of predicting popular news at cold start (2017) 0.15
    0.14987242 = sum of:
      0.14987242 = product of:
        0.62446845 = sum of:
          0.017503344 = weight(abstract_txt:techniques in 3595) [ClassicSimilarity], result of:
            0.017503344 = score(doc=3595,freq=1.0), product of:
              0.06182402 = queryWeight, product of:
                1.2277014 = boost
                4.5298495 = idf(docFreq=1295, maxDocs=44218)
                0.011116823 = queryNorm
              0.2831156 = fieldWeight in 3595, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.5298495 = idf(docFreq=1295, maxDocs=44218)
                0.0625 = fieldNorm(doc=3595)
          0.020192495 = weight(abstract_txt:learning in 3595) [ClassicSimilarity], result of:
            0.020192495 = score(doc=3595,freq=1.0), product of:
              0.068004325 = queryWeight, product of:
                1.2876043 = boost
                4.750873 = idf(docFreq=1038, maxDocs=44218)
                0.011116823 = queryNorm
              0.29692957 = fieldWeight in 3595, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.750873 = idf(docFreq=1038, maxDocs=44218)
                0.0625 = fieldNorm(doc=3595)
          0.027695347 = weight(abstract_txt:machine in 3595) [ClassicSimilarity], result of:
            0.027695347 = score(doc=3595,freq=1.0), product of:
              0.08394878 = queryWeight, product of:
                1.4306103 = boost
                5.2785225 = idf(docFreq=612, maxDocs=44218)
                0.011116823 = queryNorm
              0.32990766 = fieldWeight in 3595, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2785225 = idf(docFreq=612, maxDocs=44218)
                0.0625 = fieldNorm(doc=3595)
          0.007353444 = weight(abstract_txt:with in 3595) [ClassicSimilarity], result of:
            0.007353444 = score(doc=3595,freq=1.0), product of:
              0.04706706 = queryWeight, product of:
                1.6937242 = boost
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.011116823 = queryNorm
              0.15623334 = fieldWeight in 3595, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.0625 = fieldNorm(doc=3595)
          0.06907443 = weight(abstract_txt:articles in 3595) [ClassicSimilarity], result of:
            0.06907443 = score(doc=3595,freq=5.0), product of:
              0.10335429 = queryWeight, product of:
                1.9441243 = boost
                4.7821565 = idf(docFreq=1006, maxDocs=44218)
                0.011116823 = queryNorm
              0.6683267 = fieldWeight in 3595, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.7821565 = idf(docFreq=1006, maxDocs=44218)
                0.0625 = fieldNorm(doc=3595)
          0.48264942 = weight(abstract_txt:news in 3595) [ClassicSimilarity], result of:
            0.48264942 = score(doc=3595,freq=12.0), product of:
              0.37421975 = queryWeight, product of:
                5.6508207 = boost
                5.957094 = idf(docFreq=310, maxDocs=44218)
                0.011116823 = queryNorm
              1.2897487 = fieldWeight in 3595, product of:
                3.4641016 = tf(freq=12.0), with freq of:
                  12.0 = termFreq=12.0
                5.957094 = idf(docFreq=310, maxDocs=44218)
                0.0625 = fieldNorm(doc=3595)
        0.24 = coord(6/25)