Document (#43660)

Author
Suominen, O.
Koskenniemi, I.
Title
Annif Analyzer Shootout : comparing text lemmatization methods for automated subject indexing
Source
Code4Lib journal. Issue 54(2022), [http://journal.code4lib.org]
Year
2022
Abstract
Automated text classification is an important function for many AI systems relevant to libraries, including automated subject indexing and classification. When implemented using the traditional natural language processing (NLP) paradigm, one key part of the process is the normalization of words using stemming or lemmatization, which reduces the amount of linguistic variation and often improves the quality of classification. In this paper, we compare the output of seven different text lemmatization algorithms as well as two baseline methods. We measure how the choice of method affects the quality of text classification using example corpora in three languages. The experiments have been performed using the open source Annif toolkit for automated subject indexing and classification, but should generalize also to other NLP toolkits and similar text classification tasks. The results show that lemmatization methods in most cases outperform baseline methods in text classification particularly for Finnish and Swedish text, but not English, where baseline methods are most effective. The differences between lemmatization methods are quite small. The systematic comparison will help optimize text classification pipelines and inform the further development of the Annif toolkit to incorporate a wider choice of normalization methods.
Content
Vgl.: https://journal.code4lib.org/articles/16719.
Theme
Automatisches Indexieren

Similar documents (author)

  1. Suominen, V.: Linguistic / semiotic conditions of information retrieval / documentation in the light of a sausurean conception of language : 'organising knowledge' or 'communication concerning documents'? (1998) 6.01
    6.0137663 = sum of:
      6.0137663 = weight(author_txt:suominen in 1081) [ClassicSimilarity], result of:
        6.0137663 = fieldWeight in 1081, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.622026 = idf(docFreq=7, maxDocs=44421)
          0.625 = fieldNorm(doc=1081)
    
  2. Suominen, A.; Toivanen, H.: Map of science with topic modeling : comparison of unsupervised learning and human-assigned subject classification (2016) 4.81
    4.811013 = sum of:
      4.811013 = weight(author_txt:suominen in 4121) [ClassicSimilarity], result of:
        4.811013 = fieldWeight in 4121, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.622026 = idf(docFreq=7, maxDocs=44421)
          0.5 = fieldNorm(doc=4121)
    
  3. Suominen, O.; Hyvönen, N.: From MARC silos to Linked Data silos? (2017) 4.81
    4.811013 = sum of:
      4.811013 = weight(author_txt:suominen in 4732) [ClassicSimilarity], result of:
        4.811013 = fieldWeight in 4732, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.622026 = idf(docFreq=7, maxDocs=44421)
          0.5 = fieldNorm(doc=4732)
    
  4. Suominen, V.; Tuomi, P.: Literacies, hermeneutics, and literature (2015) 4.81
    4.811013 = sum of:
      4.811013 = weight(author_txt:suominen in 543) [ClassicSimilarity], result of:
        4.811013 = fieldWeight in 543, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.622026 = idf(docFreq=7, maxDocs=44421)
          0.5 = fieldNorm(doc=543)
    
  5. Friman, M.; Jansson, P.; Suominen, V.: Chaos or order? : Aby Warburg's library of cultural history and its classification (1995) 3.61
    3.60826 = sum of:
      3.60826 = weight(author_txt:suominen in 1157) [ClassicSimilarity], result of:
        3.60826 = fieldWeight in 1157, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.622026 = idf(docFreq=7, maxDocs=44421)
          0.375 = fieldNorm(doc=1157)
    

Similar documents (content)

  1. Kettunen, K.; Kunttu, T.; Järvelin, K.: To stem or lemmatize a highly inflectional language in a probabilistic IR environment? (2005) 0.32
    0.31905344 = sum of:
      0.31905344 = product of:
        1.1394765 = sum of:
          0.045305658 = weight(abstract_txt:stemming in 5395) [ClassicSimilarity], result of:
            0.045305658 = score(doc=5395,freq=2.0), product of:
              0.07859951 = queryWeight, product of:
                1.0631733 = boost
                7.4529724 = idf(docFreq=69, maxDocs=44421)
                0.00991942 = queryNorm
              0.5764114 = fieldWeight in 5395, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.4529724 = idf(docFreq=69, maxDocs=44421)
                0.0546875 = fieldNorm(doc=5395)
          0.07259085 = weight(abstract_txt:finnish in 5395) [ClassicSimilarity], result of:
            0.07259085 = score(doc=5395,freq=4.0), product of:
              0.08542064 = queryWeight, product of:
                1.1083465 = boost
                7.769642 = idf(docFreq=50, maxDocs=44421)
                0.00991942 = queryNorm
              0.8498046 = fieldWeight in 5395, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.769642 = idf(docFreq=50, maxDocs=44421)
                0.0546875 = fieldNorm(doc=5395)
          0.009482466 = weight(abstract_txt:most in 5395) [ClassicSimilarity], result of:
            0.009482466 = score(doc=5395,freq=1.0), product of:
              0.04398309 = queryWeight, product of:
                1.1247396 = boost
                3.94228 = idf(docFreq=2342, maxDocs=44421)
                0.00991942 = queryNorm
              0.21559344 = fieldWeight in 5395, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.94228 = idf(docFreq=2342, maxDocs=44421)
                0.0546875 = fieldNorm(doc=5395)
          0.0331703 = weight(abstract_txt:choice in 5395) [ClassicSimilarity], result of:
            0.0331703 = score(doc=5395,freq=1.0), product of:
              0.10135329 = queryWeight, product of:
                1.7073716 = boost
                5.98444 = idf(docFreq=303, maxDocs=44421)
                0.00991942 = queryNorm
              0.32727405 = fieldWeight in 5395, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.98444 = idf(docFreq=303, maxDocs=44421)
                0.0546875 = fieldNorm(doc=5395)
          0.01808312 = weight(abstract_txt:using in 5395) [ClassicSimilarity], result of:
            0.01808312 = score(doc=5395,freq=2.0), product of:
              0.067637436 = queryWeight, product of:
                1.9725031 = boost
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.00991942 = queryNorm
              0.2673537 = fieldWeight in 5395, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.0546875 = fieldNorm(doc=5395)
          0.03853367 = weight(abstract_txt:methods in 5395) [ClassicSimilarity], result of:
            0.03853367 = score(doc=5395,freq=1.0), product of:
              0.17005438 = queryWeight, product of:
                4.137491 = boost
                4.1434727 = idf(docFreq=1915, maxDocs=44421)
                0.00991942 = queryNorm
              0.22659616 = fieldWeight in 5395, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1434727 = idf(docFreq=1915, maxDocs=44421)
                0.0546875 = fieldNorm(doc=5395)
          0.9223104 = weight(abstract_txt:lemmatization in 5395) [ClassicSimilarity], result of:
            0.9223104 = score(doc=5395,freq=6.0), product of:
              0.69478834 = queryWeight, product of:
                7.0681443 = boost
                9.909708 = idf(docFreq=5, maxDocs=44421)
                0.00991942 = queryNorm
              1.3274696 = fieldWeight in 5395, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                9.909708 = idf(docFreq=5, maxDocs=44421)
                0.0546875 = fieldNorm(doc=5395)
        0.28 = coord(7/25)
    
  2. Airio, E.; Kettunen, K.: Does dictionary based bilingual retrieval work in a non-normalized index? (2009) 0.16
    0.16019857 = sum of:
      0.16019857 = product of:
        0.66749406 = sum of:
          0.0366125 = weight(abstract_txt:stemming in 224) [ClassicSimilarity], result of:
            0.0366125 = score(doc=224,freq=1.0), product of:
              0.07859951 = queryWeight, product of:
                1.0631733 = boost
                7.4529724 = idf(docFreq=69, maxDocs=44421)
                0.00991942 = queryNorm
              0.46581078 = fieldWeight in 224, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.4529724 = idf(docFreq=69, maxDocs=44421)
                0.0625 = fieldNorm(doc=224)
          0.08296097 = weight(abstract_txt:finnish in 224) [ClassicSimilarity], result of:
            0.08296097 = score(doc=224,freq=4.0), product of:
              0.08542064 = queryWeight, product of:
                1.1083465 = boost
                7.769642 = idf(docFreq=50, maxDocs=44421)
                0.00991942 = queryNorm
              0.97120523 = fieldWeight in 224, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.769642 = idf(docFreq=50, maxDocs=44421)
                0.0625 = fieldNorm(doc=224)
          0.08491814 = weight(abstract_txt:swedish in 224) [ClassicSimilarity], result of:
            0.08491814 = score(doc=224,freq=4.0), product of:
              0.086758874 = queryWeight, product of:
                1.1169946 = boost
                7.8302665 = idf(docFreq=47, maxDocs=44421)
                0.00991942 = queryNorm
              0.9787833 = fieldWeight in 224, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.8302665 = idf(docFreq=47, maxDocs=44421)
                0.0625 = fieldNorm(doc=224)
          0.010837104 = weight(abstract_txt:most in 224) [ClassicSimilarity], result of:
            0.010837104 = score(doc=224,freq=1.0), product of:
              0.04398309 = queryWeight, product of:
                1.1247396 = boost
                3.94228 = idf(docFreq=2342, maxDocs=44421)
                0.00991942 = queryNorm
              0.2463925 = fieldWeight in 224, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.94228 = idf(docFreq=2342, maxDocs=44421)
                0.0625 = fieldNorm(doc=224)
          0.021843517 = weight(abstract_txt:indexing in 224) [ClassicSimilarity], result of:
            0.021843517 = score(doc=224,freq=1.0), product of:
              0.080338255 = queryWeight, product of:
                1.8617268 = boost
                4.3503094 = idf(docFreq=1557, maxDocs=44421)
                0.00991942 = queryNorm
              0.27189434 = fieldWeight in 224, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.3503094 = idf(docFreq=1557, maxDocs=44421)
                0.0625 = fieldNorm(doc=224)
          0.43032184 = weight(abstract_txt:lemmatization in 224) [ClassicSimilarity], result of:
            0.43032184 = score(doc=224,freq=1.0), product of:
              0.69478834 = queryWeight, product of:
                7.0681443 = boost
                9.909708 = idf(docFreq=5, maxDocs=44421)
                0.00991942 = queryNorm
              0.61935675 = fieldWeight in 224, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.909708 = idf(docFreq=5, maxDocs=44421)
                0.0625 = fieldNorm(doc=224)
        0.24 = coord(6/25)
    
  3. Hahn, J.: Semi-automated methods for BIBFRAME work entity description (2021) 0.14
    0.13625176 = sum of:
      0.13625176 = product of:
        0.6812588 = sum of:
          0.03364188 = weight(abstract_txt:subject in 1726) [ClassicSimilarity], result of:
            0.03364188 = score(doc=1726,freq=2.0), product of:
              0.064896815 = queryWeight, product of:
                1.6732717 = boost
                3.9099448 = idf(docFreq=2419, maxDocs=44421)
                0.00991942 = queryNorm
              0.5183903 = fieldWeight in 1726, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.9099448 = idf(docFreq=2419, maxDocs=44421)
                0.09375 = fieldNorm(doc=1726)
          0.032765273 = weight(abstract_txt:indexing in 1726) [ClassicSimilarity], result of:
            0.032765273 = score(doc=1726,freq=1.0), product of:
              0.080338255 = queryWeight, product of:
                1.8617268 = boost
                4.3503094 = idf(docFreq=1557, maxDocs=44421)
                0.00991942 = queryNorm
              0.4078415 = fieldWeight in 1726, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.3503094 = idf(docFreq=1557, maxDocs=44421)
                0.09375 = fieldNorm(doc=1726)
          0.16150428 = weight(abstract_txt:automated in 1726) [ClassicSimilarity], result of:
            0.16150428 = score(doc=1726,freq=3.0), product of:
              0.17757222 = queryWeight, product of:
                3.1960359 = boost
                5.6011486 = idf(docFreq=445, maxDocs=44421)
                0.00991942 = queryNorm
              0.90951324 = fieldWeight in 1726, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.6011486 = idf(docFreq=445, maxDocs=44421)
                0.09375 = fieldNorm(doc=1726)
          0.06605772 = weight(abstract_txt:methods in 1726) [ClassicSimilarity], result of:
            0.06605772 = score(doc=1726,freq=1.0), product of:
              0.17005438 = queryWeight, product of:
                4.137491 = boost
                4.1434727 = idf(docFreq=1915, maxDocs=44421)
                0.00991942 = queryNorm
              0.38845056 = fieldWeight in 1726, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1434727 = idf(docFreq=1915, maxDocs=44421)
                0.09375 = fieldNorm(doc=1726)
          0.38728967 = weight(abstract_txt:annif in 1726) [ClassicSimilarity], result of:
            0.38728967 = score(doc=1726,freq=1.0), product of:
              0.416873 = queryWeight, product of:
                4.2408867 = boost
                9.909708 = idf(docFreq=5, maxDocs=44421)
                0.00991942 = queryNorm
              0.9290351 = fieldWeight in 1726, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.909708 = idf(docFreq=5, maxDocs=44421)
                0.09375 = fieldNorm(doc=1726)
        0.2 = coord(5/25)
    
  4. Ahlgren, P.; Kekäläinen, J.: Indexing strategies for Swedish full text retrieval under different user scenarios (2007) 0.13
    0.1255338 = sum of:
      0.1255338 = product of:
        0.5230575 = sum of:
          0.06004619 = weight(abstract_txt:swedish in 1896) [ClassicSimilarity], result of:
            0.06004619 = score(doc=1896,freq=2.0), product of:
              0.086758874 = queryWeight, product of:
                1.1169946 = boost
                7.8302665 = idf(docFreq=47, maxDocs=44421)
                0.00991942 = queryNorm
              0.6921043 = fieldWeight in 1896, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.8302665 = idf(docFreq=47, maxDocs=44421)
                0.0625 = fieldNorm(doc=1896)
          0.06923849 = weight(abstract_txt:analyzer in 1896) [ClassicSimilarity], result of:
            0.06923849 = score(doc=1896,freq=1.0), product of:
              0.12019839 = queryWeight, product of:
                1.3147509 = boost
                9.216561 = idf(docFreq=11, maxDocs=44421)
                0.00991942 = queryNorm
              0.5760351 = fieldWeight in 1896, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.216561 = idf(docFreq=11, maxDocs=44421)
                0.0625 = fieldNorm(doc=1896)
          0.043687034 = weight(abstract_txt:indexing in 1896) [ClassicSimilarity], result of:
            0.043687034 = score(doc=1896,freq=4.0), product of:
              0.080338255 = queryWeight, product of:
                1.8617268 = boost
                4.3503094 = idf(docFreq=1557, maxDocs=44421)
                0.00991942 = queryNorm
              0.5437887 = fieldWeight in 1896, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.3503094 = idf(docFreq=1557, maxDocs=44421)
                0.0625 = fieldNorm(doc=1896)
          0.15471347 = weight(abstract_txt:normalization in 1896) [ClassicSimilarity], result of:
            0.15471347 = score(doc=1896,freq=4.0), product of:
              0.16305809 = queryWeight, product of:
                2.1656103 = boost
                7.590594 = idf(docFreq=60, maxDocs=44421)
                0.00991942 = queryNorm
              0.9488242 = fieldWeight in 1896, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.590594 = idf(docFreq=60, maxDocs=44421)
                0.0625 = fieldNorm(doc=1896)
          0.14868926 = weight(abstract_txt:baseline in 1896) [ClassicSimilarity], result of:
            0.14868926 = score(doc=1896,freq=3.0), product of:
              0.20007217 = queryWeight, product of:
                2.9379752 = boost
                6.8651857 = idf(docFreq=125, maxDocs=44421)
                0.00991942 = queryNorm
              0.7431781 = fieldWeight in 1896, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.8651857 = idf(docFreq=125, maxDocs=44421)
                0.0625 = fieldNorm(doc=1896)
          0.04668307 = weight(abstract_txt:text in 1896) [ClassicSimilarity], result of:
            0.04668307 = score(doc=1896,freq=1.0), product of:
              0.18484308 = queryWeight, product of:
                4.611484 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.00991942 = queryNorm
              0.25255513 = fieldWeight in 1896, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=1896)
        0.24 = coord(6/25)
    
  5. Galvez, C.; Moya-Anegón, F. de: ¬An evaluation of conflation accuracy using finite-state transducers (2006) 0.12
    0.11967452 = sum of:
      0.11967452 = product of:
        0.74796575 = sum of:
          0.018266708 = weight(abstract_txt:using in 599) [ClassicSimilarity], result of:
            0.018266708 = score(doc=599,freq=1.0), product of:
              0.067637436 = queryWeight, product of:
                1.9725031 = boost
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.00991942 = queryNorm
              0.27006802 = fieldWeight in 599, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.078125 = fieldNorm(doc=599)
          0.13674866 = weight(abstract_txt:normalization in 599) [ClassicSimilarity], result of:
            0.13674866 = score(doc=599,freq=2.0), product of:
              0.16305809 = queryWeight, product of:
                2.1656103 = boost
                7.590594 = idf(docFreq=60, maxDocs=44421)
                0.00991942 = queryNorm
              0.83865 = fieldWeight in 599, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.590594 = idf(docFreq=60, maxDocs=44421)
                0.078125 = fieldNorm(doc=599)
          0.055048097 = weight(abstract_txt:methods in 599) [ClassicSimilarity], result of:
            0.055048097 = score(doc=599,freq=1.0), product of:
              0.17005438 = queryWeight, product of:
                4.137491 = boost
                4.1434727 = idf(docFreq=1915, maxDocs=44421)
                0.00991942 = queryNorm
              0.3237088 = fieldWeight in 599, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1434727 = idf(docFreq=1915, maxDocs=44421)
                0.078125 = fieldNorm(doc=599)
          0.5379023 = weight(abstract_txt:lemmatization in 599) [ClassicSimilarity], result of:
            0.5379023 = score(doc=599,freq=1.0), product of:
              0.69478834 = queryWeight, product of:
                7.0681443 = boost
                9.909708 = idf(docFreq=5, maxDocs=44421)
                0.00991942 = queryNorm
              0.7741959 = fieldWeight in 599, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.909708 = idf(docFreq=5, maxDocs=44421)
                0.078125 = fieldNorm(doc=599)
        0.16 = coord(4/25)