Document (#39911)

Author
Snajder, J.
Dalbelo Basic, B.D.
Tadic, M.
Title
Automatic acquisition of inflectional lexica for morphological normalisation
Source
Information processing and management. 44(2008) no.5, S.1720-1731
Year
2008
Abstract
Due to natural language morphology, words can take on various morphological forms. Morphological normalisation - often used in information retrieval and text mining systems - conflates morphological variants of a word to a single representative form. In this paper, we describe an approach to lexicon-based inflectional normalisation. This approach is in between stemming and lemmatisation, and is suitable for morphological normalisation of inflectionally complex languages. To eliminate the immense effort required to compile the lexicon by hand, we focus on the problem of acquiring automatically an inflectional morphological lexicon from raw corpora. We propose a convenient and highly expressive morphology representation formalism on which the acquisition procedure is based. Our approach is applied to the morphologically complex Croatian language, but it should be equally applicable to other languages of similar morphological complexity. Experimental results show that our approach can be used to acquire a lexicon whose linguistic quality allows for rather good normalisation performance.
Theme
Computerlinguistik
Automatisches Indexieren

Similar documents (content)

  1. Malenica, M.; Smuc, T.; Snajder, J.; Basic, B.D.: Language morphology offset : text classification on a Croatian-English parallel corpus (2008) 0.36
    0.36081067 = sum of:
      0.36081067 = product of:
        1.8040533 = sum of:
          0.015010662 = weight(abstract_txt:language in 3035) [ClassicSimilarity], result of:
            0.015010662 = score(doc=3035,freq=1.0), product of:
              0.046064984 = queryWeight, product of:
                1.2424006 = boost
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.008889357 = queryNorm
              0.3258584 = fieldWeight in 3035, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.078125 = fieldNorm(doc=3035)
          0.121449806 = weight(abstract_txt:croatian in 3035) [ClassicSimilarity], result of:
            0.121449806 = score(doc=3035,freq=2.0), product of:
              0.11695413 = queryWeight, product of:
                1.3998097 = boost
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.008889357 = queryNorm
              1.0384396 = fieldWeight in 3035, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.078125 = fieldNorm(doc=3035)
          0.057828963 = weight(abstract_txt:languages in 3035) [ClassicSimilarity], result of:
            0.057828963 = score(doc=3035,freq=4.0), product of:
              0.071315065 = queryWeight, product of:
                1.5458483 = boost
                5.189722 = idf(docFreq=672, maxDocs=44421)
                0.008889357 = queryNorm
              0.8108941 = fieldWeight in 3035, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.189722 = idf(docFreq=672, maxDocs=44421)
                0.078125 = fieldNorm(doc=3035)
          0.7690177 = weight(abstract_txt:normalisation in 3035) [ClassicSimilarity], result of:
            0.7690177 = score(doc=3035,freq=3.0), product of:
              0.5979545 = queryWeight, product of:
                7.0775065 = boost
                9.504243 = idf(docFreq=8, maxDocs=44421)
                0.008889357 = queryNorm
              1.2860806 = fieldWeight in 3035, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.504243 = idf(docFreq=8, maxDocs=44421)
                0.078125 = fieldNorm(doc=3035)
          0.84074616 = weight(abstract_txt:morphological in 3035) [ClassicSimilarity], result of:
            0.84074616 = score(doc=3035,freq=5.0), product of:
              0.59875196 = queryWeight, product of:
                8.379801 = boost
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.008889357 = queryNorm
              1.4041643 = fieldWeight in 3035, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.078125 = fieldNorm(doc=3035)
        0.2 = coord(5/25)
    
  2. Kettunen, K.; Kunttu, T.; Järvelin, K.: To stem or lemmatize a highly inflectional language in a probabilistic IR environment? (2005) 0.27
    0.26513264 = sum of:
      0.26513264 = product of:
        0.8285395 = sum of:
          0.007748682 = weight(abstract_txt:used in 5395) [ClassicSimilarity], result of:
            0.007748682 = score(doc=5395,freq=2.0), product of:
              0.029843349 = queryWeight, product of:
                3.3572001 = idf(docFreq=4205, maxDocs=44421)
                0.008889357 = queryNorm
              0.2596452 = fieldWeight in 5395, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.3572001 = idf(docFreq=4205, maxDocs=44421)
                0.0546875 = fieldNorm(doc=5395)
          0.04238915 = weight(abstract_txt:stemming in 5395) [ClassicSimilarity], result of:
            0.04238915 = score(doc=5395,freq=2.0), product of:
              0.07353975 = queryWeight, product of:
                1.1099982 = boost
                7.4529724 = idf(docFreq=69, maxDocs=44421)
                0.008889357 = queryNorm
              0.5764114 = fieldWeight in 5395, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.4529724 = idf(docFreq=69, maxDocs=44421)
                0.0546875 = fieldNorm(doc=5395)
          0.021014927 = weight(abstract_txt:language in 5395) [ClassicSimilarity], result of:
            0.021014927 = score(doc=5395,freq=4.0), product of:
              0.046064984 = queryWeight, product of:
                1.2424006 = boost
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.008889357 = queryNorm
              0.45620176 = fieldWeight in 5395, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.0546875 = fieldNorm(doc=5395)
          0.058304276 = weight(abstract_txt:morphologically in 5395) [ClassicSimilarity], result of:
            0.058304276 = score(doc=5395,freq=1.0), product of:
              0.11459419 = queryWeight, product of:
                1.3856149 = boost
                9.303573 = idf(docFreq=10, maxDocs=44421)
                0.008889357 = queryNorm
              0.5087891 = fieldWeight in 5395, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.303573 = idf(docFreq=10, maxDocs=44421)
                0.0546875 = fieldNorm(doc=5395)
          0.01898402 = weight(abstract_txt:complex in 5395) [ClassicSimilarity], result of:
            0.01898402 = score(doc=5395,freq=1.0), product of:
              0.0683331 = queryWeight, product of:
                1.5131841 = boost
                5.080062 = idf(docFreq=750, maxDocs=44421)
                0.008889357 = queryNorm
              0.27781588 = fieldWeight in 5395, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.080062 = idf(docFreq=750, maxDocs=44421)
                0.0546875 = fieldNorm(doc=5395)
          0.020240137 = weight(abstract_txt:languages in 5395) [ClassicSimilarity], result of:
            0.020240137 = score(doc=5395,freq=1.0), product of:
              0.071315065 = queryWeight, product of:
                1.5458483 = boost
                5.189722 = idf(docFreq=672, maxDocs=44421)
                0.008889357 = queryNorm
              0.28381294 = fieldWeight in 5395, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.189722 = idf(docFreq=672, maxDocs=44421)
                0.0546875 = fieldNorm(doc=5395)
          0.015164371 = weight(abstract_txt:approach in 5395) [ClassicSimilarity], result of:
            0.015164371 = score(doc=5395,freq=1.0), product of:
              0.074119404 = queryWeight, product of:
                2.2287285 = boost
                3.741144 = idf(docFreq=2864, maxDocs=44421)
                0.008889357 = queryNorm
              0.20459381 = fieldWeight in 5395, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.741144 = idf(docFreq=2864, maxDocs=44421)
                0.0546875 = fieldNorm(doc=5395)
          0.6446939 = weight(abstract_txt:morphological in 5395) [ClassicSimilarity], result of:
            0.6446939 = score(doc=5395,freq=6.0), product of:
              0.59875196 = queryWeight, product of:
                8.379801 = boost
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.008889357 = queryNorm
              1.0767295 = fieldWeight in 5395, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.0546875 = fieldNorm(doc=5395)
        0.32 = coord(8/25)
    
  3. Pirkola, A.: Morphological typology of languages for IR (2001) 0.25
    0.24669719 = sum of:
      0.24669719 = product of:
        1.027905 = sum of:
          0.007827352 = weight(abstract_txt:used in 5476) [ClassicSimilarity], result of:
            0.007827352 = score(doc=5476,freq=1.0), product of:
              0.029843349 = queryWeight, product of:
                3.3572001 = idf(docFreq=4205, maxDocs=44421)
                0.008889357 = queryNorm
              0.26228127 = fieldWeight in 5476, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.3572001 = idf(docFreq=4205, maxDocs=44421)
                0.078125 = fieldNorm(doc=5476)
          0.04281951 = weight(abstract_txt:stemming in 5476) [ClassicSimilarity], result of:
            0.04281951 = score(doc=5476,freq=1.0), product of:
              0.07353975 = queryWeight, product of:
                1.1099982 = boost
                7.4529724 = idf(docFreq=69, maxDocs=44421)
                0.008889357 = queryNorm
              0.58226347 = fieldWeight in 5476, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.4529724 = idf(docFreq=69, maxDocs=44421)
                0.078125 = fieldNorm(doc=5476)
          0.021228282 = weight(abstract_txt:language in 5476) [ClassicSimilarity], result of:
            0.021228282 = score(doc=5476,freq=2.0), product of:
              0.046064984 = queryWeight, product of:
                1.2424006 = boost
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.008889357 = queryNorm
              0.46083337 = fieldWeight in 5476, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.078125 = fieldNorm(doc=5476)
          0.05008135 = weight(abstract_txt:languages in 5476) [ClassicSimilarity], result of:
            0.05008135 = score(doc=5476,freq=3.0), product of:
              0.071315065 = queryWeight, product of:
                1.5458483 = boost
                5.189722 = idf(docFreq=672, maxDocs=44421)
                0.008889357 = queryNorm
              0.70225483 = fieldWeight in 5476, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.189722 = idf(docFreq=672, maxDocs=44421)
                0.078125 = fieldNorm(doc=5476)
          0.15396227 = weight(abstract_txt:morphology in 5476) [ClassicSimilarity], result of:
            0.15396227 = score(doc=5476,freq=1.0), product of:
              0.2174606 = queryWeight, product of:
                2.6993954 = boost
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.008889357 = queryNorm
              0.7080008 = fieldWeight in 5476, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.078125 = fieldNorm(doc=5476)
          0.75198627 = weight(abstract_txt:morphological in 5476) [ClassicSimilarity], result of:
            0.75198627 = score(doc=5476,freq=4.0), product of:
              0.59875196 = queryWeight, product of:
                8.379801 = boost
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.008889357 = queryNorm
              1.2559228 = fieldWeight in 5476, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.078125 = fieldNorm(doc=5476)
        0.24 = coord(6/25)
    
  4. Ekmekcioglu, F.C.; Lynch, M.F.; Willet, P.: Development and evaluation of conflation techniques for the implementation of a document retrieval system for Turkish text databases (1995) 0.19
    0.19274692 = sum of:
      0.19274692 = product of:
        0.96373457 = sum of:
          0.042757344 = weight(abstract_txt:corpora in 5865) [ClassicSimilarity], result of:
            0.042757344 = score(doc=5865,freq=1.0), product of:
              0.06505999 = queryWeight, product of:
                1.0440426 = boost
                7.01012 = idf(docFreq=108, maxDocs=44421)
                0.008889357 = queryNorm
              0.6571987 = fieldWeight in 5865, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.01012 = idf(docFreq=108, maxDocs=44421)
                0.09375 = fieldNorm(doc=5865)
          0.072667114 = weight(abstract_txt:stemming in 5865) [ClassicSimilarity], result of:
            0.072667114 = score(doc=5865,freq=2.0), product of:
              0.07353975 = queryWeight, product of:
                1.1099982 = boost
                7.4529724 = idf(docFreq=69, maxDocs=44421)
                0.008889357 = queryNorm
              0.98813385 = fieldWeight in 5865, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.4529724 = idf(docFreq=69, maxDocs=44421)
                0.09375 = fieldNorm(doc=5865)
          0.025473941 = weight(abstract_txt:language in 5865) [ClassicSimilarity], result of:
            0.025473941 = score(doc=5865,freq=2.0), product of:
              0.046064984 = queryWeight, product of:
                1.2424006 = boost
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.008889357 = queryNorm
              0.5530001 = fieldWeight in 5865, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.09375 = fieldNorm(doc=5865)
          0.18475474 = weight(abstract_txt:morphology in 5865) [ClassicSimilarity], result of:
            0.18475474 = score(doc=5865,freq=1.0), product of:
              0.2174606 = queryWeight, product of:
                2.6993954 = boost
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.008889357 = queryNorm
              0.849601 = fieldWeight in 5865, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.09375 = fieldNorm(doc=5865)
          0.63808143 = weight(abstract_txt:morphological in 5865) [ClassicSimilarity], result of:
            0.63808143 = score(doc=5865,freq=2.0), product of:
              0.59875196 = queryWeight, product of:
                8.379801 = boost
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.008889357 = queryNorm
              1.0656857 = fieldWeight in 5865, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.09375 = fieldNorm(doc=5865)
        0.2 = coord(5/25)
    
  5. Kettunen, K.: Reductive and generative approaches to management of morphological variation of keywords in monolingual information retrieval : an overview (2009) 0.16
    0.15856947 = sum of:
      0.15856947 = product of:
        0.66070616 = sum of:
          0.007827352 = weight(abstract_txt:used in 3835) [ClassicSimilarity], result of:
            0.007827352 = score(doc=3835,freq=1.0), product of:
              0.029843349 = queryWeight, product of:
                3.3572001 = idf(docFreq=4205, maxDocs=44421)
                0.008889357 = queryNorm
              0.26228127 = fieldWeight in 3835, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.3572001 = idf(docFreq=4205, maxDocs=44421)
                0.078125 = fieldNorm(doc=3835)
          0.043578934 = weight(abstract_txt:variants in 3835) [ClassicSimilarity], result of:
            0.043578934 = score(doc=3835,freq=1.0), product of:
              0.074406706 = queryWeight, product of:
                1.116522 = boost
                7.496775 = idf(docFreq=66, maxDocs=44421)
                0.008889357 = queryNorm
              0.58568555 = fieldWeight in 3835, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.496775 = idf(docFreq=66, maxDocs=44421)
                0.078125 = fieldNorm(doc=3835)
          0.015010662 = weight(abstract_txt:language in 3835) [ClassicSimilarity], result of:
            0.015010662 = score(doc=3835,freq=1.0), product of:
              0.046064984 = queryWeight, product of:
                1.2424006 = boost
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.008889357 = queryNorm
              0.3258584 = fieldWeight in 3835, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.078125 = fieldNorm(doc=3835)
          0.040891252 = weight(abstract_txt:languages in 3835) [ClassicSimilarity], result of:
            0.040891252 = score(doc=3835,freq=2.0), product of:
              0.071315065 = queryWeight, product of:
                1.5458483 = boost
                5.189722 = idf(docFreq=672, maxDocs=44421)
                0.008889357 = queryNorm
              0.5733887 = fieldWeight in 3835, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.189722 = idf(docFreq=672, maxDocs=44421)
                0.078125 = fieldNorm(doc=3835)
          0.021663386 = weight(abstract_txt:approach in 3835) [ClassicSimilarity], result of:
            0.021663386 = score(doc=3835,freq=1.0), product of:
              0.074119404 = queryWeight, product of:
                2.2287285 = boost
                3.741144 = idf(docFreq=2864, maxDocs=44421)
                0.008889357 = queryNorm
              0.29227686 = fieldWeight in 3835, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.741144 = idf(docFreq=2864, maxDocs=44421)
                0.078125 = fieldNorm(doc=3835)
          0.5317346 = weight(abstract_txt:morphological in 3835) [ClassicSimilarity], result of:
            0.5317346 = score(doc=3835,freq=2.0), product of:
              0.59875196 = queryWeight, product of:
                8.379801 = boost
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.008889357 = queryNorm
              0.88807154 = fieldWeight in 3835, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.078125 = fieldNorm(doc=3835)
        0.24 = coord(6/25)