Document (#40787)

Author
Kauchak, D.
Leroy, G.
Hogue, A.
Title
Measuring text difficulty using parse-tree frequency
Source
Journal of the Association for Information Science and Technology. 68(2017) no.9, S.2088-2100
Year
2017
Abstract
Text simplification often relies on dated, unproven readability formulas. As an alternative and motivated by the success of term familiarity, we test a complementary measure: grammar familiarity. Grammar familiarity is measured as the frequency of the 3rd level sentence parse tree and is useful for evaluating individual sentences. We created a database of 140K unique 3rd level parse structures by parsing and binning all 5.4M sentences in English Wikipedia. We then calculated the grammar frequencies across the corpus and created 11 frequency bins. We evaluate the measure with a user study and corpus analysis. For the user study, we selected 20 sentences randomly from each bin, controlling for sentence length and term frequency, and recruited 30 readers per sentence (N = 6,600) on Amazon Mechanical Turk. We measured actual difficulty (comprehension) using a Cloze test, perceived difficulty using a 5-point Likert scale, and time taken. Sentences with more frequent grammatical structures, even with very different surface presentations, were easier to understand, perceived as easier, and took less time to read. Outcomes from readability formulas correlated with perceived but not with actual difficulty. Our corpus analysis shows how the metric can be used to understand grammar regularity in a broad range of corpora.
Content
Vgl.: http://onlinelibrary.wiley.com/doi/10.1002/asi.23855/full.
Theme
Computerlinguistik

Similar documents (author)

  1. Leroy, G.; Chen, H.: Genescene: an ontology-enhanced integration of linguistic and co-occurrence based relations in biomedical texts (2005) 4.88
    4.8777785 = sum of:
      4.8777785 = weight(author_txt:leroy in 259) [ClassicSimilarity], result of:
        4.8777785 = fieldWeight in 259, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.755557 = idf(docFreq=6, maxDocs=44421)
          0.5 = fieldNorm(doc=259)
    
  2. Leroy, S.Y.; Thomas, S.L.: Impact of Web access on cataloging (2004) 4.88
    4.8777785 = sum of:
      4.8777785 = weight(author_txt:leroy in 656) [ClassicSimilarity], result of:
        4.8777785 = fieldWeight in 656, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.755557 = idf(docFreq=6, maxDocs=44421)
          0.5 = fieldNorm(doc=656)
    
  3. Ku, C.-H.; Leroy, G.: ¬A crime reports analysis system to identify related crimes (2011) 4.27
    4.2680564 = sum of:
      4.2680564 = weight(author_txt:leroy in 629) [ClassicSimilarity], result of:
        4.2680564 = fieldWeight in 629, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.755557 = idf(docFreq=6, maxDocs=44421)
          0.4375 = fieldNorm(doc=629)
    
  4. Leroy, G.; Miller, T.; Rosemblat, G.; Browne, A.: ¬A balanced approach to health information evaluation : a vocabulary-based naïve Bayes classifier and readability formulas (2008) 3.05
    3.0486116 = sum of:
      3.0486116 = weight(author_txt:leroy in 2998) [ClassicSimilarity], result of:
        3.0486116 = fieldWeight in 2998, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.755557 = idf(docFreq=6, maxDocs=44421)
          0.3125 = fieldNorm(doc=2998)
    
  5. Thirion, B.; Leroy, J.P.; Baudic, F.; Douyère, M.; Piot, J.; Darmoni, S.J.: SDI selecting, decribing, and indexing : did you mean automatically? (2001) 2.44
    2.4388893 = sum of:
      2.4388893 = weight(author_txt:leroy in 198) [ClassicSimilarity], result of:
        2.4388893 = fieldWeight in 198, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.755557 = idf(docFreq=6, maxDocs=44421)
          0.25 = fieldNorm(doc=198)
    

Similar documents (content)

  1. Fang, L.; Tuan, L.A.; Hui, S.C.; Wu, L.: Syntactic based approach for grammar question retrieval (2018) 0.23
    0.2317591 = sum of:
      0.2317591 = product of:
        1.1587955 = sum of:
          0.015701395 = weight(abstract_txt:with in 86) [ClassicSimilarity], result of:
            0.015701395 = score(doc=86,freq=4.0), product of:
              0.050322168 = queryWeight, product of:
                1.3015063 = boost
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.015489741 = queryNorm
              0.31201747 = fieldWeight in 86, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.0625 = fieldNorm(doc=86)
          0.11237944 = weight(abstract_txt:tree in 86) [ClassicSimilarity], result of:
            0.11237944 = score(doc=86,freq=3.0), product of:
              0.15156235 = queryWeight, product of:
                1.4285395 = boost
                6.849437 = idf(docFreq=127, maxDocs=44421)
                0.015489741 = queryNorm
              0.7414733 = fieldWeight in 86, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.849437 = idf(docFreq=127, maxDocs=44421)
                0.0625 = fieldNorm(doc=86)
          0.097323455 = weight(abstract_txt:sentence in 86) [ClassicSimilarity], result of:
            0.097323455 = score(doc=86,freq=1.0), product of:
              0.22734353 = queryWeight, product of:
                2.1428094 = boost
                6.849437 = idf(docFreq=127, maxDocs=44421)
                0.015489741 = queryNorm
              0.42808983 = fieldWeight in 86, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.849437 = idf(docFreq=127, maxDocs=44421)
                0.0625 = fieldNorm(doc=86)
          0.4000881 = weight(abstract_txt:parse in 86) [ClassicSimilarity], result of:
            0.4000881 = score(doc=86,freq=3.0), product of:
              0.40451464 = queryWeight, product of:
                2.8583102 = boost
                9.1365185 = idf(docFreq=12, maxDocs=44421)
                0.015489741 = queryNorm
              0.9890571 = fieldWeight in 86, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.1365185 = idf(docFreq=12, maxDocs=44421)
                0.0625 = fieldNorm(doc=86)
          0.5333031 = weight(abstract_txt:grammar in 86) [ClassicSimilarity], result of:
            0.5333031 = score(doc=86,freq=9.0), product of:
              0.37389734 = queryWeight, product of:
                3.1731296 = boost
                7.607123 = idf(docFreq=59, maxDocs=44421)
                0.015489741 = queryNorm
              1.4263356 = fieldWeight in 86, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                7.607123 = idf(docFreq=59, maxDocs=44421)
                0.0625 = fieldNorm(doc=86)
        0.2 = coord(5/25)
    
  2. Ko, Y.; Park, J.; Seo, J.: Improving text categorization using the importance of sentences (2004) 0.16
    0.16098085 = sum of:
      0.16098085 = product of:
        0.5749316 = sum of:
          0.022256117 = weight(abstract_txt:term in 3557) [ClassicSimilarity], result of:
            0.022256117 = score(doc=3557,freq=1.0), product of:
              0.07426886 = queryWeight, product of:
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.015489741 = queryNorm
              0.29966956 = fieldWeight in 3557, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.0625 = fieldNorm(doc=3557)
          0.017693643 = weight(abstract_txt:using in 3557) [ClassicSimilarity], result of:
            0.017693643 = score(doc=3557,freq=2.0), product of:
              0.057908073 = queryWeight, product of:
                1.0814633 = boost
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.015489741 = queryNorm
              0.3055471 = fieldWeight in 3557, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.0625 = fieldNorm(doc=3557)
          0.03243693 = weight(abstract_txt:measure in 3557) [ClassicSimilarity], result of:
            0.03243693 = score(doc=3557,freq=1.0), product of:
              0.09547002 = queryWeight, product of:
                1.1337835 = boost
                5.4361663 = idf(docFreq=525, maxDocs=44421)
                0.015489741 = queryNorm
              0.3397604 = fieldWeight in 3557, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.4361663 = idf(docFreq=525, maxDocs=44421)
                0.0625 = fieldNorm(doc=3557)
          0.007850697 = weight(abstract_txt:with in 3557) [ClassicSimilarity], result of:
            0.007850697 = score(doc=3557,freq=1.0), product of:
              0.050322168 = queryWeight, product of:
                1.3015063 = boost
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.015489741 = queryNorm
              0.15600874 = fieldWeight in 3557, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.0625 = fieldNorm(doc=3557)
          0.097323455 = weight(abstract_txt:sentence in 3557) [ClassicSimilarity], result of:
            0.097323455 = score(doc=3557,freq=1.0), product of:
              0.22734353 = queryWeight, product of:
                2.1428094 = boost
                6.849437 = idf(docFreq=127, maxDocs=44421)
                0.015489741 = queryNorm
              0.42808983 = fieldWeight in 3557, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.849437 = idf(docFreq=127, maxDocs=44421)
                0.0625 = fieldNorm(doc=3557)
          0.12023071 = weight(abstract_txt:frequency in 3557) [ClassicSimilarity], result of:
            0.12023071 = score(doc=3557,freq=2.0), product of:
              0.2286568 = queryWeight, product of:
                2.4814394 = boost
                5.948895 = idf(docFreq=314, maxDocs=44421)
                0.015489741 = queryNorm
              0.525813 = fieldWeight in 3557, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.948895 = idf(docFreq=314, maxDocs=44421)
                0.0625 = fieldNorm(doc=3557)
          0.2771401 = weight(abstract_txt:sentences in 3557) [ClassicSimilarity], result of:
            0.2771401 = score(doc=3557,freq=4.0), product of:
              0.3166869 = queryWeight, product of:
                2.9202945 = boost
                7.000987 = idf(docFreq=109, maxDocs=44421)
                0.015489741 = queryNorm
              0.8751234 = fieldWeight in 3557, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.000987 = idf(docFreq=109, maxDocs=44421)
                0.0625 = fieldNorm(doc=3557)
        0.28 = coord(7/25)
    
  3. Doko, A.; Stula, , M.; Seric, L.: Improved sentence retrieval using local context and sentence length (2013) 0.14
    0.14373921 = sum of:
      0.14373921 = product of:
        0.718696 = sum of:
          0.027820146 = weight(abstract_txt:term in 3705) [ClassicSimilarity], result of:
            0.027820146 = score(doc=3705,freq=1.0), product of:
              0.07426886 = queryWeight, product of:
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.015489741 = queryNorm
              0.37458694 = fieldWeight in 3705, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.078125 = fieldNorm(doc=3705)
          0.022117054 = weight(abstract_txt:using in 3705) [ClassicSimilarity], result of:
            0.022117054 = score(doc=3705,freq=2.0), product of:
              0.057908073 = queryWeight, product of:
                1.0814633 = boost
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.015489741 = queryNorm
              0.38193387 = fieldWeight in 3705, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.078125 = fieldNorm(doc=3705)
          0.1720452 = weight(abstract_txt:sentence in 3705) [ClassicSimilarity], result of:
            0.1720452 = score(doc=3705,freq=2.0), product of:
              0.22734353 = queryWeight, product of:
                2.1428094 = boost
                6.849437 = idf(docFreq=127, maxDocs=44421)
                0.015489741 = queryNorm
              0.7567631 = fieldWeight in 3705, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.849437 = idf(docFreq=127, maxDocs=44421)
                0.078125 = fieldNorm(doc=3705)
          0.1502884 = weight(abstract_txt:frequency in 3705) [ClassicSimilarity], result of:
            0.1502884 = score(doc=3705,freq=2.0), product of:
              0.2286568 = queryWeight, product of:
                2.4814394 = boost
                5.948895 = idf(docFreq=314, maxDocs=44421)
                0.015489741 = queryNorm
              0.65726626 = fieldWeight in 3705, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.948895 = idf(docFreq=314, maxDocs=44421)
                0.078125 = fieldNorm(doc=3705)
          0.34642515 = weight(abstract_txt:sentences in 3705) [ClassicSimilarity], result of:
            0.34642515 = score(doc=3705,freq=4.0), product of:
              0.3166869 = queryWeight, product of:
                2.9202945 = boost
                7.000987 = idf(docFreq=109, maxDocs=44421)
                0.015489741 = queryNorm
              1.0939043 = fieldWeight in 3705, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.000987 = idf(docFreq=109, maxDocs=44421)
                0.078125 = fieldNorm(doc=3705)
        0.2 = coord(5/25)
    
  4. Mutawa, F.; Alnajem, S.; Alzhouri, F.: ¬An HPSG approach to Arabic nominal sentences (2008) 0.13
    0.13450183 = sum of:
      0.13450183 = product of:
        1.1208487 = sum of:
          0.02502259 = weight(abstract_txt:using in 2368) [ClassicSimilarity], result of:
            0.02502259 = score(doc=2368,freq=1.0), product of:
              0.057908073 = queryWeight, product of:
                1.0814633 = boost
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.015489741 = queryNorm
              0.43210885 = fieldWeight in 2368, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.125 = fieldNorm(doc=2368)
          0.48002076 = weight(abstract_txt:sentences in 2368) [ClassicSimilarity], result of:
            0.48002076 = score(doc=2368,freq=3.0), product of:
              0.3166869 = queryWeight, product of:
                2.9202945 = boost
                7.000987 = idf(docFreq=109, maxDocs=44421)
                0.015489741 = queryNorm
              1.5157582 = fieldWeight in 2368, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.000987 = idf(docFreq=109, maxDocs=44421)
                0.125 = fieldNorm(doc=2368)
          0.6158053 = weight(abstract_txt:grammar in 2368) [ClassicSimilarity], result of:
            0.6158053 = score(doc=2368,freq=3.0), product of:
              0.37389734 = queryWeight, product of:
                3.1731296 = boost
                7.607123 = idf(docFreq=59, maxDocs=44421)
                0.015489741 = queryNorm
              1.6469904 = fieldWeight in 2368, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.607123 = idf(docFreq=59, maxDocs=44421)
                0.125 = fieldNorm(doc=2368)
        0.12 = coord(3/25)
    
  5. Goh, A.; Hui, S.C.; Chan, S.K.: ¬A text extraction system for news reports (1996) 0.13
    0.12718153 = sum of:
      0.12718153 = product of:
        0.5299231 = sum of:
          0.017693643 = weight(abstract_txt:using in 6669) [ClassicSimilarity], result of:
            0.017693643 = score(doc=6669,freq=2.0), product of:
              0.057908073 = queryWeight, product of:
                1.0814633 = boost
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.015489741 = queryNorm
              0.3055471 = fieldWeight in 6669, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.0625 = fieldNorm(doc=6669)
          0.011102563 = weight(abstract_txt:with in 6669) [ClassicSimilarity], result of:
            0.011102563 = score(doc=6669,freq=2.0), product of:
              0.050322168 = queryWeight, product of:
                1.3015063 = boost
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.015489741 = queryNorm
              0.22062966 = fieldWeight in 6669, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.0625 = fieldNorm(doc=6669)
          0.051574156 = weight(abstract_txt:measured in 6669) [ClassicSimilarity], result of:
            0.051574156 = score(doc=6669,freq=1.0), product of:
              0.13005546 = queryWeight, product of:
                1.3233079 = boost
                6.3448815 = idf(docFreq=211, maxDocs=44421)
                0.015489741 = queryNorm
              0.3965551 = fieldWeight in 6669, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.3448815 = idf(docFreq=211, maxDocs=44421)
                0.0625 = fieldNorm(doc=6669)
          0.16856916 = weight(abstract_txt:sentence in 6669) [ClassicSimilarity], result of:
            0.16856916 = score(doc=6669,freq=3.0), product of:
              0.22734353 = queryWeight, product of:
                2.1428094 = boost
                6.849437 = idf(docFreq=127, maxDocs=44421)
                0.015489741 = queryNorm
              0.7414733 = fieldWeight in 6669, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.849437 = idf(docFreq=127, maxDocs=44421)
                0.0625 = fieldNorm(doc=6669)
          0.08501595 = weight(abstract_txt:frequency in 6669) [ClassicSimilarity], result of:
            0.08501595 = score(doc=6669,freq=1.0), product of:
              0.2286568 = queryWeight, product of:
                2.4814394 = boost
                5.948895 = idf(docFreq=314, maxDocs=44421)
                0.015489741 = queryNorm
              0.37180594 = fieldWeight in 6669, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.948895 = idf(docFreq=314, maxDocs=44421)
                0.0625 = fieldNorm(doc=6669)
          0.19596764 = weight(abstract_txt:sentences in 6669) [ClassicSimilarity], result of:
            0.19596764 = score(doc=6669,freq=2.0), product of:
              0.3166869 = queryWeight, product of:
                2.9202945 = boost
                7.000987 = idf(docFreq=109, maxDocs=44421)
                0.015489741 = queryNorm
              0.61880565 = fieldWeight in 6669, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.000987 = idf(docFreq=109, maxDocs=44421)
                0.0625 = fieldNorm(doc=6669)
        0.24 = coord(6/25)