Document (#44370)

Author
Krasakis, A.M.
Yates, A.
Kanoulas, E.
Title
Corpus-informed Retrieval Augmented Generation of Clarifying Questions
Source
https://www.researchgate.net/publication/384447709_Corpus-informed_Retrieval_Augmented_Generation_of_Clarifying_Questions
Year
2024
Abstract
This study aims to develop models that generate corpus informed clarifying questions for web search, in a way that ensures the questions align with the available information in the retrieval corpus. We demonstrate the effectiveness of Retrieval Augmented Language Models (RAG) in this process, emphasising their ability to (i) jointly model the user query and retrieval corpus to pinpoint the uncertainty and ask for clarifications end-to-end and (ii) model more evidence documents, which can be used towards increasing the breadth of the questions asked. However, we observe that in current datasets search intents are largely unsupported by the corpus, which is problematic both for training and evaluation. This causes question generation models to ``hallucinate'', ie. suggest intents that are not in the corpus, which can have detrimental effects in performance. To address this, we propose dataset augmentation methods that align the ground truth clarifications with the retrieval corpus. Additionally, we explore techniques to enhance the relevance of the evidence pool during inference, but find that identifying ground truth intents within the corpus remains challenging. Our analysis suggests that this challenge is partly due to the bias of current datasets towards clarification taxonomies and calls for data that can support generating corpus-informed clarifications.
Theme
Retrievalalgorithmen

Similar documents (author)

  1. Baeza-Yates, R.A.: Introduction to data structures and algorithms related to information retrieval (1992) 4.38
    4.3785143 = sum of:
      4.3785143 = weight(author_txt:yates in 4082) [ClassicSimilarity], result of:
        4.3785143 = fieldWeight in 4082, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.757029 = idf(docFreq=18, maxDocs=44421)
          0.5 = fieldNorm(doc=4082)
    
  2. Baeza-Yates, R.A.: String searching algorithms (1992) 4.38
    4.3785143 = sum of:
      4.3785143 = weight(author_txt:yates in 4505) [ClassicSimilarity], result of:
        4.3785143 = fieldWeight in 4505, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.757029 = idf(docFreq=18, maxDocs=44421)
          0.5 = fieldNorm(doc=4505)
    
  3. Gill, H.S.; Yates-Mercer, P.: ¬The dissemination of information by local authorities on the World Wide Web (1998) 3.83
    3.8312001 = sum of:
      3.8312001 = weight(author_txt:yates in 4435) [ClassicSimilarity], result of:
        3.8312001 = fieldWeight in 4435, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.757029 = idf(docFreq=18, maxDocs=44421)
          0.4375 = fieldNorm(doc=4435)
    
  4. Baeza-Yates, R.; Navarro, G.: Block addressing indices for approximate text retrieval (2000) 3.83
    3.8312001 = sum of:
      3.8312001 = weight(author_txt:yates in 5295) [ClassicSimilarity], result of:
        3.8312001 = fieldWeight in 5295, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.757029 = idf(docFreq=18, maxDocs=44421)
          0.4375 = fieldNorm(doc=5295)
    
  5. Baeza-Yates, R.; Navarro, G.: XQL and proximal nodes (2002) 3.83
    3.8312001 = sum of:
      3.8312001 = weight(author_txt:yates in 1454) [ClassicSimilarity], result of:
        3.8312001 = fieldWeight in 1454, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.757029 = idf(docFreq=18, maxDocs=44421)
          0.4375 = fieldNorm(doc=1454)
    

Similar documents (content)

  1. Zhang, Y.; Zhang, C.: Enhancing keyphrase extraction from microblogs using human reading time (2021) 0.15
    0.15342352 = sum of:
      0.15342352 = product of:
        0.4794485 = sum of:
          0.0072049885 = weight(abstract_txt:which in 1238) [ClassicSimilarity], result of:
            0.0072049885 = score(doc=1238,freq=1.0), product of:
              0.03956354 = queryWeight, product of:
                1.0488585 = boost
                2.9137893 = idf(docFreq=6552, maxDocs=44421)
                0.012945537 = queryNorm
              0.18211183 = fieldWeight in 1238, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.9137893 = idf(docFreq=6552, maxDocs=44421)
                0.0625 = fieldNorm(doc=1238)
          0.009563935 = weight(abstract_txt:this in 1238) [ClassicSimilarity], result of:
            0.009563935 = score(doc=1238,freq=2.0), product of:
              0.044968005 = queryWeight, product of:
                1.4435955 = boost
                2.4062347 = idf(docFreq=10885, maxDocs=44421)
                0.012945537 = queryNorm
              0.21268311 = fieldWeight in 1238, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4062347 = idf(docFreq=10885, maxDocs=44421)
                0.0625 = fieldNorm(doc=1238)
          0.05451658 = weight(abstract_txt:datasets in 1238) [ClassicSimilarity], result of:
            0.05451658 = score(doc=1238,freq=1.0), product of:
              0.13320737 = queryWeight, product of:
                1.571404 = boost
                6.548176 = idf(docFreq=172, maxDocs=44421)
                0.012945537 = queryNorm
              0.409261 = fieldWeight in 1238, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.548176 = idf(docFreq=172, maxDocs=44421)
                0.0625 = fieldNorm(doc=1238)
          0.057556726 = weight(abstract_txt:models in 1238) [ClassicSimilarity], result of:
            0.057556726 = score(doc=1238,freq=4.0), product of:
              0.09959794 = queryWeight, product of:
                1.6641579 = boost
                4.623126 = idf(docFreq=1185, maxDocs=44421)
                0.012945537 = queryNorm
              0.57789075 = fieldWeight in 1238, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.623126 = idf(docFreq=1185, maxDocs=44421)
                0.0625 = fieldNorm(doc=1238)
          0.065861 = weight(abstract_txt:ground in 1238) [ClassicSimilarity], result of:
            0.065861 = score(doc=1238,freq=1.0), product of:
              0.15109894 = queryWeight, product of:
                1.6736106 = boost
                6.9740796 = idf(docFreq=112, maxDocs=44421)
                0.012945537 = queryNorm
              0.43587998 = fieldWeight in 1238, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9740796 = idf(docFreq=112, maxDocs=44421)
                0.0625 = fieldNorm(doc=1238)
          0.07693081 = weight(abstract_txt:truth in 1238) [ClassicSimilarity], result of:
            0.07693081 = score(doc=1238,freq=1.0), product of:
              0.16758794 = queryWeight, product of:
                1.7625648 = boost
                7.344759 = idf(docFreq=77, maxDocs=44421)
                0.012945537 = queryNorm
              0.45904744 = fieldWeight in 1238, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.344759 = idf(docFreq=77, maxDocs=44421)
                0.0625 = fieldNorm(doc=1238)
          0.010272717 = weight(abstract_txt:that in 1238) [ClassicSimilarity], result of:
            0.010272717 = score(doc=1238,freq=1.0), product of:
              0.0695002 = queryWeight, product of:
                2.2701092 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.012945537 = queryNorm
              0.14780845 = fieldWeight in 1238, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.0625 = fieldNorm(doc=1238)
          0.19754176 = weight(abstract_txt:corpus in 1238) [ClassicSimilarity], result of:
            0.19754176 = score(doc=1238,freq=1.0), product of:
              0.5188231 = queryWeight, product of:
                6.578693 = boost
                6.0919957 = idf(docFreq=272, maxDocs=44421)
                0.012945537 = queryNorm
              0.38074973 = fieldWeight in 1238, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0919957 = idf(docFreq=272, maxDocs=44421)
                0.0625 = fieldNorm(doc=1238)
        0.32 = coord(8/25)
    
  2. Tsai, C.-F.; McGarry, K.; Tait, J.: Qualitative evaluation of automatic assignment of keywords to images (2006) 0.14
    0.13665488 = sum of:
      0.13665488 = product of:
        0.42704654 = sum of:
          0.015394016 = weight(abstract_txt:current in 1963) [ClassicSimilarity], result of:
            0.015394016 = score(doc=1963,freq=1.0), product of:
              0.057333767 = queryWeight, product of:
                1.0309294 = boost
                4.295972 = idf(docFreq=1644, maxDocs=44421)
                0.012945537 = queryNorm
              0.26849824 = fieldWeight in 1963, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.295972 = idf(docFreq=1644, maxDocs=44421)
                0.0625 = fieldNorm(doc=1963)
          0.0072049885 = weight(abstract_txt:which in 1963) [ClassicSimilarity], result of:
            0.0072049885 = score(doc=1963,freq=1.0), product of:
              0.03956354 = queryWeight, product of:
                1.0488585 = boost
                2.9137893 = idf(docFreq=6552, maxDocs=44421)
                0.012945537 = queryNorm
              0.18211183 = fieldWeight in 1963, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.9137893 = idf(docFreq=6552, maxDocs=44421)
                0.0625 = fieldNorm(doc=1963)
          0.03432565 = weight(abstract_txt:generation in 1963) [ClassicSimilarity], result of:
            0.03432565 = score(doc=1963,freq=1.0), product of:
              0.09785621 = queryWeight, product of:
                1.3468459 = boost
                5.612423 = idf(docFreq=440, maxDocs=44421)
                0.012945537 = queryNorm
              0.35077643 = fieldWeight in 1963, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.612423 = idf(docFreq=440, maxDocs=44421)
                0.0625 = fieldNorm(doc=1963)
          0.011713381 = weight(abstract_txt:this in 1963) [ClassicSimilarity], result of:
            0.011713381 = score(doc=1963,freq=3.0), product of:
              0.044968005 = queryWeight, product of:
                1.4435955 = boost
                2.4062347 = idf(docFreq=10885, maxDocs=44421)
                0.012945537 = queryNorm
              0.26048255 = fieldWeight in 1963, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.4062347 = idf(docFreq=10885, maxDocs=44421)
                0.0625 = fieldNorm(doc=1963)
          0.14726968 = weight(abstract_txt:ground in 1963) [ClassicSimilarity], result of:
            0.14726968 = score(doc=1963,freq=5.0), product of:
              0.15109894 = queryWeight, product of:
                1.6736106 = boost
                6.9740796 = idf(docFreq=112, maxDocs=44421)
                0.012945537 = queryNorm
              0.9746573 = fieldWeight in 1963, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                6.9740796 = idf(docFreq=112, maxDocs=44421)
                0.0625 = fieldNorm(doc=1963)
          0.17202252 = weight(abstract_txt:truth in 1963) [ClassicSimilarity], result of:
            0.17202252 = score(doc=1963,freq=5.0), product of:
              0.16758794 = queryWeight, product of:
                1.7625648 = boost
                7.344759 = idf(docFreq=77, maxDocs=44421)
                0.012945537 = queryNorm
              1.0264612 = fieldWeight in 1963, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.344759 = idf(docFreq=77, maxDocs=44421)
                0.0625 = fieldNorm(doc=1963)
          0.02884359 = weight(abstract_txt:retrieval in 1963) [ClassicSimilarity], result of:
            0.02884359 = score(doc=1963,freq=2.0), product of:
              0.09386681 = queryWeight, product of:
                2.08569 = boost
                3.4765 = idf(docFreq=3732, maxDocs=44421)
                0.012945537 = queryNorm
              0.3072821 = fieldWeight in 1963, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4765 = idf(docFreq=3732, maxDocs=44421)
                0.0625 = fieldNorm(doc=1963)
          0.010272717 = weight(abstract_txt:that in 1963) [ClassicSimilarity], result of:
            0.010272717 = score(doc=1963,freq=1.0), product of:
              0.0695002 = queryWeight, product of:
                2.2701092 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.012945537 = queryNorm
              0.14780845 = fieldWeight in 1963, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.0625 = fieldNorm(doc=1963)
        0.32 = coord(8/25)
    
  3. White, H.: Patrick Wilson (2019) 0.12
    0.11855391 = sum of:
      0.11855391 = product of:
        0.4234068 = sum of:
          0.0072049885 = weight(abstract_txt:which in 314) [ClassicSimilarity], result of:
            0.0072049885 = score(doc=314,freq=1.0), product of:
              0.03956354 = queryWeight, product of:
                1.0488585 = boost
                2.9137893 = idf(docFreq=6552, maxDocs=44421)
                0.012945537 = queryNorm
              0.18211183 = fieldWeight in 314, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.9137893 = idf(docFreq=6552, maxDocs=44421)
                0.0625 = fieldNorm(doc=314)
          0.0067627234 = weight(abstract_txt:this in 314) [ClassicSimilarity], result of:
            0.0067627234 = score(doc=314,freq=1.0), product of:
              0.044968005 = queryWeight, product of:
                1.4435955 = boost
                2.4062347 = idf(docFreq=10885, maxDocs=44421)
                0.012945537 = queryNorm
              0.15038967 = fieldWeight in 314, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4062347 = idf(docFreq=10885, maxDocs=44421)
                0.0625 = fieldNorm(doc=314)
          0.020395499 = weight(abstract_txt:retrieval in 314) [ClassicSimilarity], result of:
            0.020395499 = score(doc=314,freq=1.0), product of:
              0.09386681 = queryWeight, product of:
                2.08569 = boost
                3.4765 = idf(docFreq=3732, maxDocs=44421)
                0.012945537 = queryNorm
              0.21728125 = fieldWeight in 314, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4765 = idf(docFreq=3732, maxDocs=44421)
                0.0625 = fieldNorm(doc=314)
          0.014527815 = weight(abstract_txt:that in 314) [ClassicSimilarity], result of:
            0.014527815 = score(doc=314,freq=2.0), product of:
              0.0695002 = queryWeight, product of:
                2.2701092 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.012945537 = queryNorm
              0.20903271 = fieldWeight in 314, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.0625 = fieldNorm(doc=314)
          0.045199748 = weight(abstract_txt:questions in 314) [ClassicSimilarity], result of:
            0.045199748 = score(doc=314,freq=1.0), product of:
              0.14811869 = queryWeight, product of:
                2.343385 = boost
                4.8825436 = idf(docFreq=914, maxDocs=44421)
                0.012945537 = queryNorm
              0.30515897 = fieldWeight in 314, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.8825436 = idf(docFreq=914, maxDocs=44421)
                0.0625 = fieldNorm(doc=314)
          0.08749861 = weight(abstract_txt:informed in 314) [ClassicSimilarity], result of:
            0.08749861 = score(doc=314,freq=1.0), product of:
              0.20902924 = queryWeight, product of:
                2.4108648 = boost
                6.697521 = idf(docFreq=148, maxDocs=44421)
                0.012945537 = queryNorm
              0.41859508 = fieldWeight in 314, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.697521 = idf(docFreq=148, maxDocs=44421)
                0.0625 = fieldNorm(doc=314)
          0.24181741 = weight(abstract_txt:clarifications in 314) [ClassicSimilarity], result of:
            0.24181741 = score(doc=314,freq=1.0), product of:
              0.41165304 = queryWeight, product of:
                3.3832572 = boost
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.012945537 = queryNorm
              0.5874302 = fieldWeight in 314, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.0625 = fieldNorm(doc=314)
        0.28 = coord(7/25)
    
  4. Thelwall, M.; Prabowo, R.: Identifying and characterizing public science-related fears from RSS feeds (2007) 0.11
    0.10659448 = sum of:
      0.10659448 = product of:
        0.5329724 = sum of:
          0.037508532 = weight(abstract_txt:evidence in 1137) [ClassicSimilarity], result of:
            0.037508532 = score(doc=1137,freq=1.0), product of:
              0.089465566 = queryWeight, product of:
                1.2878096 = boost
                5.3664136 = idf(docFreq=563, maxDocs=44421)
                0.012945537 = queryNorm
              0.41925105 = fieldWeight in 1137, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.3664136 = idf(docFreq=563, maxDocs=44421)
                0.078125 = fieldNorm(doc=1137)
          0.014641725 = weight(abstract_txt:this in 1137) [ClassicSimilarity], result of:
            0.014641725 = score(doc=1137,freq=3.0), product of:
              0.044968005 = queryWeight, product of:
                1.4435955 = boost
                2.4062347 = idf(docFreq=10885, maxDocs=44421)
                0.012945537 = queryNorm
              0.3256032 = fieldWeight in 1137, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.4062347 = idf(docFreq=10885, maxDocs=44421)
                0.078125 = fieldNorm(doc=1137)
          0.022241082 = weight(abstract_txt:that in 1137) [ClassicSimilarity], result of:
            0.022241082 = score(doc=1137,freq=3.0), product of:
              0.0695002 = queryWeight, product of:
                2.2701092 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.012945537 = queryNorm
              0.32001466 = fieldWeight in 1137, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.078125 = fieldNorm(doc=1137)
          0.109373264 = weight(abstract_txt:informed in 1137) [ClassicSimilarity], result of:
            0.109373264 = score(doc=1137,freq=1.0), product of:
              0.20902924 = queryWeight, product of:
                2.4108648 = boost
                6.697521 = idf(docFreq=148, maxDocs=44421)
                0.012945537 = queryNorm
              0.52324384 = fieldWeight in 1137, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.697521 = idf(docFreq=148, maxDocs=44421)
                0.078125 = fieldNorm(doc=1137)
          0.3492078 = weight(abstract_txt:corpus in 1137) [ClassicSimilarity], result of:
            0.3492078 = score(doc=1137,freq=2.0), product of:
              0.5188231 = queryWeight, product of:
                6.578693 = boost
                6.0919957 = idf(docFreq=272, maxDocs=44421)
                0.012945537 = queryNorm
              0.6730768 = fieldWeight in 1137, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.0919957 = idf(docFreq=272, maxDocs=44421)
                0.078125 = fieldNorm(doc=1137)
        0.2 = coord(5/25)
    
  5. Tsujii, J.-I.: Automatic acquisition of semantic collocation from corpora (1995) 0.11
    0.10574745 = sum of:
      0.10574745 = product of:
        0.6609216 = sum of:
          0.014409977 = weight(abstract_txt:which in 4777) [ClassicSimilarity], result of:
            0.014409977 = score(doc=4777,freq=1.0), product of:
              0.03956354 = queryWeight, product of:
                1.0488585 = boost
                2.9137893 = idf(docFreq=6552, maxDocs=44421)
                0.012945537 = queryNorm
              0.36422366 = fieldWeight in 4777, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.9137893 = idf(docFreq=6552, maxDocs=44421)
                0.125 = fieldNorm(doc=4777)
          0.0686513 = weight(abstract_txt:towards in 4777) [ClassicSimilarity], result of:
            0.0686513 = score(doc=4777,freq=1.0), product of:
              0.09785621 = queryWeight, product of:
                1.3468459 = boost
                5.612423 = idf(docFreq=440, maxDocs=44421)
                0.012945537 = queryNorm
              0.70155287 = fieldWeight in 4777, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.612423 = idf(docFreq=440, maxDocs=44421)
                0.125 = fieldNorm(doc=4777)
          0.01912787 = weight(abstract_txt:this in 4777) [ClassicSimilarity], result of:
            0.01912787 = score(doc=4777,freq=2.0), product of:
              0.044968005 = queryWeight, product of:
                1.4435955 = boost
                2.4062347 = idf(docFreq=10885, maxDocs=44421)
                0.012945537 = queryNorm
              0.42536622 = fieldWeight in 4777, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4062347 = idf(docFreq=10885, maxDocs=44421)
                0.125 = fieldNorm(doc=4777)
          0.55873245 = weight(abstract_txt:corpus in 4777) [ClassicSimilarity], result of:
            0.55873245 = score(doc=4777,freq=2.0), product of:
              0.5188231 = queryWeight, product of:
                6.578693 = boost
                6.0919957 = idf(docFreq=272, maxDocs=44421)
                0.012945537 = queryNorm
              1.0769229 = fieldWeight in 4777, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.0919957 = idf(docFreq=272, maxDocs=44421)
                0.125 = fieldNorm(doc=4777)
        0.16 = coord(4/25)