Document (#44395)

Author: Williams, S.
Huckle, J.
Title: Easy problems that LLMs get wrong
Source: arXiv:2405.19616v1 [cs.AI] 30 May 2024 [DOI: 10.48550/arXiv.2405.19616]
Year: 2024
Abstract: We introduce a comprehensive Linguistic Benchmark designed to evaluate the limitations of Large Language Models (LLMs) in domains such as logical reasoning, spatial intelligence, and linguistic understanding, among others. Through a series of straightforward questions, it uncovers the significant limitations of well-regarded models to perform tasks that humans manage with ease. It also highlights the potential of prompt engineering to mitigate some errors and underscores the necessity for better training methodologies. Our findings stress the importance of grounding LLMs with human reasoning and common sense, emphasising the need for human-in-the-loop for enterprise applications. We hope this work paves the way for future research to enhance the usefulness and reliability of new models.
Content: Vgl.: https://www.researchgate.net/publication/381006169_Easy_Problems_That_LLMs_Get_Wrong.
Theme: Computerlinguistik

Similar documents (content)

Hou, Y.; Pascale, A.; Carnerero-Cano, J.; Sattigeri, P.; Tchrakian, T.; Marinescu, R.; Daly, E.; Padhi, I.: WikiContradict : a benchmark for evaluating LLMs on real-world knowledge conflicts from Wikipedia (2024) 0.38

0.3783962 = sum of:
  0.3783962 = product of:
    1.1824882 = sum of:
      0.046020538 = weight(abstract_txt:regarded in 2368) [ClassicSimilarity], result of:
        0.046020538 = score(doc=2368,freq=1.0), product of:
          0.120510116 = queryWeight, product of:
            1.0252309 = boost
            6.982969 = idf(docFreq=111, maxDocs=44421)
            0.016833007 = queryNorm
          0.38188112 = fieldWeight in 2368, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            6.982969 = idf(docFreq=111, maxDocs=44421)
            0.0546875 = fieldNorm(doc=2368)
      0.07346258 = weight(abstract_txt:benchmark in 2368) [ClassicSimilarity], result of:
        0.07346258 = score(doc=2368,freq=2.0), product of:
          0.13064411 = queryWeight, product of:
            1.0674679 = boost
            7.270651 = idf(docFreq=83, maxDocs=44421)
            0.016833007 = queryNorm
          0.5623107 = fieldWeight in 2368, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            7.270651 = idf(docFreq=83, maxDocs=44421)
            0.0546875 = fieldNorm(doc=2368)
      0.08249299 = weight(abstract_txt:mitigate in 2368) [ClassicSimilarity], result of:
        0.08249299 = score(doc=2368,freq=1.0), product of:
          0.17782812 = queryWeight, product of:
            1.2454036 = boost
            8.482592 = idf(docFreq=24, maxDocs=44421)
            0.016833007 = queryNorm
          0.46389174 = fieldWeight in 2368, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            8.482592 = idf(docFreq=24, maxDocs=44421)
            0.0546875 = fieldNorm(doc=2368)
      0.048030253 = weight(abstract_txt:human in 2368) [ClassicSimilarity], result of:
        0.048030253 = score(doc=2368,freq=3.0), product of:
          0.108318314 = queryWeight, product of:
            1.3745985 = boost
            4.681277 = idf(docFreq=1118, maxDocs=44421)
            0.016833007 = queryNorm
          0.44341767 = fieldWeight in 2368, product of:
            1.7320508 = tf(freq=3.0), with freq of:
              3.0 = termFreq=3.0
            4.681277 = idf(docFreq=1118, maxDocs=44421)
            0.0546875 = fieldNorm(doc=2368)
      0.055000927 = weight(abstract_txt:limitations in 2368) [ClassicSimilarity], result of:
        0.055000927 = score(doc=2368,freq=2.0), product of:
          0.13571745 = queryWeight, product of:
            1.5386604 = boost
            5.2399993 = idf(docFreq=639, maxDocs=44421)
            0.016833007 = queryNorm
          0.40526053 = fieldWeight in 2368, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            5.2399993 = idf(docFreq=639, maxDocs=44421)
            0.0546875 = fieldNorm(doc=2368)
      0.06889143 = weight(abstract_txt:reasoning in 2368) [ClassicSimilarity], result of:
        0.06889143 = score(doc=2368,freq=1.0), product of:
          0.19868992 = queryWeight, product of:
            1.8617135 = boost
            6.3401756 = idf(docFreq=212, maxDocs=44421)
            0.016833007 = queryNorm
          0.34672835 = fieldWeight in 2368, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            6.3401756 = idf(docFreq=212, maxDocs=44421)
            0.0546875 = fieldNorm(doc=2368)
      0.069393754 = weight(abstract_txt:models in 2368) [ClassicSimilarity], result of:
        0.069393754 = score(doc=2368,freq=3.0), product of:
          0.15846595 = queryWeight, product of:
            2.036285 = boost
            4.623126 = idf(docFreq=1185, maxDocs=44421)
            0.016833007 = queryNorm
          0.43790957 = fieldWeight in 2368, product of:
            1.7320508 = tf(freq=3.0), with freq of:
              3.0 = termFreq=3.0
            4.623126 = idf(docFreq=1185, maxDocs=44421)
            0.0546875 = fieldNorm(doc=2368)
      0.73919564 = weight(abstract_txt:llms in 2368) [ClassicSimilarity], result of:
        0.73919564 = score(doc=2368,freq=6.0), product of:
          0.6089084 = queryWeight, product of:
            3.9915955 = boost
            9.06241 = idf(docFreq=13, maxDocs=44421)
            0.016833007 = queryNorm
          1.2139685 = fieldWeight in 2368, product of:
            2.4494898 = tf(freq=6.0), with freq of:
              6.0 = termFreq=6.0
            9.06241 = idf(docFreq=13, maxDocs=44421)
            0.0546875 = fieldNorm(doc=2368)
    0.32 = coord(8/25)

Yang, L.; Chen, H.; Li, Z.; Ding, X.; Wu, X.: Give us the facts : enhancing Large Language Models with knowledge graphs for fact-aware language modeling (2024) 0.19

0.18866225 = sum of:
  0.18866225 = product of:
    0.9433112 = sum of:
      0.051051304 = weight(abstract_txt:humans in 2337) [ClassicSimilarity], result of:
        0.051051304 = score(doc=2337,freq=1.0), product of:
          0.11814054 = queryWeight, product of:
            1.0151013 = boost
            6.9139757 = idf(docFreq=119, maxDocs=44421)
            0.016833007 = queryNorm
          0.43212348 = fieldWeight in 2337, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            6.9139757 = idf(docFreq=119, maxDocs=44421)
            0.0625 = fieldNorm(doc=2337)
      0.04444746 = weight(abstract_txt:limitations in 2337) [ClassicSimilarity], result of:
        0.04444746 = score(doc=2337,freq=1.0), product of:
          0.13571745 = queryWeight, product of:
            1.5386604 = boost
            5.2399993 = idf(docFreq=639, maxDocs=44421)
            0.016833007 = queryNorm
          0.32749996 = fieldWeight in 2337, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            5.2399993 = idf(docFreq=639, maxDocs=44421)
            0.0625 = fieldNorm(doc=2337)
      0.078733064 = weight(abstract_txt:reasoning in 2337) [ClassicSimilarity], result of:
        0.078733064 = score(doc=2337,freq=1.0), product of:
          0.19868992 = queryWeight, product of:
            1.8617135 = boost
            6.3401756 = idf(docFreq=212, maxDocs=44421)
            0.016833007 = queryNorm
          0.39626098 = fieldWeight in 2337, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            6.3401756 = idf(docFreq=212, maxDocs=44421)
            0.0625 = fieldNorm(doc=2337)
      0.07930715 = weight(abstract_txt:models in 2337) [ClassicSimilarity], result of:
        0.07930715 = score(doc=2337,freq=3.0), product of:
          0.15846595 = queryWeight, product of:
            2.036285 = boost
            4.623126 = idf(docFreq=1185, maxDocs=44421)
            0.016833007 = queryNorm
          0.5004681 = fieldWeight in 2337, product of:
            1.7320508 = tf(freq=3.0), with freq of:
              3.0 = termFreq=3.0
            4.623126 = idf(docFreq=1185, maxDocs=44421)
            0.0625 = fieldNorm(doc=2337)
      0.68977225 = weight(abstract_txt:llms in 2337) [ClassicSimilarity], result of:
        0.68977225 = score(doc=2337,freq=4.0), product of:
          0.6089084 = queryWeight, product of:
            3.9915955 = boost
            9.06241 = idf(docFreq=13, maxDocs=44421)
            0.016833007 = queryNorm
          1.1328013 = fieldWeight in 2337, product of:
            2.0 = tf(freq=4.0), with freq of:
              4.0 = termFreq=4.0
            9.06241 = idf(docFreq=13, maxDocs=44421)
            0.0625 = fieldNorm(doc=2337)
    0.2 = coord(5/25)

Gao, T.; Yen, H.; Yu, J.; Chen, D.: Enabling large language models to generate text with citations (2023) 0.15

0.14738598 = sum of:
  0.14738598 = product of:
    0.9211624 = sum of:
      0.059366733 = weight(abstract_txt:benchmark in 2295) [ClassicSimilarity], result of:
        0.059366733 = score(doc=2295,freq=1.0), product of:
          0.13064411 = queryWeight, product of:
            1.0674679 = boost
            7.270651 = idf(docFreq=83, maxDocs=44421)
            0.016833007 = queryNorm
          0.45441568 = fieldWeight in 2295, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            7.270651 = idf(docFreq=83, maxDocs=44421)
            0.0625 = fieldNorm(doc=2295)
      0.044818904 = weight(abstract_txt:human in 2295) [ClassicSimilarity], result of:
        0.044818904 = score(doc=2295,freq=2.0), product of:
          0.108318314 = queryWeight, product of:
            1.3745985 = boost
            4.681277 = idf(docFreq=1118, maxDocs=44421)
            0.016833007 = queryNorm
          0.41377032 = fieldWeight in 2295, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            4.681277 = idf(docFreq=1118, maxDocs=44421)
            0.0625 = fieldNorm(doc=2295)
      0.045788005 = weight(abstract_txt:models in 2295) [ClassicSimilarity], result of:
        0.045788005 = score(doc=2295,freq=1.0), product of:
          0.15846595 = queryWeight, product of:
            2.036285 = boost
            4.623126 = idf(docFreq=1185, maxDocs=44421)
            0.016833007 = queryNorm
          0.28894538 = fieldWeight in 2295, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            4.623126 = idf(docFreq=1185, maxDocs=44421)
            0.0625 = fieldNorm(doc=2295)
      0.7711888 = weight(abstract_txt:llms in 2295) [ClassicSimilarity], result of:
        0.7711888 = score(doc=2295,freq=5.0), product of:
          0.6089084 = queryWeight, product of:
            3.9915955 = boost
            9.06241 = idf(docFreq=13, maxDocs=44421)
            0.016833007 = queryNorm
          1.2665104 = fieldWeight in 2295, product of:
            2.236068 = tf(freq=5.0), with freq of:
              5.0 = termFreq=5.0
            9.06241 = idf(docFreq=13, maxDocs=44421)
            0.0625 = fieldNorm(doc=2295)
    0.16 = coord(4/25)

Luo, L.; Ju, J.; Li, Y.-F.; Haffari, G.; Xiong, B.; Pan, S.: ChatRule: mining logical rules with large language models for knowledge graph reasoning (2023) 0.14

0.14406388 = sum of:
  0.14406388 = product of:
    0.90039927 = sum of:
      0.10191535 = weight(abstract_txt:prompt in 2173) [ClassicSimilarity], result of:
        0.10191535 = score(doc=2173,freq=1.0), product of:
          0.18730706 = queryWeight, product of:
            1.2781652 = boost
            8.705735 = idf(docFreq=19, maxDocs=44421)
            0.016833007 = queryNorm
          0.54410845 = fieldWeight in 2173, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            8.705735 = idf(docFreq=19, maxDocs=44421)
            0.0625 = fieldNorm(doc=2173)
      0.13636966 = weight(abstract_txt:reasoning in 2173) [ClassicSimilarity], result of:
        0.13636966 = score(doc=2173,freq=3.0), product of:
          0.19868992 = queryWeight, product of:
            1.8617135 = boost
            6.3401756 = idf(docFreq=212, maxDocs=44421)
            0.016833007 = queryNorm
          0.68634415 = fieldWeight in 2173, product of:
            1.7320508 = tf(freq=3.0), with freq of:
              3.0 = termFreq=3.0
            6.3401756 = idf(docFreq=212, maxDocs=44421)
            0.0625 = fieldNorm(doc=2173)
      0.06475402 = weight(abstract_txt:models in 2173) [ClassicSimilarity], result of:
        0.06475402 = score(doc=2173,freq=2.0), product of:
          0.15846595 = queryWeight, product of:
            2.036285 = boost
            4.623126 = idf(docFreq=1185, maxDocs=44421)
            0.016833007 = queryNorm
          0.40863046 = fieldWeight in 2173, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            4.623126 = idf(docFreq=1185, maxDocs=44421)
            0.0625 = fieldNorm(doc=2173)
      0.59736025 = weight(abstract_txt:llms in 2173) [ClassicSimilarity], result of:
        0.59736025 = score(doc=2173,freq=3.0), product of:
          0.6089084 = queryWeight, product of:
            3.9915955 = boost
            9.06241 = idf(docFreq=13, maxDocs=44421)
            0.016833007 = queryNorm
          0.9810347 = fieldWeight in 2173, product of:
            1.7320508 = tf(freq=3.0), with freq of:
              3.0 = termFreq=3.0
            9.06241 = idf(docFreq=13, maxDocs=44421)
            0.0625 = fieldNorm(doc=2173)
    0.16 = coord(4/25)

Khorashadizadeh, H.; Amara, F.Z.; Ezzabady, M.; Ieng, F.; Tiwari, S.; Mihindukulasooriya, N.; Groppe, J.; Sahri, S.; Benamara, F.; Groppe, S.: Research trends for the interplay between Large Language Models and Knowledge Graphs (2024) 0.12

0.12214399 = sum of:
  0.12214399 = product of:
    1.0178666 = sum of:
      0.09841633 = weight(abstract_txt:reasoning in 2335) [ClassicSimilarity], result of:
        0.09841633 = score(doc=2335,freq=1.0), product of:
          0.19868992 = queryWeight, product of:
            1.8617135 = boost
            6.3401756 = idf(docFreq=212, maxDocs=44421)
            0.016833007 = queryNorm
          0.49532622 = fieldWeight in 2335, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            6.3401756 = idf(docFreq=212, maxDocs=44421)
            0.078125 = fieldNorm(doc=2335)
      0.057235006 = weight(abstract_txt:models in 2335) [ClassicSimilarity], result of:
        0.057235006 = score(doc=2335,freq=1.0), product of:
          0.15846595 = queryWeight, product of:
            2.036285 = boost
            4.623126 = idf(docFreq=1185, maxDocs=44421)
            0.016833007 = queryNorm
          0.36118174 = fieldWeight in 2335, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            4.623126 = idf(docFreq=1185, maxDocs=44421)
            0.078125 = fieldNorm(doc=2335)
      0.8622153 = weight(abstract_txt:llms in 2335) [ClassicSimilarity], result of:
        0.8622153 = score(doc=2335,freq=4.0), product of:
          0.6089084 = queryWeight, product of:
            3.9915955 = boost
            9.06241 = idf(docFreq=13, maxDocs=44421)
            0.016833007 = queryNorm
          1.4160016 = fieldWeight in 2335, product of:
            2.0 = tf(freq=4.0), with freq of:
              4.0 = termFreq=4.0
            9.06241 = idf(docFreq=13, maxDocs=44421)
            0.078125 = fieldNorm(doc=2335)
    0.12 = coord(3/25)

Document (#44395)

Similar documents (author)

Similar documents (content)