Document (#44395)

Author
Williams, S.
Huckle, J.
Title
Easy problems that LLMs get wrong
Source
arXiv:2405.19616v1 [cs.AI] 30 May 2024 [DOI: 10.48550/arXiv.2405.19616]
Year
2024
Abstract
We introduce a comprehensive Linguistic Benchmark designed to evaluate the limitations of Large Language Models (LLMs) in domains such as logical reasoning, spatial intelligence, and linguistic understanding, among others. Through a series of straightforward questions, it uncovers the significant limitations of well-regarded models to perform tasks that humans manage with ease. It also highlights the potential of prompt engineering to mitigate some errors and underscores the necessity for better training methodologies. Our findings stress the importance of grounding LLMs with human reasoning and common sense, emphasising the need for human-in-the-loop for enterprise applications. We hope this work paves the way for future research to enhance the usefulness and reliability of new models.
Content
Vgl.: https://www.researchgate.net/publication/381006169_Easy_Problems_That_LLMs_Get_Wrong.
Theme
Computerlinguistik

Similar documents (author)

  1. Williams, R.M.: ISI search network research front specialties (1983) 4.51
    4.5080194 = sum of:
      4.5080194 = weight(author_txt:williams in 1473) [ClassicSimilarity], result of:
        4.5080194 = fieldWeight in 1473, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          7.212831 = idf(docFreq=88, maxDocs=44421)
          0.625 = fieldNorm(doc=1473)
    
  2. Williams, J.W.: Serials cataloging, 1985-1990 : an overview of a half-decade (1992) 4.51
    4.5080194 = sum of:
      4.5080194 = weight(author_txt:williams in 4206) [ClassicSimilarity], result of:
        4.5080194 = fieldWeight in 4206, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          7.212831 = idf(docFreq=88, maxDocs=44421)
          0.625 = fieldNorm(doc=4206)
    
  3. Williams, D.A.: Information skills in the school curriculum (1991) 4.51
    4.5080194 = sum of:
      4.5080194 = weight(author_txt:williams in 4834) [ClassicSimilarity], result of:
        4.5080194 = fieldWeight in 4834, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          7.212831 = idf(docFreq=88, maxDocs=44421)
          0.625 = fieldNorm(doc=4834)
    
  4. Williams, M.: Transparent information systems through gateways, front ends, intermediaries, and interfaces (1986) 4.51
    4.5080194 = sum of:
      4.5080194 = weight(author_txt:williams in 5134) [ClassicSimilarity], result of:
        4.5080194 = fieldWeight in 5134, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          7.212831 = idf(docFreq=88, maxDocs=44421)
          0.625 = fieldNorm(doc=5134)
    
  5. Williams, F.: Appraisal and evaluation of software products (1992) 4.51
    4.5080194 = sum of:
      4.5080194 = weight(author_txt:williams in 5306) [ClassicSimilarity], result of:
        4.5080194 = fieldWeight in 5306, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          7.212831 = idf(docFreq=88, maxDocs=44421)
          0.625 = fieldNorm(doc=5306)
    

Similar documents (content)

  1. Hou, Y.; Pascale, A.; Carnerero-Cano, J.; Sattigeri, P.; Tchrakian, T.; Marinescu, R.; Daly, E.; Padhi, I.: WikiContradict : a benchmark for evaluating LLMs on real-world knowledge conflicts from Wikipedia (2024) 0.38
    0.3783962 = sum of:
      0.3783962 = product of:
        1.1824882 = sum of:
          0.046020538 = weight(abstract_txt:regarded in 2368) [ClassicSimilarity], result of:
            0.046020538 = score(doc=2368,freq=1.0), product of:
              0.120510116 = queryWeight, product of:
                1.0252309 = boost
                6.982969 = idf(docFreq=111, maxDocs=44421)
                0.016833007 = queryNorm
              0.38188112 = fieldWeight in 2368, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.982969 = idf(docFreq=111, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2368)
          0.07346258 = weight(abstract_txt:benchmark in 2368) [ClassicSimilarity], result of:
            0.07346258 = score(doc=2368,freq=2.0), product of:
              0.13064411 = queryWeight, product of:
                1.0674679 = boost
                7.270651 = idf(docFreq=83, maxDocs=44421)
                0.016833007 = queryNorm
              0.5623107 = fieldWeight in 2368, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.270651 = idf(docFreq=83, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2368)
          0.08249299 = weight(abstract_txt:mitigate in 2368) [ClassicSimilarity], result of:
            0.08249299 = score(doc=2368,freq=1.0), product of:
              0.17782812 = queryWeight, product of:
                1.2454036 = boost
                8.482592 = idf(docFreq=24, maxDocs=44421)
                0.016833007 = queryNorm
              0.46389174 = fieldWeight in 2368, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.482592 = idf(docFreq=24, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2368)
          0.048030253 = weight(abstract_txt:human in 2368) [ClassicSimilarity], result of:
            0.048030253 = score(doc=2368,freq=3.0), product of:
              0.108318314 = queryWeight, product of:
                1.3745985 = boost
                4.681277 = idf(docFreq=1118, maxDocs=44421)
                0.016833007 = queryNorm
              0.44341767 = fieldWeight in 2368, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.681277 = idf(docFreq=1118, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2368)
          0.055000927 = weight(abstract_txt:limitations in 2368) [ClassicSimilarity], result of:
            0.055000927 = score(doc=2368,freq=2.0), product of:
              0.13571745 = queryWeight, product of:
                1.5386604 = boost
                5.2399993 = idf(docFreq=639, maxDocs=44421)
                0.016833007 = queryNorm
              0.40526053 = fieldWeight in 2368, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.2399993 = idf(docFreq=639, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2368)
          0.06889143 = weight(abstract_txt:reasoning in 2368) [ClassicSimilarity], result of:
            0.06889143 = score(doc=2368,freq=1.0), product of:
              0.19868992 = queryWeight, product of:
                1.8617135 = boost
                6.3401756 = idf(docFreq=212, maxDocs=44421)
                0.016833007 = queryNorm
              0.34672835 = fieldWeight in 2368, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.3401756 = idf(docFreq=212, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2368)
          0.069393754 = weight(abstract_txt:models in 2368) [ClassicSimilarity], result of:
            0.069393754 = score(doc=2368,freq=3.0), product of:
              0.15846595 = queryWeight, product of:
                2.036285 = boost
                4.623126 = idf(docFreq=1185, maxDocs=44421)
                0.016833007 = queryNorm
              0.43790957 = fieldWeight in 2368, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.623126 = idf(docFreq=1185, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2368)
          0.73919564 = weight(abstract_txt:llms in 2368) [ClassicSimilarity], result of:
            0.73919564 = score(doc=2368,freq=6.0), product of:
              0.6089084 = queryWeight, product of:
                3.9915955 = boost
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.016833007 = queryNorm
              1.2139685 = fieldWeight in 2368, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2368)
        0.32 = coord(8/25)
    
  2. Yang, L.; Chen, H.; Li, Z.; Ding, X.; Wu, X.: Give us the facts : enhancing Large Language Models with knowledge graphs for fact-aware language modeling (2024) 0.19
    0.18866225 = sum of:
      0.18866225 = product of:
        0.9433112 = sum of:
          0.051051304 = weight(abstract_txt:humans in 2337) [ClassicSimilarity], result of:
            0.051051304 = score(doc=2337,freq=1.0), product of:
              0.11814054 = queryWeight, product of:
                1.0151013 = boost
                6.9139757 = idf(docFreq=119, maxDocs=44421)
                0.016833007 = queryNorm
              0.43212348 = fieldWeight in 2337, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9139757 = idf(docFreq=119, maxDocs=44421)
                0.0625 = fieldNorm(doc=2337)
          0.04444746 = weight(abstract_txt:limitations in 2337) [ClassicSimilarity], result of:
            0.04444746 = score(doc=2337,freq=1.0), product of:
              0.13571745 = queryWeight, product of:
                1.5386604 = boost
                5.2399993 = idf(docFreq=639, maxDocs=44421)
                0.016833007 = queryNorm
              0.32749996 = fieldWeight in 2337, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2399993 = idf(docFreq=639, maxDocs=44421)
                0.0625 = fieldNorm(doc=2337)
          0.078733064 = weight(abstract_txt:reasoning in 2337) [ClassicSimilarity], result of:
            0.078733064 = score(doc=2337,freq=1.0), product of:
              0.19868992 = queryWeight, product of:
                1.8617135 = boost
                6.3401756 = idf(docFreq=212, maxDocs=44421)
                0.016833007 = queryNorm
              0.39626098 = fieldWeight in 2337, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.3401756 = idf(docFreq=212, maxDocs=44421)
                0.0625 = fieldNorm(doc=2337)
          0.07930715 = weight(abstract_txt:models in 2337) [ClassicSimilarity], result of:
            0.07930715 = score(doc=2337,freq=3.0), product of:
              0.15846595 = queryWeight, product of:
                2.036285 = boost
                4.623126 = idf(docFreq=1185, maxDocs=44421)
                0.016833007 = queryNorm
              0.5004681 = fieldWeight in 2337, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.623126 = idf(docFreq=1185, maxDocs=44421)
                0.0625 = fieldNorm(doc=2337)
          0.68977225 = weight(abstract_txt:llms in 2337) [ClassicSimilarity], result of:
            0.68977225 = score(doc=2337,freq=4.0), product of:
              0.6089084 = queryWeight, product of:
                3.9915955 = boost
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.016833007 = queryNorm
              1.1328013 = fieldWeight in 2337, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.0625 = fieldNorm(doc=2337)
        0.2 = coord(5/25)
    
  3. Gao, T.; Yen, H.; Yu, J.; Chen, D.: Enabling large language models to generate text with citations (2023) 0.15
    0.14738598 = sum of:
      0.14738598 = product of:
        0.9211624 = sum of:
          0.059366733 = weight(abstract_txt:benchmark in 2295) [ClassicSimilarity], result of:
            0.059366733 = score(doc=2295,freq=1.0), product of:
              0.13064411 = queryWeight, product of:
                1.0674679 = boost
                7.270651 = idf(docFreq=83, maxDocs=44421)
                0.016833007 = queryNorm
              0.45441568 = fieldWeight in 2295, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.270651 = idf(docFreq=83, maxDocs=44421)
                0.0625 = fieldNorm(doc=2295)
          0.044818904 = weight(abstract_txt:human in 2295) [ClassicSimilarity], result of:
            0.044818904 = score(doc=2295,freq=2.0), product of:
              0.108318314 = queryWeight, product of:
                1.3745985 = boost
                4.681277 = idf(docFreq=1118, maxDocs=44421)
                0.016833007 = queryNorm
              0.41377032 = fieldWeight in 2295, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.681277 = idf(docFreq=1118, maxDocs=44421)
                0.0625 = fieldNorm(doc=2295)
          0.045788005 = weight(abstract_txt:models in 2295) [ClassicSimilarity], result of:
            0.045788005 = score(doc=2295,freq=1.0), product of:
              0.15846595 = queryWeight, product of:
                2.036285 = boost
                4.623126 = idf(docFreq=1185, maxDocs=44421)
                0.016833007 = queryNorm
              0.28894538 = fieldWeight in 2295, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.623126 = idf(docFreq=1185, maxDocs=44421)
                0.0625 = fieldNorm(doc=2295)
          0.7711888 = weight(abstract_txt:llms in 2295) [ClassicSimilarity], result of:
            0.7711888 = score(doc=2295,freq=5.0), product of:
              0.6089084 = queryWeight, product of:
                3.9915955 = boost
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.016833007 = queryNorm
              1.2665104 = fieldWeight in 2295, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.0625 = fieldNorm(doc=2295)
        0.16 = coord(4/25)
    
  4. Luo, L.; Ju, J.; Li, Y.-F.; Haffari, G.; Xiong, B.; Pan, S.: ChatRule: mining logical rules with large language models for knowledge graph reasoning (2023) 0.14
    0.14406388 = sum of:
      0.14406388 = product of:
        0.90039927 = sum of:
          0.10191535 = weight(abstract_txt:prompt in 2173) [ClassicSimilarity], result of:
            0.10191535 = score(doc=2173,freq=1.0), product of:
              0.18730706 = queryWeight, product of:
                1.2781652 = boost
                8.705735 = idf(docFreq=19, maxDocs=44421)
                0.016833007 = queryNorm
              0.54410845 = fieldWeight in 2173, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.705735 = idf(docFreq=19, maxDocs=44421)
                0.0625 = fieldNorm(doc=2173)
          0.13636966 = weight(abstract_txt:reasoning in 2173) [ClassicSimilarity], result of:
            0.13636966 = score(doc=2173,freq=3.0), product of:
              0.19868992 = queryWeight, product of:
                1.8617135 = boost
                6.3401756 = idf(docFreq=212, maxDocs=44421)
                0.016833007 = queryNorm
              0.68634415 = fieldWeight in 2173, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.3401756 = idf(docFreq=212, maxDocs=44421)
                0.0625 = fieldNorm(doc=2173)
          0.06475402 = weight(abstract_txt:models in 2173) [ClassicSimilarity], result of:
            0.06475402 = score(doc=2173,freq=2.0), product of:
              0.15846595 = queryWeight, product of:
                2.036285 = boost
                4.623126 = idf(docFreq=1185, maxDocs=44421)
                0.016833007 = queryNorm
              0.40863046 = fieldWeight in 2173, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.623126 = idf(docFreq=1185, maxDocs=44421)
                0.0625 = fieldNorm(doc=2173)
          0.59736025 = weight(abstract_txt:llms in 2173) [ClassicSimilarity], result of:
            0.59736025 = score(doc=2173,freq=3.0), product of:
              0.6089084 = queryWeight, product of:
                3.9915955 = boost
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.016833007 = queryNorm
              0.9810347 = fieldWeight in 2173, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.0625 = fieldNorm(doc=2173)
        0.16 = coord(4/25)
    
  5. Khorashadizadeh, H.; Amara, F.Z.; Ezzabady, M.; Ieng, F.; Tiwari, S.; Mihindukulasooriya, N.; Groppe, J.; Sahri, S.; Benamara, F.; Groppe, S.: Research trends for the interplay between Large Language Models and Knowledge Graphs (2024) 0.12
    0.12214399 = sum of:
      0.12214399 = product of:
        1.0178666 = sum of:
          0.09841633 = weight(abstract_txt:reasoning in 2335) [ClassicSimilarity], result of:
            0.09841633 = score(doc=2335,freq=1.0), product of:
              0.19868992 = queryWeight, product of:
                1.8617135 = boost
                6.3401756 = idf(docFreq=212, maxDocs=44421)
                0.016833007 = queryNorm
              0.49532622 = fieldWeight in 2335, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.3401756 = idf(docFreq=212, maxDocs=44421)
                0.078125 = fieldNorm(doc=2335)
          0.057235006 = weight(abstract_txt:models in 2335) [ClassicSimilarity], result of:
            0.057235006 = score(doc=2335,freq=1.0), product of:
              0.15846595 = queryWeight, product of:
                2.036285 = boost
                4.623126 = idf(docFreq=1185, maxDocs=44421)
                0.016833007 = queryNorm
              0.36118174 = fieldWeight in 2335, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.623126 = idf(docFreq=1185, maxDocs=44421)
                0.078125 = fieldNorm(doc=2335)
          0.8622153 = weight(abstract_txt:llms in 2335) [ClassicSimilarity], result of:
            0.8622153 = score(doc=2335,freq=4.0), product of:
              0.6089084 = queryWeight, product of:
                3.9915955 = boost
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.016833007 = queryNorm
              1.4160016 = fieldWeight in 2335, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.078125 = fieldNorm(doc=2335)
        0.12 = coord(3/25)