Document (#44384)

Author
El Hamdani, R.
Bonald, T.
Malliaros, F.
Suchanek, F.
Holzenberger, N.
Title
¬The factuality of Large Language Models in the legal domain
Source
arXiv:2409.11798v1 [cs.CL] 18 Sep 2024 [DOI: 10.48550/arXiv.2409.11798]
Year
2024
Abstract
This paper investigates the factuality of large language models (LLMs) as knowledge bases in the legal domain, in a realistic usage scenario: we allow for acceptable variations in the answer, and let the model abstain from answering when uncertain. First, we design a dataset of diverse factual questions about case law and legislation. We then use the dataset to evaluate several LLMs under different evaluation methods, including exact, alias, and fuzzy matching. Our results show that the performance improves significantly under the alias and fuzzy matching methods. Further, we explore the impact of abstaining and in-context examples, finding that both strategies enhance precision. Finally, we demonstrate that additional pre-training on legal documents, as seen with SaulLM, further improves factual precision from 63% to 81%.
Content
Vgl.: https://www.researchgate.net/publication/384115774_The_Factuality_of_Large_Language_Models_in_the_Legal_Domain.
Theme
Computerlinguistik
Field
Rechtswissenschaft

Similar documents (content)

  1. Gao, T.; Yen, H.; Yu, J.; Chen, D.: Enabling large language models to generate text with citations (2023) 0.31
    0.3133809 = sum of:
      0.3133809 = product of:
        0.9793153 = sum of:
          0.007027066 = weight(abstract_txt:that in 2295) [ClassicSimilarity], result of:
            0.007027066 = score(doc=2295,freq=1.0), product of:
              0.047541708 = queryWeight, product of:
                1.1148466 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.018031856 = queryNorm
              0.14780845 = fieldWeight in 2295, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.0625 = fieldNorm(doc=2295)
          0.025700511 = weight(abstract_txt:language in 2295) [ClassicSimilarity], result of:
            0.025700511 = score(doc=2295,freq=1.0), product of:
              0.09858773 = queryWeight, product of:
                1.3108214 = boost
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.018031856 = queryNorm
              0.26068673 = fieldWeight in 2295, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.0625 = fieldNorm(doc=2295)
          0.031050155 = weight(abstract_txt:large in 2295) [ClassicSimilarity], result of:
            0.031050155 = score(doc=2295,freq=1.0), product of:
              0.11183322 = queryWeight, product of:
                1.3961031 = boost
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.018031856 = queryNorm
              0.27764696 = fieldWeight in 2295, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.0625 = fieldNorm(doc=2295)
          0.034997117 = weight(abstract_txt:models in 2295) [ClassicSimilarity], result of:
            0.034997117 = score(doc=2295,freq=1.0), product of:
              0.12112018 = queryWeight, product of:
                1.4529154 = boost
                4.623126 = idf(docFreq=1185, maxDocs=44421)
                0.018031856 = queryNorm
              0.28894538 = fieldWeight in 2295, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.623126 = idf(docFreq=1185, maxDocs=44421)
                0.0625 = fieldNorm(doc=2295)
          0.035822347 = weight(abstract_txt:further in 2295) [ClassicSimilarity], result of:
            0.035822347 = score(doc=2295,freq=1.0), product of:
              0.12301678 = queryWeight, product of:
                1.4642467 = boost
                4.6591816 = idf(docFreq=1143, maxDocs=44421)
                0.018031856 = queryNorm
              0.29119885 = fieldWeight in 2295, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.6591816 = idf(docFreq=1143, maxDocs=44421)
                0.0625 = fieldNorm(doc=2295)
          0.10514864 = weight(abstract_txt:dataset in 2295) [ClassicSimilarity], result of:
            0.10514864 = score(doc=2295,freq=1.0), product of:
              0.2521917 = queryWeight, product of:
                2.0965126 = boost
                6.6710296 = idf(docFreq=152, maxDocs=44421)
                0.018031856 = queryNorm
              0.41693935 = fieldWeight in 2295, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.6710296 = idf(docFreq=152, maxDocs=44421)
                0.0625 = fieldNorm(doc=2295)
          0.15012719 = weight(abstract_txt:factual in 2295) [ClassicSimilarity], result of:
            0.15012719 = score(doc=2295,freq=1.0), product of:
              0.31976768 = queryWeight, product of:
                2.3607466 = boost
                7.5118127 = idf(docFreq=65, maxDocs=44421)
                0.018031856 = queryNorm
              0.4694883 = fieldWeight in 2295, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5118127 = idf(docFreq=65, maxDocs=44421)
                0.0625 = fieldNorm(doc=2295)
          0.58944225 = weight(abstract_txt:llms in 2295) [ClassicSimilarity], result of:
            0.58944225 = score(doc=2295,freq=5.0), product of:
              0.46540657 = queryWeight, product of:
                2.848055 = boost
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.018031856 = queryNorm
              1.2665104 = fieldWeight in 2295, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.0625 = fieldNorm(doc=2295)
        0.32 = coord(8/25)
    
  2. Khorashadizadeh, H.; Amara, F.Z.; Ezzabady, M.; Ieng, F.; Tiwari, S.; Mihindukulasooriya, N.; Groppe, J.; Sahri, S.; Benamara, F.; Groppe, S.: Research trends for the interplay between Large Language Models and Knowledge Graphs (2024) 0.26
    0.25635472 = sum of:
      0.25635472 = product of:
        0.9155526 = sum of:
          0.064772114 = weight(abstract_txt:answering in 2335) [ClassicSimilarity], result of:
            0.064772114 = score(doc=2335,freq=1.0), product of:
              0.124883115 = queryWeight, product of:
                1.0432034 = boost
                6.6388726 = idf(docFreq=157, maxDocs=44421)
                0.018031856 = queryNorm
              0.5186619 = fieldWeight in 2335, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.6388726 = idf(docFreq=157, maxDocs=44421)
                0.078125 = fieldNorm(doc=2335)
          0.008783832 = weight(abstract_txt:that in 2335) [ClassicSimilarity], result of:
            0.008783832 = score(doc=2335,freq=1.0), product of:
              0.047541708 = queryWeight, product of:
                1.1148466 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.018031856 = queryNorm
              0.18476056 = fieldWeight in 2335, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.078125 = fieldNorm(doc=2335)
          0.055643238 = weight(abstract_txt:language in 2335) [ClassicSimilarity], result of:
            0.055643238 = score(doc=2335,freq=3.0), product of:
              0.09858773 = queryWeight, product of:
                1.3108214 = boost
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.018031856 = queryNorm
              0.5644033 = fieldWeight in 2335, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.078125 = fieldNorm(doc=2335)
          0.038812693 = weight(abstract_txt:large in 2335) [ClassicSimilarity], result of:
            0.038812693 = score(doc=2335,freq=1.0), product of:
              0.11183322 = queryWeight, product of:
                1.3961031 = boost
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.018031856 = queryNorm
              0.3470587 = fieldWeight in 2335, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.078125 = fieldNorm(doc=2335)
          0.043746397 = weight(abstract_txt:models in 2335) [ClassicSimilarity], result of:
            0.043746397 = score(doc=2335,freq=1.0), product of:
              0.12112018 = queryWeight, product of:
                1.4529154 = boost
                4.623126 = idf(docFreq=1185, maxDocs=44421)
                0.018031856 = queryNorm
              0.36118174 = fieldWeight in 2335, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.623126 = idf(docFreq=1185, maxDocs=44421)
                0.078125 = fieldNorm(doc=2335)
          0.044777934 = weight(abstract_txt:further in 2335) [ClassicSimilarity], result of:
            0.044777934 = score(doc=2335,freq=1.0), product of:
              0.12301678 = queryWeight, product of:
                1.4642467 = boost
                4.6591816 = idf(docFreq=1143, maxDocs=44421)
                0.018031856 = queryNorm
              0.36399856 = fieldWeight in 2335, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.6591816 = idf(docFreq=1143, maxDocs=44421)
                0.078125 = fieldNorm(doc=2335)
          0.65901643 = weight(abstract_txt:llms in 2335) [ClassicSimilarity], result of:
            0.65901643 = score(doc=2335,freq=4.0), product of:
              0.46540657 = queryWeight, product of:
                2.848055 = boost
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.018031856 = queryNorm
              1.4160016 = fieldWeight in 2335, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.078125 = fieldNorm(doc=2335)
        0.28 = coord(7/25)
    
  3. Ghali, M.-K.; Farrag, A.; Won, D.; Jin, Y.: Enhancing knowledge retrieval with in-context learning and semantic search through Generative AI (2024) 0.26
    0.25526604 = sum of:
      0.25526604 = product of:
        0.79770637 = sum of:
          0.006148683 = weight(abstract_txt:that in 2367) [ClassicSimilarity], result of:
            0.006148683 = score(doc=2367,freq=1.0), product of:
              0.047541708 = queryWeight, product of:
                1.1148466 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.018031856 = queryNorm
              0.1293324 = fieldWeight in 2367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2367)
          0.057876267 = weight(abstract_txt:exact in 2367) [ClassicSimilarity], result of:
            0.057876267 = score(doc=2367,freq=1.0), product of:
              0.14695351 = queryWeight, product of:
                1.1316369 = boost
                7.201658 = idf(docFreq=89, maxDocs=44421)
                0.018031856 = queryNorm
              0.39384067 = fieldWeight in 2367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.201658 = idf(docFreq=89, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2367)
          0.031802762 = weight(abstract_txt:language in 2367) [ClassicSimilarity], result of:
            0.031802762 = score(doc=2367,freq=2.0), product of:
              0.09858773 = queryWeight, product of:
                1.3108214 = boost
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.018031856 = queryNorm
              0.32258338 = fieldWeight in 2367, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2367)
          0.047057886 = weight(abstract_txt:large in 2367) [ClassicSimilarity], result of:
            0.047057886 = score(doc=2367,freq=3.0), product of:
              0.11183322 = queryWeight, product of:
                1.3961031 = boost
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.018031856 = queryNorm
              0.4207863 = fieldWeight in 2367, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2367)
          0.030622475 = weight(abstract_txt:models in 2367) [ClassicSimilarity], result of:
            0.030622475 = score(doc=2367,freq=1.0), product of:
              0.12112018 = queryWeight, product of:
                1.4529154 = boost
                4.623126 = idf(docFreq=1185, maxDocs=44421)
                0.018031856 = queryNorm
              0.2528272 = fieldWeight in 2367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.623126 = idf(docFreq=1185, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2367)
          0.032771993 = weight(abstract_txt:domain in 2367) [ClassicSimilarity], result of:
            0.032771993 = score(doc=2367,freq=1.0), product of:
              0.12672381 = queryWeight, product of:
                1.486145 = boost
                4.7288613 = idf(docFreq=1066, maxDocs=44421)
                0.018031856 = queryNorm
              0.2586096 = fieldWeight in 2367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7288613 = idf(docFreq=1066, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2367)
          0.13011481 = weight(abstract_txt:dataset in 2367) [ClassicSimilarity], result of:
            0.13011481 = score(doc=2367,freq=2.0), product of:
              0.2521917 = queryWeight, product of:
                2.0965126 = boost
                6.6710296 = idf(docFreq=152, maxDocs=44421)
                0.018031856 = queryNorm
              0.51593614 = fieldWeight in 2367, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.6710296 = idf(docFreq=152, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2367)
          0.46131152 = weight(abstract_txt:llms in 2367) [ClassicSimilarity], result of:
            0.46131152 = score(doc=2367,freq=4.0), product of:
              0.46540657 = queryWeight, product of:
                2.848055 = boost
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.018031856 = queryNorm
              0.99120116 = fieldWeight in 2367, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2367)
        0.32 = coord(8/25)
    
  4. Yang, L.; Chen, H.; Li, Z.; Ding, X.; Wu, X.: Give us the facts : enhancing Large Language Models with knowledge graphs for fact-aware language modeling (2024) 0.24
    0.23877124 = sum of:
      0.23877124 = product of:
        0.9948802 = sum of:
          0.045642685 = weight(abstract_txt:bases in 2337) [ClassicSimilarity], result of:
            0.045642685 = score(doc=2337,freq=1.0), product of:
              0.11475346 = queryWeight, product of:
                6.3639297 = idf(docFreq=207, maxDocs=44421)
                0.018031856 = queryNorm
              0.3977456 = fieldWeight in 2337, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.3639297 = idf(docFreq=207, maxDocs=44421)
                0.0625 = fieldNorm(doc=2337)
          0.05746809 = weight(abstract_txt:language in 2337) [ClassicSimilarity], result of:
            0.05746809 = score(doc=2337,freq=5.0), product of:
              0.09858773 = queryWeight, product of:
                1.3108214 = boost
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.018031856 = queryNorm
              0.5829132 = fieldWeight in 2337, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.0625 = fieldNorm(doc=2337)
          0.04391155 = weight(abstract_txt:large in 2337) [ClassicSimilarity], result of:
            0.04391155 = score(doc=2337,freq=2.0), product of:
              0.11183322 = queryWeight, product of:
                1.3961031 = boost
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.018031856 = queryNorm
              0.3926521 = fieldWeight in 2337, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.0625 = fieldNorm(doc=2337)
          0.060616784 = weight(abstract_txt:models in 2337) [ClassicSimilarity], result of:
            0.060616784 = score(doc=2337,freq=3.0), product of:
              0.12112018 = queryWeight, product of:
                1.4529154 = boost
                4.623126 = idf(docFreq=1185, maxDocs=44421)
                0.018031856 = queryNorm
              0.5004681 = fieldWeight in 2337, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.623126 = idf(docFreq=1185, maxDocs=44421)
                0.0625 = fieldNorm(doc=2337)
          0.26002792 = weight(abstract_txt:factual in 2337) [ClassicSimilarity], result of:
            0.26002792 = score(doc=2337,freq=3.0), product of:
              0.31976768 = queryWeight, product of:
                2.3607466 = boost
                7.5118127 = idf(docFreq=65, maxDocs=44421)
                0.018031856 = queryNorm
              0.8131776 = fieldWeight in 2337, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.5118127 = idf(docFreq=65, maxDocs=44421)
                0.0625 = fieldNorm(doc=2337)
          0.52721316 = weight(abstract_txt:llms in 2337) [ClassicSimilarity], result of:
            0.52721316 = score(doc=2337,freq=4.0), product of:
              0.46540657 = queryWeight, product of:
                2.848055 = boost
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.018031856 = queryNorm
              1.1328013 = fieldWeight in 2337, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.0625 = fieldNorm(doc=2337)
        0.24 = coord(6/25)
    
  5. Hou, Y.; Pascale, A.; Carnerero-Cano, J.; Sattigeri, P.; Tchrakian, T.; Marinescu, R.; Daly, E.; Padhi, I.: WikiContradict : a benchmark for evaluating LLMs on real-world knowledge conflicts from Wikipedia (2024) 0.23
    0.23018838 = sum of:
      0.23018838 = product of:
        0.82210135 = sum of:
          0.01064983 = weight(abstract_txt:that in 2368) [ClassicSimilarity], result of:
            0.01064983 = score(doc=2368,freq=3.0), product of:
              0.047541708 = queryWeight, product of:
                1.1148466 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.018031856 = queryNorm
              0.22401026 = fieldWeight in 2368, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2368)
          0.031802762 = weight(abstract_txt:language in 2368) [ClassicSimilarity], result of:
            0.031802762 = score(doc=2368,freq=2.0), product of:
              0.09858773 = queryWeight, product of:
                1.3108214 = boost
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.018031856 = queryNorm
              0.32258338 = fieldWeight in 2368, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1709876 = idf(docFreq=1863, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2368)
          0.027168885 = weight(abstract_txt:large in 2368) [ClassicSimilarity], result of:
            0.027168885 = score(doc=2368,freq=1.0), product of:
              0.11183322 = queryWeight, product of:
                1.3961031 = boost
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.018031856 = queryNorm
              0.24294108 = fieldWeight in 2368, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2368)
          0.053039685 = weight(abstract_txt:models in 2368) [ClassicSimilarity], result of:
            0.053039685 = score(doc=2368,freq=3.0), product of:
              0.12112018 = queryWeight, product of:
                1.4529154 = boost
                4.623126 = idf(docFreq=1185, maxDocs=44421)
                0.018031856 = queryNorm
              0.43790957 = fieldWeight in 2368, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.623126 = idf(docFreq=1185, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2368)
          0.0424462 = weight(abstract_txt:under in 2368) [ClassicSimilarity], result of:
            0.0424462 = score(doc=2368,freq=1.0), product of:
              0.15057361 = queryWeight, product of:
                1.6199683 = boost
                5.154682 = idf(docFreq=696, maxDocs=44421)
                0.018031856 = queryNorm
              0.28189668 = fieldWeight in 2368, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.154682 = idf(docFreq=696, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2368)
          0.09200506 = weight(abstract_txt:dataset in 2368) [ClassicSimilarity], result of:
            0.09200506 = score(doc=2368,freq=1.0), product of:
              0.2521917 = queryWeight, product of:
                2.0965126 = boost
                6.6710296 = idf(docFreq=152, maxDocs=44421)
                0.018031856 = queryNorm
              0.36482194 = fieldWeight in 2368, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.6710296 = idf(docFreq=152, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2368)
          0.5649889 = weight(abstract_txt:llms in 2368) [ClassicSimilarity], result of:
            0.5649889 = score(doc=2368,freq=6.0), product of:
              0.46540657 = queryWeight, product of:
                2.848055 = boost
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.018031856 = queryNorm
              1.2139685 = fieldWeight in 2368, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2368)
        0.28 = coord(7/25)