Document (#44244)

Author
Törnberg, P.
Title
How to use LLMs for text analysis
Source
axXiv:2307.13106v1 [cs.CL] 24 Jul 2023 [https://www.researchgate.net/publication/372625394_How_to_use_LLMs_for_Text_Analysis]
Year
2023
Abstract
This guide introduces Large Language Models (LLM) as a highly versatile text analysis method within the social sciences. As LLMs are easy-to-use, cheap, fast, and applicable on a broad range of text analysis tasks, ranging from text annotation and classification to sentiment analysis and critical discourse analysis, many scholars believe that LLMs will transform how we do text analysis. This how-to guide is aimed at students and researchers with limited programming experience, and offers a simple introduction to how LLMs can be used for text analysis in your own research project, as well as advice on best practices. We will go through each of the steps of analyzing textual data with LLMs using Python: installing the software, setting up the API, loading the data, developing an analysis prompt, analyzing the text, and validating the results. As an illustrative example, we will use the challenging task of identifying populism in political texts, and show how LLMs move beyond the existing state-of-the-art.
Theme
Computerlinguistik

Similar documents (content)

  1. Kau, A.; He, X.; Nambissan, A.; Astudillo, A.; Yin, H.; Aryani, A.: Combining graphs and Large Language Models (2024) 0.19
    0.19032773 = sum of:
      0.19032773 = product of:
        1.1895484 = sum of:
          0.06744933 = weight(abstract_txt:versatile in 2336) [ClassicSimilarity], result of:
            0.06744933 = score(doc=2336,freq=1.0), product of:
              0.1253352 = queryWeight, product of:
                1.3030862 = boost
                8.610425 = idf(docFreq=21, maxDocs=44421)
                0.011170571 = queryNorm
              0.53815156 = fieldWeight in 2336, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.610425 = idf(docFreq=21, maxDocs=44421)
                0.0625 = fieldNorm(doc=2336)
          0.018251961 = weight(abstract_txt:will in 2336) [ClassicSimilarity], result of:
            0.018251961 = score(doc=2336,freq=1.0), product of:
              0.075625464 = queryWeight, product of:
                1.7531991 = boost
                3.8615482 = idf(docFreq=2539, maxDocs=44421)
                0.011170571 = queryNorm
              0.24134676 = fieldWeight in 2336, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.8615482 = idf(docFreq=2539, maxDocs=44421)
                0.0625 = fieldNorm(doc=2336)
          0.0488012 = weight(abstract_txt:text in 2336) [ClassicSimilarity], result of:
            0.0488012 = score(doc=2336,freq=1.0), product of:
              0.19322988 = queryWeight, product of:
                4.2807784 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.011170571 = queryNorm
              0.25255513 = fieldWeight in 2336, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=2336)
          1.0550458 = weight(abstract_txt:llms in 2336) [ClassicSimilarity], result of:
            1.0550458 = score(doc=2336,freq=5.0), product of:
              0.8330337 = queryWeight, product of:
                8.228932 = boost
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.011170571 = queryNorm
              1.2665104 = fieldWeight in 2336, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.0625 = fieldNorm(doc=2336)
        0.16 = coord(4/25)
    
  2. Jha, A.: Why GPT-4 isn't all it's cracked up to be (2023) 0.15
    0.14648004 = sum of:
      0.14648004 = product of:
        0.9155003 = sum of:
          0.006898423 = weight(abstract_txt:data in 1924) [ClassicSimilarity], result of:
            0.006898423 = score(doc=1924,freq=2.0), product of:
              0.0374974 = queryWeight, product of:
                1.0079806 = boost
                3.3302255 = idf(docFreq=4320, maxDocs=44421)
                0.011170571 = queryNorm
              0.1839707 = fieldWeight in 1924, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.3302255 = idf(docFreq=4320, maxDocs=44421)
                0.0390625 = fieldNorm(doc=1924)
          0.05170002 = weight(abstract_txt:cheap in 1924) [ClassicSimilarity], result of:
            0.05170002 = score(doc=1924,freq=1.0), product of:
              0.14360242 = queryWeight, product of:
                1.3948177 = boost
                9.216561 = idf(docFreq=11, maxDocs=44421)
                0.011170571 = queryNorm
              0.36002192 = fieldWeight in 1924, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.216561 = idf(docFreq=11, maxDocs=44421)
                0.0390625 = fieldNorm(doc=1924)
          0.022814952 = weight(abstract_txt:will in 1924) [ClassicSimilarity], result of:
            0.022814952 = score(doc=1924,freq=4.0), product of:
              0.075625464 = queryWeight, product of:
                1.7531991 = boost
                3.8615482 = idf(docFreq=2539, maxDocs=44421)
                0.011170571 = queryNorm
              0.30168346 = fieldWeight in 1924, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.8615482 = idf(docFreq=2539, maxDocs=44421)
                0.0390625 = fieldNorm(doc=1924)
          0.8340869 = weight(abstract_txt:llms in 1924) [ClassicSimilarity], result of:
            0.8340869 = score(doc=1924,freq=8.0), product of:
              0.8330337 = queryWeight, product of:
                8.228932 = boost
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.011170571 = queryNorm
              1.0012643 = fieldWeight in 1924, product of:
                2.828427 = tf(freq=8.0), with freq of:
                  8.0 = termFreq=8.0
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.0390625 = fieldNorm(doc=1924)
        0.16 = coord(4/25)
    
  3. Mens, G. Le; Kovács; B.; Hannan, M.T.; Pros, G.: Uncovering the semantics of concepts using GPT-4 (2023) 0.14
    0.14151415 = sum of:
      0.14151415 = product of:
        0.8844634 = sum of:
          0.011828332 = weight(abstract_txt:data in 2305) [ClassicSimilarity], result of:
            0.011828332 = score(doc=2305,freq=3.0), product of:
              0.0374974 = queryWeight, product of:
                1.0079806 = boost
                3.3302255 = idf(docFreq=4320, maxDocs=44421)
                0.011170571 = queryNorm
              0.31544405 = fieldWeight in 2305, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.3302255 = idf(docFreq=4320, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2305)
          0.09548245 = weight(abstract_txt:text in 2305) [ClassicSimilarity], result of:
            0.09548245 = score(doc=2305,freq=5.0), product of:
              0.19322988 = queryWeight, product of:
                4.2807784 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.011170571 = queryNorm
              0.49413916 = fieldWeight in 2305, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2305)
          0.062072106 = weight(abstract_txt:analysis in 2305) [ClassicSimilarity], result of:
            0.062072106 = score(doc=2305,freq=3.0), product of:
              0.17975038 = queryWeight, product of:
                4.413839 = boost
                3.6456752 = idf(docFreq=3151, maxDocs=44421)
                0.011170571 = queryNorm
              0.34532392 = fieldWeight in 2305, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.6456752 = idf(docFreq=3151, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2305)
          0.71508056 = weight(abstract_txt:llms in 2305) [ClassicSimilarity], result of:
            0.71508056 = score(doc=2305,freq=3.0), product of:
              0.8330337 = queryWeight, product of:
                8.228932 = boost
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.011170571 = queryNorm
              0.85840535 = fieldWeight in 2305, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2305)
        0.16 = coord(4/25)
    
  4. Gao, T.; Yen, H.; Yu, J.; Chen, D.: Enabling large language models to generate text with citations (2023) 0.14
    0.13627079 = sum of:
      0.13627079 = product of:
        1.13559 = sum of:
          0.031742867 = weight(abstract_txt:challenging in 2295) [ClassicSimilarity], result of:
            0.031742867 = score(doc=2295,freq=1.0), product of:
              0.07583192 = queryWeight, product of:
                1.0135907 = boost
                6.697521 = idf(docFreq=148, maxDocs=44421)
                0.011170571 = queryNorm
              0.41859508 = fieldWeight in 2295, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.697521 = idf(docFreq=148, maxDocs=44421)
                0.0625 = fieldNorm(doc=2295)
          0.0488012 = weight(abstract_txt:text in 2295) [ClassicSimilarity], result of:
            0.0488012 = score(doc=2295,freq=1.0), product of:
              0.19322988 = queryWeight, product of:
                4.2807784 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.011170571 = queryNorm
              0.25255513 = fieldWeight in 2295, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=2295)
          1.0550458 = weight(abstract_txt:llms in 2295) [ClassicSimilarity], result of:
            1.0550458 = score(doc=2295,freq=5.0), product of:
              0.8330337 = queryWeight, product of:
                8.228932 = boost
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.011170571 = queryNorm
              1.2665104 = fieldWeight in 2295, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.0625 = fieldNorm(doc=2295)
        0.12 = coord(3/25)
    
  5. Ghali, M.-K.; Farrag, A.; Won, D.; Jin, Y.: Enhancing knowledge retrieval with in-context learning and semantic search through Generative AI (2024) 0.11
    0.109118655 = sum of:
      0.109118655 = product of:
        0.90932214 = sum of:
          0.009657794 = weight(abstract_txt:data in 2367) [ClassicSimilarity], result of:
            0.009657794 = score(doc=2367,freq=2.0), product of:
              0.0374974 = queryWeight, product of:
                1.0079806 = boost
                3.3302255 = idf(docFreq=4320, maxDocs=44421)
                0.011170571 = queryNorm
              0.257559 = fieldWeight in 2367, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.3302255 = idf(docFreq=4320, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2367)
          0.073960386 = weight(abstract_txt:text in 2367) [ClassicSimilarity], result of:
            0.073960386 = score(doc=2367,freq=3.0), product of:
              0.19322988 = queryWeight, product of:
                4.2807784 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.011170571 = queryNorm
              0.38275853 = fieldWeight in 2367, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2367)
          0.825704 = weight(abstract_txt:llms in 2367) [ClassicSimilarity], result of:
            0.825704 = score(doc=2367,freq=4.0), product of:
              0.8330337 = queryWeight, product of:
                8.228932 = boost
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.011170571 = queryNorm
              0.99120116 = fieldWeight in 2367, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2367)
        0.12 = coord(3/25)