Document (#43264)

Author
Du, C.
Cohoon, J.
Lopez, P.
Howison, J.
Title
Softcite dataset : a dataset of software mentions in biomedical and economic research publications
Source
Journal of the Association for Information Science and Technology. 72(2021) no.7, S.870-884
Year
2021
Abstract
Software contributions to academic research are relatively invisible, especially to the formalized scholarly reputation system based on bibliometrics. In this article, we introduce a gold-standard dataset of software mentions from the manual annotation of 4,971 academic PDFs in biomedicine and economics. The dataset is intended to be used for automatic extraction of software mentions from PDF format research publications by supervised learning at scale. We provide a description of the dataset and an extended discussion of its creation process, including improved text conversion of academic PDFs. Finally, we reflect on our challenges and lessons learned during the dataset creation, in hope of encouraging more discussion about creating datasets for machine learning use.
Content
Vgl.: https://asistdl.onlinelibrary.wiley.com/doi/10.1002/asi.24454.
Form
Software

Similar documents (author)

  1. Lopez, C.G.: Technical processes and the technological development of the library system in the National Autonomous University of Mexico (2000) 5.25
    5.2535195 = sum of:
      5.2535195 = weight(author_txt:lopez in 371) [ClassicSimilarity], result of:
        5.2535195 = fieldWeight in 371, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.405631 = idf(docFreq=26, maxDocs=44421)
          0.625 = fieldNorm(doc=371)
    
  2. Lopez, P.: Artificial Intelligence und die normative Kraft des Faktischen (2021) 5.25
    5.2535195 = sum of:
      5.2535195 = weight(author_txt:lopez in 2027) [ClassicSimilarity], result of:
        5.2535195 = fieldWeight in 2027, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.405631 = idf(docFreq=26, maxDocs=44421)
          0.625 = fieldNorm(doc=2027)
    
  3. Lopez, P.: ChatGPT und der Unterschied zwischen Form und Inhalt (2023) 5.25
    5.2535195 = sum of:
      5.2535195 = weight(author_txt:lopez in 2029) [ClassicSimilarity], result of:
        5.2535195 = fieldWeight in 2029, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.405631 = idf(docFreq=26, maxDocs=44421)
          0.625 = fieldNorm(doc=2029)
    
  4. Cozar, E.D. Lopez- -> Lopez-Cozar, E.D.: 4.46
    4.457759 = sum of:
      4.457759 = weight(author_txt:lopez in 1256) [ClassicSimilarity], result of:
        4.457759 = fieldWeight in 1256, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.405631 = idf(docFreq=26, maxDocs=44421)
          0.375 = fieldNorm(doc=1256)
    
  5. Pujalte, C. Lopez- -> Lopez-Pujalte, C.: 4.46
    4.457759 = sum of:
      4.457759 = weight(author_txt:lopez in 3746) [ClassicSimilarity], result of:
        4.457759 = fieldWeight in 3746, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.405631 = idf(docFreq=26, maxDocs=44421)
          0.375 = fieldNorm(doc=3746)
    

Similar documents (content)

  1. Ahmed, M.: Automatic indexing for agriculture : designing a framework by deploying Agrovoc, Agris and Annif (2023) 0.18
    0.17823747 = sum of:
      0.17823747 = product of:
        0.8911873 = sum of:
          0.042815167 = weight(abstract_txt:learned in 2026) [ClassicSimilarity], result of:
            0.042815167 = score(doc=2026,freq=1.0), product of:
              0.10443036 = queryWeight, product of:
                1.0480511 = boost
                6.559804 = idf(docFreq=170, maxDocs=44421)
                0.015189849 = queryNorm
              0.40998775 = fieldWeight in 2026, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.559804 = idf(docFreq=170, maxDocs=44421)
                0.0625 = fieldNorm(doc=2026)
          0.06315774 = weight(abstract_txt:supervised in 2026) [ClassicSimilarity], result of:
            0.06315774 = score(doc=2026,freq=1.0), product of:
              0.13532542 = queryWeight, product of:
                1.1930503 = boost
                7.467361 = idf(docFreq=68, maxDocs=44421)
                0.015189849 = queryNorm
              0.46671006 = fieldWeight in 2026, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.467361 = idf(docFreq=68, maxDocs=44421)
                0.0625 = fieldNorm(doc=2026)
          0.01435275 = weight(abstract_txt:research in 2026) [ClassicSimilarity], result of:
            0.01435275 = score(doc=2026,freq=1.0), product of:
              0.07268177 = queryWeight, product of:
                1.5144064 = boost
                3.159582 = idf(docFreq=5124, maxDocs=44421)
                0.015189849 = queryNorm
              0.19747387 = fieldWeight in 2026, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.159582 = idf(docFreq=5124, maxDocs=44421)
                0.0625 = fieldNorm(doc=2026)
          0.056029838 = weight(abstract_txt:learning in 2026) [ClassicSimilarity], result of:
            0.056029838 = score(doc=2026,freq=3.0), product of:
              0.10914676 = queryWeight, product of:
                1.5152681 = boost
                4.7420692 = idf(docFreq=1052, maxDocs=44421)
                0.015189849 = queryNorm
              0.51334405 = fieldWeight in 2026, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.7420692 = idf(docFreq=1052, maxDocs=44421)
                0.0625 = fieldNorm(doc=2026)
          0.7148318 = weight(abstract_txt:dataset in 2026) [ClassicSimilarity], result of:
            0.7148318 = score(doc=2026,freq=7.0), product of:
              0.64801043 = queryWeight, product of:
                6.3949285 = boost
                6.6710296 = idf(docFreq=152, maxDocs=44421)
                0.015189849 = queryNorm
              1.1031178 = fieldWeight in 2026, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                6.6710296 = idf(docFreq=152, maxDocs=44421)
                0.0625 = fieldNorm(doc=2026)
        0.2 = coord(5/25)
    
  2. ¬The Computer Science Ontology (CSO) (2018) 0.15
    0.14881082 = sum of:
      0.14881082 = product of:
        0.5314672 = sum of:
          0.0287055 = weight(abstract_txt:research in 429) [ClassicSimilarity], result of:
            0.0287055 = score(doc=429,freq=4.0), product of:
              0.07268177 = queryWeight, product of:
                1.5144064 = boost
                3.159582 = idf(docFreq=5124, maxDocs=44421)
                0.015189849 = queryNorm
              0.39494774 = fieldWeight in 429, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.159582 = idf(docFreq=5124, maxDocs=44421)
                0.0625 = fieldNorm(doc=429)
          0.03234884 = weight(abstract_txt:learning in 429) [ClassicSimilarity], result of:
            0.03234884 = score(doc=429,freq=1.0), product of:
              0.10914676 = queryWeight, product of:
                1.5152681 = boost
                4.7420692 = idf(docFreq=1052, maxDocs=44421)
                0.015189849 = queryNorm
              0.29637933 = fieldWeight in 429, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7420692 = idf(docFreq=1052, maxDocs=44421)
                0.0625 = fieldNorm(doc=429)
          0.041171513 = weight(abstract_txt:discussion in 429) [ClassicSimilarity], result of:
            0.041171513 = score(doc=429,freq=1.0), product of:
              0.1281847 = queryWeight, product of:
                1.6421098 = boost
                5.1390233 = idf(docFreq=707, maxDocs=44421)
                0.015189849 = queryNorm
              0.32118896 = fieldWeight in 429, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.1390233 = idf(docFreq=707, maxDocs=44421)
                0.0625 = fieldNorm(doc=429)
          0.062453434 = weight(abstract_txt:publications in 429) [ClassicSimilarity], result of:
            0.062453434 = score(doc=429,freq=2.0), product of:
              0.13431749 = queryWeight, product of:
                1.6809328 = boost
                5.260521 = idf(docFreq=626, maxDocs=44421)
                0.015189849 = queryNorm
              0.46496874 = fieldWeight in 429, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.260521 = idf(docFreq=626, maxDocs=44421)
                0.0625 = fieldNorm(doc=429)
          0.046388876 = weight(abstract_txt:academic in 429) [ClassicSimilarity], result of:
            0.046388876 = score(doc=429,freq=1.0), product of:
              0.1588832 = queryWeight, product of:
                2.2390752 = boost
                4.6714945 = idf(docFreq=1129, maxDocs=44421)
                0.015189849 = queryNorm
              0.2919684 = fieldWeight in 429, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.6714945 = idf(docFreq=1129, maxDocs=44421)
                0.0625 = fieldNorm(doc=429)
          0.05021796 = weight(abstract_txt:software in 429) [ClassicSimilarity], result of:
            0.05021796 = score(doc=429,freq=1.0), product of:
              0.18436892 = queryWeight, product of:
                2.7851136 = boost
                4.3580413 = idf(docFreq=1545, maxDocs=44421)
                0.015189849 = queryNorm
              0.27237758 = fieldWeight in 429, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.3580413 = idf(docFreq=1545, maxDocs=44421)
                0.0625 = fieldNorm(doc=429)
          0.27018106 = weight(abstract_txt:dataset in 429) [ClassicSimilarity], result of:
            0.27018106 = score(doc=429,freq=1.0), product of:
              0.64801043 = queryWeight, product of:
                6.3949285 = boost
                6.6710296 = idf(docFreq=152, maxDocs=44421)
                0.015189849 = queryNorm
              0.41693935 = fieldWeight in 429, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.6710296 = idf(docFreq=152, maxDocs=44421)
                0.0625 = fieldNorm(doc=429)
        0.28 = coord(7/25)
    
  3. Yu, M.; Sun, A.: Dataset versus reality : understanding model performance from the perspective of information need (2023) 0.14
    0.14188822 = sum of:
      0.14188822 = product of:
        0.70944107 = sum of:
          0.09127876 = weight(abstract_txt:datasets in 2075) [ClassicSimilarity], result of:
            0.09127876 = score(doc=2075,freq=6.0), product of:
              0.10406045 = queryWeight, product of:
                1.0461932 = boost
                6.548176 = idf(docFreq=172, maxDocs=44421)
                0.015189849 = queryNorm
              0.87717056 = fieldWeight in 2075, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                6.548176 = idf(docFreq=172, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2075)
          0.025117313 = weight(abstract_txt:research in 2075) [ClassicSimilarity], result of:
            0.025117313 = score(doc=2075,freq=4.0), product of:
              0.07268177 = queryWeight, product of:
                1.5144064 = boost
                3.159582 = idf(docFreq=5124, maxDocs=44421)
                0.015189849 = queryNorm
              0.34557927 = fieldWeight in 2075, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.159582 = idf(docFreq=5124, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2075)
          0.028305238 = weight(abstract_txt:learning in 2075) [ClassicSimilarity], result of:
            0.028305238 = score(doc=2075,freq=1.0), product of:
              0.10914676 = queryWeight, product of:
                1.5152681 = boost
                4.7420692 = idf(docFreq=1052, maxDocs=44421)
                0.015189849 = queryNorm
              0.2593319 = fieldWeight in 2075, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7420692 = idf(docFreq=1052, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2075)
          0.036114458 = weight(abstract_txt:creation in 2075) [ClassicSimilarity], result of:
            0.036114458 = score(doc=2075,freq=1.0), product of:
              0.12839665 = queryWeight, product of:
                1.6434667 = boost
                5.14327 = idf(docFreq=704, maxDocs=44421)
                0.015189849 = queryNorm
              0.2812726 = fieldWeight in 2075, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.14327 = idf(docFreq=704, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2075)
          0.5286253 = weight(abstract_txt:dataset in 2075) [ClassicSimilarity], result of:
            0.5286253 = score(doc=2075,freq=5.0), product of:
              0.64801043 = queryWeight, product of:
                6.3949285 = boost
                6.6710296 = idf(docFreq=152, maxDocs=44421)
                0.015189849 = queryNorm
              0.81576663 = fieldWeight in 2075, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                6.6710296 = idf(docFreq=152, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2075)
        0.2 = coord(5/25)
    
  4. Jiao, H.; Qiu, Y.; Ma, X.; Yang, B.: Dissmination effect of data papers on scientific datasets (2024) 0.12
    0.119337164 = sum of:
      0.119337164 = product of:
        0.5966858 = sum of:
          0.0952294 = weight(abstract_txt:datasets in 2206) [ClassicSimilarity], result of:
            0.0952294 = score(doc=2206,freq=5.0), product of:
              0.10406045 = queryWeight, product of:
                1.0461932 = boost
                6.548176 = idf(docFreq=172, maxDocs=44421)
                0.015189849 = queryNorm
              0.9151354 = fieldWeight in 2206, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                6.548176 = idf(docFreq=172, maxDocs=44421)
                0.0625 = fieldNorm(doc=2206)
          0.054903604 = weight(abstract_txt:biomedical in 2206) [ClassicSimilarity], result of:
            0.054903604 = score(doc=2206,freq=1.0), product of:
              0.12326191 = queryWeight, product of:
                1.1386323 = boost
                7.1267567 = idf(docFreq=96, maxDocs=44421)
                0.015189849 = queryNorm
              0.4454223 = fieldWeight in 2206, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.1267567 = idf(docFreq=96, maxDocs=44421)
                0.0625 = fieldNorm(doc=2206)
          0.020297855 = weight(abstract_txt:research in 2206) [ClassicSimilarity], result of:
            0.020297855 = score(doc=2206,freq=2.0), product of:
              0.07268177 = queryWeight, product of:
                1.5144064 = boost
                3.159582 = idf(docFreq=5124, maxDocs=44421)
                0.015189849 = queryNorm
              0.27927023 = fieldWeight in 2206, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.159582 = idf(docFreq=5124, maxDocs=44421)
                0.0625 = fieldNorm(doc=2206)
          0.04416125 = weight(abstract_txt:publications in 2206) [ClassicSimilarity], result of:
            0.04416125 = score(doc=2206,freq=1.0), product of:
              0.13431749 = queryWeight, product of:
                1.6809328 = boost
                5.260521 = idf(docFreq=626, maxDocs=44421)
                0.015189849 = queryNorm
              0.32878256 = fieldWeight in 2206, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.260521 = idf(docFreq=626, maxDocs=44421)
                0.0625 = fieldNorm(doc=2206)
          0.3820937 = weight(abstract_txt:dataset in 2206) [ClassicSimilarity], result of:
            0.3820937 = score(doc=2206,freq=2.0), product of:
              0.64801043 = queryWeight, product of:
                6.3949285 = boost
                6.6710296 = idf(docFreq=152, maxDocs=44421)
                0.015189849 = queryNorm
              0.5896413 = fieldWeight in 2206, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.6710296 = idf(docFreq=152, maxDocs=44421)
                0.0625 = fieldNorm(doc=2206)
        0.2 = coord(5/25)
    
  5. Mai, F.; Galke, L.; Scherp, A.: Using deep learning for title-based semantic subject indexing to reach competitive performance to full-text (2018) 0.11
    0.11472616 = sum of:
      0.11472616 = product of:
        0.5736308 = sum of:
          0.052699815 = weight(abstract_txt:datasets in 93) [ClassicSimilarity], result of:
            0.052699815 = score(doc=93,freq=2.0), product of:
              0.10406045 = queryWeight, product of:
                1.0461932 = boost
                6.548176 = idf(docFreq=172, maxDocs=44421)
                0.015189849 = queryNorm
              0.5064346 = fieldWeight in 93, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.548176 = idf(docFreq=172, maxDocs=44421)
                0.0546875 = fieldNorm(doc=93)
          0.044513278 = weight(abstract_txt:economics in 93) [ClassicSimilarity], result of:
            0.044513278 = score(doc=93,freq=1.0), product of:
              0.1171519 = queryWeight, product of:
                1.1100531 = boost
                6.9478774 = idf(docFreq=115, maxDocs=44421)
                0.015189849 = queryNorm
              0.37996206 = fieldWeight in 93, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9478774 = idf(docFreq=115, maxDocs=44421)
                0.0546875 = fieldNorm(doc=93)
          0.028305238 = weight(abstract_txt:learning in 93) [ClassicSimilarity], result of:
            0.028305238 = score(doc=93,freq=1.0), product of:
              0.10914676 = queryWeight, product of:
                1.5152681 = boost
                4.7420692 = idf(docFreq=1052, maxDocs=44421)
                0.015189849 = queryNorm
              0.2593319 = fieldWeight in 93, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7420692 = idf(docFreq=1052, maxDocs=44421)
                0.0546875 = fieldNorm(doc=93)
          0.03864109 = weight(abstract_txt:publications in 93) [ClassicSimilarity], result of:
            0.03864109 = score(doc=93,freq=1.0), product of:
              0.13431749 = queryWeight, product of:
                1.6809328 = boost
                5.260521 = idf(docFreq=626, maxDocs=44421)
                0.015189849 = queryNorm
              0.28768474 = fieldWeight in 93, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.260521 = idf(docFreq=626, maxDocs=44421)
                0.0546875 = fieldNorm(doc=93)
          0.40947136 = weight(abstract_txt:dataset in 93) [ClassicSimilarity], result of:
            0.40947136 = score(doc=93,freq=3.0), product of:
              0.64801043 = queryWeight, product of:
                6.3949285 = boost
                6.6710296 = idf(docFreq=152, maxDocs=44421)
                0.015189849 = queryNorm
              0.63189006 = fieldWeight in 93, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.6710296 = idf(docFreq=152, maxDocs=44421)
                0.0546875 = fieldNorm(doc=93)
        0.2 = coord(5/25)