Document (#43670)

Author
Collard, J.
Paiva, V. de
Fong, B.
Subrahmanian, E.
Title
Extracting mathematical concepts from text
Source
arXiv, [https://doi.org/10.48550/arXiv.2208.13830]
Year
2022
Abstract
We investigate different systems for extracting mathematical entities from English texts in the mathematical field of category theory as a first step for constructing a mathematical knowledge graph. We consider four different term extractors and compare their results. This small experiment showcases some of the issues with the construction and evaluation of terms extracted from noisy domain text. We also make available two open corpora in research mathematics, in particular in category theory: a small corpus of 755 abstracts from the journal TAC (3188 sentences), and a larger corpus from the nLab community wiki (15,000 sentences).
Theme
Computerlinguistik
Wissensrepräsentation
Field
Mathematik

Similar documents (author)

  1. Fong, W.W.: Searching the World Wide Web (1996) 6.19
    6.1935673 = sum of:
      6.1935673 = weight(author_txt:fong in 6665) [ClassicSimilarity], result of:
        6.1935673 = fieldWeight in 6665, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.909708 = idf(docFreq=5, maxDocs=44421)
          0.625 = fieldNorm(doc=6665)
    
  2. Fong, K.Y.: Interpretive object-oriented facility which can access precompiled classes (1995) 6.19
    6.1935673 = sum of:
      6.1935673 = weight(author_txt:fong in 6902) [ClassicSimilarity], result of:
        6.1935673 = fieldWeight in 6902, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.909708 = idf(docFreq=5, maxDocs=44421)
          0.625 = fieldNorm(doc=6902)
    
  3. Fong, A.C.M.: Mining a Web citation database for document clustering (2002) 6.19
    6.1935673 = sum of:
      6.1935673 = weight(author_txt:fong in 4940) [ClassicSimilarity], result of:
        6.1935673 = fieldWeight in 4940, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.909708 = idf(docFreq=5, maxDocs=44421)
          0.625 = fieldNorm(doc=4940)
    
  4. Tho, Q.T.; Hui, S.C.; Fong, A.C.M.: ¬A citation-based document retrieval system for finding research expertise (2007) 3.72
    3.7161405 = sum of:
      3.7161405 = weight(author_txt:fong in 1956) [ClassicSimilarity], result of:
        3.7161405 = fieldWeight in 1956, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.909708 = idf(docFreq=5, maxDocs=44421)
          0.375 = fieldNorm(doc=1956)
    

Similar documents (content)

  1. Yang, Y.; Lu, Q.; Zhao, T.: ¬A delimiter-based general approach for Chinese term extraction (2009) 0.17
    0.1721968 = sum of:
      0.1721968 = product of:
        0.538115 = sum of:
          0.07459924 = weight(abstract_txt:step in 302) [ClassicSimilarity], result of:
            0.07459924 = score(doc=302,freq=3.0), product of:
              0.11338938 = queryWeight, product of:
                1.0780051 = boost
                6.0774503 = idf(docFreq=276, maxDocs=44421)
                0.017307334 = queryNorm
              0.65790325 = fieldWeight in 302, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.0774503 = idf(docFreq=276, maxDocs=44421)
                0.0625 = fieldNorm(doc=302)
          0.04468291 = weight(abstract_txt:extracted in 302) [ClassicSimilarity], result of:
            0.04468291 = score(doc=302,freq=1.0), product of:
              0.11620303 = queryWeight, product of:
                1.091298 = boost
                6.1523914 = idf(docFreq=256, maxDocs=44421)
                0.017307334 = queryNorm
              0.38452446 = fieldWeight in 302, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.1523914 = idf(docFreq=256, maxDocs=44421)
                0.0625 = fieldNorm(doc=302)
          0.018809956 = weight(abstract_txt:different in 302) [ClassicSimilarity], result of:
            0.018809956 = score(doc=302,freq=1.0), product of:
              0.082235314 = queryWeight, product of:
                1.2983111 = boost
                3.6597328 = idf(docFreq=3107, maxDocs=44421)
                0.017307334 = queryNorm
              0.2287333 = fieldWeight in 302, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6597328 = idf(docFreq=3107, maxDocs=44421)
                0.0625 = fieldNorm(doc=302)
          0.025320256 = weight(abstract_txt:text in 302) [ClassicSimilarity], result of:
            0.025320256 = score(doc=302,freq=1.0), product of:
              0.10025635 = queryWeight, product of:
                1.4335259 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.017307334 = queryNorm
              0.25255513 = fieldWeight in 302, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=302)
          0.08675976 = weight(abstract_txt:corpus in 302) [ClassicSimilarity], result of:
            0.08675976 = score(doc=302,freq=1.0), product of:
              0.22786559 = queryWeight, product of:
                2.1611702 = boost
                6.0919957 = idf(docFreq=272, maxDocs=44421)
                0.017307334 = queryNorm
              0.38074973 = fieldWeight in 302, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0919957 = idf(docFreq=272, maxDocs=44421)
                0.0625 = fieldNorm(doc=302)
          0.028506115 = weight(abstract_txt:from in 302) [ClassicSimilarity], result of:
            0.028506115 = score(doc=302,freq=2.0), product of:
              0.11687686 = queryWeight, product of:
                2.4472814 = boost
                2.759399 = idf(docFreq=7646, maxDocs=44421)
                0.017307334 = queryNorm
              0.2438987 = fieldWeight in 302, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.759399 = idf(docFreq=7646, maxDocs=44421)
                0.0625 = fieldNorm(doc=302)
          0.12775752 = weight(abstract_txt:extracting in 302) [ClassicSimilarity], result of:
            0.12775752 = score(doc=302,freq=1.0), product of:
              0.29493353 = queryWeight, product of:
                2.4587348 = boost
                6.930783 = idf(docFreq=117, maxDocs=44421)
                0.017307334 = queryNorm
              0.43317392 = fieldWeight in 302, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.930783 = idf(docFreq=117, maxDocs=44421)
                0.0625 = fieldNorm(doc=302)
          0.13167928 = weight(abstract_txt:sentences in 302) [ClassicSimilarity], result of:
            0.13167928 = score(doc=302,freq=1.0), product of:
              0.30093879 = queryWeight, product of:
                2.4836402 = boost
                7.000987 = idf(docFreq=109, maxDocs=44421)
                0.017307334 = queryNorm
              0.4375617 = fieldWeight in 302, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.000987 = idf(docFreq=109, maxDocs=44421)
                0.0625 = fieldNorm(doc=302)
        0.32 = coord(8/25)
    
  2. Ling, X.; Jiang, J.; He, X.; Mei, Q.; Zhai, C.; Schatz, B.: Generating gene summaries from biomedical literature : a study of semi-structured summarization (2007) 0.15
    0.14515448 = sum of:
      0.14515448 = product of:
        0.51840883 = sum of:
          0.034768876 = weight(abstract_txt:experiment in 1946) [ClassicSimilarity], result of:
            0.034768876 = score(doc=1946,freq=1.0), product of:
              0.09830681 = queryWeight, product of:
                1.003752 = boost
                5.658835 = idf(docFreq=420, maxDocs=44421)
                0.017307334 = queryNorm
              0.35367718 = fieldWeight in 1946, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.658835 = idf(docFreq=420, maxDocs=44421)
                0.0625 = fieldNorm(doc=1946)
          0.04468291 = weight(abstract_txt:extracted in 1946) [ClassicSimilarity], result of:
            0.04468291 = score(doc=1946,freq=1.0), product of:
              0.11620303 = queryWeight, product of:
                1.091298 = boost
                6.1523914 = idf(docFreq=256, maxDocs=44421)
                0.017307334 = queryNorm
              0.38452446 = fieldWeight in 1946, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.1523914 = idf(docFreq=256, maxDocs=44421)
                0.0625 = fieldNorm(doc=1946)
          0.018809956 = weight(abstract_txt:different in 1946) [ClassicSimilarity], result of:
            0.018809956 = score(doc=1946,freq=1.0), product of:
              0.082235314 = queryWeight, product of:
                1.2983111 = boost
                3.6597328 = idf(docFreq=3107, maxDocs=44421)
                0.017307334 = queryNorm
              0.2287333 = fieldWeight in 1946, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6597328 = idf(docFreq=3107, maxDocs=44421)
                0.0625 = fieldNorm(doc=1946)
          0.03580825 = weight(abstract_txt:text in 1946) [ClassicSimilarity], result of:
            0.03580825 = score(doc=1946,freq=2.0), product of:
              0.10025635 = queryWeight, product of:
                1.4335259 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.017307334 = queryNorm
              0.3571669 = fieldWeight in 1946, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=1946)
          0.028506115 = weight(abstract_txt:from in 1946) [ClassicSimilarity], result of:
            0.028506115 = score(doc=1946,freq=2.0), product of:
              0.11687686 = queryWeight, product of:
                2.4472814 = boost
                2.759399 = idf(docFreq=7646, maxDocs=44421)
                0.017307334 = queryNorm
              0.2438987 = fieldWeight in 1946, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.759399 = idf(docFreq=7646, maxDocs=44421)
                0.0625 = fieldNorm(doc=1946)
          0.12775752 = weight(abstract_txt:extracting in 1946) [ClassicSimilarity], result of:
            0.12775752 = score(doc=1946,freq=1.0), product of:
              0.29493353 = queryWeight, product of:
                2.4587348 = boost
                6.930783 = idf(docFreq=117, maxDocs=44421)
                0.017307334 = queryNorm
              0.43317392 = fieldWeight in 1946, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.930783 = idf(docFreq=117, maxDocs=44421)
                0.0625 = fieldNorm(doc=1946)
          0.2280752 = weight(abstract_txt:sentences in 1946) [ClassicSimilarity], result of:
            0.2280752 = score(doc=1946,freq=3.0), product of:
              0.30093879 = queryWeight, product of:
                2.4836402 = boost
                7.000987 = idf(docFreq=109, maxDocs=44421)
                0.017307334 = queryNorm
              0.7578791 = fieldWeight in 1946, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.000987 = idf(docFreq=109, maxDocs=44421)
                0.0625 = fieldNorm(doc=1946)
        0.28 = coord(7/25)
    
  3. Stathopoulos, Y.; Baker, S.; Rei, M.; Teufel, S.: Variable typing : assigning meaning to variables in mathematical text (2018) 0.14
    0.14178552 = sum of:
      0.14178552 = product of:
        0.7089276 = sum of:
          0.055853635 = weight(abstract_txt:extracted in 432) [ClassicSimilarity], result of:
            0.055853635 = score(doc=432,freq=1.0), product of:
              0.11620303 = queryWeight, product of:
                1.091298 = boost
                6.1523914 = idf(docFreq=256, maxDocs=44421)
                0.017307334 = queryNorm
              0.48065558 = fieldWeight in 432, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.1523914 = idf(docFreq=256, maxDocs=44421)
                0.078125 = fieldNorm(doc=432)
          0.023512444 = weight(abstract_txt:different in 432) [ClassicSimilarity], result of:
            0.023512444 = score(doc=432,freq=1.0), product of:
              0.082235314 = queryWeight, product of:
                1.2983111 = boost
                3.6597328 = idf(docFreq=3107, maxDocs=44421)
                0.017307334 = queryNorm
              0.28591663 = fieldWeight in 432, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6597328 = idf(docFreq=3107, maxDocs=44421)
                0.078125 = fieldNorm(doc=432)
          0.044760313 = weight(abstract_txt:text in 432) [ClassicSimilarity], result of:
            0.044760313 = score(doc=432,freq=2.0), product of:
              0.10025635 = queryWeight, product of:
                1.4335259 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.017307334 = queryNorm
              0.4464586 = fieldWeight in 432, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.078125 = fieldNorm(doc=432)
          0.035632644 = weight(abstract_txt:from in 432) [ClassicSimilarity], result of:
            0.035632644 = score(doc=432,freq=2.0), product of:
              0.11687686 = queryWeight, product of:
                2.4472814 = boost
                2.759399 = idf(docFreq=7646, maxDocs=44421)
                0.017307334 = queryNorm
              0.30487338 = fieldWeight in 432, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.759399 = idf(docFreq=7646, maxDocs=44421)
                0.078125 = fieldNorm(doc=432)
          0.5491685 = weight(abstract_txt:mathematical in 432) [ClassicSimilarity], result of:
            0.5491685 = score(doc=432,freq=5.0), product of:
              0.49508935 = queryWeight, product of:
                4.5051203 = boost
                6.3496094 = idf(docFreq=210, maxDocs=44421)
                0.017307334 = queryNorm
              1.1092311 = fieldWeight in 432, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                6.3496094 = idf(docFreq=210, maxDocs=44421)
                0.078125 = fieldNorm(doc=432)
        0.2 = coord(5/25)
    
  4. Li, J.; Zhang, Z.; Li, X.; Chen, H.: Kernel-based learning for biomedical relation extraction (2008) 0.14
    0.14063069 = sum of:
      0.14063069 = product of:
        0.50225246 = sum of:
          0.04757844 = weight(abstract_txt:entities in 2611) [ClassicSimilarity], result of:
            0.04757844 = score(doc=2611,freq=1.0), product of:
              0.104421504 = queryWeight, product of:
                1.0344979 = boost
                5.8321705 = idf(docFreq=353, maxDocs=44421)
                0.017307334 = queryNorm
              0.45563832 = fieldWeight in 2611, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8321705 = idf(docFreq=353, maxDocs=44421)
                0.078125 = fieldNorm(doc=2611)
          0.08262204 = weight(abstract_txt:corpora in 2611) [ClassicSimilarity], result of:
            0.08262204 = score(doc=2611,freq=1.0), product of:
              0.1508622 = queryWeight, product of:
                1.24344 = boost
                7.01012 = idf(docFreq=108, maxDocs=44421)
                0.017307334 = queryNorm
              0.5476656 = fieldWeight in 2611, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.01012 = idf(docFreq=108, maxDocs=44421)
                0.078125 = fieldNorm(doc=2611)
          0.023512444 = weight(abstract_txt:different in 2611) [ClassicSimilarity], result of:
            0.023512444 = score(doc=2611,freq=1.0), product of:
              0.082235314 = queryWeight, product of:
                1.2983111 = boost
                3.6597328 = idf(docFreq=3107, maxDocs=44421)
                0.017307334 = queryNorm
              0.28591663 = fieldWeight in 2611, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6597328 = idf(docFreq=3107, maxDocs=44421)
                0.078125 = fieldNorm(doc=2611)
          0.044760313 = weight(abstract_txt:text in 2611) [ClassicSimilarity], result of:
            0.044760313 = score(doc=2611,freq=2.0), product of:
              0.10025635 = queryWeight, product of:
                1.4335259 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.017307334 = queryNorm
              0.4464586 = fieldWeight in 2611, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.078125 = fieldNorm(doc=2611)
          0.108449705 = weight(abstract_txt:corpus in 2611) [ClassicSimilarity], result of:
            0.108449705 = score(doc=2611,freq=1.0), product of:
              0.22786559 = queryWeight, product of:
                2.1611702 = boost
                6.0919957 = idf(docFreq=272, maxDocs=44421)
                0.017307334 = queryNorm
              0.47593716 = fieldWeight in 2611, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0919957 = idf(docFreq=272, maxDocs=44421)
                0.078125 = fieldNorm(doc=2611)
          0.035632644 = weight(abstract_txt:from in 2611) [ClassicSimilarity], result of:
            0.035632644 = score(doc=2611,freq=2.0), product of:
              0.11687686 = queryWeight, product of:
                2.4472814 = boost
                2.759399 = idf(docFreq=7646, maxDocs=44421)
                0.017307334 = queryNorm
              0.30487338 = fieldWeight in 2611, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.759399 = idf(docFreq=7646, maxDocs=44421)
                0.078125 = fieldNorm(doc=2611)
          0.15969689 = weight(abstract_txt:extracting in 2611) [ClassicSimilarity], result of:
            0.15969689 = score(doc=2611,freq=1.0), product of:
              0.29493353 = queryWeight, product of:
                2.4587348 = boost
                6.930783 = idf(docFreq=117, maxDocs=44421)
                0.017307334 = queryNorm
              0.5414674 = fieldWeight in 2611, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.930783 = idf(docFreq=117, maxDocs=44421)
                0.078125 = fieldNorm(doc=2611)
        0.28 = coord(7/25)
    
  5. Fraser, C.: Mathematics in library and review classification systems : an historical overview (2020) 0.14
    0.13960917 = sum of:
      0.13960917 = product of:
        0.69804585 = sum of:
          0.05336334 = weight(abstract_txt:larger in 900) [ClassicSimilarity], result of:
            0.05336334 = score(doc=900,freq=1.0), product of:
              0.11272282 = queryWeight, product of:
                1.0748318 = boost
                6.059561 = idf(docFreq=281, maxDocs=44421)
                0.017307334 = queryNorm
              0.4734032 = fieldWeight in 900, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.059561 = idf(docFreq=281, maxDocs=44421)
                0.078125 = fieldNorm(doc=900)
          0.14240596 = weight(abstract_txt:mathematics in 900) [ClassicSimilarity], result of:
            0.14240596 = score(doc=900,freq=4.0), product of:
              0.13662031 = queryWeight, product of:
                1.1832929 = boost
                6.6710296 = idf(docFreq=152, maxDocs=44421)
                0.017307334 = queryNorm
              1.0423484 = fieldWeight in 900, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.6710296 = idf(docFreq=152, maxDocs=44421)
                0.078125 = fieldNorm(doc=900)
          0.033251617 = weight(abstract_txt:different in 900) [ClassicSimilarity], result of:
            0.033251617 = score(doc=900,freq=2.0), product of:
              0.082235314 = queryWeight, product of:
                1.2983111 = boost
                3.6597328 = idf(docFreq=3107, maxDocs=44421)
                0.017307334 = queryNorm
              0.40434718 = fieldWeight in 900, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.6597328 = idf(docFreq=3107, maxDocs=44421)
                0.078125 = fieldNorm(doc=900)
          0.043640897 = weight(abstract_txt:from in 900) [ClassicSimilarity], result of:
            0.043640897 = score(doc=900,freq=3.0), product of:
              0.11687686 = queryWeight, product of:
                2.4472814 = boost
                2.759399 = idf(docFreq=7646, maxDocs=44421)
                0.017307334 = queryNorm
              0.3733921 = fieldWeight in 900, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.759399 = idf(docFreq=7646, maxDocs=44421)
                0.078125 = fieldNorm(doc=900)
          0.42538407 = weight(abstract_txt:mathematical in 900) [ClassicSimilarity], result of:
            0.42538407 = score(doc=900,freq=3.0), product of:
              0.49508935 = queryWeight, product of:
                4.5051203 = boost
                6.3496094 = idf(docFreq=210, maxDocs=44421)
                0.017307334 = queryNorm
              0.8592067 = fieldWeight in 900, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.3496094 = idf(docFreq=210, maxDocs=44421)
                0.078125 = fieldNorm(doc=900)
        0.2 = coord(5/25)