Document (#41310)

Author
Toepfer, M.
Seifert, C.
Title
Content-based quality estimation for automatic subject indexing of short texts under precision and recall constraints
Issue
[Submitted on 7 Jun 2018].
Source
https://arxiv.org/abs/1806.02743
Abstract
Semantic annotations have to satisfy quality constraints to be useful for digital libraries, which is particularly challenging on large and diverse datasets. Confidence scores of multi-label classification methods typically refer only to the relevance of particular subjects, disregarding indicators of insufficient content representation at the document-level. Therefore, we propose a novel approach that detects documents rather than concepts where quality criteria are met. Our approach uses a deep, multi-layered regression architecture, which comprises a variety of content-based indicators. We evaluated multiple configurations using text collections from law and economics, where the available content is restricted to very short texts. Notably, we demonstrate that the proposed quality estimation technique can determine subsets of the previously unseen data where considerable gains in document-level recall can be achieved, while upholding precision at the same time. Hence, the approach effectively performs a filtering that ensures high data quality standards in operative information retrieval systems.
Content
This is an authors' manuscript version of a paper accepted for proceedings of TPDL-2018, Porto, Portugal, Sept 10-13. The nal authenticated publication is available online at https://doi.org/will be added as soon as available.
Theme
Automatisches Indexieren
Retrievalstudien

Similar documents (author)

  1. Seifert, S.: Johann Samuel Ersch, der Begründer der neueren Bibliographie in Deutschland : seine Entwicklung bis zum 'Allgemeinen Repertorium der Literatur' (1968) 6.01
    6.0137663 = sum of:
      6.0137663 = weight(author_txt:seifert in 4967) [ClassicSimilarity], result of:
        6.0137663 = fieldWeight in 4967, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.622026 = idf(docFreq=7, maxDocs=44421)
          0.625 = fieldNorm(doc=4967)
    
  2. Seifert, S.: Universelle bibliographische Verzeichnisse an der Wende vom 18. zum 19. Jahrhundert : historische Analyse und aktuelle Schlußfolgerungen (1987) 6.01
    6.0137663 = sum of:
      6.0137663 = weight(author_txt:seifert in 2539) [ClassicSimilarity], result of:
        6.0137663 = fieldWeight in 2539, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.622026 = idf(docFreq=7, maxDocs=44421)
          0.625 = fieldNorm(doc=2539)
    
  3. Seifert, W.: Herausforderungen bei der Abbildung von Regionalstudien in der Regensburger Verbundklassifikation (2018) 6.01
    6.0137663 = sum of:
      6.0137663 = weight(author_txt:seifert in 600) [ClassicSimilarity], result of:
        6.0137663 = fieldWeight in 600, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.622026 = idf(docFreq=7, maxDocs=44421)
          0.625 = fieldNorm(doc=600)
    
  4. Graner, B.; Fresenborg, M.; Lühr, A.; Seifert, J.; Sünkler, S.: Schriftgutverwaltung an der Hochschule : Entwicklung eines aufgabenorientierten Aktenplans für die Hochschule für Angewandte Wissenschaften Hamburg (2009) 3.01
    3.0068831 = sum of:
      3.0068831 = weight(author_txt:seifert in 121) [ClassicSimilarity], result of:
        3.0068831 = fieldWeight in 121, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.622026 = idf(docFreq=7, maxDocs=44421)
          0.3125 = fieldNorm(doc=121)
    
  5. Böhm, A.; Seifert, C.; Schlötterer, J.; Granitzer, M.: Identifying tweets from the economic domain (2017) 3.01
    3.0068831 = sum of:
      3.0068831 = weight(author_txt:seifert in 4495) [ClassicSimilarity], result of:
        3.0068831 = fieldWeight in 4495, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.622026 = idf(docFreq=7, maxDocs=44421)
          0.3125 = fieldNorm(doc=4495)
    

Similar documents (content)

  1. Costers, L.: ¬The electronic library and its organizational management (1994) 0.13
    0.13275026 = sum of:
      0.13275026 = product of:
        0.66375124 = sum of:
          0.16294654 = weight(abstract_txt:layered in 1282) [ClassicSimilarity], result of:
            0.16294654 = score(doc=1282,freq=1.0), product of:
              0.1808943 = queryWeight, product of:
                1.0896229 = boost
                8.235732 = idf(docFreq=31, maxDocs=44421)
                0.020157956 = queryNorm
              0.9007832 = fieldWeight in 1282, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.235732 = idf(docFreq=31, maxDocs=44421)
                0.109375 = fieldNorm(doc=1282)
          0.07493743 = weight(abstract_txt:level in 1282) [ClassicSimilarity], result of:
            0.07493743 = score(doc=1282,freq=2.0), product of:
              0.107777305 = queryWeight, product of:
                1.1894397 = boost
                4.4950905 = idf(docFreq=1347, maxDocs=44421)
                0.020157956 = queryNorm
              0.69529885 = fieldWeight in 1282, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.4950905 = idf(docFreq=1347, maxDocs=44421)
                0.109375 = fieldNorm(doc=1282)
          0.045821894 = weight(abstract_txt:approach in 1282) [ClassicSimilarity], result of:
            0.045821894 = score(doc=1282,freq=1.0), product of:
              0.1119826 = queryWeight, product of:
                1.4849083 = boost
                3.741144 = idf(docFreq=2864, maxDocs=44421)
                0.020157956 = queryNorm
              0.40918761 = fieldWeight in 1282, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.741144 = idf(docFreq=2864, maxDocs=44421)
                0.109375 = fieldNorm(doc=1282)
          0.12696025 = weight(abstract_txt:where in 1282) [ClassicSimilarity], result of:
            0.12696025 = score(doc=1282,freq=2.0), product of:
              0.1753357 = queryWeight, product of:
                1.8580593 = boost
                4.681277 = idf(docFreq=1118, maxDocs=44421)
                0.020157956 = queryNorm
              0.7240981 = fieldWeight in 1282, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.681277 = idf(docFreq=1118, maxDocs=44421)
                0.109375 = fieldNorm(doc=1282)
          0.2530851 = weight(abstract_txt:quality in 1282) [ClassicSimilarity], result of:
            0.2530851 = score(doc=1282,freq=3.0), product of:
              0.28764406 = queryWeight, product of:
                3.0723908 = boost
                4.6444306 = idf(docFreq=1160, maxDocs=44421)
                0.020157956 = queryNorm
              0.87985516 = fieldWeight in 1282, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.6444306 = idf(docFreq=1160, maxDocs=44421)
                0.109375 = fieldNorm(doc=1282)
        0.2 = coord(5/25)
    
  2. Buchholz, K.: Criteria for the analysis of scientific quality (1995) 0.13
    0.13019288 = sum of:
      0.13019288 = product of:
        0.54247034 = sum of:
          0.0899681 = weight(abstract_txt:notably in 2518) [ClassicSimilarity], result of:
            0.0899681 = score(doc=2518,freq=1.0), product of:
              0.15236054 = queryWeight, product of:
                7.558333 = idf(docFreq=62, maxDocs=44421)
                0.020157956 = queryNorm
              0.59049475 = fieldWeight in 2518, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.558333 = idf(docFreq=62, maxDocs=44421)
                0.078125 = fieldNorm(doc=2518)
          0.03784912 = weight(abstract_txt:level in 2518) [ClassicSimilarity], result of:
            0.03784912 = score(doc=2518,freq=1.0), product of:
              0.107777305 = queryWeight, product of:
                1.1894397 = boost
                4.4950905 = idf(docFreq=1347, maxDocs=44421)
                0.020157956 = queryNorm
              0.35117894 = fieldWeight in 2518, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.4950905 = idf(docFreq=1347, maxDocs=44421)
                0.078125 = fieldNorm(doc=2518)
          0.11491524 = weight(abstract_txt:short in 2518) [ClassicSimilarity], result of:
            0.11491524 = score(doc=2518,freq=2.0), product of:
              0.1793626 = queryWeight, product of:
                1.5344216 = boost
                5.7988343 = idf(docFreq=365, maxDocs=44421)
                0.020157956 = queryNorm
              0.64068675 = fieldWeight in 2518, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.7988343 = idf(docFreq=365, maxDocs=44421)
                0.078125 = fieldNorm(doc=2518)
          0.09128333 = weight(abstract_txt:indicators in 2518) [ClassicSimilarity], result of:
            0.09128333 = score(doc=2518,freq=1.0), product of:
              0.19382857 = queryWeight, product of:
                1.595099 = boost
                6.0281444 = idf(docFreq=290, maxDocs=44421)
                0.020157956 = queryNorm
              0.4709488 = fieldWeight in 2518, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0281444 = idf(docFreq=290, maxDocs=44421)
                0.078125 = fieldNorm(doc=2518)
          0.060852338 = weight(abstract_txt:content in 2518) [ClassicSimilarity], result of:
            0.060852338 = score(doc=2518,freq=1.0), product of:
              0.18635955 = queryWeight, product of:
                2.2119207 = boost
                4.1796083 = idf(docFreq=1847, maxDocs=44421)
                0.020157956 = queryNorm
              0.3265319 = fieldWeight in 2518, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1796083 = idf(docFreq=1847, maxDocs=44421)
                0.078125 = fieldNorm(doc=2518)
          0.14760223 = weight(abstract_txt:quality in 2518) [ClassicSimilarity], result of:
            0.14760223 = score(doc=2518,freq=2.0), product of:
              0.28764406 = queryWeight, product of:
                3.0723908 = boost
                4.6444306 = idf(docFreq=1160, maxDocs=44421)
                0.020157956 = queryNorm
              0.51314193 = fieldWeight in 2518, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.6444306 = idf(docFreq=1160, maxDocs=44421)
                0.078125 = fieldNorm(doc=2518)
        0.24 = coord(6/25)
    
  3. Harrow, J.; Wickersham, L.; Rotherham, S.; Ella Farnsworth, E.; McElhenny, G.: Contextual depth projection in Large Language Models through semantic lattice frameworks (2024) 0.10
    0.09923627 = sum of:
      0.09923627 = product of:
        0.41348448 = sum of:
          0.08002581 = weight(abstract_txt:gains in 2403) [ClassicSimilarity], result of:
            0.08002581 = score(doc=2403,freq=1.0), product of:
              0.16352099 = queryWeight, product of:
                1.035978 = boost
                7.8302665 = idf(docFreq=47, maxDocs=44421)
                0.020157956 = queryNorm
              0.48939165 = fieldWeight in 2403, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.8302665 = idf(docFreq=47, maxDocs=44421)
                0.0625 = fieldNorm(doc=2403)
          0.09649124 = weight(abstract_txt:ensures in 2403) [ClassicSimilarity], result of:
            0.09649124 = score(doc=2403,freq=1.0), product of:
              0.18524453 = queryWeight, product of:
                1.1026468 = boost
                8.334172 = idf(docFreq=28, maxDocs=44421)
                0.020157956 = queryNorm
              0.52088577 = fieldWeight in 2403, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.334172 = idf(docFreq=28, maxDocs=44421)
                0.0625 = fieldNorm(doc=2403)
          0.056179654 = weight(abstract_txt:precision in 2403) [ClassicSimilarity], result of:
            0.056179654 = score(doc=2403,freq=1.0), product of:
              0.1627357 = queryWeight, product of:
                1.4615718 = boost
                5.5235233 = idf(docFreq=481, maxDocs=44421)
                0.020157956 = queryNorm
              0.3452202 = fieldWeight in 2403, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5235233 = idf(docFreq=481, maxDocs=44421)
                0.0625 = fieldNorm(doc=2403)
          0.026183939 = weight(abstract_txt:approach in 2403) [ClassicSimilarity], result of:
            0.026183939 = score(doc=2403,freq=1.0), product of:
              0.1119826 = queryWeight, product of:
                1.4849083 = boost
                3.741144 = idf(docFreq=2864, maxDocs=44421)
                0.020157956 = queryNorm
              0.2338215 = fieldWeight in 2403, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.741144 = idf(docFreq=2864, maxDocs=44421)
                0.0625 = fieldNorm(doc=2403)
          0.10330415 = weight(abstract_txt:constraints in 2403) [ClassicSimilarity], result of:
            0.10330415 = score(doc=2403,freq=1.0), product of:
              0.24425417 = queryWeight, product of:
                1.7906048 = boost
                6.7669935 = idf(docFreq=138, maxDocs=44421)
                0.020157956 = queryNorm
              0.4229371 = fieldWeight in 2403, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.7669935 = idf(docFreq=138, maxDocs=44421)
                0.0625 = fieldNorm(doc=2403)
          0.051299684 = weight(abstract_txt:where in 2403) [ClassicSimilarity], result of:
            0.051299684 = score(doc=2403,freq=1.0), product of:
              0.1753357 = queryWeight, product of:
                1.8580593 = boost
                4.681277 = idf(docFreq=1118, maxDocs=44421)
                0.020157956 = queryNorm
              0.2925798 = fieldWeight in 2403, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.681277 = idf(docFreq=1118, maxDocs=44421)
                0.0625 = fieldNorm(doc=2403)
        0.24 = coord(6/25)
    
  4. Kenter, T.; Balog, K.; Rijke, M. de: Evaluating document filtering systems over time (2015) 0.09
    0.09288547 = sum of:
      0.09288547 = product of:
        0.46442735 = sum of:
          0.056577764 = weight(abstract_txt:document in 3672) [ClassicSimilarity], result of:
            0.056577764 = score(doc=3672,freq=6.0), product of:
              0.09835691 = queryWeight, product of:
                1.1362691 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.020157956 = queryNorm
              0.57522917 = fieldWeight in 3672, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.0546875 = fieldNorm(doc=3672)
          0.069518775 = weight(abstract_txt:precision in 3672) [ClassicSimilarity], result of:
            0.069518775 = score(doc=3672,freq=2.0), product of:
              0.1627357 = queryWeight, product of:
                1.4615718 = boost
                5.5235233 = idf(docFreq=481, maxDocs=44421)
                0.020157956 = queryNorm
              0.42718822 = fieldWeight in 3672, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.5235233 = idf(docFreq=481, maxDocs=44421)
                0.0546875 = fieldNorm(doc=3672)
          0.07845922 = weight(abstract_txt:recall in 3672) [ClassicSimilarity], result of:
            0.07845922 = score(doc=3672,freq=2.0), product of:
              0.17640495 = queryWeight, product of:
                1.5217178 = boost
                5.750825 = idf(docFreq=383, maxDocs=44421)
                0.020157956 = queryNorm
              0.44476765 = fieldWeight in 3672, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.750825 = idf(docFreq=383, maxDocs=44421)
                0.0546875 = fieldNorm(doc=3672)
          0.05688014 = weight(abstract_txt:short in 3672) [ClassicSimilarity], result of:
            0.05688014 = score(doc=3672,freq=1.0), product of:
              0.1793626 = queryWeight, product of:
                1.5344216 = boost
                5.7988343 = idf(docFreq=365, maxDocs=44421)
                0.020157956 = queryNorm
              0.31712374 = fieldWeight in 3672, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7988343 = idf(docFreq=365, maxDocs=44421)
                0.0546875 = fieldNorm(doc=3672)
          0.20299144 = weight(abstract_txt:estimation in 3672) [ClassicSimilarity], result of:
            0.20299144 = score(doc=3672,freq=2.0), product of:
              0.3324553 = queryWeight, product of:
                2.0890334 = boost
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.020157956 = queryNorm
              0.61058265 = fieldWeight in 3672, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.0546875 = fieldNorm(doc=3672)
        0.2 = coord(5/25)
    
  5. Daranyi, S.; Zawiasa, R.; Hajnal, Z.: Conceptual mapping of a database in the humanities : first results of an experiment with Sophia (1996) 0.09
    0.090439506 = sum of:
      0.090439506 = product of:
        0.56524694 = sum of:
          0.15148416 = weight(abstract_txt:configurations in 4565) [ClassicSimilarity], result of:
            0.15148416 = score(doc=4565,freq=1.0), product of:
              0.17230833 = queryWeight, product of:
                1.0634495 = boost
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.020157956 = queryNorm
              0.8791459 = fieldWeight in 4565, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.109375 = fieldNorm(doc=4565)
          0.14783776 = weight(abstract_txt:texts in 4565) [ClassicSimilarity], result of:
            0.14783776 = score(doc=4565,freq=2.0), product of:
              0.169532 = queryWeight, product of:
                1.4917793 = boost
                5.6376824 = idf(docFreq=429, maxDocs=44421)
                0.020157956 = queryNorm
              0.87203455 = fieldWeight in 4565, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.6376824 = idf(docFreq=429, maxDocs=44421)
                0.109375 = fieldNorm(doc=4565)
          0.18073176 = weight(abstract_txt:indicators in 4565) [ClassicSimilarity], result of:
            0.18073176 = score(doc=4565,freq=2.0), product of:
              0.19382857 = queryWeight, product of:
                1.595099 = boost
                6.0281444 = idf(docFreq=290, maxDocs=44421)
                0.020157956 = queryNorm
              0.932431 = fieldWeight in 4565, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.0281444 = idf(docFreq=290, maxDocs=44421)
                0.109375 = fieldNorm(doc=4565)
          0.08519328 = weight(abstract_txt:content in 4565) [ClassicSimilarity], result of:
            0.08519328 = score(doc=4565,freq=1.0), product of:
              0.18635955 = queryWeight, product of:
                2.2119207 = boost
                4.1796083 = idf(docFreq=1847, maxDocs=44421)
                0.020157956 = queryNorm
              0.45714468 = fieldWeight in 4565, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1796083 = idf(docFreq=1847, maxDocs=44421)
                0.109375 = fieldNorm(doc=4565)
        0.16 = coord(4/25)