Document (#39723)

Author
Borodin, Y.
Polishchuk, V.
Mahmud, J.
Ramakrishnan, I.V.
Stent, A.
Title
Live and learn from mistakes : a lightweight system for document classification
Source
Information processing and management. 49(2013) no.1, S.83-98
Year
2013
Abstract
We present a Life-Long Learning from Mistakes (3LM) algorithm for document classification, which could be used in various scenarios such as spam filtering, blog classification, and web resource categorization. We extend the ideas of online clustering and batch-mode centroid-based classification to online learning with negative feedback. The 3LM is a competitive learning algorithm, which avoids over-smoothing, characteristic of the centroid-based classifiers, by using a different class representative, which we call clusterhead. The clusterheads competing for vector-space dominance are drawn toward misclassified documents, eventually bringing the model to a "balanced state" for a fixed distribution of documents. Subsequently, the clusterheads oscillate between the misclassified documents, heuristically minimizing the rate of misclassifications, an NP-complete problem. Further, the 3LM algorithm prevents over-fitting by "leashing" the clusterheads to their respective centroids. A clusterhead provably converges if its class can be separated by a hyper-plane from all other classes. Lifelong learning with fixed learning rate allows 3LM to adapt to possibly changing distribution of the data and continually learn and unlearn document classes. We report on our experiments, which demonstrate high accuracy of document classification on Reuters21578, OHSUMED, and TREC07p-spam datasets. The 3LM algorithm did not show over-fitting, while consistently outperforming centroid-based, Naïve Bayes, C4.5, AdaBoost, kNN, and SVM whose accuracy had been reported on the same three corpora.
Content
Vgl.: doi:10.1016/j.ipm.2012.02.001.
Theme
Automatisches Klassifizieren

Similar documents (content)

  1. Wang, J.: ¬An extensive study on automated Dewey Decimal Classification (2009) 0.19
    0.19164607 = sum of:
      0.19164607 = product of:
        0.5323502 = sum of:
          0.058822013 = weight(abstract_txt:distribution in 3172) [ClassicSimilarity], result of:
            0.058822013 = score(doc=3172,freq=2.0), product of:
              0.11891129 = queryWeight, product of:
                1.2788578 = boost
                5.596568 = idf(docFreq=445, maxDocs=44218)
                0.016614184 = queryNorm
              0.4946714 = fieldWeight in 3172, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.596568 = idf(docFreq=445, maxDocs=44218)
                0.0625 = fieldNorm(doc=3172)
          0.046891674 = weight(abstract_txt:classes in 3172) [ClassicSimilarity], result of:
            0.046891674 = score(doc=3172,freq=1.0), product of:
              0.12880626 = queryWeight, product of:
                1.3310035 = boost
                5.8247695 = idf(docFreq=354, maxDocs=44218)
                0.016614184 = queryNorm
              0.3640481 = fieldWeight in 3172, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8247695 = idf(docFreq=354, maxDocs=44218)
                0.0625 = fieldNorm(doc=3172)
          0.048528057 = weight(abstract_txt:class in 3172) [ClassicSimilarity], result of:
            0.048528057 = score(doc=3172,freq=1.0), product of:
              0.13178574 = queryWeight, product of:
                1.3463095 = boost
                5.8917522 = idf(docFreq=331, maxDocs=44218)
                0.016614184 = queryNorm
              0.36823452 = fieldWeight in 3172, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8917522 = idf(docFreq=331, maxDocs=44218)
                0.0625 = fieldNorm(doc=3172)
          0.050488334 = weight(abstract_txt:accuracy in 3172) [ClassicSimilarity], result of:
            0.050488334 = score(doc=3172,freq=1.0), product of:
              0.13531123 = queryWeight, product of:
                1.3641988 = boost
                5.9700394 = idf(docFreq=306, maxDocs=44218)
                0.016614184 = queryNorm
              0.37312746 = fieldWeight in 3172, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9700394 = idf(docFreq=306, maxDocs=44218)
                0.0625 = fieldNorm(doc=3172)
          0.03848936 = weight(abstract_txt:over in 3172) [ClassicSimilarity], result of:
            0.03848936 = score(doc=3172,freq=2.0), product of:
              0.102593705 = queryWeight, product of:
                1.454845 = boost
                4.244485 = idf(docFreq=1723, maxDocs=44218)
                0.016614184 = queryNorm
              0.375163 = fieldWeight in 3172, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.244485 = idf(docFreq=1723, maxDocs=44218)
                0.0625 = fieldNorm(doc=3172)
          0.037536457 = weight(abstract_txt:document in 3172) [ClassicSimilarity], result of:
            0.037536457 = score(doc=3172,freq=1.0), product of:
              0.13991104 = queryWeight, product of:
                1.9617864 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.016614184 = queryNorm
              0.26828802 = fieldWeight in 3172, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0625 = fieldNorm(doc=3172)
          0.09984921 = weight(abstract_txt:classification in 3172) [ClassicSimilarity], result of:
            0.09984921 = score(doc=3172,freq=7.0), product of:
              0.1512575 = queryWeight, product of:
                2.2805479 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.016614184 = queryNorm
              0.66012734 = fieldWeight in 3172, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.0625 = fieldNorm(doc=3172)
          0.088135935 = weight(abstract_txt:algorithm in 3172) [ClassicSimilarity], result of:
            0.088135935 = score(doc=3172,freq=1.0), product of:
              0.24716397 = queryWeight, product of:
                2.6074638 = boost
                5.705423 = idf(docFreq=399, maxDocs=44218)
                0.016614184 = queryNorm
              0.35658893 = fieldWeight in 3172, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.705423 = idf(docFreq=399, maxDocs=44218)
                0.0625 = fieldNorm(doc=3172)
          0.063609175 = weight(abstract_txt:learning in 3172) [ClassicSimilarity], result of:
            0.063609175 = score(doc=3172,freq=1.0), product of:
              0.21422312 = queryWeight, product of:
                2.7140253 = boost
                4.750873 = idf(docFreq=1038, maxDocs=44218)
                0.016614184 = queryNorm
              0.29692957 = fieldWeight in 3172, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.750873 = idf(docFreq=1038, maxDocs=44218)
                0.0625 = fieldNorm(doc=3172)
        0.36 = coord(9/25)
    
  2. Li, Y.; Shawe-Taylor, J.: Advanced learning algorithms for cross-language patent retrieval and classification (2007) 0.19
    0.1858676 = sum of:
      0.1858676 = product of:
        0.58083624 = sum of:
          0.014414053 = weight(abstract_txt:based in 931) [ClassicSimilarity], result of:
            0.014414053 = score(doc=931,freq=1.0), product of:
              0.057874553 = queryWeight, product of:
                1.0926981 = boost
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.016614184 = queryNorm
              0.24905685 = fieldWeight in 931, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.078125 = fieldNorm(doc=931)
          0.0147191025 = weight(abstract_txt:which in 931) [ClassicSimilarity], result of:
            0.0147191025 = score(doc=931,freq=1.0), product of:
              0.06459477 = queryWeight, product of:
                1.3329823 = boost
                2.9167147 = idf(docFreq=6503, maxDocs=44218)
                0.016614184 = queryNorm
              0.22786833 = fieldWeight in 931, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.9167147 = idf(docFreq=6503, maxDocs=44218)
                0.078125 = fieldNorm(doc=931)
          0.04404324 = weight(abstract_txt:documents in 931) [ClassicSimilarity], result of:
            0.04404324 = score(doc=931,freq=2.0), product of:
              0.09672522 = queryWeight, product of:
                1.4126228 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.016614184 = queryNorm
              0.4553439 = fieldWeight in 931, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.078125 = fieldNorm(doc=931)
          0.0734872 = weight(abstract_txt:learn in 931) [ClassicSimilarity], result of:
            0.0734872 = score(doc=931,freq=1.0), product of:
              0.14976406 = queryWeight, product of:
                1.4352069 = boost
                6.280787 = idf(docFreq=224, maxDocs=44218)
                0.016614184 = queryNorm
              0.49068648 = fieldWeight in 931, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.280787 = idf(docFreq=224, maxDocs=44218)
                0.078125 = fieldNorm(doc=931)
          0.046920568 = weight(abstract_txt:document in 931) [ClassicSimilarity], result of:
            0.046920568 = score(doc=931,freq=1.0), product of:
              0.13991104 = queryWeight, product of:
                1.9617864 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.016614184 = queryNorm
              0.33536002 = fieldWeight in 931, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.078125 = fieldNorm(doc=931)
          0.06671456 = weight(abstract_txt:classification in 931) [ClassicSimilarity], result of:
            0.06671456 = score(doc=931,freq=2.0), product of:
              0.1512575 = queryWeight, product of:
                2.2805479 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.016614184 = queryNorm
              0.44106615 = fieldWeight in 931, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.078125 = fieldNorm(doc=931)
          0.11016992 = weight(abstract_txt:algorithm in 931) [ClassicSimilarity], result of:
            0.11016992 = score(doc=931,freq=1.0), product of:
              0.24716397 = queryWeight, product of:
                2.6074638 = boost
                5.705423 = idf(docFreq=399, maxDocs=44218)
                0.016614184 = queryNorm
              0.44573617 = fieldWeight in 931, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.705423 = idf(docFreq=399, maxDocs=44218)
                0.078125 = fieldNorm(doc=931)
          0.21036759 = weight(abstract_txt:learning in 931) [ClassicSimilarity], result of:
            0.21036759 = score(doc=931,freq=7.0), product of:
              0.21422312 = queryWeight, product of:
                2.7140253 = boost
                4.750873 = idf(docFreq=1038, maxDocs=44218)
                0.016614184 = queryNorm
              0.98200226 = fieldWeight in 931, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                4.750873 = idf(docFreq=1038, maxDocs=44218)
                0.078125 = fieldNorm(doc=931)
        0.32 = coord(8/25)
    
  3. Wolfram, D.; Zhang, J.: ¬An investigation of the influence of indexing exhaustivity and term distributions on a document space (2002) 0.16
    0.16269374 = sum of:
      0.16269374 = product of:
        0.6778906 = sum of:
          0.023062486 = weight(abstract_txt:based in 5238) [ClassicSimilarity], result of:
            0.023062486 = score(doc=5238,freq=4.0), product of:
              0.057874553 = queryWeight, product of:
                1.0926981 = boost
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.016614184 = queryNorm
              0.39849097 = fieldWeight in 5238, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.0625 = fieldNorm(doc=5238)
          0.058822013 = weight(abstract_txt:distribution in 5238) [ClassicSimilarity], result of:
            0.058822013 = score(doc=5238,freq=2.0), product of:
              0.11891129 = queryWeight, product of:
                1.2788578 = boost
                5.596568 = idf(docFreq=445, maxDocs=44218)
                0.016614184 = queryNorm
              0.4946714 = fieldWeight in 5238, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.596568 = idf(docFreq=445, maxDocs=44218)
                0.0625 = fieldNorm(doc=5238)
          0.016652763 = weight(abstract_txt:which in 5238) [ClassicSimilarity], result of:
            0.016652763 = score(doc=5238,freq=2.0), product of:
              0.06459477 = queryWeight, product of:
                1.3329823 = boost
                2.9167147 = idf(docFreq=6503, maxDocs=44218)
                0.016614184 = queryNorm
              0.2578036 = fieldWeight in 5238, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.9167147 = idf(docFreq=6503, maxDocs=44218)
                0.0625 = fieldNorm(doc=5238)
          0.035234593 = weight(abstract_txt:documents in 5238) [ClassicSimilarity], result of:
            0.035234593 = score(doc=5238,freq=2.0), product of:
              0.09672522 = queryWeight, product of:
                1.4126228 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.016614184 = queryNorm
              0.36427513 = fieldWeight in 5238, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0625 = fieldNorm(doc=5238)
          0.112609364 = weight(abstract_txt:document in 5238) [ClassicSimilarity], result of:
            0.112609364 = score(doc=5238,freq=9.0), product of:
              0.13991104 = queryWeight, product of:
                1.9617864 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.016614184 = queryNorm
              0.80486405 = fieldWeight in 5238, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0625 = fieldNorm(doc=5238)
          0.43150938 = weight(abstract_txt:centroid in 5238) [ClassicSimilarity], result of:
            0.43150938 = score(doc=5238,freq=2.0), product of:
              0.51391 = queryWeight, product of:
                3.2561162 = boost
                9.499662 = idf(docFreq=8, maxDocs=44218)
                0.016614184 = queryNorm
              0.83965945 = fieldWeight in 5238, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.499662 = idf(docFreq=8, maxDocs=44218)
                0.0625 = fieldNorm(doc=5238)
        0.24 = coord(6/25)
    
  4. Hofferer, M.: Heuristic search in information retrieval (1994) 0.14
    0.13521871 = sum of:
      0.13521871 = product of:
        0.5634113 = sum of:
          0.023062486 = weight(abstract_txt:based in 1070) [ClassicSimilarity], result of:
            0.023062486 = score(doc=1070,freq=1.0), product of:
              0.057874553 = queryWeight, product of:
                1.0926981 = boost
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.016614184 = queryNorm
              0.39849097 = fieldWeight in 1070, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.125 = fieldNorm(doc=1070)
          0.086306766 = weight(abstract_txt:documents in 1070) [ClassicSimilarity], result of:
            0.086306766 = score(doc=1070,freq=3.0), product of:
              0.09672522 = queryWeight, product of:
                1.4126228 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.016614184 = queryNorm
              0.89228815 = fieldWeight in 1070, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.125 = fieldNorm(doc=1070)
          0.075072914 = weight(abstract_txt:document in 1070) [ClassicSimilarity], result of:
            0.075072914 = score(doc=1070,freq=1.0), product of:
              0.13991104 = queryWeight, product of:
                1.9617864 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.016614184 = queryNorm
              0.53657603 = fieldWeight in 1070, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.125 = fieldNorm(doc=1070)
          0.07547891 = weight(abstract_txt:classification in 1070) [ClassicSimilarity], result of:
            0.07547891 = score(doc=1070,freq=1.0), product of:
              0.1512575 = queryWeight, product of:
                2.2805479 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.016614184 = queryNorm
              0.4990094 = fieldWeight in 1070, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.125 = fieldNorm(doc=1070)
          0.17627187 = weight(abstract_txt:algorithm in 1070) [ClassicSimilarity], result of:
            0.17627187 = score(doc=1070,freq=1.0), product of:
              0.24716397 = queryWeight, product of:
                2.6074638 = boost
                5.705423 = idf(docFreq=399, maxDocs=44218)
                0.016614184 = queryNorm
              0.71317786 = fieldWeight in 1070, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.705423 = idf(docFreq=399, maxDocs=44218)
                0.125 = fieldNorm(doc=1070)
          0.12721835 = weight(abstract_txt:learning in 1070) [ClassicSimilarity], result of:
            0.12721835 = score(doc=1070,freq=1.0), product of:
              0.21422312 = queryWeight, product of:
                2.7140253 = boost
                4.750873 = idf(docFreq=1038, maxDocs=44218)
                0.016614184 = queryNorm
              0.59385914 = fieldWeight in 1070, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.750873 = idf(docFreq=1038, maxDocs=44218)
                0.125 = fieldNorm(doc=1070)
        0.24 = coord(6/25)
    
  5. Krulwich, B.; Burkey, C.: Jack and the InfoFinder agent (1997) 0.13
    0.13394247 = sum of:
      0.13394247 = product of:
        0.5580936 = sum of:
          0.17294218 = weight(abstract_txt:heuristically in 3262) [ClassicSimilarity], result of:
            0.17294218 = score(doc=3262,freq=1.0), product of:
              0.18623856 = queryWeight, product of:
                1.1316979 = boost
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.016614184 = queryNorm
              0.9286057 = fieldWeight in 3262, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.09375 = fieldNorm(doc=3262)
          0.017662922 = weight(abstract_txt:which in 3262) [ClassicSimilarity], result of:
            0.017662922 = score(doc=3262,freq=1.0), product of:
              0.06459477 = queryWeight, product of:
                1.3329823 = boost
                2.9167147 = idf(docFreq=6503, maxDocs=44218)
                0.016614184 = queryNorm
              0.273442 = fieldWeight in 3262, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.9167147 = idf(docFreq=6503, maxDocs=44218)
                0.09375 = fieldNorm(doc=3262)
          0.083566174 = weight(abstract_txt:documents in 3262) [ClassicSimilarity], result of:
            0.083566174 = score(doc=3262,freq=5.0), product of:
              0.09672522 = queryWeight, product of:
                1.4126228 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.016614184 = queryNorm
              0.86395437 = fieldWeight in 3262, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.09375 = fieldNorm(doc=3262)
          0.056304682 = weight(abstract_txt:document in 3262) [ClassicSimilarity], result of:
            0.056304682 = score(doc=3262,freq=1.0), product of:
              0.13991104 = queryWeight, product of:
                1.9617864 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.016614184 = queryNorm
              0.40243202 = fieldWeight in 3262, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.09375 = fieldNorm(doc=3262)
          0.13220389 = weight(abstract_txt:algorithm in 3262) [ClassicSimilarity], result of:
            0.13220389 = score(doc=3262,freq=1.0), product of:
              0.24716397 = queryWeight, product of:
                2.6074638 = boost
                5.705423 = idf(docFreq=399, maxDocs=44218)
                0.016614184 = queryNorm
              0.5348834 = fieldWeight in 3262, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.705423 = idf(docFreq=399, maxDocs=44218)
                0.09375 = fieldNorm(doc=3262)
          0.09541376 = weight(abstract_txt:learning in 3262) [ClassicSimilarity], result of:
            0.09541376 = score(doc=3262,freq=1.0), product of:
              0.21422312 = queryWeight, product of:
                2.7140253 = boost
                4.750873 = idf(docFreq=1038, maxDocs=44218)
                0.016614184 = queryNorm
              0.44539434 = fieldWeight in 3262, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.750873 = idf(docFreq=1038, maxDocs=44218)
                0.09375 = fieldNorm(doc=3262)
        0.24 = coord(6/25)