Document (#42089)

Ruano-Ordás, D.
Fdez-Riverola, F.
Méndez, J.R.
Using evolutionary computation for discovering spam patterns from e-mail samples
Information processing and management. 54(2018) no.2, S.303-317
One of the most relevant problems affecting the efficient use of e-mail to communicate worldwide is the spam phenomenon. Spamming involves flooding Internet with undesired messages aimed to promote illegal or low value products and services. Beyond the existence of different well-known machine learning techniques, collaborative schemes and other complementary approaches, some popular anti-spam frameworks such as SpamAssassin or Wirebrush4SPAM enabled the possibility of using regular expressions to effectively improve filter performance. In this work, we provide a review of existing proposals to automatically generate fully functional regular expressions from any input dataset combining spam and ham messages. Due to configuration difficulties and the low performance achieved by analysed schemes, in this work we introduce DiscoverRegex, a novel automatic spam pattern-finding tool. Patterns generated DiscoverRegex outperform those created by existing approaches (able to avoid FP errors) whilst minimising the computational resources required for its proper operation. DiscoverRegex source code is publicly available at

Similar documents (author)

  1. Martinez Méndez, F.J.: Aproximacion general a la evaluacion de la recuperacion mediante motores de busqueda en Internet (2001) 1.72
    1.7155864 = sum of:
      1.7155864 = product of:
        3.4311728 = sum of:
          3.4311728 = weight(author_txt:méndez in 4803) [ClassicSimilarity], result of:
            3.4311728 = score(doc=4803,freq=1.0), product of:
              0.7220297 = queryWeight, product of:
                1.0215691 = boost
                9.504243 = idf(docFreq=8, maxDocs=44421)
                0.074365206 = queryNorm
              4.7521214 = fieldWeight in 4803, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.504243 = idf(docFreq=8, maxDocs=44421)
                0.5 = fieldNorm(doc=4803)
        0.5 = coord(1/2)
  2. Valdivia, J. Fdez- -> Fdez-Valdivia, J.: 1.71
    1.7068114 = sum of:
      1.7068114 = product of:
        3.4136229 = sum of:
          3.4136229 = weight(author_txt:fdez in 1041) [ClassicSimilarity], result of:
            3.4136229 = score(doc=1041,freq=2.0), product of:
              0.6918621 = queryWeight, product of:
                9.303573 = idf(docFreq=10, maxDocs=44421)
                0.074365206 = queryNorm
              4.9339643 = fieldWeight in 1041, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.303573 = idf(docFreq=10, maxDocs=44421)
                0.375 = fieldNorm(doc=1041)
        0.5 = coord(1/2)
  3. Valdivia, J. Fdez -> Fdez-Valdivia, J.: 1.71
    1.7068114 = sum of:
      1.7068114 = product of:
        3.4136229 = sum of:
          3.4136229 = weight(author_txt:fdez in 1502) [ClassicSimilarity], result of:
            3.4136229 = score(doc=1502,freq=2.0), product of:
              0.6918621 = queryWeight, product of:
                9.303573 = idf(docFreq=10, maxDocs=44421)
                0.074365206 = queryNorm
              4.9339643 = fieldWeight in 1502, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.303573 = idf(docFreq=10, maxDocs=44421)
                0.375 = fieldNorm(doc=1502)
        0.5 = coord(1/2)
  4. Rodríguez, E.M.M. -> Méndez Rodríguez, E.M.: 1.50
    1.5011381 = sum of:
      1.5011381 = product of:
        3.0022762 = sum of:
          3.0022762 = weight(author_txt:méndez in 2855) [ClassicSimilarity], result of:
            3.0022762 = score(doc=2855,freq=1.0), product of:
              0.7220297 = queryWeight, product of:
                1.0215691 = boost
                9.504243 = idf(docFreq=8, maxDocs=44421)
                0.074365206 = queryNorm
              4.1581063 = fieldWeight in 2855, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.504243 = idf(docFreq=8, maxDocs=44421)
                0.4375 = fieldNorm(doc=2855)
        0.5 = coord(1/2)
  5. Greenberg, J.; Méndez Rodríguez, E.M.: Introduction: toward a more library-like Web via semantic knitting (2006) 1.50
    1.5011381 = sum of:
      1.5011381 = product of:
        3.0022762 = sum of:
          3.0022762 = weight(author_txt:méndez in 349) [ClassicSimilarity], result of:
            3.0022762 = score(doc=349,freq=1.0), product of:
              0.7220297 = queryWeight, product of:
                1.0215691 = boost
                9.504243 = idf(docFreq=8, maxDocs=44421)
                0.074365206 = queryNorm
              4.1581063 = fieldWeight in 349, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.504243 = idf(docFreq=8, maxDocs=44421)
                0.4375 = fieldNorm(doc=349)
        0.5 = coord(1/2)

Similar documents (content)

  1. Pera, M.S.; Ng, Y.-K.: SpamED : a spam E-mail detection approach based on phrase similarity (2009) 0.36
    0.3587819 = sum of:
      0.3587819 = product of:
        1.7939094 = sum of:
          0.030988451 = weight(abstract_txt:approaches in 3721) [ClassicSimilarity], result of:
            0.030988451 = score(doc=3721,freq=1.0), product of:
              0.086263515 = queryWeight, product of:
                1.2828094 = boost
                4.5981455 = idf(docFreq=1215, maxDocs=44421)
                0.014624543 = queryNorm
              0.3592301 = fieldWeight in 3721, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.5981455 = idf(docFreq=1215, maxDocs=44421)
                0.078125 = fieldNorm(doc=3721)
          0.031898204 = weight(abstract_txt:existing in 3721) [ClassicSimilarity], result of:
            0.031898204 = score(doc=3721,freq=1.0), product of:
              0.087943695 = queryWeight, product of:
                1.295242 = boost
                4.6427093 = idf(docFreq=1162, maxDocs=44421)
                0.014624543 = queryNorm
              0.36271167 = fieldWeight in 3721, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.6427093 = idf(docFreq=1162, maxDocs=44421)
                0.078125 = fieldNorm(doc=3721)
          0.16930108 = weight(abstract_txt:mail in 3721) [ClassicSimilarity], result of:
            0.16930108 = score(doc=3721,freq=6.0), product of:
              0.14725949 = queryWeight, product of:
                1.6760625 = boost
                6.0077353 = idf(docFreq=296, maxDocs=44421)
                0.014624543 = queryNorm
              1.1496786 = fieldWeight in 3721, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                6.0077353 = idf(docFreq=296, maxDocs=44421)
                0.078125 = fieldNorm(doc=3721)
          0.25620782 = weight(abstract_txt:messages in 3721) [ClassicSimilarity], result of:
            0.25620782 = score(doc=3721,freq=6.0), product of:
              0.19410574 = queryWeight, product of:
                1.9242778 = boost
                6.8974466 = idf(docFreq=121, maxDocs=44421)
                0.014624543 = queryNorm
              1.3199395 = fieldWeight in 3721, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                6.8974466 = idf(docFreq=121, maxDocs=44421)
                0.078125 = fieldNorm(doc=3721)
          1.3055139 = weight(abstract_txt:spam in 3721) [ClassicSimilarity], result of:
            1.3055139 = score(doc=3721,freq=7.0), product of:
              0.7410182 = queryWeight, product of:
                5.9447417 = boost
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.014624543 = queryNorm
              1.7617838 = fieldWeight in 3721, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.078125 = fieldNorm(doc=3721)
        0.2 = coord(5/25)
  2. Sedhai, S.; Sun, A.: ¬An analysis of 14 Million tweets on hashtag-oriented spamming* (2017) 0.09
    0.08973918 = sum of:
      0.08973918 = product of:
        1.1217397 = sum of:
          0.15480296 = weight(abstract_txt:spamming in 4683) [ClassicSimilarity], result of:
            0.15480296 = score(doc=4683,freq=2.0), product of:
              0.18427514 = queryWeight, product of:
                1.3257662 = boost
                9.504243 = idf(docFreq=8, maxDocs=44421)
                0.014624543 = queryNorm
              0.8400643 = fieldWeight in 4683, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.504243 = idf(docFreq=8, maxDocs=44421)
                0.0625 = fieldNorm(doc=4683)
          0.9669368 = weight(abstract_txt:spam in 4683) [ClassicSimilarity], result of:
            0.9669368 = score(doc=4683,freq=6.0), product of:
              0.7410182 = queryWeight, product of:
                5.9447417 = boost
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.014624543 = queryNorm
              1.304876 = fieldWeight in 4683, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.0625 = fieldNorm(doc=4683)
        0.08 = coord(2/25)
  3. Goodman, J.; Heckerman, D.; Rounthwaite, R.: Schutzwälle gegen Spam (2005) 0.08
    0.08202712 = sum of:
      0.08202712 = product of:
        1.025339 = sum of:
          0.048381813 = weight(abstract_txt:mail in 4696) [ClassicSimilarity], result of:
            0.048381813 = score(doc=4696,freq=1.0), product of:
              0.14725949 = queryWeight, product of:
                1.6760625 = boost
                6.0077353 = idf(docFreq=296, maxDocs=44421)
                0.014624543 = queryNorm
              0.328548 = fieldWeight in 4696, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0077353 = idf(docFreq=296, maxDocs=44421)
                0.0546875 = fieldNorm(doc=4696)
          0.97695714 = weight(abstract_txt:spam in 4696) [ClassicSimilarity], result of:
            0.97695714 = score(doc=4696,freq=8.0), product of:
              0.7410182 = queryWeight, product of:
                5.9447417 = boost
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.014624543 = queryNorm
              1.3183984 = fieldWeight in 4696, product of:
                2.828427 = tf(freq=8.0), with freq of:
                  8.0 = termFreq=8.0
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.0546875 = fieldNorm(doc=4696)
        0.08 = coord(2/25)
  4. Krüger, K.: Suchmaschinen-Spamming : Vergleichend-kritische Analysen zur Wirkung kommerzieller Strategien der Website-Optimierung auf das Ranking in www-Suchmaschinen (2004) 0.08
    0.08012681 = sum of:
      0.08012681 = product of:
        1.0015851 = sum of:
          0.16419335 = weight(abstract_txt:spamming in 4700) [ClassicSimilarity], result of:
            0.16419335 = score(doc=4700,freq=1.0), product of:
              0.18427514 = queryWeight, product of:
                1.3257662 = boost
                9.504243 = idf(docFreq=8, maxDocs=44421)
                0.014624543 = queryNorm
              0.8910228 = fieldWeight in 4700, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.504243 = idf(docFreq=8, maxDocs=44421)
                0.09375 = fieldNorm(doc=4700)
          0.8373918 = weight(abstract_txt:spam in 4700) [ClassicSimilarity], result of:
            0.8373918 = score(doc=4700,freq=2.0), product of:
              0.7410182 = queryWeight, product of:
                5.9447417 = boost
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.014624543 = queryNorm
              1.1300557 = fieldWeight in 4700, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.09375 = fieldNorm(doc=4700)
        0.08 = coord(2/25)
  5. Heidrich, J.: Illegale E-Mail-Filterung : Eigenmächtiges Unterdrücken elektronischer Post ist strafbar (2005) 0.07
    0.072007 = sum of:
      0.072007 = product of:
        0.90008754 = sum of:
          0.110587 = weight(abstract_txt:mail in 4239) [ClassicSimilarity], result of:
            0.110587 = score(doc=4239,freq=1.0), product of:
              0.14725949 = queryWeight, product of:
                1.6760625 = boost
                6.0077353 = idf(docFreq=296, maxDocs=44421)
                0.014624543 = queryNorm
              0.7509669 = fieldWeight in 4239, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0077353 = idf(docFreq=296, maxDocs=44421)
                0.125 = fieldNorm(doc=4239)
          0.78950053 = weight(abstract_txt:spam in 4239) [ClassicSimilarity], result of:
            0.78950053 = score(doc=4239,freq=1.0), product of:
              0.7410182 = queryWeight, product of:
                5.9447417 = boost
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.014624543 = queryNorm
              1.0654267 = fieldWeight in 4239, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.125 = fieldNorm(doc=4239)
        0.08 = coord(2/25)