Document (#40124)

Author
Pertile, S. de L.
Moreira, V.P.
Title
Comparing and combining content- and citation-based approaches for plagiarism detection
Source
Journal of the Association for Information Science and Technology. 67(2016) no.10, S.2511-2526
Year
2016
Abstract
The vast amount of scientific publications available online makes it easier for students and researchers to reuse text from other authors and makes it harder for checking the originality of a given text. Reusing text without crediting the original authors is considered plagiarism. A number of studies have reported the prevalence of plagiarism in academia. As a consequence, numerous institutions and researchers are dedicated to devising systems to automate the process of checking for plagiarism. This work focuses on the problem of detecting text reuse in scientific papers. The contributions of this paper are twofold: (a) we survey the existing approaches for plagiarism detection based on content, based on content and structure, and based on citations and references; and (b) we compare content and citation-based approaches with the goal of evaluating whether they are complementary and if their combination can improve the quality of the detection. We carry out experiments with real data sets of scientific papers and concluded that a combination of the methods can be beneficial.
Content
Vgl.: http://onlinelibrary.wiley.com/doi/10.1002/asi.23593/full.

Similar documents (author)

  1. Moreira, F. Mosso => Mosso Moreira, F.: 4.81
    4.8060684 = sum of:
      4.8060684 = weight(author_txt:moreira in 730) [ClassicSimilarity], result of:
        4.8060684 = fieldWeight in 730, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          9.06241 = idf(docFreq=13, maxDocs=44421)
          0.375 = fieldNorm(doc=730)
    
  2. Flores, F.N.; Moreira, V.P.: Assessing the impact of stemming accuracy on information retrieval : a multilingual perspective (2016) 4.53
    4.531205 = sum of:
      4.531205 = weight(author_txt:moreira in 4187) [ClassicSimilarity], result of:
        4.531205 = fieldWeight in 4187, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.06241 = idf(docFreq=13, maxDocs=44421)
          0.5 = fieldNorm(doc=4187)
    
  3. Santos Macula, B.C. Moreira dos => Moreira dos Santos Macula, B.C.: 4.01
    4.0050573 = sum of:
      4.0050573 = weight(author_txt:moreira in 2122) [ClassicSimilarity], result of:
        4.0050573 = fieldWeight in 2122, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          9.06241 = idf(docFreq=13, maxDocs=44421)
          0.3125 = fieldNorm(doc=2122)
    
  4. Orengo, V.M. -> Moreira Orengo, V.: 3.96
    3.9648046 = sum of:
      3.9648046 = weight(author_txt:moreira in 410) [ClassicSimilarity], result of:
        3.9648046 = fieldWeight in 410, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.06241 = idf(docFreq=13, maxDocs=44421)
          0.4375 = fieldNorm(doc=410)
    
  5. Moreira Orengo, V.; Huyck, C.: Relevance feedback and cross-language information retrieval (2006) 3.96
    3.9648046 = sum of:
      3.9648046 = weight(author_txt:moreira in 1970) [ClassicSimilarity], result of:
        3.9648046 = fieldWeight in 1970, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.06241 = idf(docFreq=13, maxDocs=44421)
          0.4375 = fieldNorm(doc=1970)
    

Similar documents (content)

  1. Gipp, B.; Meuschke, N.; Breitinger, C.: Citation-based plagiarism detection : practicability on a large-scale scientific corpus (2014) 0.50
    0.49704552 = sum of:
      0.49704552 = product of:
        1.7751626 = sum of:
          0.05474838 = weight(abstract_txt:detecting in 4332) [ClassicSimilarity], result of:
            0.05474838 = score(doc=4332,freq=1.0), product of:
              0.113849595 = queryWeight, product of:
                1.0382035 = boost
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.014252441 = queryNorm
              0.4808834 = fieldWeight in 4332, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.0625 = fieldNorm(doc=4332)
          0.0562261 = weight(abstract_txt:citation in 4332) [ClassicSimilarity], result of:
            0.0562261 = score(doc=4332,freq=4.0), product of:
              0.09198125 = queryWeight, product of:
                1.3197187 = boost
                4.890223 = idf(docFreq=907, maxDocs=44421)
                0.014252441 = queryNorm
              0.6112779 = fieldWeight in 4332, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.890223 = idf(docFreq=907, maxDocs=44421)
                0.0625 = fieldNorm(doc=4332)
          0.060718633 = weight(abstract_txt:approaches in 4332) [ClassicSimilarity], result of:
            0.060718633 = score(doc=4332,freq=3.0), product of:
              0.12198281 = queryWeight, product of:
                1.8613441 = boost
                4.5981455 = idf(docFreq=1215, maxDocs=44421)
                0.014252441 = queryNorm
              0.49776384 = fieldWeight in 4332, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.5981455 = idf(docFreq=1215, maxDocs=44421)
                0.0625 = fieldNorm(doc=4332)
          0.051280428 = weight(abstract_txt:based in 4332) [ClassicSimilarity], result of:
            0.051280428 = score(doc=4332,freq=7.0), product of:
              0.09742619 = queryWeight, product of:
                2.1475317 = boost
                3.1830752 = idf(docFreq=5005, maxDocs=44421)
                0.014252441 = queryNorm
              0.5263516 = fieldWeight in 4332, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                3.1830752 = idf(docFreq=5005, maxDocs=44421)
                0.0625 = fieldNorm(doc=4332)
          0.04486374 = weight(abstract_txt:text in 4332) [ClassicSimilarity], result of:
            0.04486374 = score(doc=4332,freq=2.0), product of:
              0.12561002 = queryWeight, product of:
                2.181016 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.014252441 = queryNorm
              0.3571669 = fieldWeight in 4332, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=4332)
          0.29657796 = weight(abstract_txt:detection in 4332) [ClassicSimilarity], result of:
            0.29657796 = score(doc=4332,freq=7.0), product of:
              0.26475915 = queryWeight, product of:
                2.7422235 = boost
                6.774214 = idf(docFreq=137, maxDocs=44421)
                0.014252441 = queryNorm
              1.1201802 = fieldWeight in 4332, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                6.774214 = idf(docFreq=137, maxDocs=44421)
                0.0625 = fieldNorm(doc=4332)
          1.2107474 = weight(abstract_txt:plagiarism in 4332) [ClassicSimilarity], result of:
            1.2107474 = score(doc=4332,freq=9.0), product of:
              0.737387 = queryWeight, product of:
                5.908122 = boost
                8.757029 = idf(docFreq=18, maxDocs=44421)
                0.014252441 = queryNorm
              1.6419429 = fieldWeight in 4332, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                8.757029 = idf(docFreq=18, maxDocs=44421)
                0.0625 = fieldNorm(doc=4332)
        0.28 = coord(7/25)
    
  2. Vani, K.; Gupta, D.: Integrating syntax-semantic-based text analysis with structural and citation information for scientific plagiarism detection (2018) 0.43
    0.43326366 = sum of:
      0.43326366 = product of:
        1.5473702 = sum of:
          0.062862694 = weight(abstract_txt:citation in 543) [ClassicSimilarity], result of:
            0.062862694 = score(doc=543,freq=5.0), product of:
              0.09198125 = queryWeight, product of:
                1.3197187 = boost
                4.890223 = idf(docFreq=907, maxDocs=44421)
                0.014252441 = queryNorm
              0.6834295 = fieldWeight in 543, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.890223 = idf(docFreq=907, maxDocs=44421)
                0.0625 = fieldNorm(doc=543)
          0.03505592 = weight(abstract_txt:approaches in 543) [ClassicSimilarity], result of:
            0.03505592 = score(doc=543,freq=1.0), product of:
              0.12198281 = queryWeight, product of:
                1.8613441 = boost
                4.5981455 = idf(docFreq=1215, maxDocs=44421)
                0.014252441 = queryNorm
              0.2873841 = fieldWeight in 543, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.5981455 = idf(docFreq=1215, maxDocs=44421)
                0.0625 = fieldNorm(doc=543)
          0.050694272 = weight(abstract_txt:scientific in 543) [ClassicSimilarity], result of:
            0.050694272 = score(doc=543,freq=2.0), product of:
              0.12380941 = queryWeight, product of:
                1.8752284 = boost
                4.6324444 = idf(docFreq=1174, maxDocs=44421)
                0.014252441 = queryNorm
              0.4094541 = fieldWeight in 543, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.6324444 = idf(docFreq=1174, maxDocs=44421)
                0.0625 = fieldNorm(doc=543)
          0.043339875 = weight(abstract_txt:based in 543) [ClassicSimilarity], result of:
            0.043339875 = score(doc=543,freq=5.0), product of:
              0.09742619 = queryWeight, product of:
                2.1475317 = boost
                3.1830752 = idf(docFreq=5005, maxDocs=44421)
                0.014252441 = queryNorm
              0.4448483 = fieldWeight in 543, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                3.1830752 = idf(docFreq=5005, maxDocs=44421)
                0.0625 = fieldNorm(doc=543)
          0.06344691 = weight(abstract_txt:text in 543) [ClassicSimilarity], result of:
            0.06344691 = score(doc=543,freq=4.0), product of:
              0.12561002 = queryWeight, product of:
                2.181016 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.014252441 = queryNorm
              0.50511026 = fieldWeight in 543, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=543)
          0.22419189 = weight(abstract_txt:detection in 543) [ClassicSimilarity], result of:
            0.22419189 = score(doc=543,freq=4.0), product of:
              0.26475915 = queryWeight, product of:
                2.7422235 = boost
                6.774214 = idf(docFreq=137, maxDocs=44421)
                0.014252441 = queryNorm
              0.8467767 = fieldWeight in 543, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.774214 = idf(docFreq=137, maxDocs=44421)
                0.0625 = fieldNorm(doc=543)
          1.0677787 = weight(abstract_txt:plagiarism in 543) [ClassicSimilarity], result of:
            1.0677787 = score(doc=543,freq=7.0), product of:
              0.737387 = queryWeight, product of:
                5.908122 = boost
                8.757029 = idf(docFreq=18, maxDocs=44421)
                0.014252441 = queryNorm
              1.4480574 = fieldWeight in 543, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                8.757029 = idf(docFreq=18, maxDocs=44421)
                0.0625 = fieldNorm(doc=543)
        0.28 = coord(7/25)
    
  3. Alzahrani, S.; Palade, V.; Salim, N.; Abraham, A.: Using structural information and citation evidence to detect significant plagiarism cases in scientific publications (2012) 0.34
    0.3363788 = sum of:
      0.3363788 = product of:
        1.4015784 = sum of:
          0.029818388 = weight(abstract_txt:authors in 982) [ClassicSimilarity], result of:
            0.029818388 = score(doc=982,freq=2.0), product of:
              0.08299808 = queryWeight, product of:
                1.2536196 = boost
                4.6452923 = idf(docFreq=1159, maxDocs=44421)
                0.014252441 = queryNorm
              0.35926598 = fieldWeight in 982, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.6452923 = idf(docFreq=1159, maxDocs=44421)
                0.0546875 = fieldNorm(doc=982)
          0.042606577 = weight(abstract_txt:citation in 982) [ClassicSimilarity], result of:
            0.042606577 = score(doc=982,freq=3.0), product of:
              0.09198125 = queryWeight, product of:
                1.3197187 = boost
                4.890223 = idf(docFreq=907, maxDocs=44421)
                0.014252441 = queryNorm
              0.4632094 = fieldWeight in 982, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.890223 = idf(docFreq=907, maxDocs=44421)
                0.0546875 = fieldNorm(doc=982)
          0.04435749 = weight(abstract_txt:scientific in 982) [ClassicSimilarity], result of:
            0.04435749 = score(doc=982,freq=2.0), product of:
              0.12380941 = queryWeight, product of:
                1.8752284 = boost
                4.6324444 = idf(docFreq=1174, maxDocs=44421)
                0.014252441 = queryNorm
              0.35827234 = fieldWeight in 982, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.6324444 = idf(docFreq=1174, maxDocs=44421)
                0.0546875 = fieldNorm(doc=982)
          0.029374557 = weight(abstract_txt:based in 982) [ClassicSimilarity], result of:
            0.029374557 = score(doc=982,freq=3.0), product of:
              0.09742619 = queryWeight, product of:
                2.1475317 = boost
                3.1830752 = idf(docFreq=5005, maxDocs=44421)
                0.014252441 = queryNorm
              0.30150574 = fieldWeight in 982, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.1830752 = idf(docFreq=5005, maxDocs=44421)
                0.0546875 = fieldNorm(doc=982)
          0.13871165 = weight(abstract_txt:detection in 982) [ClassicSimilarity], result of:
            0.13871165 = score(doc=982,freq=2.0), product of:
              0.26475915 = queryWeight, product of:
                2.7422235 = boost
                6.774214 = idf(docFreq=137, maxDocs=44421)
                0.014252441 = queryNorm
              0.52391636 = fieldWeight in 982, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.774214 = idf(docFreq=137, maxDocs=44421)
                0.0546875 = fieldNorm(doc=982)
          1.1167098 = weight(abstract_txt:plagiarism in 982) [ClassicSimilarity], result of:
            1.1167098 = score(doc=982,freq=10.0), product of:
              0.737387 = queryWeight, product of:
                5.908122 = boost
                8.757029 = idf(docFreq=18, maxDocs=44421)
                0.014252441 = queryNorm
              1.5144148 = fieldWeight in 982, product of:
                3.1622777 = tf(freq=10.0), with freq of:
                  10.0 = termFreq=10.0
                8.757029 = idf(docFreq=18, maxDocs=44421)
                0.0546875 = fieldNorm(doc=982)
        0.24 = coord(6/25)
    
  4. Stamatatos, E.: Plagiarism detection using stopword n-grams (2011) 0.20
    0.19802062 = sum of:
      0.19802062 = product of:
        0.99010307 = sum of:
          0.068435475 = weight(abstract_txt:detecting in 955) [ClassicSimilarity], result of:
            0.068435475 = score(doc=955,freq=1.0), product of:
              0.113849595 = queryWeight, product of:
                1.0382035 = boost
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.014252441 = queryNorm
              0.60110426 = fieldWeight in 955, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.078125 = fieldNorm(doc=955)
          0.024227725 = weight(abstract_txt:based in 955) [ClassicSimilarity], result of:
            0.024227725 = score(doc=955,freq=1.0), product of:
              0.09742619 = queryWeight, product of:
                2.1475317 = boost
                3.1830752 = idf(docFreq=5005, maxDocs=44421)
                0.014252441 = queryNorm
              0.24867775 = fieldWeight in 955, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1830752 = idf(docFreq=5005, maxDocs=44421)
                0.078125 = fieldNorm(doc=955)
          0.04388021 = weight(abstract_txt:content in 955) [ClassicSimilarity], result of:
            0.04388021 = score(doc=955,freq=1.0), product of:
              0.13438262 = queryWeight, product of:
                2.2558918 = boost
                4.1796083 = idf(docFreq=1847, maxDocs=44421)
                0.014252441 = queryNorm
              0.3265319 = fieldWeight in 955, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1796083 = idf(docFreq=1847, maxDocs=44421)
                0.078125 = fieldNorm(doc=955)
          0.14011994 = weight(abstract_txt:detection in 955) [ClassicSimilarity], result of:
            0.14011994 = score(doc=955,freq=1.0), product of:
              0.26475915 = queryWeight, product of:
                2.7422235 = boost
                6.774214 = idf(docFreq=137, maxDocs=44421)
                0.014252441 = queryNorm
              0.5292355 = fieldWeight in 955, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.774214 = idf(docFreq=137, maxDocs=44421)
                0.078125 = fieldNorm(doc=955)
          0.7134397 = weight(abstract_txt:plagiarism in 955) [ClassicSimilarity], result of:
            0.7134397 = score(doc=955,freq=2.0), product of:
              0.737387 = queryWeight, product of:
                5.908122 = boost
                8.757029 = idf(docFreq=18, maxDocs=44421)
                0.014252441 = queryNorm
              0.9675241 = fieldWeight in 955, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.757029 = idf(docFreq=18, maxDocs=44421)
                0.078125 = fieldNorm(doc=955)
        0.2 = coord(5/25)
    
  5. Agarwal, B.; Ramampiaro, H.; Langseth, H.; Ruocco, M.: ¬A deep network model for paraphrase detection in short text messages (2018) 0.19
    0.1885151 = sum of:
      0.1885151 = product of:
        0.7854796 = sum of:
          0.05474838 = weight(abstract_txt:detecting in 43) [ClassicSimilarity], result of:
            0.05474838 = score(doc=43,freq=1.0), product of:
              0.113849595 = queryWeight, product of:
                1.0382035 = boost
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.014252441 = queryNorm
              0.4808834 = fieldWeight in 43, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.0625 = fieldNorm(doc=43)
          0.060718633 = weight(abstract_txt:approaches in 43) [ClassicSimilarity], result of:
            0.060718633 = score(doc=43,freq=3.0), product of:
              0.12198281 = queryWeight, product of:
                1.8613441 = boost
                4.5981455 = idf(docFreq=1215, maxDocs=44421)
                0.014252441 = queryNorm
              0.49776384 = fieldWeight in 43, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.5981455 = idf(docFreq=1215, maxDocs=44421)
                0.0625 = fieldNorm(doc=43)
          0.02741054 = weight(abstract_txt:based in 43) [ClassicSimilarity], result of:
            0.02741054 = score(doc=43,freq=2.0), product of:
              0.09742619 = queryWeight, product of:
                2.1475317 = boost
                3.1830752 = idf(docFreq=5005, maxDocs=44421)
                0.014252441 = queryNorm
              0.28134674 = fieldWeight in 43, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.1830752 = idf(docFreq=5005, maxDocs=44421)
                0.0625 = fieldNorm(doc=43)
          0.04486374 = weight(abstract_txt:text in 43) [ClassicSimilarity], result of:
            0.04486374 = score(doc=43,freq=2.0), product of:
              0.12561002 = queryWeight, product of:
                2.181016 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.014252441 = queryNorm
              0.3571669 = fieldWeight in 43, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=43)
          0.19415586 = weight(abstract_txt:detection in 43) [ClassicSimilarity], result of:
            0.19415586 = score(doc=43,freq=3.0), product of:
              0.26475915 = queryWeight, product of:
                2.7422235 = boost
                6.774214 = idf(docFreq=137, maxDocs=44421)
                0.014252441 = queryNorm
              0.73333013 = fieldWeight in 43, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.774214 = idf(docFreq=137, maxDocs=44421)
                0.0625 = fieldNorm(doc=43)
          0.40358245 = weight(abstract_txt:plagiarism in 43) [ClassicSimilarity], result of:
            0.40358245 = score(doc=43,freq=1.0), product of:
              0.737387 = queryWeight, product of:
                5.908122 = boost
                8.757029 = idf(docFreq=18, maxDocs=44421)
                0.014252441 = queryNorm
              0.5473143 = fieldWeight in 43, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.757029 = idf(docFreq=18, maxDocs=44421)
                0.0625 = fieldNorm(doc=43)
        0.24 = coord(6/25)