Document (#41240)

Author
Klic, L.
Miller, M.
Nelson, J.K.
Germann, J.E.
Title
Approaching the largest 'API' : extracting information from the Internet with Python
Source
Code4Lib journal. Issue 39(2018), [http://journal.code4lib.org]
Year
2018
Abstract
This article explores the need for libraries to algorithmically access and manipulate the world's largest API: the Internet. The billions of pages on the 'Internet API' (HTTP, HTML, CSS, XPath, DOM, etc.) are easily accessible and manipulable. Libraries can assist in creating meaning through the datafication of information on the world wide web. Because most information is created for human consumption, some programming is required for automated extraction. Python is an easy-to-learn programming language with extensive packages and community support for web page automation. Four packages (Urllib, Selenium, BeautifulSoup, Scrapy) in Python can automate almost any web page for all sized projects. An example warrant data project is explained to illustrate how well Python packages can manipulate web pages to create meaning through assembling custom datasets.
Content
Vgl.: http://journal.code4lib.org/articles/13197.
Theme
Internet
Object
Python

Similar documents (author)

  1. Klic, L.; Miller, M.; Nelson, J.K.; Pattuelli, C.; Provo, A.: ¬The drawings of the Florentine painters : from print catalog to linked open data (2017) 3.35
    3.3534017 = sum of:
      3.3534017 = sum of:
        1.3808911 = weight(author_txt:miller in 105) [ClassicSimilarity], result of:
          1.3808911 = score(doc=105,freq=1.0), product of:
            0.6191366 = queryWeight, product of:
              7.1371193 = idf(docFreq=95, maxDocs=44421)
              0.0867488 = queryNorm
            2.2303498 = fieldWeight in 105, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              7.1371193 = idf(docFreq=95, maxDocs=44421)
              0.3125 = fieldNorm(doc=105)
        1.9725105 = weight(author_txt:nelson in 105) [ClassicSimilarity], result of:
          1.9725105 = score(doc=105,freq=1.0), product of:
            0.7852833 = queryWeight, product of:
              1.1262115 = boost
              8.037906 = idf(docFreq=38, maxDocs=44421)
              0.0867488 = queryNorm
            2.5118456 = fieldWeight in 105, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.037906 = idf(docFreq=38, maxDocs=44421)
              0.3125 = fieldNorm(doc=105)
    
  2. Nelson, M.J.: Correlation of term usage and term indexing frequencies (1988) 1.97
    1.9725105 = sum of:
      1.9725105 = product of:
        3.945021 = sum of:
          3.945021 = weight(author_txt:nelson in 650) [ClassicSimilarity], result of:
            3.945021 = score(doc=650,freq=1.0), product of:
              0.7852833 = queryWeight, product of:
                1.1262115 = boost
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.0867488 = queryNorm
              5.023691 = fieldWeight in 650, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.625 = fieldNorm(doc=650)
        0.5 = coord(1/2)
    
  3. Nelson, M.G.: Catalogers as librarians (1986) 1.97
    1.9725105 = sum of:
      1.9725105 = product of:
        3.945021 = sum of:
          3.945021 = weight(author_txt:nelson in 2879) [ClassicSimilarity], result of:
            3.945021 = score(doc=2879,freq=1.0), product of:
              0.7852833 = queryWeight, product of:
                1.1262115 = boost
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.0867488 = queryNorm
              5.023691 = fieldWeight in 2879, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.625 = fieldNorm(doc=2879)
        0.5 = coord(1/2)
    
  4. Nelson, T.H.: ¬A file structure for the complex, the changing, and the indeterminate (1965) 1.97
    1.9725105 = sum of:
      1.9725105 = product of:
        3.945021 = sum of:
          3.945021 = weight(author_txt:nelson in 4467) [ClassicSimilarity], result of:
            3.945021 = score(doc=4467,freq=1.0), product of:
              0.7852833 = queryWeight, product of:
                1.1262115 = boost
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.0867488 = queryNorm
              5.023691 = fieldWeight in 4467, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.625 = fieldNorm(doc=4467)
        0.5 = coord(1/2)
    
  5. Nelson, M.J.: ¬The design of a hypertext interface for information retrieval (1991) 1.97
    1.9725105 = sum of:
      1.9725105 = product of:
        3.945021 = sum of:
          3.945021 = weight(author_txt:nelson in 4804) [ClassicSimilarity], result of:
            3.945021 = score(doc=4804,freq=1.0), product of:
              0.7852833 = queryWeight, product of:
                1.1262115 = boost
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.0867488 = queryNorm
              5.023691 = fieldWeight in 4804, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.625 = fieldNorm(doc=4804)
        0.5 = coord(1/2)
    

Similar documents (content)

  1. Eiter, T.; Kaminski, T.; Redl, C.; Schüller, P.; Weinzierl, A.: Answer set programming with external source access (2017) 0.07
    0.074905924 = sum of:
      0.074905924 = product of:
        0.46816206 = sum of:
          0.0091878055 = weight(abstract_txt:information in 4938) [ClassicSimilarity], result of:
            0.0091878055 = score(doc=4938,freq=3.0), product of:
              0.035087574 = queryWeight, product of:
                2.4188995 = idf(docFreq=10748, maxDocs=44421)
                0.014505593 = queryNorm
              0.26185355 = fieldWeight in 4938, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.4188995 = idf(docFreq=10748, maxDocs=44421)
                0.0625 = fieldNorm(doc=4938)
          0.015989179 = weight(abstract_txt:through in 4938) [ClassicSimilarity], result of:
            0.015989179 = score(doc=4938,freq=1.0), product of:
              0.06395967 = queryWeight, product of:
                1.1023787 = boost
                3.9998152 = idf(docFreq=2211, maxDocs=44421)
                0.014505593 = queryNorm
              0.24998845 = fieldWeight in 4938, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.9998152 = idf(docFreq=2211, maxDocs=44421)
                0.0625 = fieldNorm(doc=4938)
          0.079479806 = weight(abstract_txt:programming in 4938) [ClassicSimilarity], result of:
            0.079479806 = score(doc=4938,freq=1.0), product of:
              0.18629162 = queryWeight, product of:
                1.8813707 = boost
                6.82627 = idf(docFreq=130, maxDocs=44421)
                0.014505593 = queryNorm
              0.42664188 = fieldWeight in 4938, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.82627 = idf(docFreq=130, maxDocs=44421)
                0.0625 = fieldNorm(doc=4938)
          0.36350527 = weight(abstract_txt:python in 4938) [ClassicSimilarity], result of:
            0.36350527 = score(doc=4938,freq=1.0), product of:
              0.64670455 = queryWeight, product of:
                4.9573054 = boost
                8.993418 = idf(docFreq=14, maxDocs=44421)
                0.014505593 = queryNorm
              0.5620886 = fieldWeight in 4938, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.993418 = idf(docFreq=14, maxDocs=44421)
                0.0625 = fieldNorm(doc=4938)
        0.16 = coord(4/25)
    
  2. Falk, H.: Internet browsing tools (1995) 0.07
    0.07115399 = sum of:
      0.07115399 = product of:
        0.44471246 = sum of:
          0.013261456 = weight(abstract_txt:information in 2499) [ClassicSimilarity], result of:
            0.013261456 = score(doc=2499,freq=1.0), product of:
              0.035087574 = queryWeight, product of:
                2.4188995 = idf(docFreq=10748, maxDocs=44421)
                0.014505593 = queryNorm
              0.37795305 = fieldWeight in 2499, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4188995 = idf(docFreq=10748, maxDocs=44421)
                0.15625 = fieldNorm(doc=2499)
          0.03997295 = weight(abstract_txt:through in 2499) [ClassicSimilarity], result of:
            0.03997295 = score(doc=2499,freq=1.0), product of:
              0.06395967 = queryWeight, product of:
                1.1023787 = boost
                3.9998152 = idf(docFreq=2211, maxDocs=44421)
                0.014505593 = queryNorm
              0.62497115 = fieldWeight in 2499, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.9998152 = idf(docFreq=2211, maxDocs=44421)
                0.15625 = fieldNorm(doc=2499)
          0.068692595 = weight(abstract_txt:internet in 2499) [ClassicSimilarity], result of:
            0.068692595 = score(doc=2499,freq=2.0), product of:
              0.08337244 = queryWeight, product of:
                1.5414683 = boost
                3.7286568 = idf(docFreq=2900, maxDocs=44421)
                0.014505593 = queryNorm
              0.82392454 = fieldWeight in 2499, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.7286568 = idf(docFreq=2900, maxDocs=44421)
                0.15625 = fieldNorm(doc=2499)
          0.32278547 = weight(abstract_txt:packages in 2499) [ClassicSimilarity], result of:
            0.32278547 = score(doc=2499,freq=1.0), product of:
              0.29469213 = queryWeight, product of:
                2.8980615 = boost
                7.01012 = idf(docFreq=108, maxDocs=44421)
                0.014505593 = queryNorm
              1.0953312 = fieldWeight in 2499, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.01012 = idf(docFreq=108, maxDocs=44421)
                0.15625 = fieldNorm(doc=2499)
        0.16 = coord(4/25)
    
  3. Priss, U.: Alternatives to the "Semantic Web" : multi-strategy knowledge representation (2003) 0.06
    0.06430003 = sum of:
      0.06430003 = product of:
        0.4018752 = sum of:
          0.01193531 = weight(abstract_txt:information in 3733) [ClassicSimilarity], result of:
            0.01193531 = score(doc=3733,freq=9.0), product of:
              0.035087574 = queryWeight, product of:
                2.4188995 = idf(docFreq=10748, maxDocs=44421)
                0.014505593 = queryNorm
              0.34015775 = fieldWeight in 3733, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                2.4188995 = idf(docFreq=10748, maxDocs=44421)
                0.046875 = fieldNorm(doc=3733)
          0.03300989 = weight(abstract_txt:pages in 3733) [ClassicSimilarity], result of:
            0.03300989 = score(doc=3733,freq=1.0), product of:
              0.12562537 = queryWeight, product of:
                1.5449569 = boost
                5.6056433 = idf(docFreq=443, maxDocs=44421)
                0.014505593 = queryNorm
              0.2627645 = fieldWeight in 3733, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.6056433 = idf(docFreq=443, maxDocs=44421)
                0.046875 = fieldNorm(doc=3733)
          0.08430106 = weight(abstract_txt:programming in 3733) [ClassicSimilarity], result of:
            0.08430106 = score(doc=3733,freq=2.0), product of:
              0.18629162 = queryWeight, product of:
                1.8813707 = boost
                6.82627 = idf(docFreq=130, maxDocs=44421)
                0.014505593 = queryNorm
              0.45252204 = fieldWeight in 3733, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.82627 = idf(docFreq=130, maxDocs=44421)
                0.046875 = fieldNorm(doc=3733)
          0.27262893 = weight(abstract_txt:python in 3733) [ClassicSimilarity], result of:
            0.27262893 = score(doc=3733,freq=1.0), product of:
              0.64670455 = queryWeight, product of:
                4.9573054 = boost
                8.993418 = idf(docFreq=14, maxDocs=44421)
                0.014505593 = queryNorm
              0.42156646 = fieldWeight in 3733, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.993418 = idf(docFreq=14, maxDocs=44421)
                0.046875 = fieldNorm(doc=3733)
        0.16 = coord(4/25)
    
  4. Falk, H.: Library databases on the Web (1996) 0.06
    0.061432708 = sum of:
      0.061432708 = product of:
        0.38395444 = sum of:
          0.010609165 = weight(abstract_txt:information in 6974) [ClassicSimilarity], result of:
            0.010609165 = score(doc=6974,freq=1.0), product of:
              0.035087574 = queryWeight, product of:
                2.4188995 = idf(docFreq=10748, maxDocs=44421)
                0.014505593 = queryNorm
              0.30236244 = fieldWeight in 6974, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4188995 = idf(docFreq=10748, maxDocs=44421)
                0.125 = fieldNorm(doc=6974)
          0.027090503 = weight(abstract_txt:libraries in 6974) [ClassicSimilarity], result of:
            0.027090503 = score(doc=6974,freq=1.0), product of:
              0.057263803 = queryWeight, product of:
                1.0430804 = boost
                3.78466 = idf(docFreq=2742, maxDocs=44421)
                0.014505593 = queryNorm
              0.4730825 = fieldWeight in 6974, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.78466 = idf(docFreq=2742, maxDocs=44421)
                0.125 = fieldNorm(doc=6974)
          0.088026375 = weight(abstract_txt:pages in 6974) [ClassicSimilarity], result of:
            0.088026375 = score(doc=6974,freq=1.0), product of:
              0.12562537 = queryWeight, product of:
                1.5449569 = boost
                5.6056433 = idf(docFreq=443, maxDocs=44421)
                0.014505593 = queryNorm
              0.7007054 = fieldWeight in 6974, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.6056433 = idf(docFreq=443, maxDocs=44421)
                0.125 = fieldNorm(doc=6974)
          0.2582284 = weight(abstract_txt:packages in 6974) [ClassicSimilarity], result of:
            0.2582284 = score(doc=6974,freq=1.0), product of:
              0.29469213 = queryWeight, product of:
                2.8980615 = boost
                7.01012 = idf(docFreq=108, maxDocs=44421)
                0.014505593 = queryNorm
              0.876265 = fieldWeight in 6974, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.01012 = idf(docFreq=108, maxDocs=44421)
                0.125 = fieldNorm(doc=6974)
        0.16 = coord(4/25)
    
  5. Hilts, P.: Mosaic provides stained-glass windows into the world of the Internet (1994) 0.06
    0.061088502 = sum of:
      0.061088502 = product of:
        0.38180315 = sum of:
          0.027981065 = weight(abstract_txt:through in 847) [ClassicSimilarity], result of:
            0.027981065 = score(doc=847,freq=1.0), product of:
              0.06395967 = queryWeight, product of:
                1.1023787 = boost
                3.9998152 = idf(docFreq=2211, maxDocs=44421)
                0.014505593 = queryNorm
              0.4374798 = fieldWeight in 847, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.9998152 = idf(docFreq=2211, maxDocs=44421)
                0.109375 = fieldNorm(doc=847)
          0.0340011 = weight(abstract_txt:internet in 847) [ClassicSimilarity], result of:
            0.0340011 = score(doc=847,freq=1.0), product of:
              0.08337244 = queryWeight, product of:
                1.5414683 = boost
                3.7286568 = idf(docFreq=2900, maxDocs=44421)
                0.014505593 = queryNorm
              0.40782183 = fieldWeight in 847, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.7286568 = idf(docFreq=2900, maxDocs=44421)
                0.109375 = fieldNorm(doc=847)
          0.09387114 = weight(abstract_txt:page in 847) [ClassicSimilarity], result of:
            0.09387114 = score(doc=847,freq=1.0), product of:
              0.14333475 = queryWeight, product of:
                1.6502641 = boost
                5.987735 = idf(docFreq=302, maxDocs=44421)
                0.014505593 = queryNorm
              0.6549085 = fieldWeight in 847, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.987735 = idf(docFreq=302, maxDocs=44421)
                0.109375 = fieldNorm(doc=847)
          0.22594984 = weight(abstract_txt:packages in 847) [ClassicSimilarity], result of:
            0.22594984 = score(doc=847,freq=1.0), product of:
              0.29469213 = queryWeight, product of:
                2.8980615 = boost
                7.01012 = idf(docFreq=108, maxDocs=44421)
                0.014505593 = queryNorm
              0.76673186 = fieldWeight in 847, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.01012 = idf(docFreq=108, maxDocs=44421)
                0.109375 = fieldNorm(doc=847)
        0.16 = coord(4/25)