Document (#30809)

Author
Baumgartner, R.
Title
Methoden und Werkzeuge zur Webdatenextraktion
Source
Semantic Web: Wege zur vernetzten Wissensgesellschaft. Hrsg.: T. Pellegrini, u. A. Blumauer
Imprint
Berlin : Springer
Year
2006
Pages
S.419-435
Series
X.media.press
Abstract
Das World Wide Web kann als die größte uns bekannte "Datenbank" angesehen werden. Leider ist das heutige Web großteils auf die Präsentation für menschliche Benutzerinnen ausgelegt und besteht aus sehr heterogenen Datenbeständen. Überdies fehlen im Web die Möglichkeiten Informationen strukturiert und aus verschiedenen Quellen aggregiert abzufragen. Das heutige Web ist daher für die automatische maschinelle Verarbeitung nicht geeignet. Um Webdaten dennoch effektiv zu nutzen, wurden Sprachen, Methoden und Werkzeuge zur Extraktion und Aggregation dieser Daten entwickelt. Dieser Artikel gibt einen Überblick und eine Kategorisierung von verschiedenen Ansätzen zur Datenextraktion aus dem Web. Einige Beispielszenarien im B2B Datenaustausch, im Business Intelligence Bereich und insbesondere die Generierung von Daten für Semantic Web Ontologien illustrieren die effektive Nutzung dieser Technologien.
Theme
Data Mining

Similar documents (content)

  1. Frohner, H.: Social Tagging : Grundlagen, Anwendungen, Auswirkungen auf Wissensorganisation und soziale Strukturen der User (2010) 0.13
    0.12785794 = sum of:
      0.12785794 = product of:
        0.5327414 = sum of:
          0.07316245 = weight(abstract_txt:ansätzen in 723) [ClassicSimilarity], result of:
            0.07316245 = score(doc=723,freq=1.0), product of:
              0.15354276 = queryWeight, product of:
                7.62393 = idf(docFreq=58, maxDocs=44421)
                0.02013958 = queryNorm
              0.47649562 = fieldWeight in 723, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.62393 = idf(docFreq=58, maxDocs=44421)
                0.0625 = fieldNorm(doc=723)
          0.07316245 = weight(abstract_txt:heterogenen in 723) [ClassicSimilarity], result of:
            0.07316245 = score(doc=723,freq=1.0), product of:
              0.15354276 = queryWeight, product of:
                7.62393 = idf(docFreq=58, maxDocs=44421)
                0.02013958 = queryNorm
              0.47649562 = fieldWeight in 723, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.62393 = idf(docFreq=58, maxDocs=44421)
                0.0625 = fieldNorm(doc=723)
          0.08338968 = weight(abstract_txt:effektiv in 723) [ClassicSimilarity], result of:
            0.08338968 = score(doc=723,freq=1.0), product of:
              0.1675375 = queryWeight, product of:
                1.0445791 = boost
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.02013958 = queryNorm
              0.49773738 = fieldWeight in 723, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.0625 = fieldNorm(doc=723)
          0.19386314 = weight(abstract_txt:kategorisierung in 723) [ClassicSimilarity], result of:
            0.19386314 = score(doc=723,freq=2.0), product of:
              0.2333587 = queryWeight, product of:
                1.2328134 = boost
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.02013958 = queryNorm
              0.8307517 = fieldWeight in 723, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.0625 = fieldNorm(doc=723)
          0.067611694 = weight(abstract_txt:daten in 723) [ClassicSimilarity], result of:
            0.067611694 = score(doc=723,freq=2.0), product of:
              0.145675 = queryWeight, product of:
                1.377504 = boost
                5.250997 = idf(docFreq=632, maxDocs=44421)
                0.02013958 = queryNorm
              0.46412694 = fieldWeight in 723, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.250997 = idf(docFreq=632, maxDocs=44421)
                0.0625 = fieldNorm(doc=723)
          0.041552007 = weight(abstract_txt:dieser in 723) [ClassicSimilarity], result of:
            0.041552007 = score(doc=723,freq=1.0), product of:
              0.15187009 = queryWeight, product of:
                1.7225907 = boost
                4.377637 = idf(docFreq=1515, maxDocs=44421)
                0.02013958 = queryNorm
              0.2736023 = fieldWeight in 723, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.377637 = idf(docFreq=1515, maxDocs=44421)
                0.0625 = fieldNorm(doc=723)
        0.24 = coord(6/25)
    
  2. Röhle, T.: ¬Die Demontage der Gatekeeper : relationale Perspektiven zur Macht der Suchmaschinen (2009) 0.08
    0.08098604 = sum of:
      0.08098604 = product of:
        0.40493017 = sum of:
          0.07316245 = weight(abstract_txt:ansätzen in 1023) [ClassicSimilarity], result of:
            0.07316245 = score(doc=1023,freq=1.0), product of:
              0.15354276 = queryWeight, product of:
                7.62393 = idf(docFreq=58, maxDocs=44421)
                0.02013958 = queryNorm
              0.47649562 = fieldWeight in 1023, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.62393 = idf(docFreq=58, maxDocs=44421)
                0.0625 = fieldNorm(doc=1023)
          0.08056446 = weight(abstract_txt:strukturiert in 1023) [ClassicSimilarity], result of:
            0.08056446 = score(doc=1023,freq=1.0), product of:
              0.16373172 = queryWeight, product of:
                1.0326467 = boost
                7.872826 = idf(docFreq=45, maxDocs=44421)
                0.02013958 = queryNorm
              0.49205163 = fieldWeight in 1023, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.872826 = idf(docFreq=45, maxDocs=44421)
                0.0625 = fieldNorm(doc=1023)
          0.056936186 = weight(abstract_txt:verschiedenen in 1023) [ClassicSimilarity], result of:
            0.056936186 = score(doc=1023,freq=1.0), product of:
              0.16367137 = queryWeight, product of:
                1.4601138 = boost
                5.5659027 = idf(docFreq=461, maxDocs=44421)
                0.02013958 = queryNorm
              0.34786892 = fieldWeight in 1023, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5659027 = idf(docFreq=461, maxDocs=44421)
                0.0625 = fieldNorm(doc=1023)
          0.083104014 = weight(abstract_txt:dieser in 1023) [ClassicSimilarity], result of:
            0.083104014 = score(doc=1023,freq=4.0), product of:
              0.15187009 = queryWeight, product of:
                1.7225907 = boost
                4.377637 = idf(docFreq=1515, maxDocs=44421)
                0.02013958 = queryNorm
              0.5472046 = fieldWeight in 1023, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.377637 = idf(docFreq=1515, maxDocs=44421)
                0.0625 = fieldNorm(doc=1023)
          0.11116307 = weight(abstract_txt:werkzeuge in 1023) [ClassicSimilarity], result of:
            0.11116307 = score(doc=1023,freq=1.0), product of:
              0.25567457 = queryWeight, product of:
                1.8249211 = boost
                6.9565353 = idf(docFreq=114, maxDocs=44421)
                0.02013958 = queryNorm
              0.43478346 = fieldWeight in 1023, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9565353 = idf(docFreq=114, maxDocs=44421)
                0.0625 = fieldNorm(doc=1023)
        0.2 = coord(5/25)
    
  3. Weigel, U.: Internet - (k)ein Netz mit doppeltem Boden? : T.1: Eine erste Annäherung; T.2: Dienste; T.3: World-Wide Web (1994) 0.08
    0.080687635 = sum of:
      0.080687635 = product of:
        1.0085955 = sum of:
          0.3416171 = weight(abstract_txt:verschiedenen in 126) [ClassicSimilarity], result of:
            0.3416171 = score(doc=126,freq=1.0), product of:
              0.16367137 = queryWeight, product of:
                1.4601138 = boost
                5.5659027 = idf(docFreq=461, maxDocs=44421)
                0.02013958 = queryNorm
              2.0872135 = fieldWeight in 126, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5659027 = idf(docFreq=461, maxDocs=44421)
                0.375 = fieldNorm(doc=126)
          0.6669784 = weight(abstract_txt:werkzeuge in 126) [ClassicSimilarity], result of:
            0.6669784 = score(doc=126,freq=1.0), product of:
              0.25567457 = queryWeight, product of:
                1.8249211 = boost
                6.9565353 = idf(docFreq=114, maxDocs=44421)
                0.02013958 = queryNorm
              2.6087008 = fieldWeight in 126, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9565353 = idf(docFreq=114, maxDocs=44421)
                0.375 = fieldNorm(doc=126)
        0.08 = coord(2/25)
    
  4. Krüger, S.: Wissen ist Macht : Portale weisen den Weg und öffnen Türen (2001) 0.07
    0.070651494 = sum of:
      0.070651494 = product of:
        0.29438123 = sum of:
          0.045726534 = weight(abstract_txt:ansätzen in 6737) [ClassicSimilarity], result of:
            0.045726534 = score(doc=6737,freq=1.0), product of:
              0.15354276 = queryWeight, product of:
                7.62393 = idf(docFreq=58, maxDocs=44421)
                0.02013958 = queryNorm
              0.29780978 = fieldWeight in 6737, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.62393 = idf(docFreq=58, maxDocs=44421)
                0.0390625 = fieldNorm(doc=6737)
          0.05035279 = weight(abstract_txt:strukturiert in 6737) [ClassicSimilarity], result of:
            0.05035279 = score(doc=6737,freq=1.0), product of:
              0.16373172 = queryWeight, product of:
                1.0326467 = boost
                7.872826 = idf(docFreq=45, maxDocs=44421)
                0.02013958 = queryNorm
              0.30753228 = fieldWeight in 6737, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.872826 = idf(docFreq=45, maxDocs=44421)
                0.0390625 = fieldNorm(doc=6737)
          0.035585117 = weight(abstract_txt:verschiedenen in 6737) [ClassicSimilarity], result of:
            0.035585117 = score(doc=6737,freq=1.0), product of:
              0.16367137 = queryWeight, product of:
                1.4601138 = boost
                5.5659027 = idf(docFreq=461, maxDocs=44421)
                0.02013958 = queryNorm
              0.21741807 = fieldWeight in 6737, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5659027 = idf(docFreq=461, maxDocs=44421)
                0.0390625 = fieldNorm(doc=6737)
          0.056512747 = weight(abstract_txt:methoden in 6737) [ClassicSimilarity], result of:
            0.056512747 = score(doc=6737,freq=2.0), product of:
              0.1768268 = queryWeight, product of:
                1.5176597 = boost
                5.7852654 = idf(docFreq=370, maxDocs=44421)
                0.02013958 = queryNorm
              0.3195938 = fieldWeight in 6737, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.7852654 = idf(docFreq=370, maxDocs=44421)
                0.0390625 = fieldNorm(doc=6737)
          0.03672713 = weight(abstract_txt:dieser in 6737) [ClassicSimilarity], result of:
            0.03672713 = score(doc=6737,freq=2.0), product of:
              0.15187009 = queryWeight, product of:
                1.7225907 = boost
                4.377637 = idf(docFreq=1515, maxDocs=44421)
                0.02013958 = queryNorm
              0.24183255 = fieldWeight in 6737, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.377637 = idf(docFreq=1515, maxDocs=44421)
                0.0390625 = fieldNorm(doc=6737)
          0.069476925 = weight(abstract_txt:werkzeuge in 6737) [ClassicSimilarity], result of:
            0.069476925 = score(doc=6737,freq=1.0), product of:
              0.25567457 = queryWeight, product of:
                1.8249211 = boost
                6.9565353 = idf(docFreq=114, maxDocs=44421)
                0.02013958 = queryNorm
              0.27173966 = fieldWeight in 6737, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9565353 = idf(docFreq=114, maxDocs=44421)
                0.0390625 = fieldNorm(doc=6737)
        0.24 = coord(6/25)
    
  5. Cejpek, J.: Wie die neuen Medien bewerten : die Informationswissenschaft als Wissenschaft mit Gewissen (1996) 0.07
    0.068430945 = sum of:
      0.068430945 = product of:
        0.5702579 = sum of:
          0.07271601 = weight(abstract_txt:dieser in 6344) [ClassicSimilarity], result of:
            0.07271601 = score(doc=6344,freq=1.0), product of:
              0.15187009 = queryWeight, product of:
                1.7225907 = boost
                4.377637 = idf(docFreq=1515, maxDocs=44421)
                0.02013958 = queryNorm
              0.47880405 = fieldWeight in 6344, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.377637 = idf(docFreq=1515, maxDocs=44421)
                0.109375 = fieldNorm(doc=6344)
          0.19453537 = weight(abstract_txt:werkzeuge in 6344) [ClassicSimilarity], result of:
            0.19453537 = score(doc=6344,freq=1.0), product of:
              0.25567457 = queryWeight, product of:
                1.8249211 = boost
                6.9565353 = idf(docFreq=114, maxDocs=44421)
                0.02013958 = queryNorm
              0.76087105 = fieldWeight in 6344, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9565353 = idf(docFreq=114, maxDocs=44421)
                0.109375 = fieldNorm(doc=6344)
          0.30300656 = weight(abstract_txt:heutige in 6344) [ClassicSimilarity], result of:
            0.30300656 = score(doc=6344,freq=1.0), product of:
              0.34354988 = queryWeight, product of:
                2.1154134 = boost
                8.063882 = idf(docFreq=37, maxDocs=44421)
                0.02013958 = queryNorm
              0.8819871 = fieldWeight in 6344, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.063882 = idf(docFreq=37, maxDocs=44421)
                0.109375 = fieldNorm(doc=6344)
        0.12 = coord(3/25)