Document (#38875)

Author
Vinyals, O.
Toshev, A.
Bengio, S.
Erhan, D.
Title
¬A picture is worth a thousand (coherent) words : building a natural description of images
Source
http://googleresearch.blogspot.de/2014/11/a-picture-is-worth-thousand-coherent.html
Year
2014
Content
"People can summarize a complex scene in a few words without thinking twice. It's much more difficult for computers. But we've just gotten a bit closer -- we've developed a machine-learning system that can automatically produce captions (like the three above) to accurately describe images the first time it sees them. This kind of system could eventually help visually impaired people understand pictures, provide alternate text for images in parts of the world where mobile connections are slow, and make it easier for everyone to search on Google for images. Recent research has greatly improved object detection, classification, and labeling. But accurately describing a complex scene requires a deeper representation of what's going on in the scene, capturing how the various objects relate to one another and translating it all into natural-sounding language. Many efforts to construct computer-generated natural descriptions of images propose combining current state-of-the-art techniques in both computer vision and natural language processing to form a complete image description approach. But what if we instead merged recent computer vision and language models into a single jointly trained system, taking an image and directly producing a human readable sequence of words to describe it? This idea comes from recent advances in machine translation between languages, where a Recurrent Neural Network (RNN) transforms, say, a French sentence into a vector representation, and a second RNN uses that vector representation to generate a target sentence in German. Now, what if we replaced that first RNN and its input words with a deep Convolutional Neural Network (CNN) trained to classify objects in images? Normally, the CNN's last layer is used in a final Softmax among known classes of objects, assigning a probability that each object might be in the image. But if we remove that final layer, we can instead feed the CNN's rich encoding of the image into a RNN designed to produce phrases. We can then train the whole system directly on images and their captions, so it maximizes the likelihood that descriptions it produces best match the training descriptions for each image.
Our experiments with this system on several openly published datasets, including Pascal, Flickr8k, Flickr30k and SBU, show how robust the qualitative results are -- the generated sentences are quite reasonable. It also performs well in quantitative evaluations with the Bilingual Evaluation Understudy (BLEU), a metric used in machine translation to evaluate the quality of generated sentences. A picture may be worth a thousand words, but sometimes it's the words that are most useful -- so it's important we figure out ways to translate from images to words automatically and accurately. As the datasets suited to learning image descriptions grow and mature, so will the performance of end-to-end approaches like this. We look forward to continuing developments in systems that can read images and generate good natural-language descriptions. To get more details about the framework used to generate descriptions from images, as well as the model evaluation, read the full paper here." Vgl. auch: https://news.ycombinator.com/item?id=8621658.
Footnote
Vgl.: http://arxiv.org/abs/1411.4555.
Theme
Automatisches Indexieren
Form
Bilder
Object
Google

Similar documents (content)

  1. Graphic details : a scientific study of the importance of diagrams to science (2016) 0.54
    0.5413458 = sum of:
      0.5413458 = product of:
        1.218028 = sum of:
          0.13331926 = weight(abstract_txt:words in 4035) [ClassicSimilarity], result of:
            0.13331926 = score(doc=4035,freq=1.0), product of:
              0.22758727 = queryWeight, product of:
                1.1068717 = boost
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.038390502 = queryNorm
              0.58579403 = fieldWeight in 4035, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.109375 = fieldNorm(doc=4035)
          0.22728886 = weight(abstract_txt:picture in 4035) [ClassicSimilarity], result of:
            0.22728886 = score(doc=4035,freq=1.0), product of:
              0.3247916 = queryWeight, product of:
                1.322287 = boost
                6.398163 = idf(docFreq=200, maxDocs=44421)
                0.038390502 = queryNorm
              0.69979906 = fieldWeight in 4035, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.398163 = idf(docFreq=200, maxDocs=44421)
                0.109375 = fieldNorm(doc=4035)
          0.34204495 = weight(abstract_txt:worth in 4035) [ClassicSimilarity], result of:
            0.34204495 = score(doc=4035,freq=1.0), product of:
              0.42652205 = queryWeight, product of:
                1.5152841 = boost
                7.33202 = idf(docFreq=78, maxDocs=44421)
                0.038390502 = queryNorm
              0.80193967 = fieldWeight in 4035, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.33202 = idf(docFreq=78, maxDocs=44421)
                0.109375 = fieldNorm(doc=4035)
          0.5153749 = weight(abstract_txt:thousand in 4035) [ClassicSimilarity], result of:
            0.5153749 = score(doc=4035,freq=1.0), product of:
              0.5605765 = queryWeight, product of:
                1.7371637 = boost
                8.405631 = idf(docFreq=26, maxDocs=44421)
                0.038390502 = queryNorm
              0.9193659 = fieldWeight in 4035, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.405631 = idf(docFreq=26, maxDocs=44421)
                0.109375 = fieldNorm(doc=4035)
        0.44444445 = coord(4/9)
    
  2. Rolling, L.: ¬The role of graphic display of concept relationships in indexing and retrieval vocabularies (1985) 0.23
    0.23200533 = sum of:
      0.23200533 = product of:
        0.522012 = sum of:
          0.057136826 = weight(abstract_txt:words in 4646) [ClassicSimilarity], result of:
            0.057136826 = score(doc=4646,freq=1.0), product of:
              0.22758727 = queryWeight, product of:
                1.1068717 = boost
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.038390502 = queryNorm
              0.25105458 = fieldWeight in 4646, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.046875 = fieldNorm(doc=4646)
          0.09740952 = weight(abstract_txt:picture in 4646) [ClassicSimilarity], result of:
            0.09740952 = score(doc=4646,freq=1.0), product of:
              0.3247916 = queryWeight, product of:
                1.322287 = boost
                6.398163 = idf(docFreq=200, maxDocs=44421)
                0.038390502 = queryNorm
              0.29991388 = fieldWeight in 4646, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.398163 = idf(docFreq=200, maxDocs=44421)
                0.046875 = fieldNorm(doc=4646)
          0.1465907 = weight(abstract_txt:worth in 4646) [ClassicSimilarity], result of:
            0.1465907 = score(doc=4646,freq=1.0), product of:
              0.42652205 = queryWeight, product of:
                1.5152841 = boost
                7.33202 = idf(docFreq=78, maxDocs=44421)
                0.038390502 = queryNorm
              0.34368843 = fieldWeight in 4646, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.33202 = idf(docFreq=78, maxDocs=44421)
                0.046875 = fieldNorm(doc=4646)
          0.22087495 = weight(abstract_txt:thousand in 4646) [ClassicSimilarity], result of:
            0.22087495 = score(doc=4646,freq=1.0), product of:
              0.5605765 = queryWeight, product of:
                1.7371637 = boost
                8.405631 = idf(docFreq=26, maxDocs=44421)
                0.038390502 = queryNorm
              0.39401394 = fieldWeight in 4646, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.405631 = idf(docFreq=26, maxDocs=44421)
                0.046875 = fieldNorm(doc=4646)
        0.44444445 = coord(4/9)
    
  3. Jascó, P.: Searching for images on the Web : pt.2 (1997) 0.23
    0.2281777 = sum of:
      0.2281777 = product of:
        0.68453306 = sum of:
          0.16151053 = weight(abstract_txt:natural in 548) [ClassicSimilarity], result of:
            0.16151053 = score(doc=548,freq=1.0), product of:
              0.20390065 = queryWeight, product of:
                1.0476896 = boost
                5.0694656 = idf(docFreq=758, maxDocs=44421)
                0.038390502 = queryNorm
              0.792104 = fieldWeight in 548, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.0694656 = idf(docFreq=758, maxDocs=44421)
                0.15625 = fieldNorm(doc=548)
          0.19832413 = weight(abstract_txt:images in 548) [ClassicSimilarity], result of:
            0.19832413 = score(doc=548,freq=1.0), product of:
              0.23381288 = queryWeight, product of:
                1.1219088 = boost
                5.428591 = idf(docFreq=529, maxDocs=44421)
                0.038390502 = queryNorm
              0.8482173 = fieldWeight in 548, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.428591 = idf(docFreq=529, maxDocs=44421)
                0.15625 = fieldNorm(doc=548)
          0.3246984 = weight(abstract_txt:picture in 548) [ClassicSimilarity], result of:
            0.3246984 = score(doc=548,freq=1.0), product of:
              0.3247916 = queryWeight, product of:
                1.322287 = boost
                6.398163 = idf(docFreq=200, maxDocs=44421)
                0.038390502 = queryNorm
              0.99971294 = fieldWeight in 548, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.398163 = idf(docFreq=200, maxDocs=44421)
                0.15625 = fieldNorm(doc=548)
        0.33333334 = coord(3/9)
    
  4. Cawkell, A.E.: Developments in indexing picture collections (1993) 0.18
    0.18318927 = sum of:
      0.18318927 = product of:
        0.8243517 = sum of:
          0.23798896 = weight(abstract_txt:images in 7444) [ClassicSimilarity], result of:
            0.23798896 = score(doc=7444,freq=1.0), product of:
              0.23381288 = queryWeight, product of:
                1.1219088 = boost
                5.428591 = idf(docFreq=529, maxDocs=44421)
                0.038390502 = queryNorm
              1.0178608 = fieldWeight in 7444, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.428591 = idf(docFreq=529, maxDocs=44421)
                0.1875 = fieldNorm(doc=7444)
          0.5863628 = weight(abstract_txt:worth in 7444) [ClassicSimilarity], result of:
            0.5863628 = score(doc=7444,freq=1.0), product of:
              0.42652205 = queryWeight, product of:
                1.5152841 = boost
                7.33202 = idf(docFreq=78, maxDocs=44421)
                0.038390502 = queryNorm
              1.3747537 = fieldWeight in 7444, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.33202 = idf(docFreq=78, maxDocs=44421)
                0.1875 = fieldNorm(doc=7444)
        0.22222222 = coord(2/9)
    
  5. Cawkell, A.E.: Imaging systems and picture collection management : a review (1992) 0.17
    0.17397682 = sum of:
      0.17397682 = product of:
        0.52193046 = sum of:
          0.098310746 = weight(abstract_txt:description in 5096) [ClassicSimilarity], result of:
            0.098310746 = score(doc=5096,freq=1.0), product of:
              0.1857605 = queryWeight, product of:
                4.83871 = idf(docFreq=955, maxDocs=44421)
                0.038390502 = queryNorm
              0.5292339 = fieldWeight in 5096, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.83871 = idf(docFreq=955, maxDocs=44421)
                0.109375 = fieldNorm(doc=5096)
          0.19633088 = weight(abstract_txt:images in 5096) [ClassicSimilarity], result of:
            0.19633088 = score(doc=5096,freq=2.0), product of:
              0.23381288 = queryWeight, product of:
                1.1219088 = boost
                5.428591 = idf(docFreq=529, maxDocs=44421)
                0.038390502 = queryNorm
              0.8396923 = fieldWeight in 5096, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.428591 = idf(docFreq=529, maxDocs=44421)
                0.109375 = fieldNorm(doc=5096)
          0.22728886 = weight(abstract_txt:picture in 5096) [ClassicSimilarity], result of:
            0.22728886 = score(doc=5096,freq=1.0), product of:
              0.3247916 = queryWeight, product of:
                1.322287 = boost
                6.398163 = idf(docFreq=200, maxDocs=44421)
                0.038390502 = queryNorm
              0.69979906 = fieldWeight in 5096, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.398163 = idf(docFreq=200, maxDocs=44421)
                0.109375 = fieldNorm(doc=5096)
        0.33333334 = coord(3/9)