[二星]程序开发，不限语言，抓取网页中的图片

zldrobit · #31

既然没有PHP，俺贴个PHP吧，相当丑陋的。。。

代码：全选

<?php
	$url = '怪叔叔喜欢的网站';
	$content = file_get_contents($url);
	$content = $content . '<img src="abc.def/abc.jpg" />';
	preg_match_all('/img\s+src\s*="(.*?)"/', $content, $matches, PREG_SET_ORDER);
	echo count($matches) . "</br></br>";

	foreach ($matches as $val){
		//echo $val[0] . "</br>";
		$pic_url = $val[1];
		if (strpos($val[1], '//') !== false){
			;
		}
		elseif (preg_match('@^(.*?)/@', $val[1], $inner_matches) == 0){
			//echo $url . $val[1] . "<br />";
			$pic_url = $url . $val[1];
		}
		elseif (preg_match('@[:.]@', $inner_matches[1], $tmp_matches) == 0){
				//echo $url . $val[1] . "<br />";
				$pic_url = $url . $val[1];
		}
		$pic = file_get_contents($pic_url);
		if ($pic === FALSE){
			continue;
		}
		preg_match('@/([^/]+)$@', $pic_url, $tmp_matches);
		// may use assert
		$pic_file_name = $tmp_matches[1];		
		$f = fopen("/home/robit/pic/" . $pic_file_name, "wb");
		fwrite($f, $pic);
		fclose($f);
	}
	echo "<br/><br/>DownLoad Complete!<br/><br/>";
	//echo htmlentities($content);
?>

zldrobit · #32

俺的换行符号被吃掉了。。。。。。晕。。。 br2nl么。。晕。。。

SmallV · #33

pocoyo 写了：
tenzu 写了：10L的头像。。。
我靠受不了了。。。。

真的受不鸟了

sunfish · #34

SmallV 写了：
pocoyo 写了：
tenzu 写了：10L的头像。。。
我靠受不了了。。。。
真的受不鸟了

欣赏了会儿，翻页的时候居然有点不舍的又看了一眼

月下叹逍遥 · #35

SmallV 写了：
pocoyo 写了：
tenzu 写了：10L的头像。。。
我靠受不了了。。。。
真的受不鸟了

和饭团的有的一拼

crazyyujie · #36

那些语言没看懂。。小弟不才啊

bjlbeyond · #37

roylez 写了：纯属无聊了

代码：全选

#!/usr/bin/env ruby
# coding: utf-8
#Author: Roy L Zuo (roylzuo at gmail dot com)
require 'open-uri'
require 'hpricot'

def parse_img_list(url)
    u = URI.parse(url)
    p = Hpricot.parse(open(url).read)
    img = (p/"img").collect {|v| u.merge(v.attributes['src']).to_s}.uniq.sort
end

def buster_img_list(*list)
    list.collect {|i| 
        Thread.new { open(i.split("/").last,'wb') {|f| f.puts open(i).read } } 
    }.each {|t| t.join}
end

if __FILE__==$0
    require 'optparse'
    options = {}
    parser = OptionParser.new { |opts|
        opts.banner = "Usage: #{$0} [-l] URL"
        options[:list_only] = false
        opts.on('-l','--list-only','显示图片列表') { options[:list_only] = true }
    }
    parser.parse!
    exit unless ARGV[0]
    l = parse_img_list ARGV[0]
    options[:list_only] ? l.each{|u| puts u } : (buster_img_list *l )
end

代码：全选

roylez@Lancelot> ruby imgbuster.rb -h
Usage: imgbuster.rb [-l] URL
    -l, --list-only                  显示图片列表
roylez@Lancelot> ruby imgbuster.rb -l http://forum.ubuntu.org.cn
http://forum.ubuntu.org.cn/styles/UbuntuCN/imageset/forum_read.gif
http://forum.ubuntu.org.cn/styles/UbuntuCN/imageset/forum_read_locked.gif
http://forum.ubuntu.org.cn/styles/UbuntuCN/imageset/forum_read_subforum.gif
http://forum.ubuntu.org.cn/styles/UbuntuCN/imageset/forum_unread.gif
http://forum.ubuntu.org.cn/styles/UbuntuCN/imageset/icon_topic_latest.gif
http://forum.ubuntu.org.cn/styles/UbuntuCN/theme/images/icon_mini_faq.gif
http://forum.ubuntu.org.cn/styles/UbuntuCN/theme/images/icon_mini_login.gif
http://forum.ubuntu.org.cn/styles/UbuntuCN/theme/images/icon_mini_register.gif
http://forum.ubuntu.org.cn/styles/UbuntuCN/theme/images/whosonline.gif

高人阿，这些代码放在哪里阿？求解

wang153723482 · #38

曾经用python写过一个抓取XX网页上的图片，但是发现众多的XX网的图片格式没什么共同点，用正则匹配实在是不能考虑周全，
后来用的时候都是临时写正则的。。

代码：全选

import urllib
import re
import urllib2
urlItem = urllib.urlopen("http://wang153723482.blog.163.com/blog/static/118649845201061053229326/")#网页地址
htmlSource = urlItem.read() 
urlItem.close()

p = re.compile(r'http://img.*\.126\.net/[a-zA-Z0-9_-]*==/[0-9]*\.jpg')#匹配页面上图片的地址

i = 1
for m in p.finditer(htmlSource):
    opener = urllib2.build_opener()
    page = opener.open(m.group())
    my_picture = page.read()
    fileObj = open("img"+str(i)+".jpg","wb")
    fileObj.write(my_picture)
    i+=1

在线调试正则： http://regexpal.com/ http://www.osctools.net/

rootstar · #39

代码：全选

#!/bin/bash
#定义图片格式，需要的就自己添加。
s="\.jpg|\.png|\.gif"
read -p "Pls. input a url : " url
ul=`curl -s -m 10  "$url" | sed -n 's/\"/\n/gp' | grep ^http | grep $s\$`
wget -c $ul

随便做一个粗糙了一些

秋景雨 · #40

我来慢慢理解，边学习，边理解。

happytor · #41

代码：全选

from pyquery import PyQuery

url='http://forum.ubuntu.org.cn/'
d=PyQuery(url=url)

for anchor in d("img"):
	print PyQuery(anchor).make_links_absolute().attr('src')

Ce L-sky · #42

php,三分钟.
利用php-simple-dom.

代码：全选

<?php
include('php-simple-dom.php');
$html->get("网页地址");$srcs=null;
foreach($html->find('a') as $src){
$srcs=$srcs.$src->src."\r\n";
}
echo $srcs;

Samuelwise · #43

俺以前用matlab写过，把网页里的图片链接都提出来成txt,然后批量自动下载，就是那个beautiful leg 的sara的吧，还些过把网页二级链接子网页合并加书签，结果有的一下10—20MB多，除了IE，FF外Frontpage,dm什么的全弄不了

biog8888 · #44

开心到站

blune68 · #45

http = require 'http'
urllib = require 'url'
queryString = require 'querystring'

server = http.createServer (req, resp) ->
urlObj = urllib.parse(req.url)
pathname = urlObj.pathname

initHtml = "<form action='/submit' method='get'><div><input type='text' name='reqUrl'><input type='submit'></div></form>"

if pathname is '/submit'
reqObj = queryString.parse(urlObj.query)
resp.setHeader 'Content-Type', 'text/html'

if reqObj.reqUrl
backServer reqObj.reqUrl, (err, datas) ->
parseHtml datas, (imgs) ->
if imgs and imgs.length > 0
resp.write "#{initHtml}<div>#{imgs.join('')}</div>"
resp.end()
else
resp.write('搜索图片为空')
resp.end()
else
resp.write('参数为空')
resp.end()
else
resp.setHeader 'Content-Type', 'text/html'
resp.write(initHtml)
resp.end()

server.listen(9999)

backServer = (url, cb) ->
options = urllib.parse url
options.method = 'GET'

req = http.request options, (res) ->
chunks = []
res.on 'data', (chunk) ->
chunks.push chunk

res.on 'end', () ->
buffer = Buffer.concat chunks
cb(null, buffer)

res.on 'error', (err) ->
cb(err)

req.on 'error', (err) ->
cb(err)

req.end()

parseHtml = (datas, cb) ->
reg = /<img[^>]+src="[^"]+"[^>]*>/g
htmlToString = datas.toString() if datas
imgs = htmlToString.match reg
cb(imgs)

[二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片