有啥好方法提取某网页的链接地址？

谢宝良 · #1

如图：

astolia · #2

如果网页写的规范，直接上正则匹配。否则先解析再从DOM提取

lilydjwg · #3

python + lxml + xpath.

onlylove · #4

beautiful soap？

lilydjwg · #5

onlylove 写了：beautiful soap？

噗，美丽的肥皂～

谢宝良 · #6

暂时手工复制。搞不定。

谢宝良 · #7

代码：全选

                      	<span class="stu-l">已学完(00:08:32)</span>
                      
                    	<a href="http://yun.chinahrt.com/studentCoursePage/chapterDetail/6bac03f0-e10f-4f9b-a250-ac329f031077">2.1“互联网+”的前世今生(00:08:32)</a>
                    </li>
                   
                    <li> 
                      
                      	
                      	
                      	<span class="stu-l">已学完(00:08:33)</span>
                      
                    	<a href="http://yun.chinahrt.com/studentCoursePage/chapterDetail/422b98c1-aa23-41f6-904d-c64d54af0a41">3.1“互联网+”概述(00:08:33)</a>
                    </li>
                   
                    <li> 
                      
                      	<span class="stu-l">未学习(00:00:00)</span>
                      	
                      	
                      
                    	<a href="http://yun.chinahrt.com/studentCoursePage/chapterDetail/365ef433-b12f-4635-b6d5-127c5753ca38">3.2“互联网+”的动力(00:08:00)</a>
                    </li>
                   
                    <li> 
                      
                      	<span class="stu-l">未学习(00:00:00)</span>
                      	
                      	
                      
                    	<a href="http://yun.chinahrt.com/studentCoursePage/chapterDetail/e0c67032-ab6a-43aa-841d-0577886c395c">4.1互联网思维和互联网渠道(00:09:23)</a>
                    </li>
                   
                    <li> 
                      
                      	<span class="stu-l">未学习(00:00:00)</span>
                      	
                      	
                      
                    	<a href="http://yun.chinahrt.com/studentCoursePage/chapterDetail/c01d8cec-ec57-4692-bbcd-18e1e446a691">4.2互联网平台及物联网(00:07:55)</a>
                    </li>
                   
                    <li>

如何提取未学习的链接地址？

vickycq · #8

以前没用过 beautifulsoup，现学的，没测试过

假设网页地址为 http://aaa.bbb.ccc
最后链接存在文件 output 中

代码：全选

from bs4 import BeautifulSoup
import urllib

html_doc = urllib.urlopen("http://aaa.bbb.ccc").read()
soup = BeautifulSoup(html_doc, 'lxml')

lis = soup.find_all('li')

hrefs = {}
for each_li in lis:
    if '未学习' in str(each_li):
        as = each_li.find_all('a')
        for each_a in as:
            if each_a.has_attr('href'):
                title = each_a.get_text()
                link = each_a.attrs['href']
                hrefs[title] = link

with open('output', 'w') as f:
    for each_title, each_link in hrefs.items():
        f.write('%s - %s\n' % (each_title, each_link))

AutoXBC · #9

考虑到有未学习，已学完这种字样，显然是需要用户登录的，爬虫类的代码如果加上模拟登录，不知道有多麻烦。楼主这个需求更合适在浏览器内部解决，用非常简单的 UserJS 就能得到需要的东西

代码：全选

Array.prototype.slice.call(document.querySelectorAll('.stu-l')).forEach(function(e){
	if(e.textContent.match('未学习'))
		console.log(e.nextElementSibling.href);
});

谢宝良 · #10

楼上两位，半夜帮忙，辛苦了。很久没碰oython，不知是哪个版本的？其实我就希望半自动就ok了。意思是我人工保存网页，得到html文件，再用sed或awk，把地址提取出来

谢宝良 · #11

AutoXBC 写了：考虑到有未学习，已学完这种字样，显然是需要用户登录的，爬虫类的代码如果加上模拟登录，不知道有多麻烦。楼主这个需求更合适在浏览器内部解决，用非常简单的 UserJS 就能得到需要的东西
代码：全选
Array.prototype.slice.call(document.querySelectorAll('.stu-l')).forEach(function(e){
	if(e.textContent.match('未学习'))
		console.log(e.nextElementSibling.href);
});

是海猴子插件吗？怎么运行上面的代码？

谢宝良 · #12

哪位sed或awk比较熟悉的，帮忙提取未学习下方的地址。变成
xdg-open 地址
sleep 600
xdg-open 地址
sleep 600

AutoXBC · #13

谢宝良写了：是海猴子插件吗？怎么运行上面的代码？

不用其他扩展，存成小书签形式，点击执行就行了。
https://zh.wikipedia.org/wiki/%E5%B0%8F ... 6%E7%AD%BE

代码：全选

javascript:alert(Array.prototype.slice.call(document.querySelectorAll('.stu-l')).reduce(function(pre,cur){return pre+(cur.textContent.match('未学习')?'xdg-open '+cur.nextElementSibling.href+'\nsleep 600\n':'')},''))

与其手动保存 html，再用 Shell 处理字符串，不如直接生成目标代码，然后复制就可以了。没有什么工具比浏览器更适合处理 DOM。

谢宝良 · #14

AutoXBC 写了：
谢宝良写了：是海猴子插件吗？怎么运行上面的代码？
不用其他扩展，存成小书签形式，点击执行就行了。
https://zh.wikipedia.org/wiki/%E5%B0%8F ... 6%E7%AD%BE
代码：全选
javascript:alert(Array.prototype.slice.call(document.querySelectorAll('.stu-l')).reduce(function(pre,cur){return pre+(cur.textContent.match('未学习')?'xdg-open '+cur.nextElementSibling.href+'\nsleep 600\n':'')},''))
与其手动保存 html，再用 Shell 处理字符串，不如直接生成目标代码，然后复制就可以了。没有什么工具比浏览器更适合处理 DOM。
2017-05-23_214645.png

多谢这位朋友，让我又学到了新的知识。如果能把提取的内容写进文件，那就很好了。

AutoXBC · #15

谢宝良写了：如果能把提取的内容写进文件，那就很好了。

高版本的 ECMAScript 是支持写本地文件的，有些浏览器的私有api也可以做到，这涉及兼容性问题，需要用户按自己的实际环境编写代码并调试。对于不支持直接写文件的浏览器，也可以用本地开 web server 配合 cgi 脚本实现，浏览器将文本编码并通过 get url 传递过去，web server 接收保存到本地，也是要自己写点代码的。

有啥好方法提取某网页的链接地址？

有啥好方法提取某网页的链接地址？

Re: 有啥好方法提取某网页的链接地址？

Re: 有啥好方法提取某网页的链接地址？

Re: 有啥好方法提取某网页的链接地址？

Re: 有啥好方法提取某网页的链接地址？

Re: 有啥好方法提取某网页的链接地址？

Re: 有啥好方法提取某网页的链接地址？

Re: 有啥好方法提取某网页的链接地址？

Re: 有啥好方法提取某网页的链接地址？

Re: 有啥好方法提取某网页的链接地址？

Re: 有啥好方法提取某网页的链接地址？

Re: 有啥好方法提取某网页的链接地址？

Re: 有啥好方法提取某网页的链接地址？

Re: 有啥好方法提取某网页的链接地址？

Re: 有啥好方法提取某网页的链接地址？