diff --git a/README.md b/README.md index c82d5fba..342067dd 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -#WebCollector +# WebCollector WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes. @@ -6,15 +6,15 @@ WebCollector is an open source web crawler framework based on Java.It provides -##HomePage +## HomePage [https://github.com/CrawlScript/WebCollector](https://github.com/CrawlScript/WebCollector) -##Document +## Document [WebCollector-GitDoc](https://github.com/CrawlScript/WebCollector-GitDoc) -##Installation +## Installation ### Without Maven WebCollector jars are available on the [HomePage](https://github.com/CrawlScript/WebCollector). @@ -22,7 +22,7 @@ WebCollector jars are available on the [HomePage](https://github.com/CrawlScript + __webcollector-version-bin.zip__ contains core jars. -##Quickstart +## Quickstart Lets crawl some news from hfut news.This demo prints out the titles and contents extracted from news of hfut news. [NewsCrawler.java](https://github.com/CrawlScript/WebCollector/blob/master/NewsCrawler.java): @@ -96,7 +96,7 @@ public class NewsCrawler extends BreadthCrawler { -##Content Extraction +## Content Extraction WebCollector could automatically extract content from news web-pages: ```java @@ -114,6 +114,6 @@ Element contentElement = ContentExtractor.getContentElementByUrl(url); ``` -##Other Documentation +## Other Documentation + [中文文档](https://github.com/CrawlScript/WebCollector/blob/master/README.zh-cn.md) diff --git a/README.zh-cn.md b/README.zh-cn.md index 6481e6e7..150f3cfc 100644 --- a/README.zh-cn.md +++ b/README.zh-cn.md @@ -1,17 +1,17 @@ WebCollector ============ -###爬虫简介 +### 爬虫简介 WebCollector是一个无须配置、便于二次开发的JAVA爬虫框架(内核),它提供精简的的API,只需少量代码即可实现一个功能强大的爬虫。 -###爬虫内核: +### 爬虫内核: WebCollector致力于维护一个稳定、可扩的爬虫内核,便于开发者进行灵活的二次开发。内核具有很强的扩展性,用户可以在内核基础上开发自己想要的爬虫。源码中集成了Jsoup,可进行精准的网页解析。 -###教程: +### 教程: WebCollector的开源中国项目主页中可找到教程列表:[http://www.oschina.net/p/webcollector](http://www.oschina.net/p/webcollector) -###2.x: +### 2.x: WebCollector 2.x版本特性: * 1)自定义遍历策略,可完成更为复杂的遍历业务,例如分页、AJAX * 2)可以为每个URL设置附加信息(MetaData),利用附加信息可以完成很多复杂业务,例如深度获取、锚文本获取、引用页面获取、POST参数传递、增量更新等。 @@ -24,7 +24,7 @@ WebCollector 2.x版本特性: -###Jar包 +### Jar包 可在[WebCollector的github主页](https://github.com/CrawlScript/WebCollector)下载所需jar包. + __webcollector-version-bin.zip__ 包含核心jar包. @@ -32,7 +32,7 @@ WebCollector 2.x版本特性: -###__通过捐款支持WebCollector__ +### __通过捐款支持WebCollector__ 维护WebCollector及教程需要花费较大的时间和精力,如果你喜欢WebCollector的话,欢迎通过捐款的方式,支持开发者的工作,非常感谢!