使用JavaScript爬取網站數據

前言

主流的爬蟲一般都是使用 Python 的 Scrapy,但對於前端工程師來說 Python 可能會有一點不習慣,那麼我們熟悉的 JavaScript 能否完成這個任務呢,答案是可以的,而且可能比 Python 更適合用於爬蟲。

原因如下:

JavaScript 異步 IO 機制適用於爬蟲這種 IO 密集型任務。JavaScript 中的回調非常自然,使用異步網絡請求能夠充分利用 CPU。
JavaScript 中的 jQuery 毫無疑問是最強悍的 HTML 解析工具,使用 JavaScript 寫爬蟲能夠減少學習負擔和記憶負擔。雖然 Python 中有 PyQuery,但終究還是比不上 jQuery 自然。
爬取結果多為 JSON,JavaScript 是最適合處理 JSON 的語言。
來源: https://zhuanlan.zhihu.com/p/53763115

目標網站

GitHub 的Trending 頁

想要的數據:

  1. 項目名(repositoryName)
  2. 作者(autor)
  3. stars
  4. 描述(description)
  5. 語言(language)

使用到的庫

cheerio

Fast, flexible & lean implementation of core jQuery designed specifically for the server.
GITHUB: https://github.com/cheeriojs/cheerio

cheerio 是用來解析 HTML,他的接口跟 Jquery 基本一樣,能夠直接上手使用。用這個代替 Jquery 的原因,是因為 Cheerio 只實現了 Jquery DOM 的部分,要比 Jquery 要輕便一點。

isomorphic-fetch

> Isomorphic WHATWG Fetch API, for Node & Browserify

GITHUB: https://github.com/matthew-andrews/isomorphic-fetch

我比較習慣用 fetch 的接口,而這個 isomorphic-fetch 是可以在前、後端都可以使用的,讀者也可以選擇自己習慣的就行。

正式開始

  1. 獲得 HTML
1
2
const response = await fetch(targetUrl);
const html = await response.text();
  1. 解析 HTML
1
const $ = cheerio.load(html);
  1. 提取所需數據
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
let data = [];

$(".Box-row").each((index, element) => {
const repositoryNameAndAuthor = $(element)
.find("h1 a")
.text()
.trim();

// 作者
const author = repositoryNameAndAuthor.split("/")[0].trim();

// 項目名
const repositoryName = repositoryNameAndAuthor.split("/")[1].trim();

// 項目描述
const description = $(element)
.find("p")
.text()
.trim();

// 語言
const language = $(element)
.find('span[itemprop="programmingLanguage"]')
.text()
.trim();

// stars
const stars = $(element)
.find(".muted-link")
.first()
.text()
.trim();

data.push({
repositoryName,
author,
description,
language,
stars
});

完整代碼

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
const cheerio = require("cheerio");
const fetch = require("isomorphic-fetch");

const targetUrl = "https://github.com/trending";

const scrape = async () => {
const response = await fetch(targetUrl);
const html = await response.text();

const $ = cheerio.load(html);

let data = [];

$(".Box-row").each((index, element) => {
const repositoryNameAndAuthor = $(element)
.find("h1 a")
.text()
.trim();
const author = repositoryNameAndAuthor.split("/")[0].trim();
const repositoryName = repositoryNameAndAuthor.split("/")[1].trim();
const description = $(element)
.find("p")
.text()
.trim();

const language = $(element)
.find('span[itemprop="programmingLanguage"]')
.text()
.trim();

const stars = $(element)
.find(".muted-link")
.first()
.text()
.trim();

data.push({
repositoryName,
author,
description,
language,
stars
});
});

return data;
};

scrape().then(data => console.log(data));

結果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
[
{
"repositoryName": "Kotlin-Pokedex",
"author": "mrcsxsiq",
"description": "🌀 A Pokedex app using ViewModel, LiveData, Room and Navigation",
"language": "Kotlin",
"stars": "343"
},
{
"repositoryName": "java-design-patterns",
"author": "iluwatar",
"description": "Design patterns implemented in Java",
"language": "Java",
"stars": "54,397"
},
{
"repositoryName": "chaos-mesh",
"author": "pingcap",
"description": "A Chaos Engineering Platform for Kubernetes",
"language": "Go",
"stars": "801"
},
{
"repositoryName": "deeplearning-models",
"author": "rasbt",
"description": "A collection of various deep learning architectures, models, and tips",
"language": "Jupyter Notebook",
"stars": "10,903"
},
{
"repositoryName": "AndroidKnowledgeSystem",
"author": "feelschaotic",
"description": "The most complete Android advanced route knowledge map ⭐️你想要的最全 Android 进阶路线知识图谱+干货资料收集🚀",
"language": "JavaScript",
"stars": "1,234"
},
{
"repositoryName": "fxxkmakeding",
"author": "xyjoey",
"description": "",
"language": "",
"stars": "487"
},
{
"repositoryName": "pulse-sms-android",
"author": "klinker-apps",
"description": "The ultimate SMS app for Android, available across all of your devices.",
"language": "Kotlin",
"stars": "298"
},
{
"repositoryName": "nodetube",
"author": "mayeaux",
"description": "Open-source YouTube alternative that also supports image and audio uploads. Powered by NodeJS",
"language": "JavaScript",
"stars": "939"
},
{
"repositoryName": "coding_challenge-25",
"author": "zero-to-mastery",
"description": "",
"language": "",
"stars": "51"
},
{
"repositoryName": "computer-science",
"author": "ossu",
"description": "🎓 Path to a free self-taught education in Computer Science!",
"language": "",
"stars": "53,446"
},
{
"repositoryName": "GitHubDaily",
"author": "GitHubDaily",
"description": "GitHubDaily 分享内容定期整理与分类。欢迎推荐、自荐项目,让更多人知道你的项目。",
"language": "",
"stars": "5,411"
},
{
"repositoryName": "ALBERT",
"author": "google-research",
"description": "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations",
"language": "Python",
"stars": "728"
},
{
"repositoryName": "architect-awesome",
"author": "xingshaocheng",
"description": "后端架构师技术图谱",
"language": "",
"stars": "41,832"
},
{
"repositoryName": "isocity",
"author": "victorqribeiro",
"description": "A isometric city builder in JavaScript",
"language": "JavaScript",
"stars": "1,644"
},
{
"repositoryName": "dockerlabs",
"author": "collabnix",
"description": "Docker - Beginners | Intermediate | Advanced",
"language": "PHP",
"stars": "736"
},
{
"repositoryName": "rhasspy",
"author": "synesthesiam",
"description": "Rhasspy voice assistant for Home Assistant and Hass.IO",
"language": "HTML",
"stars": "485"
},
{
"repositoryName": "flutter",
"author": "flutter",
"description": "Flutter makes it easy and fast to build beautiful mobile apps.",
"language": "Dart",
"stars": "83,214"
},
{
"repositoryName": "coding-interview-university",
"author": "jwasham",
"description": "A complete computer science study plan to become a software engineer.",
"language": "",
"stars": "95,699"
},
{
"repositoryName": "tailwind-starter-kit",
"author": "creativetimofficial",
"description": "Tailwind Starter Kit a beautiful extension for TailwindCSS, Free and Open Source",
"language": "CSS",
"stars": "318"
},
{
"repositoryName": "flowy",
"author": "alyssaxuu",
"description": "The minimal javascript library to create flowcharts ✨",
"language": "JavaScript",
"stars": "5,350"
},
{
"repositoryName": "awesome-python",
"author": "vinta",
"description": "A curated list of awesome Python frameworks, libraries, software and resources",
"language": "Python",
"stars": "77,646"
},
{
"repositoryName": "overreacted.io",
"author": "gaearon",
"description": "Personal blog by Dan Abramov.",
"language": "JavaScript",
"stars": "4,903"
},
{
"repositoryName": "spring-boot-demo",
"author": "xkcoding",
"description": "spring boot demo 是一个用来深度学习并实战 spring boot 的项目,目前总共包含 63 个集成demo,已经完成 52 个。 该项目已成功集成 actuator(监控)、admin(可视化监控)、logback(日志)、aopLog(通过AOP记录web请求日志)、统一异常处理(json级别和页面级别)、freemarker(模板引擎)、thymeleaf(模板引擎)、Beetl(模板引擎)、Enjoy(模板引擎)、JdbcTemplate(通用JDBC操作数据库)、JPA(强大的ORM框架)、mybatis(强大的ORM框架)、通用Mapper(快速操作Mybatis)、PageHelper(通用的Mybatis分页插件)、mybatis-plus(快速操作M…",
"language": "Java",
"stars": "9,139"
},
{
"repositoryName": "hasskit",
"author": "tuanha2000vn",
"description": "HassKit is a Touch-Friendly - Zero Config App to help users instantly start using Home Assistant",
"language": "Dart",
"stars": "196"
},
{
"repositoryName": "linux",
"author": "torvalds",
"description": "Linux kernel source tree",
"language": "C",
"stars": "84,872"
}
]
Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×