TypeScript 编写爬虫工具
初始化
// 生成package.json
npm init -y
// tsconfig.json
tsc --init
// 卸载全局ts-node
npm uninstall ts-node -g
// 安装ts-node typescript在dev环境
npm install -D ts-node
npm install typescript -D
- 新建 src 下的 crawler.ts
console.log('test');
- 更改 package.json 中的执行语句
"scripts": {
"dev": "ts-node ./src/crawler.ts"
}
- 测试
PS E:\typescript\crawler> npm run dev
> crawler@1.0.0 dev E:\typescript\crawler
> ts-node ./src/crawler.ts
hello world
class Crawler {
private secret = 'secretKey';
private url = `
http://www.dell-lee.com/typescript/demo.html?secret?=${this.secret}
`;
constructor() {
console.log('constructor');
}
}
// secret是类里面的一个属性,需要通过this获取
const crawler = new Crawler();
SuperAgent
superagent 可以获取到远程网址上的 html
npm install superagent --save
- --save:dependencies 生产环境用到的模块
- --–save-dev: devDependencies 开发环境模块(-D)
类型定义文件@types
TypeScript 引用 JavaScript 会报错,且无法提供只能提醒
import Superagent from 'superagent';
需要提供 .d.ts 的翻译文件,将 js 文件里面的类型文件进行补全
ts => .d.ts 翻译文件 @types/ => js
无法找到模块“superagent”的声明文件。“e:/typescript/crawler/node_modules/superagent/lib/node/index.js”隐式拥有 "any" 类型。
Try `npm install @types/superagent` if it exists or add a new declaration (.d.ts) file containing `declare module 'superagent';`ts(7016)
解决:在开发环境下引入翻译文件
npm install @types/superagent -D
Htmt 获取的实现
import Superagent from 'superagent';
class Crawler {
private secret = 'secretKey';
private url = `http://www.dell-lee.com/typescript/demo.html?secret?=${this.secret}`;
private rawHtml = '';
async getRawHtml() {
const result = await Superagent.get(this.url);
this.rawHtml = result.text;
}
constructor() {
this.getRawHtml();
}
}
const crawler = new Crawler();
cheerio 数据获取
cheerio 库引入
cheerio 可以读取 html 字符串,让我们能够以 jQuery 的方式操作获取数据
npm install cheerio --save
npm install @types/cheerio -D
代码实现
cheerio 中的 map((index,element)=>{})方法的参数和 JS 的 map((element,index)=>{})方法参数相反
// https://cheerio.js.org/ 文档实例
$('li')
.map(function(i, el) {
// this === el
return $(this).text();
})
.get()
.join(' ');
//=> "apple orange pear"
import superagent from 'superagent';
import cheerio from 'cheerio';
interface Course {
title: string;
count: number;
}
class Crowller {
private secret = 'secretKey';
private url = `http://www.dell-lee.com/typescript/demo.html?secret=${this.secret}`;
getCourseInfo(html: string) {
const $ = cheerio.load(html);
const courseItems = $('.course-item');
const courseInfo: Course[] = [];
courseItems.map((index, ele) => {
const descs = $(ele).find('.course-desc');
const title = descs.eq(0).text();
const count = parseInt(
descs
.eq(1)
.text()
.split(':')[1],
10
);
courseInfo.push({
title: title,
count: count,
});
});
const result = {
time: new Date().getTime(),
data: courseInfo,
};
console.log(result);
}
async getRawHtml() {
const result = await superagent.get(this.url);
this.getCourseInfo(result.text);
}
constructor() {
this.getRawHtml();
}
}
const crowller = new Crowller();
结果
> ts-node ./src/crawler.ts
{
time: 1582818112855,
data: [
{ title: 'Vue2.5开发去哪儿网App', count: 18 },
{ title: 'React 16.4 开发简书项目', count: 74 },
{ title: 'React服务器渲染原理解析与实践', count: 10 },
{ title: '手把手带你掌握新版Webpack4.0', count: 41 }
]
}
组合设计模式优化
crawler
import fs from 'fs';
import path from 'path';
import superagent from 'superagent';
import CaffreyAnalyzer from './specialAnalyzer';
export interface Analyzer {
analyze: (html: string, filePath: string) => string;
}
class Crowller {
private filePath = path.resolve(__dirname, '../data/course.json');
async getRawHtml() {
const result = await superagent.get(url);
return result.text;
}
private writeFile(content: string) {
fs.writeFileSync(this.filePath, content);
}
async initSpiderProcess() {
const html = await this.getRawHtml();
const fileContent = this.analyzer.analyze(html, this.filePath);
this.writeFile(fileContent);
}
constructor(private url: string, private analyzer: Analyzer) {
this.initSpiderProcess();
}
}
const secret = 'secretKey';
const url = `http://www.dell-lee.com/typescript/demo.html?secret=${secret}`;
const analyzer = new CaffreyAnalyzer();
new Crowller(url, analyzer);
analyzer.js
- class implements interface
import fs from 'fs';
import cheerio from 'cheerio';
import { Analyzer } from './crowller';
interface Course {
title: string;
count: number;
}
interface CourseResult {
time: number;
data: Course[];
}
interface Content {
[propName: number]: Course[];
}
export default class CaffreyAnalyzer implements Analyzer {
getCourseInfo(html: string) {
const $ = cheerio.load(html);
const courseItems = $('.course-item');
const courseInfos: Course[] = [];
courseItems.map((index, element) => {
const descs = $(element).find('.course-desc');
const title = descs.eq(0).text();
const count = parseInt(
descs
.eq(1)
.text()
.split(':')[1],
10
);
courseInfos.push({ title, count });
});
return {
time: new Date().getTime(),
data: courseInfos,
};
}
generateJsonContent(courseInfo: CourseResult, filePath: string) {
let fileContent: Content = {};
if (fs.existsSync(filePath)) {
fileContent = JSON.parse(fs.readFileSync(filePath, 'utf-8'));
}
fileContent[courseInfo.time] = courseInfo.data;
return fileContent;
}
public analyze(html: string, filePath: string) {
const courseInfo = this.getCourseInfo(html);
const fileContent = this.generateJsonContent(courseInfo, filePath);
return JSON.stringify(fileContent);
}
}
单例模式实战
specialAnalyzer.ts
import fs from 'fs';
import cheerio from 'cheerio';
import { Analyzer } from './crowller';
interface Course {
title: string;
count: number;
}
interface CourseResult {
time: number;
data: Course[];
}
interface Content {
[propName: number]: Course[];
}
export default class CaffreyAnalyzer implements Analyzer {
// static静态属性,将方法直接挂载在类上面,而不是类的实例上面
private static instance: CaffreyAnalyzer;
static getInstance() {
// 只生成一次
if (!CaffreyAnalyzer.instance) {
CaffreyAnalyzer.instance = new CaffreyAnalyzer();
}
return CaffreyAnalyzer.instance;
}
private getCourseInfo(html: string) {
const $ = cheerio.load(html);
const courseItems = $('.course-item');
const courseInfos: Course[] = [];
courseItems.map((index, element) => {
const descs = $(element).find('.course-desc');
const title = descs.eq(0).text();
const count = parseInt(
descs
.eq(1)
.text()
.split(':')[1],
10
);
courseInfos.push({ title, count });
});
return {
time: new Date().getTime(),
data: courseInfos,
};
}
private generateJsonContent(courseInfo: CourseResult, filePath: string) {
let fileContent: Content = {};
if (fs.existsSync(filePath)) {
fileContent = JSON.parse(fs.readFileSync(filePath, 'utf-8'));
}
fileContent[courseInfo.time] = courseInfo.data;
return fileContent;
}
public analyze(html: string, filePath: string) {
const courseInfo = this.getCourseInfo(html);
const fileContent = this.generateJsonContent(courseInfo, filePath);
return JSON.stringify(fileContent);
}
// private私有限制符,只允许内部调用 禁止new 实例
private constructor() {}
}
引用
const analyzer = CaffreyAnalyzer.getInstance();
new Crowller(url, analyzer);
编译过程
初始配置
将 ts 文件编译为 js 文件,然后运行该文件
"scripts": {
"dev": "ts-node ./src/crawler.ts"
}
打开 tsconfig.json 修改编译路径
"outDir": "./build"
typescript 文件是不能直接运行的
node ./build/crawler.js
//报错
node src/crawler.ts
自动编译 ts 文件
通过 npm run build 后,如果后续 ts 文件有修改,会自动编译更新 js 文件
"scripts": {
"build": "tsc -w"
}
自动执行 js 文件
监控整个项目文件变化后执行动作,安装 nodemon(npm install nodemon -D)
- nodemon 默认不会监测 TypeScript 的文件变化(可配置修改)
"scripts": {
"build": "tsc -w",
"start": "nodemon node ./build/crawler.js"
}
tips: 第一次运行的 npm run start 的时候会先执行一次,导致生成了 data 文件夹下面的 course.json; 而当前的文件变化又导致了 nodemon 的监测重新执行,如此反复循环运行 craw.js,需要在 package.json 增加 json 配置
"nodemonConfig": {
"ignore": [
"data/*"
]
}
合并命令
concurrently并行执行命令(npm install concurrently -D)
"scripts": {
"dev:build": "tsc -w",
"dev:start": "nodemon node ./build/crawler.js",
"dev": "concurrently npm run dev:build & npm run dev:start"
}
npm:dev:*相当于 npm run dev: 下的所有命令
"scripts": {
"dev:build": "tsc -w",
"dev: start": "nodemon node ./build/crawler.js",
"dev": "concurrently npm:dev:*"
}