MyDataProvider » Blog » How to scrape product urls from text file using NodeJS

How to scrape product urls from text file using NodeJS

  • by

Our goal for this article is to create a function that will extract product urls from a general plain text file.

File format is simple: 1 line is 1 product url.

SAMPLE:

[code lang1=”txt”]
https://www.bhphotovideo.com/c/product/4831-REG/Berg_BBTS32_Toner_for_Black.html
https://www.bhphotovideo.com/c/product/4832-REG/Berg_BCTS32_Toner_for_Black.html
https://www.bhphotovideo.com/c/product/4837-REG/Berg_TRG_Toner_for_Black.html
https://www.bhphotovideo.com/c/product/2436-REG/Arkay_60207724_FC_10b_9_3_4_Water_Filter.html
https://www.bhphotovideo.com/c/product/2846-REG/Arkay_S247210_Premium_Stainless_Steel_Photo.html
https://www.bhphotovideo.com/c/product/2848-REG/Arkay_S248410_Premium_Stainless_Steel_Photo.html
https://www.bhphotovideo.com/c/product/3360-REG/Arkay_Stainless_Steel_Stand_for.html
https://www.bhphotovideo.com/c/product/3370-REG/Arkay_Stainless_Steel_Stand_for.html
https://www.bhphotovideo.com/c/product/3374-REG/Arkay_Stainless_Steel_Stand_for.html
https://www.bhphotovideo.com/c/product/3376-REG/Arkay_Stainless_Steel_Stand_for.html
https://www.bhphotovideo.com/c/product/3382-REG/Arkay_Stainless_Steel_Stand_for.html
https://www.bhphotovideo.com/c/product/3388-REG/Arkay_Stainless_Steel_Stand_for.html
https://www.bhphotovideo.com/c/product/3400-REG/Arkay_Stainless_Steel_Stand_for.html
https://www.bhphotovideo.com/c/product/3404-REG/Arkay_Stainless_Steel_Stand_for.html
[/code]
Find below a code that goes thought all lines from the 1st till the last and allows you to handle each line.

[code lang=”js”]
const fs = require(‘fs’);
const readline = require(‘readline’);

async function readTxtFileLineByLine()
{
const fileStream = fs.createReadStream(‘productUrlsForScraping.txt’);
const rl = readline.createInterface({
input: fileStream,
crlfDelay: Infinity
});

for await (const line of rl)
{
console.log("product Url from file: " + line);
}
}

readTxtFileLineByLine();
[/code]

ok, the next step is to create a function that will extract html from product url

you can take sample from our post where we extracted data using axios/nodejs library

How to create a web scraper using nodejs and axios

[code lang=”js”]
async function getHtml(url)
{
const axios = require(‘axios’);

var soUrl = ‘here is url that you want to scrape’;
const html = await axios.get(url);

return html
}
[/code]

and the final step lets use this function for data extraction for all our product lines from text file

[code lang=”js”]
async function readTxtFileLineByLine()
{
const fileStream = fs.createReadStream(‘productUrlsForScraping.txt’);
const rl = readline.createInterface({
input: fileStream,
crlfDelay: Infinity
});

for await (const line of rl)
{
console.log("product Url from file: " + line);

var html = await getHtml(line);
}
}

[/code]