gs scraping with node.js
introduction
- Google Scholar data can be scaped without coding but use a online web scaping tool, e.g. Octoparse ( Jones, 2023 ) or ParseHub ( Miranda, 2022 ).
- Python and Beautiful Soup, one of its package for parsing HTML and XML, can be also used to scrape Google Scholar data ( iWeb Data Scraping, 2023 ), which must pay attention to IP blocks due to simple API ( Ganesan, 2020 ).
- There is also Python package written for Google Scholar scaping, which requires only few lines of codes ( DataKund, 2022 ).
- PHP with the help of Simple HTML DOM Parser are able to collect data of hundreds of lectures from several faculties in a university ( Putri et al., 2021 ).
reading materials
- How to use Google Scholar: A short introduction to search literature in internet .
- Install Node.js: Server-side web applications using JavaScript .
- Scraping your webpage: Using Node.js to scrap you GitHub Pages webpage .
- Node.js and Google SERP: Following the steps from Khandelwal story .
- Random HTTP headers User-Agent: One way to reduce being blocked while web scraping .
- Scrape Google Scholar SERP with Node.js (planned).
assignment
- Read the reading materials and perform the activities described if you have not done it yet.
- Modify the given code to scrape Google Scholar SERP for three search words.
- Record the statistics in the form of output program for
- each word result
- each pair of two words in arbitrary order
- all three words in arbitrary order
- all three words in a phrase
- Analyze the four results and make a conclusion.
- Make a report for this assignment in a form a story in Medium.
- Report the link on https://github.com/dudung/lecture-notes/issues/18 before Friday 27th, October 2023, 0900 GMT+07.
- Prepare for presentation in the class on Friday 27th, October 2023, during the class 0900-1040 GMT+07.
hint
<ul id="gs_res_sb_yyl">
<li class="gs_ind gs_bdy_sb_sel">
<a href="/scholar?q=physics&hl=de&as_sdt=0,5">Beliebige Zeit</a>
</li>
<li class="gs_ind">
<a href="/scholar?as_ylo=2023&q=physics&hl=de&as_sdt=0,5">Seit 2023</a>
</li>
<li class="gs_ind">
<a href="/scholar?as_ylo=2022&q=physics&hl=de&as_sdt=0,5">Seit 2022</a>
</li>
<li class="gs_ind">
<a href="/scholar?as_ylo=2019&q=physics&hl=de&as_sdt=0,5">Seit 2019</a>
</li>
<li class="gs_ind">
<a id="gs_res_sb_yyc" href="javascript:void(0)">Zeitraum wählen...</a>
</li>
</ul>
<div class="gs_ab_mdw">
Ungefähr 6.250.000 Ergebnisse (<b>0,07</b> Sek.)
</div>