introduction Link to heading

  • Google Scholar data can be scaped without coding but use a online web scaping tool, e.g. Octoparse (Jones, 2023) or ParseHub (Miranda, 2022).
  • Python and Beautiful Soup, one of its package for parsing HTML and XML, can be also used to scrape Google Scholar data (iWeb Data Scraping, 2023), which must pay attention to IP blocks due to simple API (Ganesan, 2020).
  • There is also Python package written for Google Scholar scaping, which requires only few lines of codes (DataKund, 2022).
  • PHP with the help of Simple HTML DOM Parser are able to collect data of hundreds of lectures from several faculties in a university (Putri et al., 2021).

reading materials Link to heading

assignment Link to heading

  • Read the reading materials and perform the activities described if you have not done it yet.
  • Modify the given code to scrape Google Scholar SERP for three search words.
  • Record the statistics in the form of output program for
    • each word result
    • each pair of two words in arbitrary order
    • all three words in arbitrary order
    • all three words in a phrase
  • Analyze the four results and make a conclusion.
  • Make a report for this assignment in a form a story in Medium.
  • Report the link on https://github.com/dudung/lecture-notes/issues/18 before Friday 27th, October 2023, 0900 GMT+07.
  • Prepare for presentation in the class on Friday 27th, October 2023, during the class 0900-1040 GMT+07.

hint Link to heading

<ul id="gs_res_sb_yyl">
  <li class="gs_ind gs_bdy_sb_sel">
    <a href="/scholar?q=physics&amp;hl=de&amp;as_sdt=0,5">Beliebige Zeit</a>
  </li>
  <li class="gs_ind">
    <a href="/scholar?as_ylo=2023&amp;q=physics&amp;hl=de&amp;as_sdt=0,5">Seit 2023</a>
  </li>
  <li class="gs_ind">
    <a href="/scholar?as_ylo=2022&amp;q=physics&amp;hl=de&amp;as_sdt=0,5">Seit 2022</a>
  </li>
  <li class="gs_ind">
    <a href="/scholar?as_ylo=2019&amp;q=physics&amp;hl=de&amp;as_sdt=0,5">Seit 2019</a>
  </li>
  <li class="gs_ind">
    <a id="gs_res_sb_yyc" href="javascript:void(0)">Zeitraum wählen...</a>
  </li>
</ul>
<div class="gs_ab_mdw">
  Ungefähr 6.250.000 Ergebnisse (<b>0,07</b> Sek.)
</div>