Automated Google Scholar Crawling with a Web-Based Tool for Publication Data Management

  • Yudha Islami Sulistya Telkom University
  • Ariq Cahya Wardhana Telkom University
  • Maie Istighosah Telkom University
  • Arif Riyandi Telkom University
Keywords: google scholar, automated web crawling, data management, web-based tool

Abstract

Here’s the revised abstract in English: The rapid growth of academic publications requires efficient tools for publication data extraction and management, especially from widely used platforms like Google Scholar. To address this need, an automated web-based tool was developed, designed to simplify the processes of data crawling, extraction, and publication data management, allowing researchers to handle large volumes of academic publications more effectively. The tool supports both simple and detailed crawling modes, enabling users to input multiple Google Scholar URLs and neatly organize the extracted data into CSV files. For multiple URLs, the data is compiled into a ZIP file containing separate CSV files for each source, ensuring organized and accessible publication data management. The tool was tested with various dataset sizes. When processing 41 entries, the simple mode completed extraction in 9.054 seconds, while the detailed mode took 71.898 seconds. For smaller datasets of 5 entries, the simple mode executed in 3.283 seconds, while the detailed mode required 11.908 seconds. These results indicate that the tool is efficient and performs well with both small and large datasets. The differences in execution time between the simple and detailed modes offer users flexibility in balancing speed and depth of data extraction according to their research needs. This web-based tool not only automates the data extraction process from Google Scholar but also enhances the organization and accessibility of publication data, making it an asset for researchers and institutions in managing publication data.

Downloads

Download data is not yet available.

References

REFRENCES

A. Z. Rizquina and C. I. Ratnasari, “Implementasi Web Scraping untuk Pengambilan Data Pada Website E-Commerce,” J. Teknol. Dan Sist. Inf. Bisnis, vol. 5, no. 4, pp. 377–383, 2023.

P. Ni, X. Wang, B. Lv, and L. Wu, “GTR: An explainable Graph Topic-aware Recommender for scholarly document,” Electron. Commer. Res. Appl., vol. 67, no. 1, pp. 1–10, Sep. 2024.

M. Khabsa and C. L. Giles, “The number of scholarly documents on the public web,” PLoS One, vol. 9, no. 5, pp. 1–6, May 2014.

R. van Dinter, B. Tekinerdogan, and C. Catal, “Automation of systematic literature reviews: A systematic literature review,” Inf. Softw. Technol., vol. 136, no. October 2020, p. 106589, 2021.

Q. Liu, R. Yahyapour, H. Liu, and Y. Hu, “A novel combining method of dynamic and static web crawler with parallel computing,” Multimed. Tools Appl., vol. 83, no. 21, pp. 60343–60364, Jun. 2024.

C. Bhatt, A. Bisht, R. Chauhan, A. Vishvakarma, M. Kumar, and S. Sharma, “Web Scraping Techniques and Its Applications: A Review,” in Proceedings - 2023 3rd International Conference on Innovative Sustainable Computational Technologies, CISCT 2023, 2023, pp. 381–394.

N. D. K.A. et al., “Bibliographic dataset of literature for analysing global trends and progress of the machine learning paradigm in space weather research,” Data Br., vol. 51, no. 5, pp. 1–8, Dec. 2023.

N. Sateesh, K. Kaur, M. Lakshminarayana, V. Vekariya, H. Patil, and R. Maranan, “Development of a GUI for Automated Classification of Scientific Journal Articles using clustering,” in 2024 5th International Conference on Innovative Trends in Information Technology, ICITIIT 2024, 2024, pp. 1–6.

P. P. Kusumojati and E. Mediawati, “Web-Based Asset Management Information Systems in Higher Education,” Int. J. Business, Law, Educ., vol. 5, no. 1, pp. 398–411, 2024.

J. Leskovec, A. Rajaraman, and D. J. Ullman, Mining of Massive Datasets 3rd, 3rd ed., vol. 3. 2020.

B. Tóth, L. Berek, L. Gulácsi, M. Péntek, and Z. Zrubka, “Automation of systematic reviews of biomedical literature: a scoping review of studies indexed in PubMed,” Syst. Rev., vol. 13, no. 1, pp. 1–22, 2024.

L. Yu, Y. Li, Q. Zeng, Y. Sun, Y. Bian, and W. He, “Summary of web crawler technology research,” in Journal of Physics: Conference Series, 2020, vol. 1449, no. 1, pp. 1–6.

M. Thelwall, “Microsoft Academic: A multidisciplinary comparison of citation counts with Scopus and Mendeley for 29 journals,” J. Informetr., vol. 11, no. 4, pp. 1201–1212, Nov. 2017.

S. A. Mohamed, M. A. Mahmoud, M. N. Mahdi, and S. A. Mostafa, “Improving Efficiency and Effectiveness of Robotic Process Automation in Human Resource Management,” Sustain., vol. 14, no. 7, 2022.

A. Martín-Martín, R. Costas, T. Van Leeuwen, and E. Delgado López-Cózar, “Evidence of open access of scientific publications in Google Scholar: A large-scale analysis,” J. Informetr., vol. 12, no. 3, pp. 819–841, Aug. 2018.

O. Kambli, A. Karande, and H. Kanakia, “EasyChair Preprint H-Index Analysis of Research Paper Using Web Crawling Techniques,” in EasyChair, 2023, pp. 521–531.

F. M. Javed et al., “An Effective Implementation of Web Crawling Technology to Retrieve Data from the World Wide Web (www),” Int. J. Sci. Technol. Res., vol. 9, no. 1, pp. 1152–1256, 2020.

E. Uzun, “A Novel Web Scraping Approach Using the Additional Information Obtained from Web Pages,” IEEE Access, vol. 8, pp. 61726–61740, 2020.

A. W. Anuar, A. Azmi, N. Kama, H. M. Rusli, N. A. A. Bakar, and N. Mohamed, “Integrating user experience assessment in Re-CRUD console framework development,” Wirel. Networks, vol. 29, no. 1, pp. 109–127, Jan. 2023.

S. Wan, H. Lin, W. Gan, J. Chen, and P. S. Yu, “Web3: The Next Internet Revolution,” IEEE, vol. 1, no. 1, pp. 1–11, Mar. 2023.

Published
2024-10-06
How to Cite
Sulistya, Y., Wardhana, A. C., Istighosah, M., & Riyandi, A. (2024). Automated Google Scholar Crawling with a Web-Based Tool for Publication Data Management. Jurnal Teknologi Dan Sistem Informasi Bisnis, 6(4), 768-773. https://doi.org/10.47233/jteksis.v6i4.1604
Section
Articles