David L. answered 05/19/23
Expert, Easy-to-Understand Python Tutoring (No Pandas or Data Science)
Web scraping is going to a specific web page or set of connected web pages, then sifting through the HTML of that web page for text that you want. It is common to write Python programs to do web scraping, but it is also possible to write programs in other languages to do web scraping. I helped someone use Excel VBA to do web scraping, and it is my understanding that Java, PHP, C++, Javascript, Golang, Ruby, Perl, and Rust all can be used for web scraping.
To illustrate, I'll use an example. A big fan of the WNBA hired me to help him extract all the history of basketball games on the WNBA website, www.wnba.com/schedule?season=2023&month=all&hidepast=true, If you go to that web page, you can see a schedule of all the future games for the current season, and if you turn off "Hide Previous Games" at the top, you can see the scores for the past games for the current season. You can examine the HTML for that web page to see how the visible information in that web page is built into the HTML for that web page, and then you can write a program to go through the entire web page, extract the HTML elements that hold the information you want, extract the desired information, and write the extracted information into a file. A web scraping program can automatically click on links in a web page to navigate through a series of web pages and extract content from them all.
There are two Python libraries I've used for web scraping: Beautiful Soup, and Selenium. Beautiful Soup is easier to use, but it only handles static HTML, it will not allow Javascript to alter the web page, and it cannot deal with cookies. Saving a web page to file and then reading the file with Beautiful Soup will allow Javascript to alter the HTML. But logging into a website to access password-protected is not possible with Beautiful Soup.
Selenium is another Python library (there are versions of Selenium for VBA, Java, and probably other languages) for web scraping. The library comes with builds of popular browsers, like Chrome, Firefox, and Edge, with Selenium built into them, so that your program can run that specially built browser and scrape the HTML it gets and constructs. This specially built browser handles all cookies and Javascript, so it can handle with password-protected websites and web pages where the Javascript dynamically builds at least some of the HTML.