A Shortcut to Web Scraping in Python
Web scraping is an important part of data aggregation, and is one of the many reasons people flock to Python. It can be hard to get started with, but Pandas has a rarely talked about function that makes web scraping super simple.
The read_html function can look at any webpage and find the HTML tables that are scrape-able. It is important to note that this function only works for tabular data. If you’re new to web scraping, this should not be a problem. If you are looking for more advanced scraping, skip to the end of this article.
Here is how to use the function:
First, select the URL you want to scrape data from. Wikipedia pages are a great place to find simple tables to scrape.
For this example, we’ll use a NBA finals data from this page:
1.Import the required packages. (Make sure to pip install these first, if you do not already have them)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from unicodedata import normalize
2. Create the table variable ( you can name this whatever you like) and set it equal to the Pandas read_html function with your URL as the first argument, and the name of your table as the argument to the “match” parameter. This match parameter will look for tables with your designated heading, and narrow down the tables to choose from
table_NBA = pd.read_html('<https://en.wikipedia.org/wiki/NBA_Finals>', match='Finals appearances')
3. The command above will output how many tables meet your match requirements.
4. Create a data frame that is the your table variable with a selected table. Since there were two possible tables to pick from, we could have used “” or “”.
df = table_NBA
5. Printing this data frame will give us our data from the website in a Pandas DataFrame.
Again, this is very simple web scraping, but it covers a large percentage of use cases, especially as you are becoming more familiar with the technique.
If you are looking for more advanced web-scraping, I will definitely cover it in subsequent posts.