Web Scraping Libraries | Python Resources

Everything you need for Web Scraping workshop

Use PIP to install all packages.

Pip is a package management system used to install and manage software packages written in Python. Many packages can be found in the Python Package Index (PyPI). Python 2.7.9 and later (on the python2 series), and Python 3.4 and later include pip (pip3 for Python 3) by default.

For more info and installation:

Pip and virtualenv on Mac
Pip and virtualenv on Windows

Fetching URLs

Urllib module for python

Urllib is a Python module for fetching URLs. You do not have to install it. Urllib module comes with Python package. For python 3.6 use:

https://docs.python.org/3.6/howto/urllib2.html

For python 2.7 use:

https://docs.python.org/2.7/howto/urllib2.html

Requests library

Requests is HTTP library for Python, official documentation is here:

http://docs.python-requests.org/en/master/

Installation:

pip install requests

WGET library

Python download utility WGET, official documentation is here:

https://pypi.python.org/pypi/wget

Installation:

pip install wget

Scraping

Beautiful Soup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. Official documentation is here:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all

Installation:

pip install beautifulsoup4

PDFminer3k

PDFminer3k PDF parser and analyzer, official documentation is here:

https://pypi.python.org/pypi/pdfminer3k

Installation:

pip install pdfminer3k