10.03.2018       Выпуск 220 (05.03.2018 - 11.03.2018)       Статьи

Как парсить сайты на Python

Используется urlib

Читать>>




Экспериментальная функция:

Ниже вы видите текст статьи по ссылке. По нему можно быстро понять ссылка достойна прочтения или нет

Просим обратить внимание, что текст по ссылке и здесь может не совпадать.

Web scraping is a technique to extract data from webpage using computer program. In this tutorial we will demonstrate how to collect news links and title from a newspaper website for educational purpose. During the journey we will learn about Python’s built-in URLLIB library to connect website and retrieve data and a third party BeautifulSoup library to parse HTML data so that we can filter and take out our target links. We will also use a built-in logging library to write error in text file if any error happens during the program runs. The main library for web scraping is designed using Object Oriented Programming (OOP) style.

We have a video tutorial for this chapter for Bangla speaking people:
@Web Scraping

If we visit http://www.aljazeera.com we can see the news they published. Today the homepage looks like this but you can see different news as it changes daily.

1. aljazeera.png

What the browser do, they receive the HTML, CSS code and render that on browser. Thus we see a nice layout with news images etc. This is how website works within browser.
If we click Tools->Web Developer->Page Source in Firefox browser we can see any webpage’s source code.
In Google Chrome we can get the same thing by: View->Developer->View Source

2. Page Source.png

Here we can see the html code, css and javascript file links within code. What we will do, by using computer program we will grab this code and research on it.

3. Souce HTML Code.png

Our target is to parse the HTML codes and grab the news links from home page with title and make a simple HTML page for our own with the news title and links.

4. Links.png

Before writing the main script, we will write a Logging script, so that during the execution of the program if any error occurred it saves in a text file in the disk. Here https://github.com/mahmudahsan/thinkdiff/tree/master/python-advanced we can see the directory structure. chapter4.py is the main file, which we will run in terminal or by any way like using cron in server. In the same directory where chapter4.py stays we will create 2 more directories. A html and a webscrap directories.

5. Directory-Structure.png

Now in webscrap directory, we will create a file named: wlog.py Here we will write the following code:

import logging
def set_custom_log_info(filename):
    logging.basicConfig(filename=filename, level=logging.INFO)
    
def report(e:Exception):
    logging.exception(str(e))

We will import logging built-in library, and define 2 functions. In set_custom_log_info(filename) function when we call it, we will pass a filename or filepath so that that will be used to store logging information like error, warning or any traceback. In the function, we are calling logging module’s logging.basicConfig(filename=filename, level=logging.INFO) . This line tells the logging functionality where to save the info and the level of log. We can provide logging.INFO, logging.ERROR, logging.WARNING as well. For simplicity we just only used logging.INFO

In report(e:Exception) function we are using logging.exception(str(e)) to report any exception to save in logfile which we pass as report(e) function’s ‘e’ parameter.

Now we will create another file named wscrap.py for scrapping and parsing purpose. Before writing more code, we first need to install BeautifulSoup library. Its easy to install if you have installed PIP in your machine. To install BeautifulSoup by PIP in mac:

pip3 install beautifulsoup4

In Windows:

pip install beautifulsoup4

If successfully installed we can now use the library in our code.

Here is the full code wscrap.py and lets discuss about it:

from urllib.request import urlopen 
from bs4 import BeautifulSoup
# Global variables
url_aj   = 'http://www.aljazeera.com'
filepath = 'html/aj.html'

First we import urllib.request module’s urlopen function. Then we import BeautifulSoup module from bs4 library. Then we define 2 variable url_aj assigned aljazeera newspaper’s website. And filepath is assigned ‘html/aj.html‘. As we will run chapter4.py so when our code scrap aljazeera website’s homepage’s HTML code it will save that in html dictory’s aj.html file. So that in our research we doesn’t request every time to Aljazeera’s website rather we can research the data saved in aj.html file.

Now lets write a class named NewsScrapper and a constructor method:

class NewsScraper:
    __url   = ''
    __data  = ''
    __wlog  = None
    __soup  = None 
    
    def __init__(self, url, wlog):
        self.__url  = url 
        self.__wlog = wlog

When we create NewsScraper object in chapter4.py file we will pass the url and wlog arguments which will be assigned in __url and __wlog variables. __VARIABLE is called private according to python’s convention, though it is not really private as there are no such access specifier in Python. But we defined this variable by __ so that we want these variables should not be accessed directly rather should be accessed indirectly by related methods.

Now we will write some more methods.

def retrieve_webpage(self):
        try:
            html = urlopen(self.__url)
        except Exception as e:
            print (e)
            self.__wlog.report(str(e))
        else:
            self.__data = html.read()
            if len(self.__data) > 0:
                print ("Retrieved successfully")

retrieve_webpage() method use urlopen() function to retrieve self.__url and saved the output in local variable html. If any error happened it will write that in html directory’s error.log file. If no error occurred, it will use html.read() method to read the contents and assigned them in self.__data variable within else block and shows a message.

    def write_webpage_as_html(self, filepath=filepath, data=''):
        try:
            with open(filepath, 'wb') as fobj:
                if data:
                    fobj.write(data)
                else:
                    fobj.write(self.__data)
        except Exception as e:
            print(e)
            self.__wlog.report(str(e))
            
    def read_webpage_from_html(self, filepath=filepath):
        try:
            with open(filepath) as fobj:
                self.__data = fobj.read()
        except Exception as e:
            print (e)
            self.__wlog.report(str(e))

Next we will write write_webpage_as_html() method which will save data in a file provided by filepath. By default we assigned the global filepath variable and empty data as arguments. But In one case we will provide the data and filepath from chapter4.py code so that it will write that data in the filepath we provided.
We will also write a read_webpage_from_html() method which will retrieve the data from the file we provided as filepath and assigned the value in self.__data variable.

    def change_url(self, url):
        self.__url = url
            
    def print_data(self):
        print (self.__data)

We will also define a change_url() method to change __url if we want and a print_data() method to check __data.

    def convert_data_to_bs4(self):
        self.__soup = BeautifulSoup(self.__data, "html.parser")

Then we will define convert_data_to_bs4() method which will convert __data to BeautifulSoup object and assigned in self.__soup variable.

    def parse_soup_to_simple_html(self):
        news_list = self.__soup.find_all(['h1', 'h2']) # h1
        
        #print (news_list)
        
        htmltext = '''
<html>
    <head><title>Simple News Link Scrapper</title></head>
    <body>
        {NEWS_LINKS}
    </body>
</html>
'''
        
        news_links = '<ol>'
        
        for tag in news_list:
            if tag.parent.get('href'):
                # print (self.__url + tag.parent.get('href'), tag.string)
                link  = self.__url + tag.parent.get('href')
                title = tag.string
                news_links += "<li><a href='{}' target='_blank'>{}</a></li>\n".format(link, title)
                
        news_links += '</ol>'
        htmltext = htmltext.format(NEWS_LINKS=news_links)
        
        # print(htmltext)
        self.write_webpage_as_html(filepath="html/simplenews.html", data=htmltext.encode())

Finally in NewsScraper class we will write parse_soup_to_simple_html(). Here at the beginning we will call __soup.find_all([‘h1′,’h2’]) to filter out all the html data started with h1 or h2 tag. We can also provide some other html tags like h3, h4, h5, h6.

The reason we are looking for headings tag because if we analysis their homepage, we will see all the news published by headings tag thats why we filter out the headings tag.

6. headings.png

Then we write a html template file’s structure in htmltext variable. In that text, we will later replaced {NEWS_LINKS} by actual news links.

Then in the for loop we iterate the news_list variable and check if the heading tag’s parent tag has any href attribute or not. If it has, then we will create a URL Link by providing the base url and the path we get from parent. If we see above picture we will see, the parent of h1 tag is an anchor tag
Some tags may not have anchor tag or href value thats why we write that if condition. Finally we will update htmltext by replacing {NEWS_LINKS} variable by news_links data generated by our program.

And at the end we will save htmltext data by converting that to binary format and save in html/simplenews.html file.

Now In chapter4.py we will write the following code:

from webscrap import wlog
from webscrap import wscrap
# Define log file location
wlog.set_custom_log_info('html/error.log')
'''
# Testing log file reporting
try:
    raise Exception
except Exception as e:
    wlog.report(str(e))
'''
news_scrap = wscrap.NewsScraper(wscrap.url_aj, wlog)
news_scrap.retrieve_webpage()
news_scrap.write_webpage_as_html()
news_scrap.read_webpage_from_html()
news_scrap.convert_data_to_bs4()
#news_scrap.print_beautiful_soup()
news_scrap.parse_soup_to_simple_html()

We simply import our own library from webscrap directory. Then we will set the log file path ‘html/error.log’

Then we will create a news_scrap object by NewsScraper class. Then we will call retrieve_webpage() method to retrieve the site and then we call write_webpage_as_html() to save that data simply in a text file.

Then we call read_webpage_from_html() file to retrieve data. Because after we first retrieve data from webpage, we should comment retrieve_webpage() and write_webpage_as_html() method so that during our research it doesn’t request web server every time. As the data already saved in a text file in our disk.

#news_scrap.retrieve_webpage()
#news_scrap.write_webpage_as_html()

Finally convert_data_to_bs4() will convert the data to BeautifulSoup object and if we call parse_soup_to_simple_html() it will parse the data and create a html file simplenews.html which will look like this. The data might be look different as the news changed daily.

4. Links.png

So these are some simple techniques for web scraping. There are some sites which use Reactjs or other javascript framework to dynamically assigned URL which are not easy to scrap and takes some more code to parse. If you want to study more about web scraping you can research more about https://www.crummy.com/software/BeautifulSoup/bs4/doc/ library and study some other techniques for web scraping.

Github: @Web Scraping
Diretories: webscrap | html
Main Script: chapter4.py
Originally Published: Thinkdiff.net




Лучшая Python рассылка




Разместим вашу рекламу

Пиши: mail@pythondigest.ru

Нашли опечатку?

Выделите фрагмент и отправьте нажатием Ctrl+Enter.

Система Orphus