Python Beautiful Soup Basics Tutorial
This tutorial covers the basics of the Python Beautiful Soup library including installation, parsing HTML/XML, finding elements and getting element data.
What is Beautiful Soup?
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It is commonly used for scraping websites and simply getting data out of a known HTML/XML structure.
There is awesome documentation for Beautiful Soup at www.crummy.com/software/BeautifulSoup/bs4/doc/ which covers all of the functions that are to offer and many examples. In this tutorial, I will cover a subset of the functions provided with examples that I feel will give a good starting point to someone new. I will cover some common expectations of a library like Beautiful Soup including:
- Searching for elements
- Getting element contents and attributes
- Getting the children and parent of an element
Getting Your Data To Parse
Before you start using Beautiful Soup, you’ll first need to get your data source ready. For most of my examples, I’ll be using some hard-coded example HTML. Here are a few ways you could source your data.
A string literal is simply a string with your HTML or XML in it; for example:
HTML or XML File
If your HTML or XML is in a file, you will need to read it into a variable so Beautiful Soup can use it; for example:
The variable html will have your data now like the String literal example.
If you want to get a webpage, you can use something like the requests library to get the page. Say you want to get https://example.com/, you would do:
The variable html will have your data now like the String literal example.
Installing Beautiful Soup
To install Beautiful Soup, simply go to the command line and execute:
python -m pip install beautifulsoup4
If you can’t import BeautifulSoup later on, make sure you’re 100% sure that you installed Beautiful Soup in the same distribution of Python that you’re trying to import it in. Go to my tutorial on How to Manage Multiple Python Distributions if you’re having some issues or are unsure.
Using Beautiful Soup
Parsing Your HTML/XML
When you have your HTML or XML data, you now want Beautiful Soup to parse it into a BeautifulSoup object using the following:
The variable soup now contains a BeautifulSoup object that you can use to traverse the root element.
Note: In all of the following examples, the variable html contains the HTML defined above the usage of it.
The Four Main Kinds Of Objects
When using Beautiful Soup, you will encounter four types of objects, these are:
Please note different types of objects that could be returned, these are just common ones.
The BeautifulSoup object represents the parsed document as a whole. It inherits the Tag object so most calls you can make on a Tag object, you can also make on a BeautifulSoup object.
A Tag object corresponds to an XML or HTML tag in the original document.
The Tag object allows us to access attributes on a tag using dictionary-like methods and also search for other tags under this tag.
To get the name of the current tag, access tag.name :
This tutorial covers more of what we can get out of a Tag under Getting Data From An Element / Tag And Other Elements.
A NavigableString corresponds to a bit of text within a tag. When accessing the content of a Tag object, a NavigableString object will be returned.
To make this a string and drop the object altogether, cast the object to a string: str(tag.string) .
Ways to Search For Elements / Tags
Searching Using .find vs .find_all
On any BeautifulSoup or Tag object, we can search for elements under the current tag ( BeautifulSoup will have the root tag majority of the time). To search for other elements/tags, we can use .find and .find_all .
These two calls are very similar, they both take the same inputs, but .find returns the first tag found whereas .find_all returns all tags found if there are any.
For example, if we had the following BeautifulSoup object:
Using .find to get the first span, we would do:
This returned object is of type bs4.element.Tag so we could further search under this tag. If there was no matching element, we would get None , for example:
Using .find_all to get all the spans, we would do:
This has returned a list of bs4.element.Tag objects, so pulling out an individual object would allow us to perform more tag operations. If there was no matching element, we would get an empty list, for example:
All of the following examples will use one of .find or .find_all but they can both be used interchangeably to get the first or all of the target elements.
Search For Elements By Tag Name
As displayed in the examples above, using .find or .find_all and passing a tag name, we can search for elements with a specific tag. For example, if we had:
If we wanted to get the a tag, we would execute:
If we wanted to get the p tag, we would execute:
Search For Elements By Id
Passing the id argument to .find allows us to search for an element by id, for example:
Notice how this has found the element with the id «target» regardless of its tag name.
Search For Elements By Class Name
Similar to searching by an id, we can also search for elements with a specific class by passing the class we want to search for, for example:
In this example, we found both the element with just the class «class_c» and the element with «class_c» being within other classes. This shows that this search will find the class name anywhere in the class attribute.
Note that we had to use class_ as an argument to .find_all ; this is because class is a reserved keyword in Python.
Search For Elements By A Combination Of Attributes
Using the elements above, we can search for elements with multiple attributes. To do this, the first positional argument is always the tag name and the other keyword arguments are attribute names. For example, if we had something like:
And wanted to identify the p element with the class «bold», we would do:
We are not only bound to search for tags, id and classes though; as stated above, providing other keyword arguments allows us to search for other attributes.
For example, if we have:
We can get the iframe with the title «Nitratine» by doing:
Search For Elements By Text Content
Aside from searching for things on the element itself, we can search for an element using expected text content. For example, if we have:
And we want to get the element with the text «This is also a paragraph» to check what class it has, we can do:
In the example above, we said to search for a p tag with the text «This is also a paragraph». We needed to specify the tag name otherwise we would get back a NavigableString object as shown below.
However, to get around providing the tag it’s in, we can get the parent of the NavigableString object to get the p tag that it’s located in.
Search For Elements Using a Query Selector
For anyone that has used CSS or JavaScripts document.querySelector / document.querySelectorAll , Beautiful Soup offers methods to search by CSS selectors. Using .select() and .select_one() , we can pass a CSS selectors to get elements/tags.
The difference between .select() and .select_one() is like .find() and .find_all() ; .select() finds many like .find_all() and .select_one() finds only one like .find() .
Lets say we had:
And wanted to find the p tag with the class «red», we would do:
If we had used soup.select() , we would get a list with the single item:
Using this same idea, we can also get all the p tags:
Searching Using Lambdas
For searching that needs some more advanced logic, you can pass a lambda to the .find() / .find_all() functions to do a more powerful search. For example, if we had:
And we wanted to get the all the p tags under the div with the id «main_content», we could do:
We can see that every tag in the parsed tree has been passed to the lambda function which then checks if the tag is a p tag and that the id attribute on its parent is «main_content».
We will look at what Tag.name , Tag.parent and Tag.attrs are soon.
Sometimes these lambda searches can be less preformat than doing intermediate searches, thus you could chain searches as demonstrated below to speed this operation up.
When using one of the find or select queries to get a Tag object, you can also then use this Tag object to search further. Chaining searches can lead to performance increases as you reduce the search space for each step. For example, if we had what we used above:
And we wanted to get all the p tags under the div with the id «main_content», we would do:
Getting Data From An Element / Tag And Other Elements
Once you have a Tag object, getting data off it is pretty easy.
Getting The Tag Name Of The Current Tag
To get the name of the current tag, we can call tag.name :
Getting The Text Inside The Current Tag
To get the text inside the current tag, we can call tag.text or tag.string :
Getting The Attributes Of The Current Tag
To get the attributes inside the current tag, we can access them using tag.attrs . This will return something that looks and functions like a dictionary.
Notice how id and title have a string value whereas class has a list of string as its value; this is demonstrating Beautiful Soup handling attributes with multiple values.
If you want to get a particular attribute from an element, we can use .get() as it may not always be there:
In the case the attribute does not exist, the second parameter passed to .get() is returned:
Getting The Parent Of The Current Tag
To get a tags parent (the tag it’s located in), we can call tag.parent :
None will be returned if the element has no parent
Getting The Children Of The Current Tag
To get all the elements under a given element, we can call tag.children :
This has returned an iterator which finds the children on-demand to potentially reduce memory and CPU consumption. We can see that this has also returned elements that look like ‘\n’ ; looking at these more closely, we can see they are NavigableString objects:
Example 1 — Scraping Data From A Table
- Data source: Custom
- Target: Read the HTML table into a Python array
Getting the table data:
Example 2 — Read A Single Value On The Page
- Data source: testing-ground.scraping.pro/whoami (No longer exists)
- Target: Read IP address
HTML preview (cut down version):
Getting the IP address:
Owner of PyTutorials and creator of auto-py-to-exe. I enjoy making quick tutorials for people new to particular topics in Python and tools that help fix small things.
Pip Install Beautifulsoup : How to Install Beautifulsoup ( Windows & Linux )

Beautifulsoup is a python package that is useful in scrapping web content from a particular URL. It parses HTML and XML documents and creates a parse tree that is used to extract data from HTML. But how you will install it in your system. In this entire tutorial you will know how to pip install Beautifulsoup, means how to install Beautifulsoup using the pip command.
Steps to Install Beautifulsoup using PIP
In this section, you will know all the steps required to install beautifulsoup in your system. I will tell you how to install in both windows and linux operating system.
Install Beautifulsoup on Windows
Step 1: Open your command prompt
Step 2: Check the version of the python by typing the following command.

Checking the version of python on windows
Step 3: Install the beautifulsoup using pip
After checking the version of the python now you can install beautifusoup for different python versions.
For python 3.xx
For python 2.xx
In my system, the python version is 3.xx. So I will use the pip3 command.

Installed beautiulsoup on windows
Install Beautifulsoup on Linux
Now let’s install Beautifulsoup on Linux. You have to follow the below steps to install it.
Step 1: Update the Linux

To update the Linux use the following command. Update you linux os
Step 2: Check the python version

To check the python version. If it is installed then you will get the version number. And if are getting the following error then you have to install it. Python not found an error in Linux
To install python use the pip3 command.
It will install the python 3.xx version.
Step 3: Install the Beautifulsoup
After the installation of the python install the Beautifulsoup using the pip command. Run the following bash command to install it.
It will successfully install the beautifulsoup on the Linux OS.
Conclusion
If you want to make your own scrapper then beautifulsoup python package is very useful. These are the steps to install beautifulsoup using the pip command for Linux and Window OS. Hope this tutorial has solved your queries. If you want to get more help then you can contact us anytime. We are always ready to help you. Apart from installation, there are so many operations you should explore with beautifulsoap package like extracting data using findall() function , parsing HTML and select.
- Total 1
Join our list
Subscribe to our mailing list and get interesting stuff and updates to your email inbox.
We respect your privacy and take protecting it seriously
Thank you for signup. A Confirmation Email has been sent to your Email Address.
Beautiful Soup Tutorial — How to Parse Web Data With Python

Although web scraping in its totality is a complex and nuanced field of knowledge, building your own basic web scraper isn’t all that difficult. And that’s mostly due to coding languages such as Python. This language makes the process much more straightforward thanks to its relative ease of use and the many useful libraries that it offers. In this tutorial, we’ll be focusing on one of these wildly popular libraries named Beautiful Soup, a Python package used for parsing HTML and XML documents.
If you want to build your first web scraper, we recommend checking our video tutorial below or our article that details everything you need to know to get started with Python web scraping. Yet, in this tutorial, we’ll focus specifically on parsing a sample HTML file in Python and using Selenium to render dynamic pages.
This tutorial is useful for those seeking to quickly grasp the value that Python and Beautiful Soup 4 offer. After following the provided examples, you should be able to understand the basic principles of how to parse HTML data. The examples will demonstrate traversing a document for HTML tags, printing the full content of the tags, finding elements by ID, extracting text from specified tags, and exporting it to a CSV file.
Before getting to the matter at hand, let’s first take a look at some of the fundamentals of this topic.
What is data parsing?
Data parsing is a process during which a piece of data gets converted into a different type of data according to specified criteria. It’s an important part of web scraping since it helps transform raw HTML data into a more easily readable format that can be understood and analyzed.
What does a parser do?
A well-built parser will identify the needed HTML string and the relevant information within it. Based on predefined criteria and the rules of the parser, it’ll filter and combine the needed information into CSV, JSON, or any other format.
Our previous article on what is parsing sums up this topic nicely.
What is Beautiful Soup?
Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed web pages based on specific criteria that can be used to extract, navigate, search, and modify data from HTML, which is mostly used for web scraping. Beautiful Soup 4 is supported on Python versions 3.6 and greater. Being a useful library, it can save programmers loads of time when collecting data and parsing it.
1. Install the Beautiful Soup library
Before following this tutorial, you should have a Python programming environment set up on your machine. For this tutorial, we’ll assume that PyCharm is used since it’s a convenient choice even for the less experienced with Python and is a great starting point. Otherwise, simply use your go-to IDE.
On Windows, when installing Python, make sure to tick the PATH installation checkbox. PATH installation adds executables to the default OS Command Prompt executable search. The OS will then recognize commands like pip or python without having to point to the directory of the executable, which makes things more convenient.
The next step is to install the Beautiful Soup 4 library on your system. No matter the OS, you can easily do it by using this command on the terminal to install the latest version of Beautiful Soup:
If you’re using Windows, it’s recommended to run the terminal as administrator to ensure that everything works out smoothly.
Finally, since this article explores working with a sample file written in HTML, you should be at least somewhat familiar with the HTML structure.
2. Inspect your target HTML
A sample HTML document will help demonstrate the main methods of how Beautiful Soup parses data. This file is much more simple than your average modern website; however, it’ll be sufficient for the scope of this tutorial.
For PyCharm to use this file, simply copy it to any text editor and save it with the .html extension to the directory of your PyCharm project. Alternatively, you can create an HTML file in PyCharm by right-clicking on the project area, then navigating to New > HTML File and pasting the HTML code from above.
Going further, you can create a new Python file by navigating to New > Python File. Congratulations, and welcome to your new playground!
3. Find the HTML tags
First, you can use Beautiful Soup to extract a list of all the tags used in our sample HTML file. For this step, you can use the soup.descendants generator:
Click the Run button, and you should get the below output:
Beautiful Soup traversed our HTML file and printed all the HTML tags that it found sequentially. Let’s take a quick look at what each line did:
This tells Python to import the Beautiful Soup library.
This code snippet above, as you could probably guess, gives an instruction to open our sample HTML file, read its contents, and store them in the contents variable.
This line creates a Python Beautiful Soup object and passes it to Python’s built-in HTML parser. Other parsers, such as lxml, might also be used, but it’s a separate external library, and for the purpose of this tutorial, the built-in parser will do just fine.
The final piece of code, namely the soup.descendants generator, instructs Beautiful Soup to look for HTML tag names and print them in the PyCharm console. The results can also easily be exported to a CSV file, but we’ll get to this later.
4. Extract the full content from HTML tags
To extract the content of HTML tags, this is what you can do:
It’s a simple parsing instruction that outputs the HTML tag with its full content in the specified order. Here’s what the output should look like:
Additionally, you can remove the HTML tags and print the text only by adding .text :
Which gives the following output:
Note that this only prints the first instance of the specified tag. Let’s continue to see how to find an HTML element by ID and use the find_all method to filter all elements by specific criteria.
5. Find elements by ID
You can use two similar ways to find elements by ID:
Both of these will output the same result in the Python Console:
6. Find all instances of a tag and extract text
The find_all method is a great way to extract all the data stored in specific elements from an HTML file. It accepts many criteria that make it a flexible tool allowing users to filter data in convenient ways. Let’s find all the items within the <li> tags and print them as text only:
This is how the full code should look like:
And here’s the output:
7. Parse elements by CSS selectors
Beautiful Soup has excellent support for CSS selectors as it provides several methods to interact with HTML content using selectors. Under the hood, Beautiful Soup uses the soupsieve package. When you install Beautiful Soup with Python’s package-management system pip, it’ll automatically install the soupsieve dependency for you. Be sure to check out their documentation to learn more about the supported CSS selectors.
Beautiful Soup primarily provides two methods to interact with HTML web page content using CSS selectors: select and select_one . Let’s try out both of them.
Using the select method
You can grab the title from our HTML sample file using the select method. Your code should look like the below:
Simple, isn’t it? Notice how the CSS selector navigates the HTML by going through the hierarchy of the HTML elements sequentially.
Using the select_one method
This method is useful when you need to grab only one element using a CSS selector that matches multiple elements. For instance, our HTML sample has several <li> elements. If you want to grab only the first one, you can use the following CSS selector:
This will pick the first <li> element of the <ul> tag, which has several other <li> elements.
To extract a specific <li> element, you can add :nth-of-type(n) to your CSS selector. For instance, you can extract the third <li> element, which in our HTML file is <li>Shared proxies</li> , using the following line:
8. How to parse dynamic elements
Most websites these days tend to load content dynamically, meaning data can be left out if JavaScript isn’t triggered to load the content. The requests library and Beautiful Soup libraries aren’t equipped to handle JavaScript-rendered web pages. Consequently, using these libraries to download the HTML document of a website would exclude any dynamically-loaded content.
You’ll have to use other libraries that can render the website by executing JavaScript to parse dynamic elements. Python’s Selenium package offers powerful capabilities to interact with and manipulate DOM elements. In a nutshell, its WebDriver utilizes popular web browsers and renders JavaScript-based dynamic websites quickly. By combining Beautiful Soup with Selenium WebDriver, you can easily parse dynamic content from any website.
Additionally, there are other ways you can scrape dynamic websites that we have explored in our Playwright and Scrapy Splash tutorials.
Step 1: Install Selenium
First, install Selenium with the below command:
As of Selenium 4.6, the browser driver is downloaded automatically. Yet, if you’re using an older version of Selenium or the driver wasn’t found, you’ll have to manually download the WebDriver. Visit this page to find the driver download links for the supported web browsers.
Step 2: Import the necessary libraries
Now that you’ve installed all the required dependencies, you can jump right into writing the code. Let’s begin by importing the newly installed library and Beautiful Soup:
Step 3: Launch the browser
Next, you’ll have to initiate a browser instance using the below code:
The above code uses the Chrome() driver to launch an instance of a Chrome browser.
Step 4: Fetch content from a dynamic website
Now, you can use this driver object to fetch dynamic content. So let’s extract the HTML of this JavaScript-rendered dummy website http://quotes.toscrape.com/js/:
As soon as you execute the above code, you’ll notice the Chrome browser instance automatically navigating to the desired website and rendering the JavaScript-based content. The new object named js_content contains the HTML content of the website.
Step 5: Parse the HTML content using Beautiful Soup
Now that you’ve got the HTML content in a string format, you can simply use the BeautifulSoup() constructor to create the Beautiful Soup object with parsed data:
You can now navigate the soup object with Beautiful Soup and parse any HTML element using the methods outlined previously. For example, let’s extract the first quote found on our target website. Every quote is within the <span> tag with an attribute set to class="text" , so the code line to extract the content from the quote can look like this:
Note the additional underscore _ within class_="text" – you must use it. Otherwise, Python will interpret it as a reserved class keyword.
When parsing dynamic websites, keep in mind that some websites have strong anti-bot measures that can easily detect Selenium-based web scrapers. Mostly, this is achieved by identifying the Selenium web driver's common request patterns and using various other fingerprinting techniques. Thus, it’s extremely difficult to avoid such anti-bot measures. In case your IP address gets blocked, you might want to consider using proxies and implementing other anti-detection methods.
By now you should now have a basic understanding of how Beautiful Soup can be used to parse and extract data. It should be noted that the information presented in this article is useful as introductory material, yet real-world web scraping and parsing with BeautifulSoup is usually much more complicated than this. For a more in-depth look at Beautiful Soup, you’ll hardly find a better source than its official documentation, so be sure to check it out too.
9. Export data to a CSV file
A very common real-world application would be exporting data to a CSV file for later analysis. Although this is outside the scope of this tutorial, let’s take a quick look at how this might be achieved.
First, you would need to install an additional Python library called pandas that helps Python create structured data. This can be easily done by entering the following line in your terminal:
You should also add this line to the beginning of your code to import the library:
Going further, let’s add some lines that’ll export the list we extracted earlier to a CSV file. This is how your full code should look like:
What happens here exactly? Let’s take a look:
This line finds all instances of the <li> tag and stores it in the results object.
And here, we see the pandas library at work, storing our results into a table (DataFrame) and exporting it to a CSV file.
If all goes well, a new file titled names.csv should appear in the running directory of your Python project, and inside, you should see a table with the proxy types list. That’s it! Now you not only know how data extraction from an HTML document works, but you can also programmatically export the data to a new file.
Conclusion
As you can see, Beautiful Soup is a greatly useful HTML parser. With a relatively low learning curve, you can quickly grasp how to navigate, search, and modify the parse tree. With the addition of libraries, such as pandas, you can further manipulate and analyze the data, which offers a powerful package for a near-infinite amount of data collection and analysis use cases.
And if you’d like to expand your knowledge on Python web scraping in general and get familiar with other Python libraries, we recommend heading over to What is Python used for? and Python Requests blog posts. Also, don't miss out on a 1-week free trial of our advanced public data collection solution – Web Scraper API. Try it out and decide whether it fits your data-gathering needs.
Frequently asked questions
Is Beautiful Soup easy to learn?
Yes, Beautiful Soup is relatively easy to learn. It offers a straightforward way to extract data by navigating and searching through the HTML structure. In addition, the Beautiful Soup documentation offers in-depth explanations with examples, so you can be sure to find most of the answers to your questions.
While the Beautiful Soup library is pretty simple to use, it still requires you to have, at the very least, basic Python coding knowledge and an understanding of HTML structure.
Is Beautiful Soup better than Scrapy?
The answer really depends on what you’re trying to achieve. Beautiful Soup is a lightweight Python library that focuses on data parsing, while Scrapy is a full-fledged web scraping infrastructure that allows users to make HTTP requests, scrape data, and parse it.
In essence, Beautiful Soup is better when working with small-scale web scraping projects that don’t require complex web scraping techniques. On the other hand, Scrapy is exceptionally better for medium to large-scale operations. It offers much more features, such as web crawling, the ability to follow links, concurrency and asynchronous web scraping, cookie management, and more. Using Scrapy for larger projects guarantees better overall performance and speed.
Take a look at our blog post on Web Scraping with Scrapy to learn more and see the tool in action.
Is Beautiful Soup good for web scraping?
Yes, Beautiful Soup is highly regarded for most web scraping projects. Its ease of use through intuitive functions makes it one of the most popular Python parsing libraries. It offers all the fundamentals required to parse HTML and XML files and allows users to search for elements based on HTML tags, attributes, text, and more.
While it lacks some functionality for more complex web scraping tasks, it’s certainly one of the better web scraping libraries for beginner and advanced programmers.
Веб-скрейпинг с нуля на Python: библиотека Beautiful Soup

Данные есть везде, на каждом посещенном вами сайте. Чаще всего они уже представлены в читаемом текстовом формате, пригодном для использования в новом проекте, однако, несмотря на то, что нужный текст всегда можно скопировать и вставить прямо со страницы сайта, когда речь заходит о больших данных — о тексте с десятка тысяч веб-сайтов — скрейпинг приходит на помощь.
Обучаться веб-скрейпингу (web-scraping) поначалу сложно, однако если вы начнете своё знакомство с большими данными, используя правильные инструменты, то предстоящий вам путь существенно облегчится.
В пошаговом руководстве вы узнаете, как сделать скрейпинг нескольких страниц веб-сайта с помощью самой простой и популярной библиотеки Python для скрейпинга: Beautiful Soup.
Руководство состоит из двух разделов. В первом разделе речь пойдет о том, как осуществить скрейпинг одной страницы, а во втором — о том, как скрейпить сразу нескольких страниц с помощью примера кода, из первого раздела.
Список рассматриваемых в руководстве тем:
- Что нужно для начала веб-скрейпинга?
- Установка, запуск и настройка Python-библиотеки Beautiful Soup.
- Раздел 1: cкрапинг одной страницы:
— Импорт библиотек.
— Получение HTML-содержимого веб-сайта.
— Анализ веб-сайта и его HTML-разметки.
— Одновременное нахождение нескольких HTML-элементов с помощью Beautiful Soup.
— Экспорт данных в txt -файл. - Раздел 2: cкрапинг нескольких страниц:
— Получение атрибута href .
— Нахождение нескольких элементов с помощью Beautiful Soup.
— Переход по каждой из необходимых ссылок.
Что нужно для начала веб-скрейпинга?
- Beautiful Soup: это пакет Python для анализа веб-сайтов, построенных на технологиях HTML и CSS, то есть, без использования JavaScript-фреймворков вроде React, Angular, VueJS. Beautiful Soup хорошо справляется с разбором HTML и XML документов на части: библиотека создаёт дерево синтаксического анализа веб-страниц для последующего извлечения с них разнообразных данных в удобном для программиста формате. Не волнуйтесь, вам не нужны предыдущие знания о Beautiful Soup, чтобы выполнять указания из руководства — во время чтения вы всему научитесь с нуля!
- Библиотека requests: это стандарт индустрии для выполнения HTTP-запросов на языке программирования Python. Данная библиотека используется в дополнение к Beautiful Soup, когда нужно получить HTML-файл с веб-сайта.
- Python: чтобы следовать руководству, вам не нужно быть экспертом в Python, однако, по крайней мере, вы должны знать, как работают циклы for и списки.
Перед началом руководства, убедитесь, что на вашем компьютере установлен Python 3 версии.
Давайте начнем ознакомление с пособием для новичков по настройке Beautiful Soup в Python!
Установка, запуск и настройка Python-библиотеки Beautiful Soup
- Начнём с установки Beautiful Soup на ваш компьютер. Для этого выполните следующую команду в командной строке или терминале:
- Теперь установите парсер: он понадобится для извлечения данных из HTML-документов. В руководстве предлагается использовать библиотеку-парсер lxml , следовательно, выполните установочную команду:
- Пришло время устанавливать библиотеку requests . Для этого выполните следующую команду в командной строке или терминале:
Наконец-то настройка окружения завершена — можно приступать к программированию!
Раздел 1: Скрейпинг одной страницы
Руководство бережно проведёт вас через каждую строчку кода, необходимого для создания вашего первого скрейпера веб-сайтов. Полный код вы найдете в самом конце статьи. Давайте начнем!
Импорт библиотек
Для скрейпинга нам пригодятся библиотеки BeautifulSoup и request , поэтому импортируем их в программу с помощью двух строчек кода:
Получение HTML-содержимого веб-сайта
Теперь, в образовательных целях, давайте получим все данные сайта, содержащего сотни страниц информации о фильмах; начнём со скрейпинга первой страницы, а затем уже разберёмся со скрейпингом множества страниц одновременно.
В самом начале нам стоит выбрать ссылку на веб-сайт, для примера возьмём страницу диалогов фильма “Титаник”, но вы можете выбрать любой фильм; затем нужно сформировать и отправить HTTP-запрос на сайт, чтобы получить ответ, в котором и содержится нужное описание фильма. Получив ответ на запрос, сохраняем его в переменной с идентификатором result , чтобы использовать метод .text для получения содержимого страницы сайта:
Наконец, воспользуемся парсером lxml для получения “супа” — объекта, содержащего все данные во вложенной структуре, которая понадобится нам в работе позднее:
Теперь, когда у нас уже есть объект soup , доступ к HTML в читабельном формате легко получается с помощью функции .prettify() . Несмотря на то, что HTML в текстовом редакторе также пригоден для ручного поиска конкретных его элементов, гораздо лучше сразу перейти к HTML-коду необходимого нам элемента страницы: в следующем шаге мы как раз это сделаем.
Анализ веб-сайта и его HTML-разметки
Важным шагом перед тем, как перейти к написанию кода, является анализ веб-сайта, скрейпинг которого производится, и полученного HTML-кода, ради нахождения самого лучшего подхода к решению задачи. Ниже приведен скриншот страницы описания фильма, по которому видно, что элементами, текст которых нужно получить, являются название фильма и реплики из него:
Movie Title — название фильма, Transcript — это диалоги из фильма
Теперь необходимо разобраться, как получить HTML только этих двух нужных нам элементов; выполните следующие действия:
- Перейдите на веб-страницу с диалогами из выбранного вами фильма.
- Наведите курсор на название фильма или его диалоги, а затем щелкните правой кнопкой мыши: появится меню, в котором выберите пункт “Исследовать”, чтобы открыть исходный код страницы сразу на нужном месте.
Ниже приведена уменьшенная версия HTML-кода, полученного после нажатия на пункт меню “Исследовать”. В дальнейшем в качестве справочника используется именно эта HTML-разметка, полезная при определении местоположения элементов в следующем шаге руководства.
Поиск HTML-элемента с помощью Beautiful Soup
Найти конкретный HTML-элемент в объекте, полученном с помощью Beautiful Soup — проще простого: нужно всего-то применить метод .find() к созданному ранее объекту из переменной soup .
В качестве примера давайте найдем блок, содержащий название фильма, описание и диалоги; нужный блок находится внутри тега article и обозначен классом main-article . Доступ к данному блоку осуществляется с помощью одной-единственной строчки кода:
Теперь давайте найдем название фильма и список реплик. Название фильма находится внутри тега h1 и не обозначено никаким классом, что никак не помешает нам его найти; после нахождения нужного элемента, было бы неплохо получить из него текст, используя метод .get_text() :
Список реплик из фильма расположен внутри тега div и обозначен классом full-script . В таком случае, чтобы получить текст, нужно изменить параметры по умолчанию в методе .get_text() . Во-первых, устанавливаем настройку strip=True , чтобы удалить все лишние пробелы. Затем добавим пробел в качестве разделителя separator=»» , чтобы пробел ставился после каждой новой строки, то есть, после каждого нового символа перехода на новую строку \n :
На данный момент скрейпинг данных с одной страницы успешно завершён. Выведите в консоль или терминал значения переменных title и transcript , чтобы убедиться в корректности собственного кода на Python.
Экспорт данных в txt-файл
Скорее всего, ради возможности дальнейшего использования только что полученных в результате успешного скрейпинга данных, вы захотите куда-то их сохранить: например, экспортировать данные в формат .csv , .json или какой-либо другой; в данном примере извлеченные данные сохраняются в файле формата .txt :
При экспорте данных пригодится ключевое слово with , как показано в приведенном ниже коде:
Имейте в виду, что в примере при установке названия фильма в качестве имени файла используется f-string , известная как “форматируемая строка”.
После выполнения кода в вашем рабочем каталоге должен появиться файл в формате .txt .
Теперь, когда вы успешно выполнили скрейпинг данных с одной страницы, вы готовы приступать к скрейпингу сразу нескольких страниц!
Раздел 2: Скрейпинг нескольких страниц
Ниже представлен скриншот первой страницы сайта с диалогами из фильмов: сайт предоставляет аж 1 234 такие страницы, на каждой из которых размещено около 30-ти фильмов:
Pages — список страниц сайта, List of Movie Transctipts — список фильмов
Во второй части руководства вы узнаете, как скрейпить несколько ссылок, получая атрибут href для каждой из них. Во-первых, понадобится изменить веб-сайт для скрейпинга с используемого в первом примере на приведённый выше.
Новая переменная с идентификатором website , содержащая ссылку на веб-сайт, оформляется следующим образом:
Вы можете заметить в коде примера ещё одну новую переменную с идентификатором root — она пригодится позже.
Получение атрибута href
Сейчас вы узнаете, как получить атрибут href сразу для всех 30-ти фильмов, перечисленных на одной странице: для начала выберите любое название фильма из блока “список фильмов” на скриншоте выше.
Теперь необходимо получить HTML-разметку страницы. Каждый из HTML-тегов <a></a> относится к определенному названию фильма, причем ссылочный тег <a></a> выбранного вами фильма должен быть выделен синим цветом, как показано на скриншоте:

На скриншоте видно, что ссылки внутри href не содержат корня ссылки на веб-сайт subslikescript.com : именно поэтому для составления полного веб-адреса пригодится ранее созданная переменная root, содержащая как раз корень для ссылок.
Теперь давайте найдем все ссылочные теги <a></a> на странице со списком фильмов.
Одновременное нахождение нескольких HTML-элементов с помощью Beautiful Soup.
Для нахождения нескольких элементов в Beautiful Soup пригодится специальный метод .find_all() с параметром-настройкой href=True : данный метод позволяет удобно и быстро извлекать ссылки, соответствующие каждой из страниц с диалогами:
Извлечь ссылки из href можно с помощью добавления указания на атрибут [‘href’] к выражению выше; однако метод .find_all() возвращает список, а не строки, поэтому придется получать атрибут href по одному внутри обходного цикла:
Теперь было бы неплохо создать свой список из ссылок в нужном формате, в чём нам поможет списковое включение, также известное как генератор списка:
При выводе списка ссылок на экран с помощью функции print() , вы увидите все те ссылки, веб-страницы которых планируется скрейпить. В следующем шаге руководства мы как раз реализуем скрейпинг каждой из них.
Переход по каждой из необходимых ссылок
Для получения диалогов фильмов по каждой из полученных на предыдущем этапе ссылок, выполним те же шаги, что и для одной ссылки ранее, но на этот раз воспользуемся циклом for , повторив таким образом все действия по нескольку раз: