I am scraping a few links with BeautifulSoap however, it seems to completely ignore <br>
tags.
Here is the relevant portion of source code of the URL I am scrapping:
<h1 class="para-title">A quick brown fox jumps over<br>the lazy dog
<span id="something"></span></h1>
Here is my BeautifulSoap code (relevant part only) to get the text within h1
tags:
soup = BeautifulSoup(page, 'html.parser')
title_box = soup.find('h1', attrs={'class': 'para-title'})
title = title_box.text.strip()
print title
This gives the following output:
A quick brown fox jumps overthe lazy dog
Whereas I am expecting:
A quick brown fox jumps over the lazy dog
How can I replace the <br>
with a space
in my code?
How about using the .get_text()
with the separator parameter?
from bs4 import BeautifulSoup
page = '''<h1 class="para-title">A quick brown fox jumps over<br>the lazy dog
<span>some stuff here</span></h1>'''
soup = BeautifulSoup(page, 'html.parser')
title_box = soup.find('h1', attrs={'class': 'para-title'})
title = title_box.get_text(separator=" ").strip()
print (title)
Output:
print (title)
A quick brown fox jumps over the lazy dog
some stuff here
Using replace()
on the html before parsing:
from bs4 import BeautifulSoup
html = '''<h1 class="para-title">A quick brown fox jumps over<br>the lazy dog
<span>some stuff here</span></h1>'''
html = html.replace("<br>", " ")
soup = BeautifulSoup(html, 'html.parser')
title_box = soup.find('h1', attrs={'class': 'para-title'})
title = title_box.get_text().strip()
print (title)
OUTPUT:
A quick brown fox jumps over the lazy dog
some stuff here
EDIT:
For the part OP mentioned in the comments below;
html = '''<div class="description">Planet Nine was initially proposed to explain the clustering of orbits
Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.
</div>'''
from bs4 import BeautifulSoup
html = html.replace("\n", ". ")
soup = BeautifulSoup(html, 'html.parser')
div_box = soup.find('div', attrs={'class': 'description'})
divText= div_box.get_text().strip()
print (divText)
OUTPUT:
Planet Nine was initially proposed to explain the clustering of orbits. Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four..
Use str.replace
function :
print title.replace("<br>", " ")