Extracting a URL in Python
In response to the OP's edit I hijacked Find Hyperlinks in Text using Python (twitter related) and came up with this:
import re
myString = "This is my tweet check it out http://example.com/blah"
print(re.search("(?P<url>https?://[^\s]+)", myString).group("url"))
How do you extract a url from a string using python?
There may be few ways to do this but the cleanest would be to use regex
>>> myString = "This is a link http://www.google.com"
>>> print re.search("(?P<url>https?://[^\s]+)", myString).group("url")
http://www.google.com
If there can be multiple links you can use something similar to below
>>> myString = "These are the links http://www.google.com and http://stackoverflow.com/questions/839994/extracting-a-url-in-python"
>>> print re.findall(r'(https?://[^\s]+)', myString)
['http://www.google.com', 'http://stackoverflow.com/questions/839994/extracting-a-url-in-python']
>>>
Extract all urls in a string with python3
Apart from what others mentioned, since you've asked for something that already exists, you might want to try URLExtract.
Apparently it tries to find any occurrence of TLD in given text. If TLD is found, it starts from that position to expand boundaries to both sides searching for a "stop character" (usually white space, comma, single or double quote).
You have a couple of examples here.
from urlextract import URLExtract
extractor = URLExtract()
urls = extractor.find_urls("Let's have URL youfellasleepwhilewritingyourtitle.com as an example.")
print(urls) # prints: ['youfellasleepwhilewritingyourtitle.cz']
It seems that this module also has an update()
method which lets you update the TLD list cache file
However, if that doesn't fit you specific requirements, you can manually do some checks after you've processed the urls using the above module (or any other way of parsing the URLs). For example, say you get a list of the URLs:
result = ['https://www.lorem.com/ipsum.php?q=suas', 'https://www.lorem.org', 'http://news.bbc.co.uk']
You can then build another lists which hold the excluded domains / TLDs / etc:
allowed_protocols = ['protocol_1', 'protocol_2']
allowed_tlds = ['tld_1', 'tld_2', 'tld_3']
allowed_domains = ['domain_1']
for each_url in results:
# here, check each url against your rules
how to extract url from a string and save to a list
Are you just after the "next page" or do you want all the links?
so do you want just:
/jobs?q=software+engineer+&l=Kerala&start=10
or are you after all of these?
/jobs?q=software+engineer+&l=Kerala&start=10
/jobs?q=software+engineer+&l=Kerala&start=20
/jobs?q=software+engineer+&l=Kerala&start=30
/jobs?q=software+engineer+&l=Kerala&start=40
/jobs?q=software+engineer+&l=Kerala&start=10
Few issues:
Links1
is a list of elements. And you are then using.find('a')
on a list, which won't work.- Since you want href attributes, consider using the
find('a',href=True)
So here's how I would go about it:
import requests
from bs4 import BeautifulSoup
url = "https://in.indeed.com/jobs?q=software%20engineer%20&l=Kerala"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
Links1 = soup.find_all("div",{"class":"pagination"})
url = [tag.find('a',href=True)['href'] for tag in Links1]
website=f'https://in.indeed.com{url[0]}'
Output:
print(website)
https://in.indeed.com/jobs?q=software+engineer+&l=Kerala&start=10
To get all those links:
import requests
from bs4 import BeautifulSoup
url = "https://in.indeed.com/jobs?q=software%20engineer%20&l=Kerala"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
Links1 = soup.find("div",{"class":"pagination"})
urls = [tag['href'] for tag in Links1.find_all('a',href=True)]
website=f'https://in.indeed.com{url[0]}'
Use RegEx in Python to extract URL and optional query string from web server log data
You can use
^(?P<url>[^?]+)(?P<querystr>\?.*)?$
Details
^
- start of string(?P<url>[^?]+)
- Group "url": any one or more chars other than?
(?P<querystr>\?.*)?
- an optional Group "querystr": a?
char and then any zero or more chars other than line break chars as many as possible$
- end of string.
See the regex demo.
Extracting URL link using regular expression re - string matching - Python
re.findall(r'https?://[^\s<>"]+|www\.[^\s<>"]+', str(STRING))
The [^\s<>"]+
part matches any non-whitespace, non quote, non anglebracket character to avoid matching strings like:
<a href="http://www.example.com/stuff">
http://www.example.com/stuff</br>
Related Topics
Find the Indexes of All Regex Matches
Beautiful Soup 4 Find_All Don't Find Links That Beautiful Soup 3 Finds
What Is the Best Approach to Change Primary Keys in an Existing Django App
Lambda Function Don't Closure the Parameter in Python
How to Implement a Minimal Server for Ajax in Python
Finding Moving Average from Data Points in Python
Timedelta to String Type in Pandas Dataframe
Check If Value Already Exists Within List of Dictionaries
How to Fix Character Constantly Accelerating in Both Directions After Deceleration Pygame
Get Files Names Inside a Zip File on Ftp Server Without Downloading Whole Archive
Safe Way to Parse User-Supplied Mathematical Formula in Python
How to Correctly Parse Utf-8 Encoded HTML to Unicode Strings with Beautifulsoup