Sebastian's Blog
Posts
Building an Efficient Web Crawler with Python: From Raw data to a Dashboard

Building an Efficient Web Crawler with Python: From Raw data to a Dashboard

From Raw Data to a Dashboard: How efficiently to extract and process large volumes of data from websites using Python.

Sebastian Coronel
July 24, 2024

Scraping data is a powerful technique for extracting data from websites. When dealing with large datasets or numerous pages, efficiency becomes crucial. This blog post demonstrates how I built a scalable web crawler using Python. We will break down a complete web crawler that paginates through search results, scrapes individual job details, and saves the data for use it in a dashboard.

The target site is seek.co.nz, a popular job board with thousands of offers for all New Zealand. I chose this site because I used it when I was living in New Zealand.

So lets start.

Setting Up the Environment

Before diving into the code, we have to ensure the necessary libraries are installed.

pip install requests lxml

Initialization

The init method initializes the necessary tools and configurations.

class Crawler:
    def __init__(self):
		    # To log events
        self.__tools_obj = Tools()
        self.__logger = Tools().get_logger("./")
        
        # Scraping module 
        self.__scraper = scraper.Scraper()
        
        # Saving module 
        self.__saver = saver.Saver()
        
        # To keep all the urls 
        self.__totalJobsUrls = []
        
        # To keep the jobs data
        self.__totalJobsData = []
        
        # String template that represents the base URL for the API request
        self.__startURL = config.startURL.format(config.where, config.page, config.keywords)

Crawl Request module.

The “crawl_request” method orchestrates the crawling process, starting with pagination and then scraping. I pass the startUrl to the pagination function and then after gather all the job urls I pass the totalJobsUrls list to extract the data from them.

def crawl_request(self):
    try:
        # Start pagination process
        self.pagination(self.__startURL)

        # Start scraping process
        self.scraping(self.__totalJobsUrls)

    except Exception as e:
        self.logging('Error in crawl_request function: {}'.format(e))

Pagination process

Here, the Pagination method calculates the total number of pages and uses “ThreadPoolExecutor“ for concurrent execution. The try_request funtion works with Python requests library, I use it to make a request to the startURL to get the total count of job posted to calculate how many times I have to iterate the pagination looking for job posts.

# Pagination process
def pagination(self, startURL):
	try:
	    # Calculate number of pages
	    response_raw = self.try_request(startURL)
	    response_json = json.loads(response_raw.text)
	    totalItems = response_json["totalCount"]
	    calc_pages = totalItems / 22
	    pages = math.ceil(calc_pages)
	
	    # Paginate with parallel execution
	    with concurrent.futures.ThreadPoolExecutor(max_workers=config.workers) as executor:
	        for page in range(1, pages + 1):
	            # Create job URL
	            executor.submit(self.parallel_pagination, page)
	except Exception as e:
	    self.logging('ERROR at - {}'.format(e))
	
	
	

# Python Requests
def try_request(self, url):
	try:
		response = requests.request("GET", url)
	    return response
	except Exception as e:
	    self.logging('ERROR in try requests: {}'.format(e))

To enhance the pagination process Im using parallel excecution with 10 workers defined in a config.py file, also I have here the start url and other search parameters.

# config.py file

startURL = '<https://www.seek.co.nz/api/chalice-search/v4/search?siteKey=NZ-Main&sourcesystem=houston&userqueryid=43b3a37fc63aa9dfa2bc5edbd5dba489-3490962&userid=a185362f-be3a-4ad6-a6b9-33f499f10989&usersessionid=a185362f-be3a-4ad6-a6b9-33f499f10989&eventCaptureSessionId=a185362f-be3a-4ad6-a6b9-33f499f10989&where={}&page={}&seekSelectAllPages=true&keywords={}&include=seodata&locale=en-NZ&solId=94e340f0-cd6d-4c35-ba5a-8614c67aad65>'
jobURL = '<https://www.seek.co.nz/job/>'

workers = 10

# search parameters
keywords = 'project manager'
page = 1
where = "All+New+Zealand"

Parallel Pagination

The parallel_pagination method processes each page concurrently, building job URLs.

For that, I do a request to the startURL, search the “data” value inside the json and then an Iteration to build all the urls and keep them in a totalJobsUrls list.

# Parallel Pagination process

def parallel_pagination(self, page):
    try:
        response_raw = self.try_request(config.startURL.format(config.where, page, config.keywords))
        response_json = json.loads(response_raw.text)
        
        # Get the job list located in data value.
        jobs_list = response_json['data']
        for job in jobs_list:
            job_id = job['id']
            
            # Build Job URL
            job_url = config.jobURL + str(job_id)
            self.__totalJobsUrls.append(job_url)
    except Exception as e:
        self.logging('ERROR in parallel_pagination: {}'.format(e))
    pass

Scraping

The Scraping method processes each job URL concurrently. I pass the job_list and using parallel execution I can extract data 10 times faster, as I did with the pagination process as well.

# Scraping process

def scraping(self, jobs_list):
    try:
        with concurrent.futures.ThreadPoolExecutor(max_workers=config.workers) as executor:
    for url in jobs_list:
        self.logging('Scraping process')
        executor.submit(self.parallel_scraping, url)
        
# Saver process
self.__saver.saver(self.__totalJobsData)
    except Exception as e:
        self.logging('ERROR in scraping - {}'.format(e))

Parallel Scraping

The parallel_scraping method extracts job details from each URL concurrently. After each request to the job url, I get raw data, I parsed it with html.fromstring method and then using xpath I look for the script that contains all the relevant data that I was looking for.

Cleaning and converting the JSON Data
The extracted script tag content includes the window.SEEK_REDUX_DATA variable, which holds the JSON data. I used then Regular expressions to isolate this JSON string. Finally I replace occurrences of undefined with null, making the JSON string valid.

WIth all this done, all that remains is send the jobJson object to the Scrape function.

def parallel_scraping(self, url):
    try:
        # Extract JSON from HTML
        response_raw = self.try_request(url)
        response_html = html.fromstring(response_raw.text)
        responseJson_raw = response_html.xpath('//script[@data-automation]/text()')[0]
        
        # **Cleaning and converting the JSON Data**
        responseJson_raw = re.findall('.*window\\\\.SEEK_REDUX_DATA = (.*)?;', responseJson_raw)[0]
        responseJson_raw = re.sub(r'undefined', 'null', responseJson_raw)
        responseJson = json.loads(responseJson_raw)
        jobJson = responseJson["jobdetails"]["result"]["job"]
        
        # Scraper process begins here
        result_dict = self.__scraper.create_dict(jobJson)
        self.__totalJobsData.append(result_dict)
    except Exception as e:
        self.logging('Error in parallel_scraping function: {}'.format(e))

Class Scraper:

I definded a result_dict variable first to keep the data before saving into the database.

 def create_dict(self, jobJson):
        
        result_dict = {"titulo": None,
                       "phone_1": None,
                       "phone_2": None,
                       "phone_3": None,                     
                       "summary": None,              
                       "type": None,
                       "salary": None,
                       "location": None,
                       "email_1": None,
                       "email_2": None,                     
                       "link": None,
                       }
                       
        # Extract the data from the JSON
        try:
            result_dict["titulo"] = jobJson["title"]
            result_dict["telefono_1"] = jobJson["phoneNumber"]
            # Find extra phones
            phone_list = re.findall('\\d(?:\\s?\\d){5,}', content)
            if len(phone_list) > 0:
                try:
                    result_dict["phone_2"] = re.findall('\\d(?:\\s?\\d){5,}', content)[0]
                    result_dict["phone_3"] = re.findall('\\d(?:\\s?\\d){5,}', content)[1]
                except Exception as e:
                    pass

            # Emails
            email_list = re.findall(r'\\b([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,})\\b', content)
            if len(email_list) > 0:
                try:
                    result_dict["email_1"] = \\
                        re.findall(r'\\b([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,})\\b', content)[0]
                    result_dict["email_2"] = \\
                        re.findall(r'\\b([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,})\\b', content)[1]
                except Exception as e:
                    pass

            try:
                # salary
                salary_label = jobJson["salary"]["label"]
			     
			     # rest of the code

Conclusion

This web crawler efficiently scrape job listings from Seek.co.nz website. By breaking down the process into pagination and scraping tasks and running them concurrently, we achieve a scalable and performant crawler.

Next Steps

In the next part of this post, I will explain how I built the API in Node to code the endpoints and use them on the front end. Read the next post here

Leave your comments below and Happy scraping!