Extracting and Analyzing Car Listings from OLX – A Web Scraping Adventure

January 9, 2024

Introduction

Web scraping is a powerful technique to extract valuable information from websites. In this blog post, we explore the process of scraping car listings from OLX, focusing on the Tamil Nadu region. We will cover topics such as web scraping, data cleaning, and parsing, providing both code snippets and detailed explanations.

Web Scraping OLX Car Listings

To kickstart our adventure, we utilize the requests library to fetch the HTML content of OLX’s car listings in Tamil Nadu. The BeautifulSoup library helps parse the HTML, and by identifying a key marker (“myads”), we narrow down our content to the relevant section.

import requests
from bs4 import BeautifulSoup

url = "https://www.olx.in/tamil-nadu_g2001173/cars_c84/q-cars"
response = requests.get(url)
content = str(response.content)
p = content.find("myads")
content = content[p:]

Cleaning and Extracting Data

The raw HTML content is then saved to a file for reference. Next, we split the content based on the “title” keyword, and a list of data chunks is obtained. Each chunk represents a car listing.

content_list = content.split("title")

dlist = []
for txt in content_list:
# Data cleaning steps
val = 'title'+txt
val = val.replace("\\u002F"," ")
val = val[:val.find("spell")].strip('"').strip(",")
val = val.replace(val[val.find("images"):val.find("package")],"")
val = val[:val.find("]}")+2]

if len(val) > 2:
    dlist.append(val)

Extracting Relevant Information

With the data chunks in hand, we filter out unwanted information and extract relevant details such as car titles and prices. We create a list final_data to store this refined information.

final_data=[]
for data in dlist:
if ":" not in data: continue
if 'title"' not in data: continue
if "OLX" in data: continue
if "Length" in data: continue
if "_length" in data: continue

title = data[data.find("title")+7 : data.find('","')+1]
value = data[data.find('"raw":')+6 : data.find(',"currency"')]

if not value.isdigit(): continue
final_data.append([title, value])

Parsing and Cleaning HTML Content

To better understand the extracted information, we define a function to clean HTML content and another to identify the starting word of the description.

def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' || ', raw_html)
    return cleantext

def start_word(txt):
    sw=""
    for w in txt.split():
        if word_mix(w) == {',', 'n'}:
            sw = w
            break
    return sw

Parsing Car Descriptions

We create a parser function to process the cleaned HTML content, extracting relevant details. The results are then written to an output file.

with open("output_cars.txt","w") as f:
    for i in range(len(car_content)):
        data = parser(car_content[i])
        f.write(str(data)+"\n")

Conclusion

Web scraping is a valuable skill for extracting information from websites. In this journey, we’ve explored the OLX car listings in Tamil Nadu, delving into web scraping, data cleaning, and parsing techniques. By combining these skills, we can transform raw HTML content into structured data for further analysis or visualization.

Data Analytics

The Ultimate Guide to Organizing Your Data Like a Pro 😧

ByKishore January 10, 2024May 27, 2024

Lists, a versatile and fundamental data structure in Python, play a pivotal role in various programming scenarios. In this comprehensive guide, we will explore the creation, manipulation, and advanced features of lists in Python. Understanding Lists A list is an ordered collection of elements enclosed in square brackets [ ] and separated by commas. Python…

Machine Learning

The Mathematics Behind Machine Learning

ByKishore March 2, 2024May 27, 2024

Machine learning is a branch of artificial intelligence that enables computers to learn from data and make decisions or predictions without being explicitly programmed. At the core of machine learning algorithms lie mathematical concepts and principles that drive their functionality. In this blog post, we’ll explore some key mathematical concepts behind machine learning. Linear Algebra…

Data Analytics | Machine Learning | NLP

Exploring Named Entity Recognition with Conditional Random Fields

ByKishore January 9, 2024January 10, 2024

Named Entity Recognition (NER) is a fundamental task in natural language processing that involves identifying and classifying entities, such as names of people, organizations, and locations, within a text. NER plays a crucial role in various applications, including information retrieval, question answering, and text summarization. In this blog post, we’ll dive into the world of…

Machine Learning

Composite Estimators using scikit-learn: A Comprehensive Guide

ByKishore February 1, 2024May 26, 2024

Agenda 1. Introduction to Composite Estimators Composite Estimators in scikit-learn involve connecting one or more transformers with estimators to create a comprehensive model. These composite transformers are implemented using the Pipeline class, while FeatureUnion is used to concatenate the output of transformers to create derived features. Pipelines enhance code reusability and modularity in machine learning…

Data Analytics | Machine Learning

Visualizing Data for Classification

ByKishore January 9, 2024May 27, 2024

In this lab, we’ll explore the German bank credit dataset to understand relationships for a classification problem. Unlike regression problems where the label is a continuous variable, classification problems involve categorical labels. We aim to visually explore the data to identify features useful in predicting customers with bad credit. Load and Prepare the Dataset Let’s…

Data Analytics

A Comprehensive Guide to Array Handling and Advanced Operations using Numpy

ByKishore January 5, 2024January 5, 2024

Numpy, Your Gateway to Powerful Array Manipulation in Python If you’re venturing into the realm of scientific computing or data analysis with Python, Numpy is your trusted companion. This library is tailored for multidimensional array operations, offering features like seamless data consistency checks, efficient memory usage, and lightning-fast vector arithmetic. In this comprehensive guide, we’ll…