Real Estate Semantic Search.mp4

Multimodal Semantic Search for Real Estate Listings

Client: ML Institute

Year: 2025

Role: Machine Learning Engineer

Duration: 2 week

Summary: Multi-model semantic search engine for real estate listings in rightmove.

Project Summary

User Flow

Dataset

Baseline Model – Text-Only Semantic Search

Clip for Multi-Modality

Using Attention to Compute Property Listing Embeddings

Project Summary

I built a multi-modal semantic search system for real estate listings, enabling users to search across both images and text using natural language queries — e.g., "flat with a blue sofa and big windows".

This was motivated by a real need I experienced while searching for a flat, where visual and textual cues were often disconnected in traditional search interfaces.

Although focused on real estate, the same approach can generalize to other domains — like searching for system architecture diagrams or screen recordings based on descriptive queries.

User Flow

Indexing Pipeline:
- Loads listing data including
- Either use pre-computed embeddings from OpenAI or get them through inference (more on this below)
- Builds an in-memory FAISS index optimized for cosine similarity
- Makes the index available for the search API
Search Flow:
- User enters a natural language query in the frontend
- Query is sent to the FastAPI backend
- Backend converts the query to an embedding vector using the same query tower model.
- FAISS index performs efficient similarity search to find the most semantically similar listings
- Top N results are returned and displayed in the frontend with images and details

Dataset

I trained the models on a dataset of 5,000 Rightmove listings, each including:

Textual descriptions
Images
Metadata such as location, price, number of bedrooms, etc.

To generate training and evaluation queries, I used ChatGPT to create semantic search prompts tailored to each listing. These queries referenced both textual features (e.g., “Victorian-style fireplace”) and visual features (e.g., “room with skylights and exposed brick”), simulating realistic multi-modal search beha

Baseline Model – Text-Only Semantic Search

As a first step, I built a text-only semantic search baseline:

Used the ChatGPT embedding model (text-embedding-ada-002) to encode listing descriptions and queries.
Performed retrieval using FAISS with k-nearest neighbours (kNN) — simple but surprisingly effective.

To push performance further, I fine-tuned the model using a Two-Tower architecture with InfoNCE loss on query–listing pairs, which led to noticeable improvements in relevance and ranking quality.

Clip for Multi-Modality

To move beyond text-only search, I used CLIP for multi-modal semantic retrieval. However, CLIP’s text encoder limit (~70 tokens) wasn’t enough to handle full queries or listing descriptions.

To work around this, I:

Used a small LLaMA-1B model to break long queries and descriptions into short, self-contained statements (e.g. “blue sofa”, “has a garden”).
Computed CLIP embeddings for each of these fragments and all listing images.
Treated each image and text fragment as an individual embedding, and performed search over this entire pool.

The retrieval process looked something like this:

A query like “blue sofa with garden” is split into components.
Each component is searched separately over all fragments.
The intersection of matching results (e.g. listings with both “blue sofa” and “garden”) is returned.

Unfortunately there were some limitations with this. For example, it was very slow during inference and the naive intersection logic misses listings where relevant info is present but phrased differently or spread across modalities.

To improve this ranking, I:

Used average pooling of all matched embeddings per listing.
Final relevance score was computed as the average similarity between query components and the listing’s top-matching embeddings.

This architecture is visible in the diagram above and served as the multi-modal baseline.

Using Attention to Compute Property Listing Embeddings

To improve alignment between text and image embeddings, I moved beyond average pooling and integrated attention mechanisms. Specifically:

LLaMA was used to generate query and description embeddings.
CLIP handled image embeddings.
A Transformer encoder was applied on top of these embeddings to better align text and image representations in a shared space.

For training, I used an InfoNCE two-tower model to fine-tune the embeddings, ensuring that the model learned to align text and images more effectively.

This new architecture outperformed previous iterations by significantly improving the quality of results and better handling the nuances of multi-modal queries.

Page updated

Google Sites

Report abuse