Ask HN: Tips on building a search engine?

9 points by vednig 10 months ago

I'm recently thinking about building my own search engine from scratch that can be deployed over a cheap hardware like RPi, or an old android phone.

If I were to start today what are some good resources to build a search engine like Google (including in terms of quality, reliability and following web standards).

If you've ever worked on anything like this. I'd like to know why or what challenges I may face since starting or deploying in production.

MrVandemar 10 months ago

Viktor Lofgren's blog is probably worth a read — he is developing the rather delightful marginalia search engine.

https://www.marginalia.nu/log/

Someone 10 months ago

> that can be deployed over a cheap hardware like RPi, or an old android phone.

Power usage will dominate hardware costs, making that hardware expensive, not cheap, probably more expensive than hardware designed to run 24/7 in a data center.

> build a search engine like Google (including in terms of quality, reliability and following web standards).

You won’t get reliability from “an old android phone”

> what challenges I may face since starting or deploying in production.

Buy a napkin first, and use it to make some calculations. Starting numbers: according to https://blog.hubspot.com/marketing/google-search-statistics, Google handles about 250k queries per second. https://zyppy.com/seo/google-index-size/ They have an index with 400 billion documents.

vednig 10 months ago

Yeah, estimates do not seem to be favorable, but if you look at this correctly, I think I only have to handle 400 billion documents storage cache and indexing first scaling up to 250k queries per second is another problem.

krapp 10 months ago

You're a few years behind the times if you consider Google to be a standard of quality and reliability. Search is all but dead now, killed by an ocean of AI slop and ads because Google wanted search queries to be a moneymaker.

I think the only viable solutions going forward in a post-AI world will be decentralized, small scale and non-general, curated by human beings based on a reputation system rather than algorithms, possibly even torrent-based and not touching the web at all.

vednig 10 months ago

I thing all the AI systems wouldn't have been possible without Google Search structuring all the random data on the internet, we would have not been able to reach this stage.
So, in theory if there is a better indexing engine it would potentially lead to much larger and improved set of problems, it's similar to CPU each component is related to some other component but there are always performance bottlenecks.

readyplayernull 10 months ago

The easiest way would be running Postgres:

https://www.crunchydata.com/blog/postgres-full-text-search-a...

The problem is getting all the data. If you try to scrap another search engine it will punish you.

You could think on a distributed collaborative SE that scraps the web from the IP addresses of each device.

vednig 10 months ago

Yes, based on how web is design there will always be different subsections of the internet which I can scrape but all of the data starting from one site like Wikipedia would only allow me to scrape only one subsection of the internet mainly, truth and news coverage, if I wanted to tap into index of spam ads and untrusted blog posts from this dataset it would never yield correct or even nearly correct result.

sweca 10 months ago

You're going to face a challenge of scalability. With so many results like Google, you can't fit every embeddings vector in memory. You'll need to work a database hybrid of in memory caching and disk persistent indices.

vednig 10 months ago

At least one type of oss software for this should already exist, right?

sareada52 10 months ago

[dead]

vednig 10 months ago

I'm hoping on achieving just a better search results, that value quality over algorithms.