Ask HN: Tips on building a search engine?
I'm recently thinking about building my own search engine from scratch that can be deployed over a cheap hardware like RPi, or an old android phone.
If I were to start today what are some good resources to build a search engine like Google (including in terms of quality, reliability and following web standards).
If you've ever worked on anything like this. I'd like to know why or what challenges I may face since starting or deploying in production.
Viktor Lofgren's blog is probably worth a read — he is developing the rather delightful marginalia search engine.
https://www.marginalia.nu/log/
> that can be deployed over a cheap hardware like RPi, or an old android phone.
Power usage will dominate hardware costs, making that hardware expensive, not cheap, probably more expensive than hardware designed to run 24/7 in a data center.
> build a search engine like Google (including in terms of quality, reliability and following web standards).
You won’t get reliability from “an old android phone”
> what challenges I may face since starting or deploying in production.
Buy a napkin first, and use it to make some calculations. Starting numbers: according to https://blog.hubspot.com/marketing/google-search-statistics, Google handles about 250k queries per second. https://zyppy.com/seo/google-index-size/ They have an index with 400 billion documents.
Yeah, estimates do not seem to be favorable, but if you look at this correctly, I think I only have to handle 400 billion documents storage cache and indexing first scaling up to 250k queries per second is another problem.
You're a few years behind the times if you consider Google to be a standard of quality and reliability. Search is all but dead now, killed by an ocean of AI slop and ads because Google wanted search queries to be a moneymaker.
I think the only viable solutions going forward in a post-AI world will be decentralized, small scale and non-general, curated by human beings based on a reputation system rather than algorithms, possibly even torrent-based and not touching the web at all.
I thing all the AI systems wouldn't have been possible without Google Search structuring all the random data on the internet, we would have not been able to reach this stage.
So, in theory if there is a better indexing engine it would potentially lead to much larger and improved set of problems, it's similar to CPU each component is related to some other component but there are always performance bottlenecks.
The easiest way would be running Postgres:
https://www.crunchydata.com/blog/postgres-full-text-search-a...
The problem is getting all the data. If you try to scrap another search engine it will punish you.
You could think on a distributed collaborative SE that scraps the web from the IP addresses of each device.
Yes, based on how web is design there will always be different subsections of the internet which I can scrape but all of the data starting from one site like Wikipedia would only allow me to scrape only one subsection of the internet mainly, truth and news coverage, if I wanted to tap into index of spam ads and untrusted blog posts from this dataset it would never yield correct or even nearly correct result.
You're going to face a challenge of scalability. With so many results like Google, you can't fit every embeddings vector in memory. You'll need to work a database hybrid of in memory caching and disk persistent indices.
At least one type of oss software for this should already exist, right?
[dead]
I'm hoping on achieving just a better search results, that value quality over algorithms.