Ex-Google-Search engineer here, having also done some projects since leaving that involve data-mining publicly-available web documents.
This proposal won’t do very much. Indexing is the (relatively) easy part of building a search engine. CommonCrawl already indexes the top 3B+ pages on the web and makes it freely available on AWS. It costs about $50 to grep over it, $800 or so to run a moderately complex Hadoop job.
(For comparison, when I was at Google nearly all research & new features were done on the top 4B pages, and the remaining 150B+ pages were only consulted if no results in the top 4B turned up. Difficulty of running a MapReduce over that corpus was actually a little harder than running a Hadoop job over CommonCrawl, because there’s less documentation available.)
The comments here that PageRank is Google’s secret sauce also aren’t really true – Google hasn’t used PageRank since 2006. The ones about the search & clickthrough data being important are closer, but I suspect that if you made those public you still wouldn’t have an effective Google competitor.
The real reason Google’s still on top is that consumer habits are hard to change, and once people have 20 years of practice solving a problem one way, most of them are not going to switch unless the alternative isn’t just better, it’s way, way better. Same reason I still buy Quilted Northern toilet paper despite knowing that it supports the Koch brothers and their abhorrent political views, or drink Coca-Cola despite knowing how unhealthy it is.
If you really want to open the search-engine space to competition, you’d have to break Google up and then forbid any of the baby-Googles from using the Google brand or google.com domain name. (Needless to say, you’d also need to get rid of Chrome & Toolbar integration.) Same with all the other monopolies that plague the American business landscape. Once you get to a certain age, the majority of the business value is in the brand, and so the only way to keep the monopoly from dominating its industry again is to take away the brand and distribute the productive capacity to successor companies on relatively even footing.
> 1) a record of searches and user clicks for the past 20 years
From what I can tell, Google cares a lot more about recency.
When I switch over to a new framework or language, search results are pretty bad for the first week, horrible actually as Google thinks I am still using /other language/. I have to keep appending the language / framework name to my queries.
After a week or so? The results are pure magic. I can search for something sort of describing what I want and Google returns the correct answer. If I search for ‘array length’ Google is going to tell me how to find the length of an array in whatever language I am currently immersed in!
As much as I try to use Duck Duck Go, Google is just too magic.
But I don’t think it is because they have my complete search history.
Also people forget that the creepy stuff Google does is super useful.
For example, whatever framework I am using, Google will start pushing news updates to my Google Now (or whatever it is called on my phone) about new releases to that framework. I get a constant stream of learning resources, valuable blog posts, and best practices delivered to me every morning!
It really is impressive.
Here’s what the guidelines say:
“Please don’t complain that a submission is inappropriate. If a story is spam or off-topic, flag it. Don’t feed egregious comments by replying; flag them instead. If you flag something, please don’t also comment that you did.
Please don’t use Hacker News primarily for political or ideological battle. This destroys intellectual curiosity, so we ban accounts that do it.”
By 2000 everyone knew there was going to be a big correction. Few accurately imagined how big. It was like going to the doctors because you knew something was ‘a little off’ and finding out that you were going to die in three months. Then talking to all your friends and realizing that everyone you knew had the same disease.
Initially many thought it would be similar to the early 90s recession. I.E. A few layoffs with everyone getting hired back again two years later. It wasn’t until 2002 that people realized that springtime wasnt coming back.
It wasnt just places like pets.com that crashed. You had every single non-tech company simultaneously scaling back IT initiatives. At the time there was still a large cohort of management types who considered the Internet more of a fad than a new paradigm. These types took full advantage of the shifting winds to cut deeply into anything tech related. This broad overcorrection did a massive amount of damage to the industry.
You can still see the impact crater. Remember the big talent shortages during 2008-2012? Thats because you no longer had a cohort of mid career professionals to draw from. Only the thin number of people who survived the collapse and a bunch of novices who were just getting started. Everyone was missing that valuable group of mid career 8-12 years of experience etc… Basically a generation of careers strangled. Which is why you’ll often see no sympathy for the ‘talent shortage’ complaint. Five years after cutting everything and leaving people to starve you have the same cohort of managers demanding to know “where are all the people with five years of recent experience”