To Break Google’s Monopoly on Search, Make Its Index Public

Ex-Google-Search engineer here, having also done some projects since leaving that involve data-mining publicly-available web documents.

This proposal won’t do very much. Indexing is the (relatively) easy part of building a search engine. CommonCrawl already indexes the top 3B+ pages on the web and makes it freely available on AWS. It costs about $50 to grep over it, $800 or so to run a moderately complex Hadoop job.

(For comparison, when I was at Google nearly all research & new features were done on the top 4B pages, and the remaining 150B+ pages were only consulted if no results in the top 4B turned up. Difficulty of running a MapReduce over that corpus was actually a little harder than running a Hadoop job over CommonCrawl, because there’s less documentation available.)

The comments here that PageRank is Google’s secret sauce also aren’t really true – Google hasn’t used PageRank since 2006. The ones about the search & clickthrough data being important are closer, but I suspect that if you made those public you still wouldn’t have an effective Google competitor.

The real reason Google’s still on top is that consumer habits are hard to change, and once people have 20 years of practice solving a problem one way, most of them are not going to switch unless the alternative isn’t just better, it’s way, way better. Same reason I still buy Quilted Northern toilet paper despite knowing that it supports the Koch brothers and their abhorrent political views, or drink Coca-Cola despite knowing how unhealthy it is.

If you really want to open the search-engine space to competition, you’d have to break Google up and then forbid any of the baby-Googles from using the Google brand or google.com domain name. (Needless to say, you’d also need to get rid of Chrome & Toolbar integration.) Same with all the other monopolies that plague the American business landscape. Once you get to a certain age, the majority of the business value is in the brand, and so the only way to keep the monopoly from dominating its industry again is to take away the brand and distribute the productive capacity to successor companies on relatively even footing.

Ex-Google-Search engineer here, having also done some projects since leaving that involve data-mining publicly-available web documents.This proposal won’t do very much. Indexing is the (relatively) easy part of building a search engine. CommonCrawl already indexes the top 3B+ pages on the web and makes it freely available on AWS. It costs about $50 to grep over it, $800 or so to run a moderately complex Hadoop job.

(For comparison, when I was at Google nearly all research & new features were done on the top 4B pages, and the remaining 150B+ pages were only consulted if no results in the top 4B turned up. Difficulty of running a MapReduce over that corpus was actually a little harder than running a Hadoop job over CommonCrawl, because there’s less documentation available.)

The comments here that PageRank is Google’s secret sauce also aren’t really true – Google hasn’t used PageRank since 2006. The ones about the search & clickthrough data being important are closer, but I suspect that if you made those public you still wouldn’t have an effective Google competitor.

The real reason Google’s still on top is that consumer habits are hard to change, and once people have 20 years of practice solving a problem one way, most of them are not going to switch unless the alternative isn’t just better, it’s way, way better. Same reason I still buy Quilted Northern toilet paper despite knowing that it supports the Koch brothers and their abhorrent political views, or drink Coca-Cola despite knowing how unhealthy it is.

If you really want to open the search-engine space to competition, you’d have to break Google up and then forbid any of the baby-Googles from using the Google brand or google.com domain name. (Needless to say, you’d also need to get rid of Chrome & Toolbar integration.) Same with all the other monopolies that plague the American business landscape. Once you get to a certain age, the majority of the business value is in the brand, and so the only way to keep the monopoly from dominating its industry again is to take away the brand and distribute the productive capacity to successor companies on relatively even footing.

Sure, it costs $50 to grep it, but how much does it cost to host an in-memory index with all the data?This is not a proposal to just share the crawl data, but the actual searchable index, presumably at arms length cost both internally & externally.

The same ideas could be extended to the Knowledge Graph, etc.

IMO the goal here should not be to kill Google, but to keep Google on their toes by removing barriers to competition.

This ^ times a 1000.Google simply has the best search product. They invest in it like crazy.

I’ve tried bing multiple times. It’s slow, it spams msn ads in your face on the homepage. Microsoft just doesn’t get the value of a clean UX.

DuckDuckGo results are pretty irrelevant the last time I tried them. There is nothing that comes close to their usability. To make the switchover, it has to be much much better than Google. Chances are that if something is, Google will buy them.

One thing to keep in mind when comparing DuckDuckGo to Google is that people do not use Google with an alternative backup in mind. When you DDG something and it fails, you can always switch to google.But what about when Google fails? Unlike DDG, there is no culture of switching between search engines when googling. Typically, you’ll just rewrite the query for google. And as rewriting the query is an entrenched part of googling, you are less likely to notice this as a failure. It is this training that’s the core advantage nostrademons points out.

Webspam is a really big problem, yes. It’s very unlikely that you’d be able to catch up or keep up in that regard without Google’s resources.Building the index itself is relatively easy. There are some subtleties that most people don’t think about (eg. dupe detection and redirects are surprisingly complicated, and CJK segmentation is a pre-req for tokenizing), but things like tokenizing, building posting lists, and finding backlinks are trivial – a competent programmer could get basic English-only implementations of all three running in a day.

> 1) a record of searches and user clicks for the past 20 years

From what I can tell, Google cares a lot more about recency.

When I switch over to a new framework or language, search results are pretty bad for the first week, horrible actually as Google thinks I am still using /other language/. I have to keep appending the language / framework name to my queries.

After a week or so? The results are pure magic. I can search for something sort of describing what I want and Google returns the correct answer. If I search for ‘array length’ Google is going to tell me how to find the length of an array in whatever language I am currently immersed in!

As much as I try to use Duck Duck Go, Google is just too magic.

But I don’t think it is because they have my complete search history.

Also people forget that the creepy stuff Google does is super useful.

For example, whatever framework I am using, Google will start pushing news updates to my Google Now (or whatever it is called on my phone) about new releases to that framework. I get a constant stream of learning resources, valuable blog posts, and best practices delivered to me every morning!

It really is impressive.

 

 

 

Publish Blog Using Google Docs

We make publishing

blog posts easier

Write and collaborate using Google Docs, then publish to your blog or website with the click of a button

Start your free trial

Write and collaborate using Google Docs
Enhance your productivity by using the powerful editing and collaboration features of Google Docs.
One-click publishing
Publish your content from Google Docs to your website with the click on a button.
Preserve your formatting
Cloudpress will preserve your formatting, upload your images, and even has a few other tricks up its sleeve.

Google Prepares to Launch New Privacy Tools to Limit Cookies

Google is set to launch new tools to limit the use of tracking cookies, a move that could strengthen the search giant’s advertising dominance and deal a blow to other digital-marketing companies, according to people familiar with the matter.

After years of internal debate, Google could as soon as this week roll out a dashboard-like function in its Chrome browser that will give internet users more information about what cookies are tracking them and offer options to fend them off, the people said.

This is a more incremental approach than less-popular browsers, such as Apple Inc.’s Safari and Mozilla Corp.’s Firefox, which introduced updates to restrict by default the majority of tracking cookies in 2017 and 2018, respectively.

Google’s move, which could be announced at its developer conference in Mountain View, Calif., starting Tuesday, is expected to be touted as part of the company’s commitment to privacy—a complicated sell, given the torrent of data it continues to store on users—and press its sizable advantage over online-advertising rivals.

The unit of Alphabet Inc. GOOGL +0.01% is the world’s largest digital ad seller. The coming changes aren’t expected to significantly curtail Google’s ability to collect data.

.. Yet cookies also boost competition in the advertising landscape by allowing hundreds of digital firms—large and small—to collect their own user data and sell higher-priced ads based on it. Any restriction on them is a boon to the biggest tech companies, including Google, which can target ads based on the slew of other information it collects on users through its many products.

Google, like its browser rivals, isn’t planning to end the use of cookies that websites use to make their own users’ experience smoother, such as those that store login information so users don’t have to enter it every time. Instead, it is mostly targeting cookies installed by profit-seeking third parties, separate from the owner of the website a user is actively visiting.

If the new Google tools prompt users to broadly reject tracking cookies, some people in the industry think it could mean the long-predicted demise of a technology that is both widely criticized and used.

“It really strikes at the Achilles’ heel of the ad tech ecosystem,” said Ratko Vidakovic, a Toronto-based consultant in the digital ad industry.

.. The changes could be damaging to Google competitors that use cookies or resell data collected via cookies to companies hoping to better target ads. Shares in one such company, Paris-based Criteo SA, which helps sites tag cookies on their visitors, are down 27% since Adweek reported in late March that Google was considering new restrictions.

Criteo’s chief executive during its recent quarterly earnings call flagged risks due to coming restrictions on cookies, and said the company was working to become less reliant on cookies.

Google Quietly Disbanded Another AI Review Board Following Disagreements

LONDON—Google is disbanding a panel here to review its artificial-intelligence work in health care, people familiar with the matter say, as disagreements about its effectiveness dogged one of the tech industry’s highest-profile efforts to govern itself.

The Alphabet Inc. GOOGL +0.23% unit is struggling with how best to set guidelines for its sometimes-sensitive work in AI—the ability for computers to replicate tasks that only humans could do in the past. It also highlights the challenges Silicon Valley faces in setting up self-governance systems as governments around the world scrutinize issues ranging from privacy and consent to the growing influence of social media and screen addiction among children.

AI has recently become a target in that stepped-up push for oversight as some sensitive decision-making—including employee recruitment, health-care diagnoses and law-enforcement profiling—is increasingly being outsourced to algorithms. The European Commission is proposing a set of AI ethical guidelines and researchers have urged companies to adopt similar rules. But industry efforts to conduct such oversight in-house have been mixed.

But the move also came amid disagreements between panel members and DeepMind, Google’s U.K.-based AI research unit, according to people familiar with the matter. Those differences centered on the review panel’s ability to access information about research and products, the binding power of their recommendations and the amount of independence that DeepMind could maintain from Google, according to these people.

A spokeswoman for DeepMind’s health-care unit in the U.K. declined to comment specifically about the board’s deliberations. After the reorganization, the company found that the board, called the Independent Review Panel, was “unlikely to be the right structure in the future.”

Google bought DeepMind in 2014, promising it a degree of autonomy to pursue its research on artificial intelligence. In 2016, DeepMind set up a special unit called DeepMind Health to focus on health-care-related opportunities. At the same time, DeepMind co-founder Mustafa Suleyman unveiled a board of nine veterans of government and industry, drawn from the arts, sciences and technology sectors, to meet once a quarter and scrutinize its work with the U.K.’s publicly funded health service. Among its tasks, the group had to produce a public annual report.

Google said it would build the app into an “AI-powered assistant for nurses and doctors everywhere.” That caused concern in public health and privacy circles because of previous assurances from Google and DeepMind that the two wouldn’t share health records.

DeepMind Health was renamed Google Health, becoming part of an umbrella division uniting Google’s other health-focused units like health-tracking platform Google Fit and Verily, a life-sciences research arm.

Inside the review board, many directors felt blindsided, according to people familiar with the matter. Some directors complained they could have played a helpful role in explaining the change of control of the Streams app to the public if given earlier insight.

The review panel still plans to publish a final “lessons learned” report, according to a person familiar with the matter, which will make recommendations about how better to set up such boards in the future.