Annotations of Google Tweet about AI Jobs

Translation: We fired and harassed out our top AI researchers from historically marginalized groups who did the work for us, and we are now looking for more people from historically marginalized groups to burn out, exploit, and expend.
Quote Tweet
@JeffDean
·
I encourage students from historically marginalized groups who are interested in learning to conduct research in AI/ML, CS or related areas to consider applying for our CSRMP mentorship program! We have 100s of researchers @GoogleAI who are excited to work with you. twitter.com/GoogleAI/statu…
@hondanhon
the birdhouse annotations on that tweet must be something to behold if birdhouse were actually a thing
Jean-Michel Plourde
@j_mplourde
Replying to
aka: we are looking for more obediant marginalized folks for the PR but yeah don’t cross the line or else…

Google Dominates Thanks to an Unrivaled View of the Web

As regulators seek ways to curb the company’s power, there is more focus on the vast index — hundreds of billions of web pages — behind its search engine.

 In 2000, just two years after it was founded, Google reached a milestone that would lay the foundation for its dominance over the next 20 years: It became the world’s largest search engine, with an index of more than one billion web pages.

The rest of the internet never caught up, and Google’s index just kept on getting bigger. Today, it’s somewhere between 500 billion and 600 billion web pages, according to estimates.

Now, as regulators around the world examine ways to curb Google’s power, including a search monopoly case expected from state attorneys general as early as this week and the antitrust lawsuit the Justice Department filed in October, they are wrestling with a company whose sheer size has allowed it to squash competitors. And those competitors are pointing investigators toward that enormous index, the gravitational center of the company.

If people are on a search engine with a smaller index, they’re not always going to get the results they want. And then they go to Google and stay at Google,” said Matt Wells, who started Gigablast, a search engine with an index of around five billion web pages, about 20 years ago. “A little guy like me can’t compete.”

Understanding how Google’s search works is a key to figuring out why so many companies find it nearly impossible to compete and, in fact, go out of their way to cater to its needs.

Every search request provides Google with more data to make its search algorithm smarter. Google has performed so many more searches than any other search engine that it has established a huge advantage over rivals in understanding what consumers are looking for. That lead only continues to widen, since Google has a market share of about 90 percent.

Google directs billions of users to locations across the internet, and websites, hungry for that traffic, create a different set of rules for the company. Websites often provide greater and more frequent access to Google’s so-called web crawlers — computers that automatically scour the internet and scan web pages — allowing the company to offer a more extensive and up-to-date index of what is available on the internet.

When he was working at the music site Bandcamp, Zack Maril, a software engineer, became concerned about how Google’s dominance had made it so essential to websites.

In 2018, when Google said its crawler, Googlebot, was having trouble with one of Bandcamp’s pages, Mr. Maril made fixing the problem a priority because Google was critical to the site’s traffic. When other crawlers encountered problems, Bandcamp would usually block them.

Mr. Maril continued to research the different ways that websites opened doors for Google and closed them for others. Last year, he sent a 20-page report, “Understanding Google,” to a House antitrust subcommittee and then met with investigators to explain why other companies could not recreate Google’s index.

“It’s largely an unchecked source of power for its monopoly,” said Mr. Maril, 29, who works at another technology company that does not compete directly with Google. He asked that The New York Times not identify his employer since he was not speaking for it.

A report this year by the House subcommittee cited Mr. Maril’s research on Google’s efforts to create a real-time map of the internet and how this had “locked in its dominance.” While the Justice Department is looking to unwind Google’s business deals that put its search engine front and center on billions of smartphones and computers, Mr. Maril is urging the government to intervene and regulate Google’s index. A Google spokeswoman declined to comment.

Websites and search engines are symbiotic. Websites rely on search engines for traffic, while search engines need access to crawl the sites to provide relevant results for users. But each crawler puts a strain on a website’s resources in server and bandwidth costs, and some aggressive crawlers resemble security risks that can take down a site.

Since having their pages crawled costs money, websites have an incentive to let it be done only by search engines that direct enough traffic to them. In the current world of search, that leaves Google and — in some cases — Microsoft’s Bing.

Google and Microsoft are the only search engines that spend hundreds of millions of dollars annually to maintain a real-time map of the English-language internet. That’s in addition to the billions they’ve spent over the years to build out their indexes, according to a report this summer from Britain’s Competition and Markets Authority.

Google holds a significant leg up on Microsoft in more than market share. British competition authorities said Google’s index included about 500 billion to 600 billion web pages, compared with 100 billion to 200 billion for Microsoft.

Other large tech companies deploy crawlers for other purposes. Facebook has a crawler for links that appear on its site or services. Amazon says its crawler helps improve its voice-based assistant, Alexa. Apple has its own crawler, Applebot, which has fueled speculation that it might be looking to build its own search engine.

But indexing has always been a challenge for companies without deep pockets.
The privacy-minded search engine DuckDuckGo decided to stop crawling the entire web more than a decade ago and now syndicates results from Microsoft. It still crawls sites like Wikipedia to provide results for answer boxes that appear in its results, but maintaining its own index does not usually make financial sense for the company.

“It costs more money than we can afford,” said Gabriel Weinberg, chief executive of DuckDuckGo. In a written statement for the House antitrust subcommittee last year, the company said that “an aspiring search engine start-up today (and in the foreseeable future) cannot avoid the need” to turn to Microsoft or Google for its search results.

When FindX started to develop an alternative to Google in 2015, the Danish company set out to create its own index and offered a build-your-own algorithm to provide individualized results.

FindX quickly ran into problems. Large website operators, such as Yelp and LinkedIn, did not allow the fledgling search engine to crawl their sites. Because of a bug in its code, FindX’s computers that crawled the internet were flagged as a security risk and blocked by a group of the internet’s largest infrastructure providers. What pages they did collect were frequently spam or malicious web pages.

“If you have to do the indexing, that’s the hardest thing to do,” said Brian Schildt Laursen, one of the founders of FindX, which shut down in 2018.

Mr. Schildt Laursen launched a new search engine last year, Givero, which offered users the option to donate a portion of the company’s revenue to charitable causes. When he started Givero, he syndicated search results from Microsoft.

Most large websites are judicious about who can crawl their pages. In general, Google and Microsoft get more access because they have more users, while smaller search engines have to ask for permission.

“You need the traffic to convince the websites to allow you to copy and crawl, but you also need the content to grow your index and pull up your traffic,” said Marc Al-Hames, a co-chief executive of Cliqz, a German search engine that closed this year after seven years of operation. “It’s a chicken-and-egg problem.”

In Europe, a group called the Open Search Foundation has proposed a plan to create a common internet index that can underpin many European search engines. It’s essential to have a diversity of options for search results, said Stefan Voigt, the group’s chairman and founder, because it is not good for only a handful of companies to determine what links people are shown and not shown.

“We just can’t leave this to one or two companies,” Mr. Voigt said.

When Mr. Maril started researching how sites treated Google’s crawler, he downloaded 17 million so-called robots.txt files — essentially rules of the road posted by nearly every website laying out where crawlers can go — and found many examples where Google had greater access than competitors.

ScienceDirect, a site for peer-reviewed papers, permits only Google’s crawler to have access to links containing PDF documents. Only Google’s computers get access to listings on PBS Kids. On Alibaba.com, the U.S. site of the Chinese e-commerce giant Alibaba, only Google’s crawler is given access to pages that list products.

This year, Mr. Maril started an organization, the Knuckleheads’ Club (“because only a knucklehead would take on Google”), and a website to raise awareness about Google’s web-crawling monopoly.

“Google has all this power in society,” Mr. Maril said. “But I think there should be democratic — small d — control of that power.”

Google pushes “text fragment links” with new Chrome extension

New feature can deep-link to specific text on a Web page, with highlighting.

Google has been cooking up an extension to the URL standard called “Text Fragments.” The new link style will allow you to link not just to a page but to specific text on a page, which will get scrolled to and highlighted automatically once the page loads. It’s like an anchor link, but with highlighting and creatable by anyone.

The feature has actually been supported in Chrome since version 80, which hit the stable channel in February. Now a new extension from Google makes it easy to create this new link type, which will work for anyone else using Chrome on desktop OSes and Android. Google has proposed the idea to the W3C and hopes other browsers will adopt it, but even if they don’t, the links are backward-compatible.

The syntax for this URL is pretty strange looking. After the URL, the magic is in the string “#:~:text=” and then whatever text you want to match. So a full link would look like this:

https://en.wikipedia.org/wiki/Cat#:~:text=Most breeds of cat have a noted fondness for sitting in high places

If you copy and paste this into Chrome, the browser will open Wikipedia’s cat page, scroll to the first text that matches “Most breeds of cat have a noted fondness for sitting in high places,” and will highlight it. If the text doesn’t match anything, the page will still load. Backward-compatibility works because browsers currently support the number sign (#) as a URI fragment, which usually gets used for anchor links that are made by the page creator. If you paste this into a browser that doesn’t support it, the page will still load, and everything after the number sign will just be ignored as a bad anchor link. So far, so good.

One problem is that this means you can have spaces in a URL. On a webpage or forum, you can hand-code the link with a href tag (or whatever the non-HTML equivalent is) and everything will work. For instant messengers and social media though, which don’t allow code and use automatic URL parsers, things get a bit more complicated. Every URL parser treats a space as the end of a URL, so you’ll need to use percent-encoding to replace all the spaces with the equivalent “%20.” URL parsers now have a shot at linkifying this correctly, but it looks like a mess:

https://en.wikipedia.org/wiki/Cat#:~:text=Most%20breeds%20of%20cat%20have%20a%20noted%20fondness%20for%20sitting%20in%20high%20places.

Spaces aren’t the only characters that can cause problems. The standard RFC 3986 defines several “reserved” characters as having a special meaning in a URL, so they shouldn’t be in a URL. Web-page-authoring tools tend to handle these characters automatically, but now that you’re embedding arbitrary sentences in a URL for highlighting, there’s a higher chance you’ll run into one of these reserved characters:! * ‘ ( ) ; : @ & = + $ , / ? # [ ]. They all need to be percent-encoded in order for the URL to work, and Google’s extension takes care of that for you.

Google’s new Chrome extension, called “Link to Text Fragment,” (it’s also on Github) will put a new entry in Chrome’s right-click menu. You just highlight text on a page, right-click it, and hit “Copy link to selected text.” Like magic, a text fragment link will end up on your clipboard. All the text encoding is done automatically, so the link should work with most websites and messengers.

Google seems like it is going to start pushing out support for text fragments across its Web ecosystem, even without the W3C. The links have already started to show up in some Google search results, which allow Chrome users to zip right to the relevant text. It’s probably only a matter of time before link creation moves from an extension to a normal Chrome feature.

The Great Google Revolt

Some of its employees tried to stop their company from doing work they saw as unethical. It blew up in their faces.