IBL News | New YorkChatbots mimic human speech because the AI that powers them has ingested a huge amount of text, mostly scraped from the Internet. The Washington Post analyzed those websites used to train AI, although companies like OpenAI didn’t disclose what dataset used. The newspaper worked with researchers of the Allen Institute for AI and categorized the websites, with data from analytics firm Similarweb. It started looking inside Google’s C4 data set, which includes 15 million websites from journalism, entertainment, software development, medicine, and content creation, among other industries. The three biggest sites were patents.google.com (which contains text from patents issued around the world), wikipedia.org, and scribd.com (a subscription-only digital library).
Source: The Guardian June 24, 2023 07:29 UTC