Why Wordfreq Won't Be Updated: Generative AI's Impact

BY Mark Howell 19 September 20246 MINS READ
article cover

Today in Edworking News we want to talk about Why wordfreq will not be updated The wordfreq data is a snapshot of language that could be found in various online sources up through 2021. There are several reasons why it will not be updated anymore.

Generative AI has polluted the data I don't think anyone has reliable information about post-2021 language usage by humans. The open Web (via OSCAR) was one of wordfreq's data sources. Now the Web at large is full of slop generated by large language models, written by no one to communicate nothing. Including this slop in the data skews the word frequencies. Sure, there was spam in the wordfreq data sources, but it was manageable and often identifiable.

Large language models generate text that masquerades as real language with intention behind it, even though there is none, and their output crops up everywhere. As one example, Philip Shapira reports that ChatGPT (OpenAI's popular brand of generative language model circa 2024) is obsessed with the word "delve" in a way that people never have been, and caused its overall frequency to increase by an order of magnitude. Information that used to be free became expensive wordfreq is not just concerned with formal printed words. It collected more conversational language usage from two sources in particular: Twitter and Reddit. The Twitter data was always built on sand.

Even when Twitter allowed free access to a portion of their "firehose", the terms of use did not allow me to distribute that data outside of the company where I collected it (Luminoso). wordfreq has the frequencies that were built with that data as input, but the collected data didn't belong to me and I don't have it anymore. Now Twitter is gone anyway, its public APIs have shut down, and the site has been replaced with an oligarch's plaything, a spam-infested right-wing cesspool called X. Even if X made its raw data feed available (which it doesn't), there would be no valuable information to be found there. Reddit also stopped providing public data archives, and now they sell their archives at a price that only OpenAI will pay. And given what's happening to the field, I don't blame them. I don't want to be part of this scene anymore wordfreq used to be at the intersection of my interests. I was doing corpus linguistics in a way that could also benefit natural language processing tools.

The field I know as "natural language processing" is hard to find these days. It's all being devoured by generative AI. Other techniques still exist but generative AI sucks up all the air in the room and gets all the money. It's rare to see NLP research that doesn't have a dependency on closed data controlled by OpenAI and Google, two companies that I already despise. wordfreq was built by collecting a whole lot of text in a lot of languages. That used to be a pretty reasonable thing to do, and not the kind of thing someone would be likely to object to. Now, the text-slurping tools are mostly used for training generative AI, and people are quite rightly on the defensive. If someone is collecting all the text from your books, articles, Web site, or public posts, it's very likely because they are creating a plagiarism machine that will claim your words as its own. So I don't want to work on anything that could be confused with generative AI, or that could benefit generative AI. OpenAI and Google can collect their own damn data. I hope they have to pay a very high price for it, and I hope they're constantly cursing the mess that they made themselves. — Robyn Speer

Summary

The wordfreq project, a tool that provided a snapshot of language usage from various online sources up until 2021, will no longer be updated. The decision stems from several critical issues that have emerged over the past few years, primarily revolving around the influence of generative AI and the changing landscape of data accessibility.

Generative AI's Impact

One of the primary reasons for discontinuing updates to wordfreq is the pollution of data by generative AI. The web, which was a significant data source for wordfreq, is now inundated with content generated by large language models. This content, often devoid of genuine human intent or communication, skews the frequencies of words in a way that makes the data unreliable. For instance, ChatGPT has been reported to have an unnatural obsession with the word "delve," causing its frequency to spike disproportionately.

Generative AI's influence on language data.

The Cost of Data

Another significant factor is the increasing cost of data that was once freely accessible. Platforms like Twitter and Reddit were valuable sources of conversational language data for wordfreq. However, Twitter's public APIs have been shut down, and the platform has transformed into a less reliable source of information. Reddit, on the other hand, has started selling its data archives at prices only large entities like OpenAI can afford. This shift has made it impractical for smaller projects like wordfreq to continue using these sources.

The changing landscape of data accessibility.

Shifting Focus in NLP

The field of Natural Language Processing (NLP) has also undergone significant changes. The rise of generative AI has overshadowed other NLP techniques, drawing most of the attention and funding. This monopolization by companies like OpenAI and Google has made it challenging for independent projects to thrive. The tools and methods that were once used for corpus linguistics are now primarily employed to train generative AI models, often leading to ethical concerns about data usage and ownership.

The evolving landscape of Natural Language Processing.

Remember these 3 key ideas for your startup:

  1. Data Integrity is Crucial: Ensure that the data you rely on is free from significant distortions. The rise of generative AI has shown how easily data can be polluted, affecting the reliability of your insights and decisions.

  2. Adapt to Changing Data Accessibility: Be prepared for shifts in how data is accessed and priced. Platforms that once offered free data may start charging for it, impacting your operational costs and data strategies.

  3. Stay Ethical and Transparent: As the landscape of NLP and AI evolves, maintain ethical standards in data collection and usage. Avoid practices that could be perceived as exploitative or invasive, and be transparent with your stakeholders about your data sources and methodologies.


Edworking is the best and smartest decision for SMEs and startups to be more productive. Edworking is a FREE superapp of productivity that includes all you need for work powered by AI in the same superapp, connecting Task Management, Docs, Chat, Videocall, and File Management. Save money today by not paying for Slack, Trello, Dropbox, Zoom, and Notion.
For more details, see the original source.

article cover
About the Author: Mark Howell Linkedin

Mark Howell is a talented content writer for Edworking's blog, consistently producing high-quality articles on a daily basis. As a Sales Representative, he brings a unique perspective to his writing, providing valuable insights and actionable advice for readers in the education industry. With a keen eye for detail and a passion for sharing knowledge, Mark is an indispensable member of the Edworking team. His expertise in task management ensures that he is always on top of his assignments and meets strict deadlines. Furthermore, Mark's skills in project management enable him to collaborate effectively with colleagues, contributing to the team's overall success and growth. As a reliable and diligent professional, Mark Howell continues to elevate Edworking's blog and brand with his well-researched and engaging content.

Trendy NewsSee All Articles
CoverVisual Prompt Injections: Essential Guide for StartupsThe Beginner's Guide to Visual Prompt Injections explores vulnerabilities in AI models like GPT-4V, highlighting security risks for startups and offering strategies to mitigate potential data compromises.
BY Mark Howell 1 mo ago
CoverGraph-Based AI: Pioneering Future Innovation PathwaysGraph-based AI, developed by MIT's Markus J. Buehler, bridges unrelated fields, revealing shared complexity patterns, accelerating innovation by uncovering novel ideas and designs, fostering unprecedented growth opportunities.
BY Mark Howell 1 mo ago
CoverRevolutionary Image Protection: Watermark Anything with Localized MessagesWatermark Anything enables embedding multiple localized watermarks in images, balancing imperceptibility and robustness. It uses Python, PyTorch, and CUDA, with COCO dataset, under CC-BY-NC license.
BY Mark Howell 1 mo ago
CoverJungle Music's Role in Shaping 90s Video Game SoundtracksJungle music in the 90s revolutionized video game soundtracks, enhancing fast-paced gameplay on PlayStation and Nintendo 64, and fostering a cultural revolution through its energetic beats and immersive experiences.
BY Mark Howell 1 mo ago
CoverMastering Probability-Generating Functions: A Guide for EntrepreneursProbability-generating functions (pgfs) are mathematical tools used in probability theory for data analysis, risk management, and predictive modeling, crucial for startups and SMEs in strategic decision-making.
BY Mark Howell 2 mo ago
CoverMastering Tokenization: Key to Successful AI ApplicationsTokenization is crucial in NLP for AI apps, influencing data processing. Understanding tokenizers enhances AI performance, ensuring meaningful interactions and minimizing Garbage In, Garbage Out issues.
BY Mark Howell 2 mo ago
CoverReviving Connection: What We Lost with the Decline of Letter WritingThe shift from handwritten letters to digital communication has reduced personal connection, depth, and attentiveness, impacting how we communicate and relate in both personal and business contexts.
BY Mark Howell 2 mo ago
CoverLichess Move: Behind-the-Scenes Technical BreakdownWhen you make a move on lichess.org, it triggers real-time data exchanges via WebSocket, updates game state, and ensures seamless gameplay using Redis Pub/Sub and MongoDB.
BY Mark Howell 2 mo ago
Try EdworkingA new way to work from  anywhere, for everyone for Free!
Sign up Now