American inventor James Spangler sold his idea for an electric broomstick-like cleaner —with a cloth filter and dust-collection bag attached to the long handle— to William Hoover in 1908. His invention resulted in the first truly practicable domestic vacuum cleaner. And just as the US brand “Kleenex” became a generic name for soft facial tissues, the concept of “to Hoover” has become a verb, meaning to suck things up.
OpenAI and other LLM providers have taken hoovering of data to an extreme that many never dreamed possible, now with recent versions expected to ingest up to 4 trillion “parameters”, aka words and numbers, that have been scraped from the web (with and without authorization) and used to train the AI.
Not long after ChatGPT popularity reached high orbit and became seemingly unstoppable, companies began realizing that the massive data-sucking monsters created by OpenAI and others were happy to hoover up data offered from within corporate firewalls, and to train on that as well, making the resulting knowledge readily available for all comers. Amazon appeared to be the first major company to ban employees from using ChatGPT in January, and other tech companies including Samsung and Apple have followed suit, as have banks and most others who care about their confidential information. Famously, the country of Italy also banned OpenAI, although it’s unclear how that can work without copying the Great Firewall of China.
I’ve been working with about 20 of our portfolio companies across the Global South since February to help raise awareness of the opportunities with GAI and also of the near guarantee of data leakage when using public LLM services. Many had no idea of the risk, and many more who had embraced the idea of rapid prototyping didn’t realize that the data hoovering issue was possible to avoid.
There are two fundamentally different approaches to addressing the data leakage problem. First is to use the right cloud LLM provider the right way and to trust them, and the second is to have your data used only with a local LLM (inside the firewall) for training and/or fine-tuning, and then to host the tuned model either securely in the cloud or within your corporate firewall. I will discuss the second model in part two of this blog.
Selecting a Trusted GAI Provider: Azure with GPT4 or OpenAI with GPT4?
Microsoft was the first to offer hosted versions of OpenAI’s GPT version 4 on Azure with different models of privacy, from a “don’t train on my data” model for their publicly hosted instance to a privately hosted model that does train on your data but is accessible only to you and your customers. Microsoft describes its privacy model here and summarizes it with “…No prompts or completions are stored in the model during these operations, and prompts and completions are not used to train, retrain or improve the models.”
Not to be left out of the data privacy race, in late April Open AI announced “…an upcoming ChatGPT Business subscription in addition to its $20 / month ChatGPT Plus plan. The Business variant targets ‘professionals who need more control over their data as well as enterprises seeking to manage their end users.’ The new plan will follow the same data-usage policies as its API, meaning it won’t use your data for training by default. The plan will become available ‘in the coming months.’”
Of course, by the time I’ve finished writing this and getting it in front of you, some of these things will have changed; new providers with alternative secure hosting and fine-tuning models will have appeared. Do look around before making a decision, and don’t make decisions you can’t easily undo when something better comes along a month or two later.
Privacy and IP / data security are not being discussed enough and not being acted on as broadly as they should. In an otherwise-helpful 20+ page paper release by McKinsey in May on “What Every CEO Needs to Know About Generative AI” there are only two paragraphs on IP and privacy issues, both put forward as possibilities / risks vs. certainties. In my view, every CEO needs to take immediate steps to ensure: 1/ that your employees are not inadvertently leaking proprietary data into the public GAI systems, and 2/ that you have a strategy for how to prototype and quickly deploy data-secure GAI in either trusted cloud GAI implementations like Azure or to use a privately trained and hosted small model GAI, described in part two of this blog.