中文字幕日韩人妻在线乱码,神马影院午夜在线观看,精品一精品国产一级毛片

對(duì)AI模型公司的終極懲罰：算法追繳

STEPHEN PASTIS

2023-09-02

聯(lián)邦貿(mào)易委員會(huì)很少動(dòng)用這項(xiàng)權(quán)力，通常針對(duì)的是濫用數(shù)據(jù)的公司。

文本設(shè)置

小號(hào)

默認(rèn)

大號(hào)

Plus(0條)

圖片來(lái)源：JAAP ARRIENS —— NURPHOTO/蓋蒂圖片社

一切要從詹姆斯·鄒收到的一封郵件說(shuō)起。

這封郵件提出了一個(gè)貌似合理的要求，但鄒意識(shí)到，這個(gè)要求幾乎不可能滿足。

郵件開頭寫道：“親愛(ài)的研究者：如您所知，參與者可隨時(shí)退出英國(guó)生物樣本庫(kù)（UK Biobank），并要求不得繼續(xù)使用其數(shù)據(jù)。自從我們上一次審查以來(lái)，一些參加申請(qǐng)[經(jīng)過(guò)修改]的參與者已經(jīng)要求不得繼續(xù)使用他們的數(shù)據(jù)?！?/p>

這封郵件來(lái)自英國(guó)生物樣本庫(kù)，這是一個(gè)大型健康與基因數(shù)據(jù)庫(kù)，收集了50萬(wàn)英國(guó)人的數(shù)據(jù)。該數(shù)據(jù)庫(kù)被公共和私營(yíng)部門廣泛應(yīng)用。

鄒是斯坦福大學(xué)（Stanford University）的教授，也是一位知名生物醫(yī)學(xué)數(shù)據(jù)科學(xué)家。他已經(jīng)將生物樣本庫(kù)的數(shù)據(jù)輸入到算法當(dāng)中，用于訓(xùn)練一個(gè)人工智能模型?，F(xiàn)在，這封郵件要求他刪除數(shù)據(jù)。鄒在2019年舉辦的有關(guān)此事的研討會(huì)上解釋稱：“事情變得很棘手。”

這是因?yàn)?，從一個(gè)經(jīng)過(guò)訓(xùn)練的人工智能模型中刪除一個(gè)用戶的數(shù)據(jù)幾乎是不可能的，除非將模型重置，而這樣一來(lái)為訓(xùn)練模型所付出的大量金錢和精力就打了水漂。如果用人類來(lái)類比的話，人工智能“看見(jiàn)了”某個(gè)事物，你就很難告訴模型要“忘掉”它所看到的東西。而且徹底刪除模型也出奇地難。

這是人工智能時(shí)代初期，我們所遇到的最棘手的、尚未解決的挑戰(zhàn)之一，其他問(wèn)題包括人工智能“幻覺(jué)”和解釋人工智能的某些輸出結(jié)果時(shí)存在的困難。許多專家認(rèn)為，人工智能遺忘問(wèn)題，與對(duì)隱私和虛假信息監(jiān)管不足這個(gè)問(wèn)題產(chǎn)生了沖突：隨著人工智能模型的規(guī)模日益龐大，并吸納越來(lái)越多數(shù)據(jù)，如果沒(méi)有從模型中刪除數(shù)據(jù)甚至刪除模型本身的解決方案，受影響的將不止是健康研究的參與者，這將成為一個(gè)涉及每個(gè)人的突出問(wèn)題。

為什么很難徹底刪除人工智能模型

在鄒最初遭遇困境多年以后，ChatGPT等生成式人工智能工具引發(fā)的熱度，掀起了一波創(chuàng)作和擴(kuò)散人工智能的潮流。此外，這些模型的規(guī)模越來(lái)越大，這意味著它們?cè)谟?xùn)練過(guò)程中吸收了更多數(shù)據(jù)。

許多模型被用于醫(yī)療、金融等行業(yè)，在這些行業(yè)重視數(shù)據(jù)隱私和數(shù)據(jù)使用尤為重要。

但正如鄒最初所發(fā)現(xiàn)的情況一樣，從模型中刪除數(shù)據(jù)并不容易。這是因?yàn)槿斯ぶ悄苣Ｐ筒恢故且恍行械拇a。它是經(jīng)過(guò)學(xué)習(xí)后掌握的在特定數(shù)據(jù)集中不同數(shù)據(jù)點(diǎn)之間的統(tǒng)計(jì)學(xué)關(guān)系，其中包含了人類難以理解的極其復(fù)雜的微妙關(guān)系。一旦模型學(xué)會(huì)這些關(guān)系，就很難讓其忽視它已經(jīng)學(xué)會(huì)的某些部分。

紐約大學(xué)（New York University）的人工智能專家和計(jì)算機(jī)科學(xué)教授阿納斯·巴里對(duì)《財(cái)富》雜志表示：“如果一個(gè)基于機(jī)器學(xué)習(xí)的系統(tǒng)已經(jīng)接受過(guò)數(shù)據(jù)訓(xùn)練，要追溯性刪除部分?jǐn)?shù)據(jù)的唯一方法就是從零開始重新訓(xùn)練算法?！?

這個(gè)問(wèn)題不止關(guān)于個(gè)人數(shù)據(jù)隱私。如果一個(gè)人工智能模型被發(fā)現(xiàn)收集了存在偏見(jiàn)或惡意的數(shù)據(jù)，例如來(lái)自種族主義者的社交媒體帖子中的數(shù)據(jù)，要清理這些不良數(shù)據(jù)難度極大。

訓(xùn)練或重新訓(xùn)練人工智能模型成本高昂。尤其是訓(xùn)練超大型“基礎(chǔ)模型”需要花費(fèi)巨額成本，這類模型為當(dāng)前生成式人工智能的蓬勃發(fā)展提供了動(dòng)力。據(jù)報(bào)道，OpenAI公司CEO山姆·阿爾特曼曾表示，訓(xùn)練GPT-4的成本超過(guò)1億美元。GPT-4是驅(qū)動(dòng)GhatGPT高端版本的大語(yǔ)言模型。

這就是為什么開發(fā)人工智能模型的公司，會(huì)害怕美國(guó)聯(lián)邦貿(mào)易委員會(huì)處罰違反美國(guó)貿(mào)易法的公司時(shí)用到的一個(gè)強(qiáng)大工具。這個(gè)工具名為“算法追繳”。該法律程序旨在強(qiáng)制違法公司徹底刪除違規(guī)的人工智能模型，作為對(duì)公司的處罰。聯(lián)邦貿(mào)易委員會(huì)很少動(dòng)用這項(xiàng)權(quán)力，通常針對(duì)的是濫用數(shù)據(jù)的公司。一個(gè)著名案例是聯(lián)邦貿(mào)易委員會(huì)對(duì)Everalbum這家公司行使了這項(xiàng)權(quán)力，因?yàn)樵摴疚唇?jīng)許可使用人們的生物識(shí)別數(shù)據(jù)訓(xùn)練了一個(gè)面部識(shí)別系統(tǒng)。

但巴里表示，算法追繳假設(shè)創(chuàng)建人工智能系統(tǒng)的公司可以識(shí)別一個(gè)數(shù)據(jù)集中非法收集的部分，但事實(shí)上并非如此。數(shù)據(jù)很容易在互聯(lián)網(wǎng)上四處傳播，而且越來(lái)越多數(shù)據(jù)未經(jīng)許可從原始來(lái)源被“抓取”，這給確定數(shù)據(jù)的原始所有權(quán)帶來(lái)了挑戰(zhàn)。

算法追繳存在的另外一個(gè)問(wèn)題是，在實(shí)踐中，徹底刪除人工智能模型，可能像消滅僵尸一樣困難。

人工智能專家蘭斯·埃利奧特通過(guò)電子郵件對(duì)《財(cái)富》雜志表示：“試圖刪除一個(gè)人工智能模型，或許看起來(lái)很容易，似乎只要按下刪除鍵就能徹底解決問(wèn)題，但實(shí)際情況并非如此。”

埃利奧特寫道，人工智能模型被刪除后很容易復(fù)原，因?yàn)榭赡苓€有模型的其他數(shù)字拷貝存在，很容易復(fù)原。

鄒表示，對(duì)于目前的狀況，要么對(duì)技術(shù)進(jìn)行大幅調(diào)整，使公司遵守法律，要么立法者重新制定法規(guī)，并重新思考如何讓公司遵守規(guī)定。

創(chuàng)建小模型有利于保護(hù)隱私

鄒和他的合作伙伴在研究中確實(shí)提出了在不破壞整個(gè)模型的前提下，從基于聚類的簡(jiǎn)單機(jī)器學(xué)習(xí)模型中刪除數(shù)據(jù)的一些方法。但這些方法不適用于更復(fù)雜的模型，例如支撐當(dāng)前生成式人工智能繁榮發(fā)展的大多數(shù)深度學(xué)習(xí)模型。鄒和他的合作伙伴在2019年發(fā)表的一篇研究論文中建議，這些更復(fù)雜的模型可能在最開始就需要使用一種不同訓(xùn)練機(jī)制，才能在不影響整個(gè)模型運(yùn)行也不需要重新訓(xùn)練整個(gè)模型的情況下，刪除模型中的特定統(tǒng)計(jì)路徑。

如果公司擔(dān)心要求其依據(jù)規(guī)定刪除用戶數(shù)據(jù)，例如歐洲多項(xiàng)數(shù)據(jù)隱私法律中都有這樣的規(guī)定，他們或許需要采用其他方法。事實(shí)上，至少有一家人工智能公司的業(yè)務(wù)就是完全圍繞這種觀念展開的。

德國(guó)公司Xayn從事私人個(gè)性化人工智能搜索和推薦技術(shù)研發(fā)。該公司的技術(shù)使用一個(gè)基礎(chǔ)模型，為每一位用戶單獨(dú)訓(xùn)練一個(gè)小模型。這樣一來(lái)，該公司很容易就能根據(jù)用戶的要求刪除用戶個(gè)人的模型。

Xayn CEO兼聯(lián)合創(chuàng)始人列夫-尼森·倫德班克表示：“我們絕不會(huì)遇到將用戶數(shù)據(jù)輸入一個(gè)大模型的問(wèn)題?！?/p>

倫德班克表示，他認(rèn)為Xayn獨(dú)立的人工智能小模型比OpenAI、谷歌（Google）、Anthropic、Inflection等公司開發(fā)的龐大的大語(yǔ)言模型，在開發(fā)符合數(shù)據(jù)隱私規(guī)定的人工智能方面更可行。龐大的模型從互聯(lián)網(wǎng)中抓取了海量數(shù)據(jù)，包括個(gè)人信息，以至于公司自己通常都不能準(zhǔn)確了解其訓(xùn)練數(shù)據(jù)集中包含了哪些數(shù)據(jù)。而且，倫德班克表示，這些龐大的模型需要巨額的訓(xùn)練和維護(hù)成本。

他表示，目前隱私公司和人工智能公司處在一種并行發(fā)展的狀態(tài)。

另外一家人工智能公司SpotLab試圖填補(bǔ)隱私與人工智能之間的空白。該公司致力于開發(fā)臨床研究模型。其創(chuàng)始人兼CEO米格爾·盧恩格-奧羅斯曾是一名聯(lián)合國(guó)的研究員和首席科學(xué)家。他表示，在研究人工智能的20年間，他經(jīng)常思考這個(gè)缺失的環(huán)節(jié)：人工智能系統(tǒng)的遺忘能力。

他表示，在這方面之所以鮮有進(jìn)展，原因之一是，到目前為止，沒(méi)有任何數(shù)據(jù)隱私法能迫使公司和研究人員必須認(rèn)真解決這個(gè)問(wèn)題。歐洲在這方面已經(jīng)有所轉(zhuǎn)變，但美國(guó)仍然缺少要求公司為刪除用戶個(gè)人數(shù)據(jù)提供便利的規(guī)定。

立法者到目前為止在這方面毫無(wú)作為，因此有人希望法院能夠介入。最近有一項(xiàng)訴訟指控OpenAI盜用“數(shù)以百萬(wàn)計(jì)美國(guó)人的”數(shù)據(jù)訓(xùn)練ChatGPT模型。

而且有跡象表明，一些大型科技公司可能開始認(rèn)真對(duì)待這個(gè)問(wèn)題。6月，谷歌宣布發(fā)起一項(xiàng)研究競(jìng)賽，邀請(qǐng)研究人員解決人工智能無(wú)法遺忘的問(wèn)題。

但在這些工作取得任何進(jìn)展之前，用戶的數(shù)據(jù)將繼續(xù)在日益龐大的人工智能模型中傳播，很容易成為可疑甚至危險(xiǎn)行為針對(duì)的對(duì)象。

倫德班克表示：“我認(rèn)為這是很危險(xiǎn)的。如果有人能夠獲取這些數(shù)據(jù)，例如某些情報(bào)機(jī)構(gòu)或者其他國(guó)家，這些數(shù)據(jù)很有可能被惡意利用?！保ㄘ?cái)富中文網(wǎng)）

翻譯：劉進(jìn)龍

審校：汪皓

一切要從詹姆斯·鄒收到的一封郵件說(shuō)起。

這封郵件提出了一個(gè)貌似合理的要求，但鄒意識(shí)到，這個(gè)要求幾乎不可能滿足。

為什么很難徹底刪除人工智能模型

許多模型被用于醫(yī)療、金融等行業(yè)，在這些行業(yè)重視數(shù)據(jù)隱私和數(shù)據(jù)使用尤為重要。

算法追繳存在的另外一個(gè)問(wèn)題是，在實(shí)踐中，徹底刪除人工智能模型，可能像消滅僵尸一樣困難。

人工智能專家蘭斯·埃利奧特通過(guò)電子郵件對(duì)《財(cái)富》雜志表示：“試圖刪除一個(gè)人工智能模型，或許看起來(lái)很容易，似乎只要按下刪除鍵就能徹底解決問(wèn)題，但實(shí)際情況并非如此?！?

埃利奧特寫道，人工智能模型被刪除后很容易復(fù)原，因?yàn)榭赡苓€有模型的其他數(shù)字拷貝存在，很容易復(fù)原。

創(chuàng)建小模型有利于保護(hù)隱私

Xayn CEO兼聯(lián)合創(chuàng)始人列夫-尼森·倫德班克表示：“我們絕不會(huì)遇到將用戶數(shù)據(jù)輸入一個(gè)大模型的問(wèn)題。”

他表示，目前隱私公司和人工智能公司處在一種并行發(fā)展的狀態(tài)。

倫德班克表示：“我認(rèn)為這是很危險(xiǎn)的。如果有人能夠獲取這些數(shù)據(jù)，例如某些情報(bào)機(jī)構(gòu)或者其他國(guó)家，這些數(shù)據(jù)很有可能被惡意利用。”（財(cái)富中文網(wǎng)）

翻譯：劉進(jìn)龍

審校：汪皓

It all started with an email James Zou received.

The email was making a request that seemed reasonable, but which Zou realized would be nearly impossible to fulfill.

“Dear Researcher,” the email began. “As you are aware, participants are free to withdraw from the UK Biobank at any time and request that their data no longer be used. Since our last review, some participants involved with Application [REDACTED] have requested that their data should longer be used.”

The email was from the U.K. Biobank, a large-scale database of health and genetic data drawn from 500,000 British residents, that is widely available to the public and private sector.

Zou, a professor at Stanford University and prominent biomedical data scientist, had already fed the Biobank’s data to an algorithm and used it to train an A.I. model. Now, the email was requesting the data’s removal. “Here’s where it gets hairy,” Zou explained in a 2019 seminar he gave on the matter.

That’s because, as it turns out, it’s nearly impossible to remove a user’s data from a trained A.I. model without resetting the model and forfeiting the extensive money and effort put into training it. To use a human analogy, once an A.I. has “seen” something, there is no easy way to tell the model to “forget” what it saw. And deleting the model entirely is also surprisingly difficult.

This represents one of the thorniest, unresolved, challenges of our incipient artificial intelligence era, alongside issues like A.I. “hallucinations” and the difficulties of explaining certain A.I. outputs. According to many experts, the A.I. unlearning problem is on a collision course with inadequate regulations around privacy and misinformation: As A.I. models get larger and hoover up ever more data, without solutions to delete data from a model — and potentially delete the model itself — the people affected won’t just be those who have participated in a health study, it’ll be a salient problem for everyone.

Why A.I. models are as difficult to kill as a zombie

In the years since Zou’s initial predicament, the excitement over generative A.I. tools like ChatGPT has caused a boom in the creation and proliferation of A.I. models. What’s more, those models are getting bigger, meaning they ingest more data during their training.

Many of these models are being put to work in industries like medical care and finance where it’s especially important to be careful about data privacy and data usage.

But as Zou discovered when he set out to find a solution to removing data, there’s no simple way to do it. That’s because an A.I. model isn’t just lines of coding. It’s a learned set of statistical relations between points in a particular dataset, encompassing subtle relationships that are often far too complex for human understanding. Once the model learns this relationship, there’s no simple way to get the model to ignore some portion of what it has learned.

“If a machine learning-based system has been trained on data, the only way to retroactively remove a portion of that data is by re-training the algorithms from scratch,” Anasse Bari, an A.I. expert and computer science professor at New York University, told Fortune.

The problem goes beyond private data. If an A.I. model is discovered to have gleaned biased or toxic data, say from racist social media posts, weeding out the bad data will be tricky.

Training or retraining an A.I. model is expensive. This is particularly true for the ultra-large “foundation models” that are currently powering the boom in generative A.I. Sam Altman, the CEO of OpenAI, has reportedly said that GPT-4, the large language model that powers its premium version of ChatGPT, cost in excess of $100 million to train.

That’s why, to companies developing A.I. models, a powerful tool that the U.S. Federal Trade Commission has to punish companies it finds have violated U.S. trade laws is scary. The tool is called “algorithmic disgorgement.” It’s a legal process that penalizes the law-breaking company by forcing it to delete an offending A.I. model in its entirety. The FTC has only used that power a handful of times, typically directed at companies who have misused data. One well known case where the FTC did use this power is against a company called Everalbum, which trained a facial recognition system using people’s biometric data without their permission.

But Bari says that algorithmic disgorgement assumes those creating A.I. systems can even identify which part of a dataset was illegally collected, which is sometimes not the case. Data easily traverses various internet locations, and is increasingly “scraped” from its original source without permission, making it challenging to determine its original ownership.

Another problem with algorithmic disgorgement is that, in practice, A.I. models can be as difficult to kill as zombies.

“Trying to delete an AI model might seem exceedingly simple, namely just press a delete button and the matter is entirely concluded, but that’s not how things work in the real world,” Lance Elliot, an A.I. expert, told Fortune in an email.

A.I. models can be easily reinstated after deletion because it’s likely other digital copies of the model exist and can be easily reinstated, Elliot writes.

Zou says that, the way things stand, either the technology needs to change substantially so that companies can comply with the law, or lawmakers need to rethink the regulations and how they can make companies comply.

Building smaller models is good for privacy

In his research, Zou and his collaborators did come up with some ways that data can be deleted from simple machine learning models that are based on a technique known as clustering without compromising the entire model. But those same methods won’t work for more complex models such as most of the deep learning systems that underpin today’s generative A.I. boom. For these models, a different kind of training regime may have to be used in the first place to make it possible to delete certain statistical pathways in the model without compromising the whole model’s performance or requiring the entire model to be retrained, Zou and his co-authors suggested in a 2019 research paper.

For companies worried about the requirement that they be able to delete users data upon request, which is a part of several European data privacy laws, other methods may be needed. In fact, there’s at least one A.I. company that has built its entire business around this idea.

Xayn is a German company that makes private, personalized A.I. search and recommendation technology. Xayn’s technology works by using a base model and then training a separate small model for each user. That makes it very easy to delete any of these individual users’ models upon request.

“This problem of your data floating into the big model never happens with us,” Leif-Nissen Lundb?k, the CEO and co-founder of Xayn, said.

Lundb?k said he thinks Xayn’s small, individual A.I. models represent a more viable way to create A.I. in a way that can comply with data privacy requirements than the massive large language models being built by companies such as OpenAI, Google, Anthropic, Inflection, and others. Those models suck up vast amounts of data from the internet, including personal information—so much that the companies themselves often have poor insight into exactly what data is contained in the training set. And these massive models are extremely expensive to train and maintain, Lundbaek said.

Privacy and artificial intelligence businesses are currently a sort of parallel development, he said.

Another A.I. company trying to bridge the gap between privacy and A.I. is SpotLab, which builds models for clinical research. Its founder and CEO Miguel Luengo-Oroz previously worked at the United Nations as a researcher and chief data scientist. In 20 years of studying A.I., he says he has often thought about this missing piece: an A.I.’s system’s ability to unlearn.

He says that one reason little progress has been made on the issue is that, until recently, there was no data privacy regulation forcing companies and researchers to expend serious effort to address it. That has changed recently in Europe, but in the U.S., rules that would require companies to make it easy to delete people’s data are still absent.

Some people are hoping the courts will step in where lawmakers have so far failed. One recent lawsuit alleges OpenAI stole “millions of Americans'” data to train ChatGPT’s model.

And there are signs that some big tech companies may be starting to think harder about the problem. In June, Google announced a competition for researchers to come up with solutions to A.I.’s inability to forget.

But until more progress is made, user data will continue to float around in an expanding constellation of A.I models, leaving it vulnerable to dubious, or even threatening, actions.

“I think it’s dangerous and if someone got access to this data, let’s say, some kind of intelligence agencies or even other countries, I mean, I think it can be really be used in a bad way,” Lundb?k said.

財(cái)富中文網(wǎng)所刊載內(nèi)容之知識(shí)產(chǎn)權(quán)為財(cái)富媒體知識(shí)產(chǎn)權(quán)有限公司及/或相關(guān)權(quán)利人專屬所有或持有。未經(jīng)許可，禁止進(jìn)行轉(zhuǎn)載、摘編、復(fù)制及建立鏡像等任何使用。

0條Plus

精彩評(píng)論

評(píng)論

撰寫或查看更多評(píng)論

請(qǐng)打開財(cái)富Plus APP

前往打開

熱讀文章

關(guān)注我們

對(duì)AI模型公司的終極懲罰：算法追繳

撰寫或查看更多評(píng)論