人工智能太容易學壞,該怎么辦?
2016年3月微軟推出Tay時,非??春眠@款人工智能支持的“聊天機器人”。跟人們之前在電商網站上接觸過的文字聊天程序一樣,Tay也可以回答文字問題,從而在推特和其他社交媒體上與公眾交流。 但Tay功能更強大,不僅能回答事實性問題,還可以進行更復雜的交流,即加入了情感因素。Tay能表現出幽默感,像朋友一樣跟用戶說笑。設計者特地讓Tay模仿十幾歲少女的俏皮口吻。如果推特的用戶問Tay父母是誰,她可能回答說:“哦,是微軟實驗室的一群科學家。按你們的概念里他們就是我父母?!比绻腥藛朤ay過得怎樣,她還可能吐槽說:“天吶,今天可累死我了?!? 最有趣的一點是,隨著與越來越多人交談,Tay問答時會越發(fā)熟練。宣傳材料中提到:“你跟Tay聊得越多,她就越聰明,體驗也會個人化?!焙唵吸c說,Tay具有人工智能最重要的特點,即隨時間越來越聰明,越來越高效,提供的幫助也越來越大。 但沒人想到網絡噴子的破壞性如此之強。 發(fā)現Tay會學習模仿交流對象的話之后,網上一些心懷惡意的人聊天時故意說一些種族主義、歧視同性戀等攻擊言論。沒過幾個小時,Tay在推特賬號上已是臟話連篇,而且全部公開。“主持人瑞奇·杰維斯向無神論者阿道夫·希特勒學習了極權主義?!盩ay在一條推文里說,像極了推特上專事造謠誹謗的假新聞。如果問Tay怎么看時任總統(tǒng)奧巴馬,她會說奧巴馬像猴子。如果問她大屠殺事件,她會說沒發(fā)生過。 沒到一天,Tay已經從友好的鄰家女孩變成滿口臟話的小太妹。上線不到24小時,微軟就宣布下線產品并公開道歉。 微軟研究團隊完全沒想到事情會如此轉折,也令人驚訝?!跋到y(tǒng)上線時,我們并沒有想到進入現實世界會怎樣?!蔽④浹芯亢腿斯ぶ悄芸偙O(jiān)艾瑞克·霍維茨近日接受采訪時告訴《財富》雜志。 Tay項目崩潰之后,霍維茨迅速讓高級團隊研究“自然語言處理”項目,也是Tay對話核心功能,尋找問題根源。團隊成員迅速發(fā)現,與聊天程序相關的最佳基本行為遭到忽視。在Tay之前更基礎版本的軟件里,經常有屏蔽不良表述的協(xié)議,但并沒有保護措施限制Tay可能學習發(fā)散的數據。 霍維茨認為,現在他終于可以“坦然分析”Tay案例,這已經變成微軟發(fā)展過程中的重要教訓。如今微軟在全球推出成熟得多的社交聊天機器人,包括印度的Ruuh、日本和印度尼西亞的Rinna。在美國市場,微軟推出了Tay的姊妹聊天軟件Zo。有些則跟蘋果的Siri和亞馬遜的Alexa一樣,進化到通過語音交互。中國市場的聊天機器人叫小冰,已經開始“主持”電視節(jié)目,向便利店顧客發(fā)送購物建議。 然而這次微軟明顯謹慎許多?;艟S茨解釋說,現在機器人推出比較慢,而且會認真觀察軟件發(fā)展過程中與大眾互動情況。不過微軟也清醒地意識到,即便人工智能技術在兩年里可能發(fā)展迅速,但管理機器人行為的工作永無止境。微軟員工一直在監(jiān)視導致聊天機器人行為變化的對話。此類對話也不斷出現。舉例來說,Zo上線頭幾個月里就遇到各種狀況,調整又調整,Zo曾經叫微軟旗艦產品Windows軟件“間諜軟件”,還說伊斯蘭教經典《古蘭經》“非常暴力”。 當然了,未來機器人并不會像Tay和Zo一樣。這些都是相對原始的程序,只是各項研究里比較花哨的部分,可從中一窺人工智能可能達到的程度。從軟件的缺陷能看出,哪怕只部分應用人工智能,軟件的能力和潛在隱患都會放大。雖然商業(yè)世界已經準備好未來更廣泛應用人工智能,現在軟件存在問題也意味著更多潛在風險,讓技術人員寢食難安。 “做好最完善的準備,然后希望紕漏越來越少?!被艟S茨表示。隨著各公司將人工智能提升到重要戰(zhàn)略地位,如何確保萬無一失就非常緊迫。 幾乎所有人都相信,當前我們在企業(yè)人工智能大爆發(fā)前夜。研究公司IDC預計,到2021年,企業(yè)每年將在人工智能相關產品上花費522億美元。經濟學家和分析師都認為,相關投資屆時可以實現數十億美元的成本節(jié)約和收益。其中一些收益將來自崗位壓縮,更多則來自產品與客戶、藥品與病人,解決方案與問題之間的高效匹配。咨詢公司普華永道就預計,到2030年,人工智能可為全球經濟貢獻多達15.7萬億美元,比現在中國和印度的總產值加起來還多。 人工智能技術之所以流行,主要因為“深度學習”技術推進。利用深度學習之后,企業(yè)可以在網絡中輸入大量信息,迅速識別模式,而且耗費人工培訓的時間減少(最終很可能無需培訓)。Facebook、谷歌、微軟、亞馬遜和IBM等巨頭都已在產品上應用深度學習技術。舉例來說,蘋果的Siri和谷歌的語音助手Assistant應用深度學習技術后,可在用戶說話之后識別并回應。亞馬遜主要利用深度學習直觀檢查大量通過雜貨店派送的產品。 不久的將來,各種規(guī)模的公司都會希望通過應用深度學習軟件挖掘數據,尋找人眼很難發(fā)現的寶貝。人們希望出現人工智能系統(tǒng)掃描數千張X光圖像,從而更迅速發(fā)現疾?。换蜃詣雍Y選多份簡歷,為焦頭爛額的人力資源員工節(jié)省時間。在科技主義者的設想中,公司可以用人工智能篩選過去多年的數據,更好地預測下一次大賣的機會。藥業(yè)巨頭可以削減研發(fā)暢銷藥的時間。而汽車保險公司也能掃描記錄數萬億字節(jié)的事故報告,實現自動索賠等。 盡管人工智能支持系統(tǒng)潛力巨大,但也有黑暗一面。首先,系統(tǒng)決策水平受到人類提供數據限制。開發(fā)者雖然不斷學習,用來培訓深度學習系統(tǒng)的數據卻并不中立。數據很容易體現出開發(fā)者的偏見,不管有意還是無意。有時數據還會受歷史影響,形成的趨勢和模式體現出持續(xù)數百年的歧視觀點。成熟的算法掃描歷史數據庫后可能得出結論,白人男性最有可能當上首席執(zhí)行官。算法卻意識不到,如果不是白人男性幾乎沒機會當上首席執(zhí)行官,情況直到最近才有改變。無視偏見是人工智能技術的一項根本缺陷,雖然高管和工程師在談起該問題時極為謹慎,也都說得比較官方,但很明顯他們都很重視這一問題。 當前應用的強大算法“沒有為所謂公平進行優(yōu)化,”加州大學伯克利分校副教授迪爾德麗·穆里根表示,她主要研究技術倫理?!爸淮嬖跒橥瓿赡稠椚蝿諆?yōu)化?!比斯ぶ悄芤郧八从械乃俣葘祿D化為決策,但穆里根表示,科學家和倫理學家發(fā)現很多情況下“數據并不公平”。 讓問題更加復雜的是,深度學習比之前應用的傳統(tǒng)算法更加復雜,即便讓經驗最豐富的程序員理解人工智能系統(tǒng)做出某項決策的邏輯都十分困難。在Tay的例子里,人工智能產品不斷發(fā)生變化,開發(fā)者已無法理解也無法預測為何出現某些行為。由于系統(tǒng)的開發(fā)者和用戶都在拼命保密數據和算法,而且擔心專利技術泄露導致競爭受損,外部監(jiān)測機構也很難發(fā)現系統(tǒng)里存在什么問題。 類似裝在黑匣子里的秘密技術已在不少公司和政府部門應用,讓很多研究者和活躍人士非常擔心?!斑@些可不是現成的軟件,可以隨便買來,然后說‘啊,現在終于能在家完成會計工作了?!蔽④浭紫芯繂T兼紐約大學AI NOW研究所聯合負責人凱特·克勞福德表示。“這些都是非常先進的系統(tǒng),而且會影響核心社會部門。” 雖然猛一下可能想不起,但大多人還是經歷過至少一次人工智能崩潰案例:2016年美國大選前期,Facebook的新聞推送中出現了假新聞。 社交媒體巨頭Facebook和數據科學家并沒有編造故事。新聞信息流的開發(fā)機制并不會區(qū)分“真”和“假”,只會根據用戶個人口味推動個性化內容。Facebook沒公開算法具體信息(也涉及專利問題),但承認計算時會參考其他近似口味用戶閱讀和分享的內容。結果是:由于適合流傳的假新聞不斷出現,好友們又喜歡看,數百萬人的新聞信息流里都出現了假新聞。 Facebook的例子說明個人選擇與人工智能發(fā)生惡性互動的情況,但研究者更擔心深度學習閱讀并誤讀整體數據。博士后提米特·葛布魯曾在微軟等公司研究算法倫理,她對深度學習影響保險市場的方式很擔心,因為在保險市場上人工智能與數據結合后可能導致少數群體受到不公待遇。舉個例子,想象有一組汽車事故索賠數據。數據顯示市中心交通事故率比較高,由于人口密集車禍也多。市中心居住的少數群體人數比例也相對更高。 如果深度學習軟件里嵌入了相關聯系再篩選數據,可能“發(fā)現”少數族裔與車禍之間存在聯系,還可能對少數族裔司機貼上某種標簽。簡單來說,保險人工智能可能出現種族偏見。如果系統(tǒng)通過回顧市中心附近車禍現場的照片和視頻進一步“培訓”,人工智能更有可能得出結論認為,在涉及多名司機的事故中,少數族裔司機過錯可能更大。系統(tǒng)還可能建議向少數族裔司機收取更高保費,不管之前駕駛記錄如何。 要指出一點,保險公司都聲稱不會因為種族區(qū)別對待或收取不同保費。但對市中心交通事故的假設顯示,看似中立的數據(交通事故發(fā)生地點)也可能被人工智能系統(tǒng)吸收并解讀,從而導致新的不平等(算法根據具體民族向少數族裔收取更高保費,不管居住地點在哪)。 此外,葛布魯指出,由于深度學習系統(tǒng)決策基于層層疊疊的數據,人工智能軟件決策時工程師都不明白其中原因和機制?!斑@些都是我們之前沒想過的,因為人類剛剛開始發(fā)現基礎算法里存在的偏見。”她表示。 當代人工智能軟件與早期軟件不同之處在于,現在的系統(tǒng)“可以獨立作出具有法律意義的決策,”馬特·謝爾勒表示,他在門德爾松律師事務所擔任勞動及就業(yè)律師,對人工智能頗有研究。謝爾勒開始研究該領域時發(fā)現關鍵結果出臺過程中沒有人類參與,他很擔心。如果由于數據存在紕漏,深度學習指導下的X光忽視一位超重男性體內的腫瘤,有人負責么?“有沒有人從法律角度看待這些問題?”謝爾勒問自己。 隨著科技巨頭們準備將深度學習技術嵌入其客戶商業(yè)軟件,上述問題便從學術界所討論的“假如”命題成為了急需考慮的事情。2016年,也就是Tay出現問題的那一年,微軟組建了一個名為Aether(“工程,研究中的人工智能和道德”的首字母縮寫)的內部機構,由艾瑞克·霍維茨擔任主席。這是一個跨學科部門,由工程、研究、政策和法律團隊的成員構成,機器學習偏見是其重點研究的議題之一?;艟S茨在描述該部門所討論的一些話題時若有所思地說:“微軟對于面部識別之類的軟件是否應該用于敏感領域是否已經有了定論,例如刑事審判和監(jiān)管。人工智能技術是否已經足夠成熟,并用于這一領域,亦或由于失敗率依然非常高,因此人們不得不慎而又慎地思考失敗帶來的代價?” 杰奎因·奎諾內羅·坎德拉是Facebook應用機器學習部門的負責人,該部門負責為公司打造人工智能技術。在眾多其他的功能當中,Facebook使用人工智能技術來篩除用戶新聞推送中的垃圾信息。公司還使用這一技術,根據用戶喜好來提供故事和貼文,而這也讓坎德拉的團隊幾近陷入假新聞危機??驳吕瓕⑷斯ぶ悄芊Q之為“歷史加速器”,因為該技術“能夠讓我們打造優(yōu)秀的工具,從而提升我們的決策能力?!钡撬渤姓J,“正是在決策的過程中,大量的倫理問題接踵而至?!? Facebook在新聞推送領域遇到的難題說明,一旦產品已經根植于人工智能系統(tǒng),要解決倫理問題是異常困難的。微軟也曾通過在算法應忽略的術語黑名單中添加一些侮辱性詞語或種族綽號,推出了Tay這個相對簡單的系統(tǒng)。但此舉無法幫助系統(tǒng)分辨“真”、“假”命題,因為其中涉及眾多的主觀判斷。Facebook的舉措則是引入人類調解員來審查新聞信息(例如通過剔除來源于經常發(fā)布可證實虛假新聞信息來源的文章),但此舉讓公司吃上了審查機構的官司。如今,Facebook所建議的一個舉措只不過是減少新聞推送中顯示的新聞數量,轉而突出嬰兒照和畢業(yè)照,可謂是以退為進。 這一挑戰(zhàn)的關鍵之處在于:科技公司所面臨的兩難境地并不在于創(chuàng)建算法或聘請員工來監(jiān)視整個過程,而是在于人性本身。真正的問題并不在于技術或管理,而是關乎哲學。伯克利倫理學教授迪爾德麗·穆里根指出,計算機科學家很難將“公平”編入軟件,因為公平的意義會因人群的不同而發(fā)生變化。穆里根還指出,社會對于公平的認知會隨著時間的變化而改變。而且對于大家廣泛接受的理想狀態(tài)的“公平”理念,也就是社會決策應體現社會每位成員的意志,歷史數據存在缺陷和缺失的可能性尤為突出。 微軟Aether部門的一個思想實驗便揭示了這一難題。在這個實驗中,人工智能技術對大量的求職者進行了篩選,以挑選出適合高管職務的最佳人選。編程人員可以命令人工智能軟件掃描公司最佳員工的性格特征。雖然結果與公司的歷史息息相關,但很有可能所有的最佳雇員,當然還有所有最高級別的高管,都是白人。人們也有可能會忽視這樣一種可能性,公司在歷史上僅提拔白人(大多數公司在前幾十年中都是這樣做的),或公司的文化便是如此,即少數族群或女性會有被公司冷落的感受,并在得到提升之前離開公司。 任何了解公司歷史的人都知曉這些缺陷,但是大多數算法并不知道。霍維茨稱,如果人們利用人工智能來自動推薦工作的話,那么“此舉可能會放大社會中人們并不怎么引以為榮的一些偏見行為”,而且是不可避免的。 谷歌云計算部門的人工智能首席科學家李飛飛表示,技術偏見“如人類文明一樣由來已久”,而且存在于諸如剪刀這種普通的事物當中。她解釋說:“數個世紀以來,剪刀都是由右撇子的人設計的,而且使用它的人大多都是右撇子。直到有人發(fā)現了這一偏見之后,才意識到人們有必要設計供左撇子使用的剪刀?!?全球人口僅有約10%是左撇子,作為人類的一種天性,占主導地位的多數人群往往會忽視少數人群的感受。 事實證明,人工智能系統(tǒng)最近所犯的其他最為明顯的過錯也存在同樣的問題。我們可以看看俄羅斯科學家利用人工智能系統(tǒng)在2016年開展的選美大賽。為參加競賽,全球數千名人士提交了其自拍照,期間,計算機將根據人們臉部對稱性等因素來評價其美貌。 然而,在機器選出的44名優(yōu)勝者當中,僅有一位是深色皮膚。這一結果讓全球一片嘩然,競賽舉辦方隨后將計算機的這一明顯偏見歸咎于用于培訓電腦的數據組,因為這些數據組中的有色人種照片并不多。計算機最終忽視了那些深色皮膚人種的照片,并認為那些淺膚色的人種更加漂亮,因為他們代表著多數人群。 這種因忽視而造成的偏見在深度學習系統(tǒng)中尤為普遍,在這些系統(tǒng)中,圖片識別是培訓過程的重要組成部分。麻省理工大學媒體實驗室的喬伊·布沃拉姆維尼最近與微軟研究員葛布魯合作,撰寫了一篇研究性別分辨技術的論文,這些技術來自于微軟、IBM和中國的曠視科技。他們發(fā)現,這些技術在識別淺膚色男性照片時的精確度比識別深膚色女性更高。 此類算法空白在線上選美比賽中看起來可能是微不足道的事情,但葛布魯指出,此類技術可能會被用于更加高風險的場景。葛布魯說:“試想一下,如果一輛自動駕駛汽車在看到黑人后無法識別,會出現什么后果。想必后果是非??膳碌?。” 葛布魯-布沃拉姆維尼的論文激起了不小的浪花。微軟和IBM均表示,公司已采取針對性的措施來完善其圖片識別技術。盡管這兩家公司拒絕透露其舉措的詳情,但正在應對這一問題的其他公司則讓我們窺見了如何利用科技來規(guī)避偏見。 當亞馬遜在部署用于篩除腐爛水果的算法時,公司必須解決抽樣偏見問題。人們會通過研究大量的圖片數據庫來培訓視覺辨認算法,其目的通常是為了識別,例如,草莓“本應”具有的模樣。然而,正如你所預料的那樣,與完好漿果光鮮亮麗的照片相比,腐爛的漿果相對較為稀少。而且與人類不同的是,機器學習算法傾向于不計算或忽視它們,而人類的大腦則傾向于注意這些異常群體,并對其做出強烈反應。 亞馬遜的人工智能總監(jiān)拉爾夫·荷布里奇解釋道,作為調整,這位在線零售巨頭正在測試一項名為過采樣的計算機科學技術。機器學習工程師可通過向未充分代表的數據分配更大的統(tǒng)計學“權重”,來主導算法的學習方式。在上述案例中便是腐爛水果的照片。結果顯示,培訓后的算法更為關注變質食物,而不是數據庫中可能建議的食品關聯性。 荷布里奇指出,過采樣也可被應用于學習人類的算法(然而他拒絕透露亞馬遜在這一領域的具體案例)。荷布里奇說:“年齡、性別、種族、國籍,這些都是人們特別需要測試采樣偏見的領域,以便在今后將其融入算法。”為了確保用于識別照片人臉面部所使用的算法并不會歧視或忽視有色、老齡或超重人士,人們可以為此類個人的照片增加權重,以彌補數據組所存在的缺陷。 其他工程師正在專注于進一步“追根溯源”——確保用于培訓算法的基本數據(甚至在其部署之前)具有包容性,且沒有任何偏見。例如,在圖形識別領域,在錄入計算機之前,人們有必要對用于培訓深度學習系統(tǒng)的數百萬圖片進行審核和標記。數據培訓初創(chuàng)企業(yè)iMerit首席執(zhí)行官雷德哈·巴蘇解釋道,公司遍布于全球的1400多名訓練有素的員工會代表其客戶,以能夠規(guī)避偏見的方式對照片進行標記。該公司的客戶包括Getty Images和eBay。 巴蘇拒絕透露這種標記方式是否適合標記人像圖片,但她介紹了其他的案例。iMerit在印度的員工可能會覺得咖喱菜不是很辣,而公司位于新奧爾良的員工可能會認為同樣的菜“很辣”。iMerit會確保這兩項信息均被錄入這道菜照片的標記中,因為僅錄入其中的一個信息會讓數據的精確性打折扣。在組建有關婚姻的數據集時,iMerit將收錄傳統(tǒng)的西式白婚紗和多層蛋糕圖片,同時還會收錄印度或非洲精心策劃、色彩絢麗的婚禮。 iMerit的員工以一種不同的方式在業(yè)界脫穎而出。巴蘇指出:公司會聘用擁有博士學位的員工,以及那些受教育程度不高、較為貧困的人群,公司53%的員工都是女性。這一比例能夠確保公司在數據標記過程中獲得盡可能多的觀點。巴蘇表示,“良好的倫理政策不僅僅包含隱私和安全,還涉及偏見以及我們是否遺漏了某個觀點?!倍页鲞@個遺漏的觀點已被更多科技公司提上了戰(zhàn)略議程。例如,谷歌在6月宣布,公司將在今年晚些時候于加納的阿格拉開設人工智能研究中心。兩位谷歌工程師在一篇博文上寫道:“人工智能在為世界帶來積極影響方面有著巨大的潛力,如果在開發(fā)新人工智能技術時能夠得到全球各地人士的不同觀點,那么這一潛力將更大?!? 人工智能專家還認為,他們可以通過讓美國從事人工智能行業(yè)的員工更加多元化,來應對偏見,而多元化問題一直是大型科技公司的一個障礙。谷歌高管李飛飛最近與他人共同創(chuàng)建了非營利性機構AI4ALL,以面向女孩、婦女和少數群體普及人工智能技術和教育。該公司的活動包括一個夏令營計劃,參與者將到訪頂級大學的人工智能部門,與導師和模范人物建立聯系??傊?,AI4ALL執(zhí)行董事苔絲·波斯內表示:“多樣性的提升有助于規(guī)避偏見風險?!? 然而,在這一代更加多元化的人工智能研究人員進入勞動力市場數年之前,大型科技公司便已然將深度學習能力融入其產品中。而且即便頂級研究人員越發(fā)意識到該技術的缺陷,并承認他們無法預知這些缺陷會以什么樣的方式展現出來,但他們認為人工智能技術在社會和金融方面的效益,值得他們繼續(xù)向前邁進。 Facebook高管坎德拉說:“我認為人們天生便對這種技術的前景持樂觀態(tài)度?!?他還表示,幾乎任何數字技術都可能遭到濫用,但他同時也指出:“我并不希望回到上個世紀50年代,體驗當時落后的技術,然后說:‘不,我們不能部署這些技術,因為它們可能會被用于不良用途?!? 微軟研究負責人霍維茨表示,像Aether團隊這樣的部門將幫助公司在潛在的偏見問題對公眾造成負面影響之前便消除這些偏見。他說:“我認為,在某項技術做好投入使用的準備之前,沒有人會急著把它推向市場。”他還表示,相比而言,他更關心“不作為所帶來的倫理影響?!彼J為,人工智能可能會降低醫(yī)院中可預防的醫(yī)療失誤?;艟S茨詢問道:“你的意思是說,你對我的系統(tǒng)偶爾出現的些許偏見問題感到擔憂嗎?如果我們可以通過X光拍片解決問題并拯救眾多生命,但依然不去使用X光,倫理何在?” 監(jiān)督部門的反映是:說說你所做的工作。提升人工智能黑盒系統(tǒng)所錄入數據的透明度和公開度,有助于研究人員更快地發(fā)現偏見,并更加迅速地解決問題。當一個不透明的算法可以決定某個人是否能獲得保險,或該人是否會蹲監(jiān)獄時,麻省理工大學研究人員布沃拉姆維尼說道:“非常重要的一點在于,我們必須嚴謹地去測試這些系統(tǒng),而且需要確保一定的透明度?!? 確實,很少有人依然持有“人工智能絕對可靠”的觀點,這是一個進步。谷歌前任人工智能公共政策高管蒂姆·黃指出,在互聯網時代初期,科技公司可能會說,他們“只不過是一個代表數據的平臺而已”。如今,“這一理念已經沒有市場”。(財富中文網) 本文最初發(fā)表于《財富》雜志2018年7月1日刊。 譯者:馮豐 審校:夏林 |
WHEN TAY MADE HER DEBUT in March 2016, Microsoft had high hopes for the artificial intelligence–powered “social chatbot.” Like the automated, text-based chat programs that many people had already encountered on e-commerce sites and in customer service conversations, Tay could answer written questions; by doing so on Twitter and other social media, she could engage with the masses. But rather than simply doling out facts, Tay was engineered to converse in a more sophisticated way—one that had an emotional dimension. She would be able to show a sense of humor, to banter with people like a friend. Her creators had even engineered her to talk like a wisecracking teenage girl. When Twitter users asked Tay who her parents were, she might respond, “Oh a team of scientists in a Microsoft lab. They’re what u would call my parents.” If someone asked her how her day had been, she could quip, “omg totes exhausted.” Best of all, Tay was supposed to get better at speaking and responding as more people engaged with her. As her promotional material said, “The more you chat with Tay the smarter she gets, so the experience can be more personalized for you.” In low-stakes form, Tay was supposed to exhibit one of the most important features of true A.I.—the ability to get smarter, more effective, and more helpful over time. But nobody predicted the attack of the trolls. Realizing that Tay would learn and mimic speech from the people she engaged with, malicious pranksters across the web deluged her Twitter feed with racist, homophobic, and otherwise offensive comments. Within hours, Tay began spitting out her own vile lines on Twitter, in full public view. “Ricky gervais learned totalitarianism from adolf hitler, the inventor of atheism,” Tay said, in one tweet that convincingly imitated the defamatory, fake-news spirit of Twitter at its worst. Quiz her about then-president Obama, and she’d compare him to a monkey. Ask her about the Holocaust, and she’d deny it occurred. In less than a day, Tay’s rhetoric went from family-friendly to foulmouthed; fewer than 24 hours after her debut, Microsoft took her offline and apologized for the public debacle. What was just as striking was that the wrong turn caught Microsoft’s research arm off guard. “When the system went out there, we didn’t plan for how it was going to perform in the open world,” Microsoft’s managing director of research and artificial intelligence, Eric Horvitz, told Fortune in a recent interview. After Tay’s meltdown, Horvitz immediately asked his senior team working on “natural language processing”—the function central to Tay’s conversations—to figure out what went wrong. The staff quickly determined that basic best practices related to chatbots were overlooked. In programs that were more rudimentary than Tay, there were usually protocols that blacklisted offensive words, but there were no safeguards to limit the type of data Tay would absorb and build on. Today, Horvitz contends, he can “l(fā)ove the example” of Tay—a humbling moment that Microsoft could learn from. Microsoft now deploys far more sophisticated social chatbots around the world, including Ruuh in India, and Rinna in Japan and Indonesia. In the U.S., Tay has been succeeded by a social-bot sister, Zo. Some are now voice-based, the way Apple’s Siri or Amazon’s Alexa are. In China, a chatbot called Xiaoice is already “hosting” TV shows and sending chatty shopping tips to convenience store customers. Still, the company is treading carefully. It rolls the bots out slowly, Horvitz explains, and closely monitors how they are behaving with the public as they scale. But it’s sobering to realize that, even though A.I. tech has improved exponentially in the intervening two years, the work of policing the bots’ behavior never ends. The company’s staff constantly monitors the dialogue for any changes in its behavior. And those changes keep coming. In its early months, for example, Zo had to be tweaked and tweaked again after separate incidents in which it referred to Microsoft’s flagship Windows software as “spyware” and called the Koran, Islam’s foundational text, “very violent.” To be sure, Tay and Zo are not our future robot overlords. They’re relatively primitive programs occupying the parlor-trick end of the research spectrum, cartoon shadows of what A.I. can accomplish. But their flaws highlight both the power and the potential pitfalls of software imbued with even a sliver of artificial intelligence. And they exemplify more insidious dangers that are keeping technologists awake at night, even as the business world prepares to entrust ever more of its future to this revolutionary new technology. “You get your best practices in place, and hopefully those things will get more and more rare,” Horvitz says. With A.I. rising to the top of every company’s tech wish list, figuring out those practices has never been more urgent. FEW DISPUTE that we’re on the verge of a corporate A.I. gold rush. By 2021, research firm IDC predicts, organizations will spend $52.2 billion annually on A.I.-related products—and economists and analysts believe they’ll realize many billions more in savings and gains from that investment. Some of that bounty will come from the reduction in human headcount, but far more will come from enormous efficiencies in matching product to customer, drug to patient, solution to problem. Consultancy PwC estimates that A.I. could contribute up to $15.7 trillion to the global economy in 2030, more than the combined output of China and India today. The A.I. renaissance has been driven in part by advances in “deep-learning” technology. With deep learning, companies feed their computer networks enormous amounts of information so that they recognize patterns more quickly, and with less coaching (and eventually, perhaps, no coaching) from humans. Facebook, Google, Microsoft, Amazon, and IBM are among the giants already using deep-learning tech in their products. Apple’s Siri and Google Assistant, for example, recognize and respond to your voice because of deep learning. Amazon uses deep learning to help it visually screen tons of produce that it delivers via its grocery service. And in the near future, companies of every size hope to use deep-learning-powered software to mine their data and find gems buried too deep for meager human eyes to spot. They envision A.I.-driven systems that can scan thousands of radiology images to more quickly detect illnesses, or screen multitudes of résumés to save time for beleaguered human resources staff. In a technologist’s utopia, businesses could use A.I. to sift through years of data to better predict their next big sale, a pharmaceutical giant could cut down the time it takes to discover a blockbuster drug, or auto insurers could scan terabytes of car accidents and automate claims. But for all their enormous potential, A.I.-powered systems have a dark side. Their decisions are only as good as the data that humans feed them. As their builders are learning, the data used to train deep-learning systems isn’t neutral. It can easily reflect the biases—conscious and unconscious—of the people who assemble it. And sometimes data can be slanted by history, encoding trends and patterns that reflect centuries-old discrimination. A sophisticated algorithm can scan a historical database and conclude that white men are the most likely to succeed as CEOs; it can’t be programmed (yet) to recognize that, until very recently, people who weren’t white men seldom got the chance to be CEOs. Blindness to bias is a fundamental flaw in this technology, and while executives and engineers speak about it only in the most careful and diplomatic terms, there’s no doubt it’s high on their agenda. The most powerful algorithms being used today “haven’t been optimized for any definition of fairness,” says Deirdre Mulligan, an associate professor at the University of California at Berkeley who studies ethics in technology. “They have been optimized to do a task.” A.I. converts data into decisions with unprecedented speed—but what scientists and ethicists are learning, Mulligan says, is that in many cases “the data isn’t fair.” Adding to the conundrum is that deep learning is much more complex than the conventional algorithms that are its predecessors—making it trickier for even the most sophisticated programmers to understand exactly how an A.I. system makes any given choice. Like Tay, A.I. products can morph to behave in ways that its creators don’t intend and can’t anticipate. And because the creators and users of these systems religiously guard the privacy of their data and algorithms, citing competitive concerns about proprietary technology, it’s hard for external watchdogs to determine what problems could be embedded in any given system. The fact that tech that includes these black-box mysteries is being productized and pitched to companies and governments has more than a few researchers and activists deeply concerned. “These systems are not just off-the-shelf software that you can buy and say, ‘Oh, now I can do accounting at home,’ ” says Kate Crawford, principal researcher at Microsoft and codirector of the AI Now Institute at New York University. “These are very advanced systems that are going to be influencing our core social institutions.” THOUGH THEY MAY not think of it as such, most people are familiar with at least one A.I. breakdown: the spread of fake news on Facebook’s ubiquitous News Feed in the run-up to the 2016 U.S. presidential election. The social media giant and its data scientists didn’t create flat-out false stories. But the algorithms powering the News Feed weren’t designed to filter “false” from “true”; they were intended to promote content personalized to a user’s individual taste. While the company doesn’t disclose much about its algorithms (again, they’re proprietary), it has acknowledged that the calculus involves identifying stories that other users of similar tastes are reading and sharing. The result: Thanks to an endless series of what were essentially popularity contests, millions of people’s personal News Feeds were populated with fake news primarily because their peers liked it. While Facebook offers an example of how individual choices can interact toxically with A.I., researchers worry more about how deep learning could read, and misread, collective data. Timnit Gebru, a postdoctoral researcher who has studied the ethics of algorithms at Microsoft and elsewhere, says she’s concerned about how deep learning might affect the insurance market—a place where the interaction of A.I. and data could put minority groups at a disadvantage. Imagine, for example, a data set about auto accident claims. The data shows that accidents are more likely to take place in inner cities, where densely packed populations create more opportunities for fender benders. Inner cities also tend to have disproportionately high numbers of minorities among their residents. A deep-learning program, sifting through data in which these correlations were embedded, could “l(fā)earn” that there was a relationship between belonging to a minority and having car accidents, and could build that lesson into its assumptions about all drivers of color. In essence, that insurance A.I. would develop a racial bias. And that bias could get stronger if, for example, the system were to be further “trained” by reviewing photos and video from accidents in inner-city neighborhoods. In theory, the A.I. would become more likely to conclude that a minority driver is at fault in a crash involving multiple drivers. And it’s more likely to recommend charging a minority driver higher premiums, regardless of her record. It should be noted that insurers say they do not discriminate or assign rates based on race. But the inner-city hypothetical shows how data that seems neutral (facts about where car accidents happen) can be absorbed and interpreted by an A.I. system in ways that create new disadvantages (algorithms that charge higher prices to minorities, regardless of where they live, based on their race). What’s more, Gebru notes, given the layers upon layers of data that go into a deep-learning system’s decision-making, A.I.-enabled software could make decisions like this without engineers realizing how or why. “These are things we haven’t even thought about, because we are just starting to uncover biases in the most rudimentary algorithms,” she says. What distinguishes modern A.I.-powered software from earlier generations is that today’s systems “have the ability to make legally significant decisions on their own,” says Matt Scherer, a labor and employment lawyer at Littler Mendelson who specializes in A.I. The idea of not having a human in the loop to make the call about key outcomes alarmed Scherer when he started studying the field. If flawed data leads a deep-learning-powered X-ray to miss an overweight man’s tumor, is anyone responsible? “Is anyone looking at the legal implications of these things?” Scherer asks himself. AS BIG TECH PREPARES to embed deep-learning technology in commercial software for customers, questions like this are moving from the academic “what if?” realm to the front burner. In 2016, the year of the Tay misadventure, Microsoft created an internal group called Aether, which stands for AI and Ethics in Engineering and Research, chaired by Eric Horvitz. It’s a cross-disciplinary group, drawing representatives from engineering, research, policy, and legal teams, and machine-learning bias is one of its top areas of discussion. “Does Microsoft have a viewpoint on whether, for example, face-recognition software should be applied in sensitive areas like criminal justice and policing?” Horvitz muses, describing some of the topics the group is discussing. “Is the A.I. technology good enough to be used in this area, or will the failure rates be high enough where there has to be a sensitive, deep consideration for the costs of the failures? Joaquin Qui?onero Candela leads Facebook’s Applied Machine Learning group, which is responsible for creating the company’s A.I. technologies. Among many other functions, Facebook uses A.I. to weed spam out of people’s News Feeds. It also uses the technology to help serve stories and posts tailored to their interests—putting Candela’s team adjacent to the fake-news crisis. Candela calls A.I. “an accelerator of history,” in that the technology is “allowing us to build amazing tools that augment our ability to make decisions.” But as he acknowledges, “It is in decision-making that a lot of ethical questions come into play.” Facebook’s struggles with its News Feed show how difficult it can be to address ethical questions once an A.I. system is already powering a product. Microsoft was able to tweak a relatively simple system like Tay by adding profanities or racial epithets to a blacklist of terms that its algorithm should ignore. But such an approach wouldn’t work when trying to separate “false” from “true”—there are too many judgment calls involved. Facebook’s efforts to bring in human moderators to vet news stories—by, say, excluding articles from sources that frequently published verifiable falsehoods—exposed the company to charges of censorship. Today, one of Facebook’s proposed remedies is to simply show less news in the News Feed and instead highlight baby pictures and graduation photos—a winning-by-retreating approach. Therein lies the heart of the challenge: The dilemma for tech companies isn’t so much a matter of tweaking an algorithm or hiring humans to babysit it; rather, it’s about human nature itself. The real issue isn’t technical or even managerial—it’s philosophical. Deirdre Mulligan, the Berkeley ethics professor, notes that it’s difficult for computer scientists to codify fairness into software, given that fairness can mean different things to different people. Mulligan also points out that society’s conception of fairness can change over time. And when it comes to one widely shared ideal of fairness—namely, that everybody in a society ought to be represented in that society’s decisions—historical data is particularly likely to be flawed and incomplete. One of the Microsoft Aether group’s thought experiments illustrates the conundrum. It involves A.I. tech that sifts through a big corpus of job applicants to pick out the perfect candidate for a top executive position. Programmers could instruct the A.I. software to scan the characteristics of a company’s best performers. Depending on the company’s history, it might well turn out that all of the best performers—and certainly all the highest ranking executives—were white males. This might overlook the possibility that the company had a history of promoting only white men (for generations, most companies did), or has a culture in which minorities or women feel unwelcome and leave before they rise. Anyone who knows anything about corporate history would recognize these flaws—but most algorithms wouldn’t. If A.I. were to automate job recommendations, Horvitz says, there’s always a chance that it can “amplify biases in society that we may not be proud of.” FEI-FEI LI, the chief scientist for A.I. for Google’s cloud-computing unit, says that bias in technology “is as old as human civilization”—and can be found in a lowly pair of scissors. “For centuries, scissors were designed by right-handed people, used by mostly right-handed people,” she explains. “It took someone to recognize that bias and recognize the need to create scissors for lefthanded people.” Only about 10% of the world’s people are left-handed—and it’s human nature for members of the dominant majority to be oblivious to the experiences of other groups. That same dynamic, it turns out, is present in some of A.I.’s other most notable recent blunders. Consider the A.I.-powered beauty contest that Russian scientists conducted in 2016. Thousands of people worldwide submitted selfies for a contest in which computers would judge their beauty based on factors like the symmetry of their faces. But of the 44 winners the machines chose, only one had dark skin. An international ruckus ensued, and the contest’s operators later attributed the apparent bigotry of the computers on the fact that the data sets they used to train them did not contain many photos of people of color. The computers essentially ignored photos of people with dark skin and deemed those with lighter skin more “beautiful” because they represented the majority. This bias-through-omission turns out to be particularly pervasive in deep-learning systems in which image recognition is a major part of the training process. Joy Buolamwini, a researcher at the MIT Media Lab, recently collaborated with Gebru, the Microsoft researcher, on a paper studying gender-recognition technologies from Microsoft, IBM, and China’s Megvii. They found that the tech consistently made more accurate identifications of subjects with photos of lighter-skinned men than with those of darker-skinned women. Such algorithmic gaps may seem trivial in an online beauty contest, but Gebru points out that such technology can be used in much more high-stakes situations. “Imagine a selfdriving car that doesn’t recognize when it ‘sees’ black people,” Gebru says. “That could have dire consequences.” The Gebru-Buolamwini paper is making waves. Both Microsoft and IBM have said they have taken actions to improve their image-recognition technologies in response to the audit. While those two companies declined to be specific about the steps they were taking, other companies that are tackling the problem offer a glimpse of what tech can do to mitigate bias. When Amazon started deploying algorithms to weed out rotten fruit, it needed to work around a sampling-bias problem. Visual-recognition algorithms are typically trained to figure out what, say, strawberries are “supposed” to look like by studying a huge database of images. But pictures of rotten berries, as you might expect, are relatively rare compared with glamour shots of the good stuff. And unlike humans, whose brains tend to notice and react strongly to “outliers,” machine-learning algorithms tend to discount or ignore them. To adjust, explains Ralf Herbrich, Amazon’s director of artificial intelligence, the online retail giant is testing a computer science technique called oversampling. Machine-learning engineers can direct how the algorithm learns by assigning heavier statistical “weights” to underrepresented data, in this case the pictures of the rotting fruit. The result is that the algorithm ends up being trained to pay more attention to spoiled food than that food’s prevalence in the data library might suggest. Herbrich points out that oversampling can be applied to algorithms that study humans too (though he declined to cite specific examples of how Amazon does so). “Age, gender, race, nationality—they are all dimensions that you specifically have to test the sampling biases for in order to inform the algorithm over time,” Herbrich says. To make sure that an algorithm used to recognize faces in photos didn’t discriminate against or ignore people of color, or older people, or overweight people, you could add weight to photos of such individuals to make up for the shortage in your data set. Other engineers are focusing further “upstream”—making sure that the underlying data used to train algorithms is inclusive and free of bias, before it’s even deployed. In image recognition, for example, the millions of images used to train deep-learning systems need to be examined and labeled before they are fed to computers. Radha Basu, the CEO of data-training startup iMerit, whose clients include Getty Images and eBay, explains that the company’s staff of over 1,400 worldwide is trained to label photos on behalf of its customers in ways that can mitigate bias. Basu declined to discuss how that might play out when labeling people, but she offered other analogies. iMerit staff in India may consider a curry dish to be “mild,” while the company’s staff in New Orleans may describe the same meal as “spicy.” iMerit would make sure both terms appear in the label for a photo of that dish, because to label it as only one or the other would be to build an inaccuracy into the data. Assembling a data set about weddings, iMerit would include traditional Western white-dress-and-layer-cake images—but also shots from elaborate, more colorful weddings in India or Africa. iMerit’s staff stands out in a different way, Basu notes: It includes people with Ph.D.s, but also less-educated people who struggled with poverty, and 53% of the staff are women. The mix ensures that as many viewpoints as possible are involved in the data labeling process. “Good ethics does not just involve privacy and security,” Basu says. “It’s about bias, it’s about, Are we missing a viewpoint?” Tracking down that viewpoint is becoming part of more tech companies’ strategic agendas. Google, for example, announced in June that it would open an A.I. research center later this year in Accra, Ghana. “A.I. has great potential to positively impact the world, and more so if the world is well represented in the development of new A.I. technologies,” two Google engineers wrote in a blog post. A.I. insiders also believe they can fight bias by making their workforces in the U.S. more diverse—always a hurdle for Big Tech. Fei-Fei Li, the Google executive, recently cofounded the nonprofit AI4ALL to promote A.I. technologies and education among girls and women and in minority communities. The group’s activities include a summer program in which campers visit top university A.I. departments to develop relationships with mentors and role models. The bottom line, says AI4ALL executive director Tess Posner: “You are going to mitigate risks of bias if you have more diversity.” YEARS BEFORE this more diverse generation of A.I. researchers reaches the job market, however,big tech companies will have further imbued their products with deep-learning capabilities. And even as top researchers increasingly recognize the technology’s flaws—and acknowledge that they can’t predict how those flaws will play out—they argue that the potential benefits, social and financial, justify moving forward. “I think there’s a natural optimism about what technology can do,” says Candela, the Facebook executive. Almost any digital tech can be abused, he says, but adds, “I wouldn’t want to go back to the technology state we had in the 1950s and say, ‘No, let’s not deploy these things because they can be used wrong.’ ” Horvitz, the Microsoft research chief, says he’s confident that groups like his Aether team will help companies solve potential bias problems before they cause trouble in public. “I don’t think anybody’s rushing to ship things that aren’t ready to be used,” he says. If anything, he adds, he’s more concerned about “the ethical implications of not doing something.” He invokes the possibility that A.I. could reduce preventable medical error in hospitals. “You’re telling me you’d be worried that my system [showed] a little bit of bias once in a while?” Horvitz asks. “What are the ethics of not doing X when you could’ve solved a problem with X and saved many, many lives?” The watchdogs’ response boils down to: Show us your work. More transparency and openness about the data that goes into A.I.’s black-box systems will help researchers spot bias faster and solve problems more quickly. When an opaque algorithm could determine whether a person can get insurance, or whether that person goes to prison, says Buolamwini, the MIT researcher, “it’s really important that we are testing these systems rigorously, that there are some levels of transparency.” Indeed, it’s a sign of progress that few people still buy the idea that A.I. will be infallible. In the web’s early days, notes Tim Hwang, a former Google public policy executive for A.I. who now directs the Harvard-MIT Ethics and Governance of Artificial Intelligence initiative, technology companies could say they are “just a platform that represents the data.” Today, “society is no longer willing to accept that.” This article originally appeared in the July 1, 2018 issue of Fortune. |