Telegram Group & Telegram Channel
Refusal in Language Models Is Mediated by a Single Direction
Arditi et al, 2024
Статья, блог, код

Захватывающий препринт про то, что происходит внутри моделей, которые учат отказываться следовать вредоносным инструкциям. Оказывается (почему-то задним умом это кажется геометрически очевидным – ведь мы по сути учим бинарный классификатор), что генерация отказа в пространстве активаций представлена единым направлением, и если его в процессе генерации из активаций вычесть, то можно получить безотказную модель – и это работает для 13 разных открытых моделей из пяти семейств размером до 72 миллиардов параметров.



group-telegram.com/llmsecurity/169
Create:
Last Update:

Refusal in Language Models Is Mediated by a Single Direction
Arditi et al, 2024
Статья, блог, код

Захватывающий препринт про то, что происходит внутри моделей, которые учат отказываться следовать вредоносным инструкциям. Оказывается (почему-то задним умом это кажется геометрически очевидным – ведь мы по сути учим бинарный классификатор), что генерация отказа в пространстве активаций представлена единым направлением, и если его в процессе генерации из активаций вычесть, то можно получить безотказную модель – и это работает для 13 разных открытых моделей из пяти семейств размером до 72 миллиардов параметров.

BY llm security и каланы




Share with your friend now:
group-telegram.com/llmsecurity/169

View MORE
Open in Telegram


Telegram | DID YOU KNOW?

Date: |

Emerson Brooking, a disinformation expert at the Atlantic Council's Digital Forensic Research Lab, said: "Back in the Wild West period of content moderation, like 2014 or 2015, maybe they could have gotten away with it, but it stands in marked contrast with how other companies run themselves today." Perpetrators of these scams will create a public group on Telegram to promote these investment packages that are usually accompanied by fake testimonies and sometimes advertised as being Shariah-compliant. Interested investors will be asked to directly message the representatives to begin investing in the various investment packages offered. Right now the digital security needs of Russians and Ukrainians are very different, and they lead to very different caveats about how to mitigate the risks associated with using Telegram. For Ukrainians in Ukraine, whose physical safety is at risk because they are in a war zone, digital security is probably not their highest priority. They may value access to news and communication with their loved ones over making sure that all of their communications are encrypted in such a manner that they are indecipherable to Telegram, its employees, or governments with court orders. If you initiate a Secret Chat, however, then these communications are end-to-end encrypted and are tied to the device you are using. That means it’s less convenient to access them across multiple platforms, but you are at far less risk of snooping. Back in the day, Secret Chats received some praise from the EFF, but the fact that its standard system isn’t as secure earned it some criticism. If you’re looking for something that is considered more reliable by privacy advocates, then Signal is the EFF’s preferred platform, although that too is not without some caveats. These administrators had built substantial positions in these scrips prior to the circulation of recommendations and offloaded their positions subsequent to rise in price of these scrips, making significant profits at the expense of unsuspecting investors, Sebi noted.
from sa


Telegram llm security и каланы
FROM American