AT Moderation Architecture

@bnewbold.net

The AT network is becoming more heterogeneous in practice, with independent PDS hosts, apps, and alternative bsky AppViews establishing themselves. This means that more complex inter-service moderation scenarios are starting to get attention. While the labeling system was released all the way back in April 2023, there are some parts of the atproto moderation architecture and design philosophy that have never really been written up. There are also parts which Bluesky Social hasn't had occasion to fully exercise or include in our moderation policies, which can lead to confusion.

This blog post gives an overview of how all the various moderation components and mechanisms are expected to fit together. The audience for this post is protocol and application developers who already familiar with the AT network design and terminology.

Design Goals

What were we trying to achieve with the moderation system design? A few top-level principles:

  • Governance: there should not be a single organization or party who has unique control of moderation policy and enforcement across the entire global network.
  • Safety: users should have a safe, harassment-free, well-moderated experience out of the box.
  • Externalities: Bluesky-operated infrastructure should not be a haven for bad actors and abuse on the internet, or in society more broadly.

And some additional goals and requirements:

  • It should be possible for new entrants to participate incrementally in moderation of subsets/slices of the existing network.
  • The protocol should unbundle moderation authority from infrastructure operation, when possible.
  • Folks unhappy with moderation enforcement should have more effective options than lobbying large service providers ("affective voice") or exiting the entire network.
  • There are some forms of harmful content which are forbidden from all Bluesky infrastructure, even if they are not directly impacting our product or users.
  • Adults in the network should be able to control their experience and view some filtered content (e.g. graphic, erotic, spammy, or unsettling) if they chose to do so.

Network Roles

Different parties play different roles:

Infrastructure Providers (PDS, Relay, AppView) have an unavoidable role to play when it comes to the most harmful content, and must abide by the law in their physical jurisdiction. Apart from these base-level responsibilities, it is a goal to unbundle moderation authority from infrastructure ownership, and making network infrastructure commodity and un-opinionated.

Moderation Services (Labelers) are composable, flexible in scope, and heterogeneous in governance and operational model. They might provide high-touch dispute resolution for a specific community; focus on a narrow pattern of harm in a global context; or provide relatively broad and global moderation coverage, as Bluesky does for users of our service.

Branded Client Apps are in a privileged position, with economic leverage (eg, facilitating payments, or displaying advertisements) and control over product features. They have a responsibility to implement baseline moderation features, and an expectation that they contribute to baseline moderation services for the content which is created, viewed, or reported by their users.

Why involve client apps explicitly, and how will that work out?

  • One scenario we want to avoid is the emergence of popular and profitable commercial apps that extract value from the ecosystem but do not contribute proportionately to community safety. Or worse, introduce risky or harmful product features that have a negative impact. In other words, build a Torment Nexus app and expect others to clean up the mess.
  • We expect client apps to be the easiest to commercialize, and the most socially accountable in terms of brand and community loyalty. We expect operating moderation services to be the most difficult to sustain independently, in both human and financial terms. The natural partnership is for branded client apps to support and sustain moderation services.
  • While clients have responsibilities, we also think open source and alternative user agents are good. We don't want to restrict access to moderation decisions to specific client software. Volunteer and non-profit client app developers are not expected to contribute directly to moderation work, though they are expected to make responsible design decisions and set sensible default configurations.

Moderation Interventions

Let's look at each specific service type, and what moderation interventions are available to them.

Key considerations for each intervention are whether the action is "observable" (meaning any party can tell what happened), and "enumerable" (whether a papertrail of all such actions is easy to create). Because the AT network is an open distributed system requiring coordination and interoperation between many parties, almost all interventions are observable, even if that transparency is undesirable. The labeling system is enumerable by default, while some infrastructure moderation actions are designed to not be enumerable.

PDS

PDS infrastructure operators have two primary actions at their disposal:

  • "PDS Account Suspension" (or Takedown): the account's public repository and blobs become unavailable, and the account can not make authenticated requests to external services. A public "account" event is broadcast, and downstream services are expected to mirror the suspension behavior. All application types and network services are impacted. Account status can be queried at the PDS at any time. The account's network identity (handle and DID) are not impacted.
  • "PDS Blob Purge" of individual blobs (media files) for an account: makes a specific blob unavailable publicly, and also prevents the account from accessing the blob. This action is intentionally not broadcast publicly, and there is not an API to enumerate all purged blobs.

Note that the PDS does not have content-level (eg, per-record) interventions, such as taking down an individual post.

Media files are considered a greater abuse liability than repo records. Being able to rapidly action individual media files without taking down the overall account is important while investigating abuse and legal requests. If content is found to be abusive, the presumption is that the entire account would be takendown as a follow-up. It is also important to have mechanisms for actioning abusive media files without creating public hash lists of such content (this is why blob purges are not broadcast or publicly enumerable). Note that most media in the network is accessed via CDN, and PDS blob purges do not automatically result in CDN purges.

Abuse notifications can currently be sent to individual PDS instances via the admin email declared in the server description API, or by other existing internet service abuse reporting mechanisms. Additional mechanisms might be added in the future to support opt-in inter-service abuse reporting and configurable auto-actioning.

PDS operators are generally not expected to proactively moderate application-specific content or interactions. They should implement reasonable rate-limits to prevent bulk spam and resource consumption attacks.

Relay

Relay infrastructure operators have two primary actions:

  • "Relay Account Suspension" (or Takedown): blocks re-distribution of public repository content from the account. Not required to impact the firehose backfill window. Broadcast as a public "account" event, and downstream services are expected to mirror the behavior. Could be mirroring an upstream PDS action, or could be an intervention at the relay itself.
  • "PDS Ban": refusing to connect to a specific PDS host, or potentially a pattern of hostnames or IP addresses. No further events from the PDS are transmitted. PDS host status (including banned servers) might be publicly enumerable, depending on implementation. Not currently broadcast as a public event.

Note that relay instances do not access or distribute blobs (media files).

Relay operators are also not expected to proactively moderation application-specific content or interactions. They primarily intervene to prevent network abuse, such as spamming, resource consumption attacks, and bulk registration of inauthentic accounts.

Ideally all relay-level interventions would be transparent and open to assessment by downstream services.

AppView

AppViews facilitate labeling in a general sense, by subscribing to moderation services, and including labels in API responses based on configuration provided in client requests. The AppView might have a default labeler configuration if none is provided, but in general these mechanisms are driven by client apps, not the AppView operators. There are three forms:

  • "Labeling": Labels are included in API responses for individual pieces of content and for overall accounts. The content itself is still included in the API response, and it is up to the client to redact or annotate appropriately. This requires API schema support, but is relatively application-generic. There are broadly two categories of labels:
    • "Annotation Labels": act as badges or content warnings in client apps. These labels have semantics ("why" the label was applied). Usually configurable by end users.
    • "Action Labels": mandate a specific app behavior, without reasoning (eg, "warn" or "hide"). These start with a ! character.
  • "Label Takedown", aka "Redaction Labeling": the client request requests "redaction" for a specific labeler, and that labeler has a !takedown label on an account or piece of content, then instead of including that label as usual, the AppView will fully redact the content from the API response, leaving a tombstone indicator or clarifying error code.

Label-based redaction is not particularly different from regular labeling. The main advantages are simplicity for application developers: a minimal client app might not (yet) have schema support for label hydration, or might have client-side bugs with redacting content based on label metadata in API responses. The redaction flag shifts some of that work up to the server.

One corner-case is a "Label Takedown" on the account making the API request itself. In this case the AppView should probably refuse to service any requests at all.

AppView infrastructure operators have some additional actions at their disposal:

  • "API Content Takedown": access to specific pieces of content are blocked for all clients, regardless of labeler configuration. Is not broadcast publicly or enumerable, but is usually indicated with a tombstone or specific error code in API responses.
  • "API Account Suspenion" (or Takedown): access to all content published by the account is blocked for all clients, regardless of labeler configuration. Also not broadcast publicly or enumerable, and is usually indicated with a tombstone or specific error code in API responses. All API requests from the account are rejected. Mirrors upstream account status (eg, at the PDS and/or Relay), in addition to any AppView-level action.
  • "Labeler Block": halts ingest of new labels from a service, and ignores the labeler when included in client requests (the labelers actually applied are indicated in reponse headers).

AppView instances usually run media CDNs. In that context, there is one more action:

  • "CDN Blob Purge": distribution of a specific media file, in the context of a specific account, is blocked. Not publicly broadcast or enumerable.

Client App

Moderation interventions don't generally happen in the client app itself, but the client app does have a large role to play in the overall moderation architecture.

They set labeler configuration in API requests, control default configurations, and can specify "mandatory" labelers and label values that end users can not opt-out of. They also implement and route moderation reports to specific moderation services.

They determine which AppView service is used by default, and whether that service is configurable by end users.

Client Apps are expected to function smoothly with any PDS instance and hosting provider.

Client Apps may be open source and collaboratively developed. It is the party which hosts a specific app instance on the web, or publishes in app stores, who is responsible for decisions about service configuration.

Moderation Service

Moderation services are generally expected to fufil the following functions, though they may be limited in scope to specific application content types, categories of reports, and sub-communities of the overall network:

  • Receive and review moderation reports from client apps. These reports are private.
  • Publish labels, including action labels like !takedown, which can be consumed by AppViews.

Labels are usually public and enumerable, though there is an intentional possibility for gating access to labels. Having labels be public is good for transparency and public assessment. But moderation labor is difficult and expensive, and there may be situations where a moderation service needs to restrict access to service providers who are contributing resources to that effort. There may also be scenarios where community labelers do not want to make all actions easily enumerable. However, the design of the labeling system does not support strong privacy, and the expectation is that most labels will be public.

Some moderation service might be granted additional powers by infrastructure service operators. In that case the moderation service might also be able to directly effect actions on PDS instances, relays, or AppViews (eg, API Takedowns).

Ladder of Interventions

The number of possible moderation interventions is intentionally large, to allow an escalating ladder with increasingly large impact through the network. Violations can be addressed proportionate to the degree and scope of harm.

As one example, consider images:

  • graphic or disturbing images can have labels applied, which end users can click-through or disable in the app
  • more offensive or intolerant images might have a label takedown applied. This could prevent the image from being shown in some client apps regardless of how they are configured. Users who want to access the content would need to switch apps.
  • more harmful (but not abusive or illegal) images might result in an API Post Takedown, either in isolation or combined with an account-level action.
  • content which is illegal in specific jurisdictions might be taken down at the infrastructure level in those regions (eg, PDS Blob Purge, CDN Blob Purge). For example, a local trademark violation.
  • extreme abuse content would have the file blocked on all infrastructure (PDS Blob Purge, CDN Blob Purge), combined with other content-level and account-level interventions

As another example, spam behaviors:

  • small quantities of unwanted promotional content might be labeled at the post or account level. This might be done using automation; if the automation was incorrect, it would be easy for end users to click-through or reconficure the labels
  • larger quantities of spam could result in account-level actions: PDS Account Takedown, or API Account Takedown. These actions can be appealed using regular moderation flows
  • very high rates of spam become a network abuse problem, consuming bandwidth and impacting regular service. PDS Account Takedown becomes more important, or a Relay Account Takedown.
  • if an entire PDS instance is dedicated to persistent network abuse, then a PDS Ban at relays may be appropriate

The ladder of interventions is mirrored by the effort an account or community would need to invest to circumvent different moderation interventions. This is low in some situations, and higher in others. This includes both infrastructural effort (building alternative infrastructure) and social effort (convincing communities to adopt that alternative infrastructure). Some categories of content are so harmful that they should be excluded from the entire internet, to the greatest degree possible.

More to Discuss

There is a lot more to consider and explore about this architecture. How does this all go wrong? How might these un-bundled roles end up being re-bundled? What are the incentives and econonmics that support each role? Do moderation services specialize by application type, user community, geo-region, or category of harm? Or do they generalize and cover all of those aspects? What are the opportunities for collaboration, and when are independence and differentiation important? What are the actual interventions that the Bluesky Moderation Service uses today?

Hope to get in to those questions more soon, but this is already a long post.

bnewbold.net
bryan newbold

@bnewbold.net

oscilloscopes, cycling, snow, big cities, wiki. I like speculating about found objects.
protocol engineer @bsky.app. formerly archive.org
elsewhere: bnewbold.net / @bnewbold@social.coop

Post reaction in Bluesky

*To be shown as a reaction, include article link in the post or add link card

Reactions from everyone (0)