1 files changed, 463 insertions, 0 deletions
diff --git a/doc/architecture/blueprints/ai_gateway/index.md b/doc/architecture/blueprints/ai_gateway/index.md
new file mode 100644
index 00000000000..c9947723739
--- /dev/null
+++ b/doc/architecture/blueprints/ai_gateway/index.md
@@ -0,0 +1,463 @@
+---
+status: ongoing
+creation-date: "2023-07-14"
+authors: [ "@reprazent" ]
+coach: [ "@andrewn", "@stanhu" ]
+approvers: [ "@m_gill", "@mksionek", "@marin" ]
+owning-stage: "~devops::modelops"
+participating-stages: []
+---
+
+<!-- vale gitlab.FutureTense = NO -->
+
+# AI-gateway
+
+## Summary
+
+The AI-gateway is a standalone-service that will give access to AI
+features to all users of GitLab, no matter which instance they are
+using: self-managed, dedicated or GitLab.com.
+
+Initially, all AI-gateway deployments will be managed by GitLab (the
+organization), and GitLab.com and all GitLab self-managed instances
+will use the same gateway. However, in the future we could also deploy
+regional gateways, or even customer-specific gateways if the need
+arises.
+
+The AI-Gateway is an API-Gateway that takes traffic from clients, in
+this case GitLab installations, and directing it to different
+services, in this case AI-providers and their models. This North/South
+traffic pattern allows us to control what requests go where and to
+translate the content of the redirected request where needed.
+
+![architecture diagram](img/architecture.png)
+
+[src of the architecture diagram](https://docs.google.com/drawings/d/1PYl5Q5oWHnQAuxM-Jcw0C3eYoGw8a9w8atFpoLhhEas/edit)
+
+By using a hosted service under the control of GitLab we can ensure
+that we provide all GitLab instances with AI features in a scalable
+way. It is easier to scale this small stateless service, than scaling
+GitLab-rails with it's dependencies (database, Redis).
+
+It allows users of self-managed installations to have access to
+features using AI without them having to host their own models or
+connect to 3rd party providers.
+
+## Language: Python
+
+The AI-Gateway was originally started as the "model-gateway" that
+handled requests from IDEs to provide code suggestions. It was written
+in Python.
+
+Python is an object oriented language that is familiar enough for
+Rubyists to pick up through in the younger codebase that is the
+AI-gateway. It also makes it easy for data- and ML-engineers that
+already have Python experience to contribute.
+
+## API
+
+### Basic stable API for the AI-gateway
+
+Because the API of the AI-gateway will be consumed by a wide variety
+of GitLab instances, it is important that we design a stable, yet
+flexible API.
+
+To do this, we can implement an API-endpoint per use-case we
+build. This means that the interface between GitLab and the AI-gateway
+is one that we build and own. This ensures future scalability,
+composability and security.
+
+The API is not versioned, but is backward compatible. See [cross version compatibility](#cross-version-compatibility)
+for details. The AI-gateway will support the last 2 major
+versions. For example when working on GitLab 17.2, we would support
+both GitLab 17 and GitLab 16.
+
+We can add common functionality like rate-limiting, circuit-breakers and
+secret redaction at this level of the stack as well as in GitLab-rails.
+
+#### Protocol
+
+We're choosing to use a simple JSON API for the AI-gateway
+service. This allows us to re-use a lot of what is already in place in
+the current model-gateway. It also allows us to make the endpoints
+version agnostic. We could have an API that expects only a rudimentary
+envelope that can contain dynamic information. We should make sure
+that we make these APIs compatible with multiple versions of GitLab,
+or other clients that use the gateway through GitLab. **This means
+that all client versions talk to the same API endpoint, the AI-gateway
+needs to support this, but we don't need to support different
+endpoints per version**.
+
+We also considered gRPC as a the protocol for communication between
+GitLab instances, they differ on these items:
+
+| gRPC                                                                                                                                                                    | REST + JSON                                                                                       |
+|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|
+| + Strict protocol definition that is easier to evolve versionless                                                                                                       | - No strict schema, so the implementation needs to take good care of supporting multiple versions |
+| + A new Ruby-gRPC server for vscode: likely faster because we can limit dependencies to load ([modular monolith](https://gitlab.com/gitlab-org/gitlab/-/issues/365293)) | - Existing Grape API for vscode: meaning slow boot time and unneeded resources loaded             |
+| + Bi-directional streaming                                                                                                                                              | - Straight forward way to stream requests and responses (could still be added)                    |
+| - A new Python-gRPC server: we don't have experience running gRPC-Python servers                                                                                        | + Existing Python fastapi server, already running for code suggestions to extend                  |
+| - Hard to pass on unknown messages from vscode through GitLab to ai-gateway                                                                                             | + Easier support for newer vscode + newer ai-gatway, through old GitLab instance                  |
+| - Unknown support for gRPC in other clients (vscode, jetbrains, other editors)                                                                                          | + Support in all external clients                                                                 |
+| - Possible protocol mismatch (VSCode --REST--> Rails --gRPC--> AI gateway)                                                                                              | + Same protocol across the stack                                                                  |
+
+**Discussion:** Because we chose REST+JSON in this iteration to port
+features that already partially exist does not mean we need to exclude
+new features using gRPC or Websockets. For example: Chat features
+might be better served by streaming requests and responses. Since we
+are suggesting an endpoint per use-case, different features could also
+opt for different protocols, as long as we keep cross-version
+compatibility in mind.
+
+#### Single purpose endpoints
+
+For features using AI, we prefer building a single purpose endpoint
+with a stable API over the [provider API we expose](#exposing-ai-providers)
+as a direct proxy.
+
+Some features will have specific endpoints, while others can share
+endpoints. For example, code suggestions or chat could have their own
+endpoint, while several features that summarize issues or merge
+requests could use the same endpoint but make the distinction on what
+information is provided in the payload.
+
+The end goal is to build an API that exposes AI for building
+features without having to touch the AI-gateway. This is analogous to
+how we built Gitaly, adding features to Gitaly where it was needed,
+and reusing existing endpoints when that was possible. We had some
+cost to pay up-front in the case where we needed to implement a new
+endpoint (RPC), but pays off in the long run when most of the required
+functionality is implemented.
+
+**This does not mean that prompts need to be built inside the
+AI-gateway.** But if prompts are part of the payload to a single
+purpose endpoint, the payload needs to specify which model they were
+built for along with other metadata about the prompts. By doing this,
+we can gracefully degrade or otherwise try to support the request if
+one of the prompt payloads is no longer supported by the AI
+gateway. It allows us to potentially avoid breaking features in older
+GitLab installations as the AI landscape changes.
+
+#### Cross version compatibility
+
+When building single purpose endpoints, we should be mindful that
+these endpoints will be used by different GitLab instances indirectly
+by external clients. To achieve this, we have a very simple envelope
+to provide information. It has to have a series of `prompt_components`
+that contain information the AI-gateway can use to build prompts and
+query the model of it selects.
+
+Each prompt component contains 3 elements:
+
+- `type`: This is the kind of information that is being presented in
+  `payload`. The AI-gateway should ignore any types it does not know
+  about.
+- `payload`: The actual information that can be used by the AI-gateway
+  to build the payload that is going to go out to AI providers. The
+  payload will be different depending on the type, and the version of
+  the client providing the payload. This means that the AI-gateway
+  needs to consider all fields optional.
+- `metadata`: Information about the client that built this part of the
+  prompt. This may, or may not be used by GitLab for
+  telemetry. Nothing inside this field should be required.
+
+The only component in there that is likely to change often is the
+`payload` one. There we need to make sure that all fields are
+optional, and avoid renaming, removing or repurposing fields.
+
+When this is needed, we need to build support for the old versions of
+a field in the gateway, and keep them around for at least 2 major
+versions of GitLab. For example, we could consider adding 2 versions
+of a prompt to the `prompt_components` payload:
+
+```json
+{
+  prompt_components: [
+    {
+      "type": "prompt",
+      "metadata": {
+        "source": "GitLab EE",
+        "version": "16.3",
+      },
+      "payload": {
+        "content": "You can fetch information about a resource called an issue...",
+        "params": {
+          "temperature": 0.2,
+          "maxOutputTokens": 1024
+        },
+        "model": "text-bison",
+        "provider": "vertex-ai"
+      }
+    },
+    {
+      "type": "prompt",
+      "metadata": {
+        "source": "GitLab EE",
+        "version": "16.3",
+      },
+      "payload": {
+        "content": "System: You can fetch information about a resource called an issue...\n\nHuman:",
+        "params": {
+          "temperature": 0.2,
+        },
+        "model": "claude-2",
+        "provider": "anthropic"
+      }
+    }
+
+  ]
+}
+```
+
+Allowing the API to direct the prompt to either provider, based on
+what is in the payload.
+
+To document and validate the content of `payload` we can specify their
+format using [JSON-schema](https://json-schema.org/).
+
+#### Example feature: code suggestions
+
+For example, a rough code suggestions service could look like this:
+
+```plaintext
+POST /internal/code-suggestions/completions
+```
+
+```json
+{
+  "prompt_components": [
+    {
+      "type": "prompt",
+      "metadata": {
+        "source": "GitLab EE",
+        "version": "16.3",
+      },
+      "payload": {
+        "content": "...",
+        "params": {
+          "temperature": 0.2,
+          "maxOutputTokens": 1024
+        },
+        "model": "code-gecko",
+        "provider": "vertex-ai"
+      }
+    },
+    {
+      "type": "editor_content",
+      "metadata": {
+        "source": "vscode",
+        "version": "1.1.1"
+      },
+       "payload": {
+        "filename": "application.rb",
+        "before_cursor": "require 'active_record/railtie'",
+        "after_cursor": "\nrequire 'action_controller/railtie'",
+        "open_files": [
+          {
+            "filename": "app/controllers/application_controller.rb",
+            "content": "class ApplicationController < ActionController::Base..."
+          }
+        ]
+      }
+    }
+  ]
+}
+```
+
+A response could look like this:
+
+```json
+{
+  "response": "require 'something/else'",
+  "metadata": {
+    "identifier": "deadbeef",
+    "model": "code-gecko",
+    "timestamp": 1688118443
+  }
+}
+```
+
+The `metadata` field contains information that could be used in a
+telemetry endpoint on the AI-gateway where we could count
+suggestion-acceptance rates among other things.
+
+The way we will receive telemetry for code suggestions is being
+discussed in [#415745](https://gitlab.com/gitlab-org/gitlab/-/issues/415745).
+We will try to come up with an architecture for all AI-related features.
+
+#### Exposing AI providers
+
+A lot of AI functionality has already been built into GitLab-Rails
+that currently builds prompts and submits this directly to different
+AI providers. At the time of writing, GitLab has API-clients for the
+following providers:
+
+- [Anthropic](https://gitlab.com/gitlab-org/gitlab/blob/4344729240496a5018e19a82030d6d4b227e9c79/ee/lib/gitlab/llm/anthropic/client.rb#L6)
+- [Vertex](https://gitlab.com/gitlab-org/gitlab/blob/4344729240496a5018e19a82030d6d4b227e9c79/ee/lib/gitlab/llm/vertex_ai/client.rb#L6)
+- [OpenAI](https://gitlab.com/gitlab-org/gitlab/blob/4344729240496a5018e19a82030d6d4b227e9c79/ee/lib/gitlab/llm/open_ai/client.rb#L8)
+
+To make these features available to self-managed instances, we should
+provide endpoints for each of these that GitLab.com, self-managed or
+dedicated installations can use to give these customers to these
+features.
+
+In a first iteration we could build endpoints that proxy the request
+to the AI provider. This should make it easier to migrate to routing
+these requests through the AI-Gateway. As an example, the endpoint for
+Anthropic could look like this:
+
+```plaintext
+POST /internal/proxy/anthropic/(*endpoint)
+```
+
+The `*endpoint` means that the client specifies what is going to be
+called, for example `/v1/complete`. The request body is entirely
+forwarded to the AI provider. The AI-gateway makes sure the request is
+correctly authenticated.
+
+Having the proxy in between GitLab and the AI provider means that we
+still have control over what goes through to the AI provider and if
+the need arises, we can manipulate or reroute the request to a
+different provider. Doing this means that we could keep supporting
+the features of older GitLab installations even if the provider's API
+changes or we decide not to work with a certain provider anymore.
+
+I think there is value in moving features that use API providers
+directly to a feature-specific purpose built API. Doing this means
+that we can improve these features as AI providers evolve by changing
+the AI-gateway that is under our control. Customers using self-managed
+or dedicated installations could then start getting better
+AI-supported features without having to upgrade their GitLab instance.
+
+Features that are currently
+[experimental](../../../policy/experiment-beta-support.md#experiment)
+can use these generic APIs, but we should aim to convert to a single
+purpose API endpoint before we make the feature [generally available](../../../policy/experiment-beta-support.md#generally-available-ga)
+for self-managed installations. This makes it easier for us to support
+features long-term even if the landscape of AI providers change.
+
+The [Experimental REST API](../../../development/ai_features.md#experimental-rest-api)
+available to GitLab team members should also use this proxy in the
+short term. In the longer term, we should provide developers access to
+a separate proxy that allows them to use GitLab owned authentication
+to several AI providers for experimentation. This will separate the
+traffic from developers trying out new things from the fleet that is
+serving paying customers.
+
+### API in GitLab instances
+
+This is the API that external clients can consume on their local
+GitLab instance. For example VSCode that talks to a self-managed
+instance.
+
+These versions could also widely defer: it could be that the VSCode
+extension is kept up-to-date by developers. But the GitLab instance
+they use for work is kept a minor version behind. So the same
+requirements in terms of stability and flexibility apply for the
+clients as for the AI gateway.
+
+In a first iteration we could consider keeping the current REST
+payloads that the VSCode extension and the Web-IDE send, but direct it
+to the appropriate GitLab installation. GitLab-rails can wrap the
+payload in an envelope for the AI-gateway without having to interpret
+it.
+
+When we do this then the GitLab-instance that receives the request
+from the extension doesn't need to understand it to enrich it and pass
+it on to the AI-Gateway. GitLab can add information to the
+`prompt_components` and pass everything that was already there
+straight through to the AI-gateway.
+
+If a request is initiated from another client (for example VSCode),
+GitLab-rails needs to forward the entire payload in addition to any
+other enhancements and prompts. This is required so we can potentially
+support changes from a newer version of the client, traveling through
+an outdated GitLab installation to a recent AI-gateway.
+
+**Discussion:** This first iteration is also using a REST+JSON
+approach. This is how the VSCode extension is currently communicating
+with the model gateway. This means that it's a smaller iteration to go
+from that, to wrapping that existing payload into an envelope. With
+the added advantage of cross version compatibility. But it does not
+mean that future iterations also need to use REST+JSON. As each
+feature would have it's own endpoint, the protocol could also be
+different.
+
+## Authentication & Authorization
+
+GitLab will provide the first layer of authorization: It authenticate
+the user and check if the license allows using the feature the user is
+trying to use. This can be done using the authentication and license
+checks that are already built into GitLab.
+
+Authenticating the GitLab-instance on the AI-gateway will be discussed
+in[#177](https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/issues/177).
+Because the AI-gateway exposes proxied endpoints to AI providers, it
+is important that the authentication tokens have limited validity.
+
+## Embeddings
+
+Embeddings can be requested for all features in a single endpoint, for
+example through a request like this:
+
+```plaintext
+POST /internal/embeddings
+```
+
+```json
+{
+  "content": "The lazy fox and the jumping dog",
+  "content_type": "issue_title",
+  "metadata": {
+    "source": "GitLab EE",
+    "version": "16.3"
+  }
+}
+```
+
+The `content_type` and properties `content` could in the future be
+used to create embeddings from different models based on what is
+appropriate.
+
+The response will include the embedding vector besides the used
+provider and model. For example:
+
+```json
+{
+  "response": [0.2, -1, ...],
+  "metadata": {
+    "identifier": "8badf00d",
+    "model": "text-embedding-ada-002",
+    "provider": "open_ai",
+  }
+}
+```
+
+When storing the embedding, we should make sure we include the model
+and provider data. When embeddings are used to generate a prompt, we
+could include that metadata in the payload so we can judge the quality
+of the embedding.
+
+## Deployment
+
+Currently, the model-gateway that will become the AI-gateway is being
+deployed using from the project repository in
+[`gitlab-org/modelops/applied-ml/code-suggestions/ai-assist`](https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist).
+
+It is deployed to a Kubernetes cluster in it's own project. There is a
+staging environment that is currently used directly by engineers for
+testing.
+
+In the future, this will be deloyed using
+[Runway](https://gitlab.com/gitlab-com/gl-infra/platform/runway/). At
+that time, there will be a production and staging deployment. The
+staging deployment can be used for automated QA-runs that will have
+the potential to stop a deployment from reaching production.
+
+Further testing strategy is being discussed in
+[&10563](https://gitlab.com/groups/gitlab-org/-/epics/10563).
+
+## Alternative solutions
+
+Alternative solutions were discussed in
+[applied-ml/code-suggestions/ai-assist#161](https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/issues/161#what-are-the-alternatives).