At Airbyte, we orchestrate Docker containers to sync data from files, APIs, and databases to data warehouses and data lakes. Each sync uses two types of containers: containers that read data (sources) and containers that write data (destinations). These containers implement the Airbyte Protocol which specifies a command line interface for sources and destinations as well as the structure of stdout messages to pass between a source and a destination. Because Airbyte is an open-source project we need to support any type of source and destination image that supports our protocol. Moreover, sources and destinations can be written in any language.
When we first started orchestrating third-party containers in Kubernetes, we found out that we needed to extend container entrypoints to be able to perform syncs. Since Kubernetes does not allow inspecting Docker entrypoints, we needed a strategy to identify this entrypoint when launching the container.
In this article, we'll discuss the options we considered, the one we chose, and go through some examples of how to extend Docker entrypoints in a Kubernetes pod.
Initially Airbyte was offered on Docker Compose, but we rapidly encountered use cases that required additional horizontal scaling. Naturally, we turned to Kubernetes for this.
Overriding the entrypoint when using the docker run command is trivial with the <span class='text-style-code'>--entrypoint</span> flag. Similarly, with kubectl run you can specify the entrypoint with the <span class='text-style-code'>--command</span> flag.
Our unique challenge of performing syncs between independent pods is passing data between them. If each container exposed a service, this would be easy, but since these are implementing a command line interface, we need to capture stdout from the source pod, send it to the destination pod, and accept the input on the destination pod. Another challenge is that we don't want to send all stdout over Kuberentes logging. This is because we could be processing billions of records from many sources and don't want to require scaling up these logs or limiting the latency of Kubernetes logging. If we dump all source data into logs, we also have to worry about costs, data retention, scalability, security/access controls, etc.
Our purpose for overriding the entrypoint is to intercept data coming from the pod, but it's easier to illustrate this with an example that focuses solely on overriding an entrypoint.
At Airbyte, we commonly have two Docker images we want to orchestrate in a similar way even though they have quite different entrypoints. For example, the Dockerfile of our exchange rates source uses a call to a Python script as entrypoint. The Dockerfile of our MySQL destination inherits its entrypoint from a parent image.
Let's look at a simple example with two Docker images built by the following Dockerfiles. They have similar behavior (they just output a string) but they use quite different entrypoints.
Let's say we want to configure a Pod in a way that the output of the pod says "Hello World" or "Bonjour World", depending on which Docker image is used. In Kubernetes, the default behavior for Docker is to use the entrypoint of the image. However, you can override this entrypoint by specifying a command.
In an ideal world, we'd hope that we could specify this command in terms of the Docker entrypoint. However, this isn't possible. Since Kubernetes allows running non-Docker containers, it doesn't have any concept of retrieving the entrypoint for a Docker container. While you could read the entrypoint from an image (by simply running <span class='text-style-code'>docker inspect [image]</span> or reading it from the Docker API) this greatly complicates an orchestrator running in Kubernetes because you would need to be able to authenticate/access the relevant Docker registry and extract the entrypoint. The orchestrator doesn't know anything about how to access the Docker registry by default. Kuberentes keeps pulling images isolated from the pods running themselves.
Since we can't access this entrypoint directly, we needed to resort to an alternative. Here are the options we considered.
Since our application is running on Kubernetes and launching pods on behalf of the user, we could require them to configure our application with credentials that allow it to interact with the Docker registry their image is hosted on. Kubernetes image pull credentials are insufficient; the application needs to talk to the Docker registry, not just pull its image. Potentially, we could talk with the Docker API and identify the endpoint for the Docker image in question.
There are a few problems with this approach. First of all, we would need to update our application's data model to store the registry for each image and the credentials for the unique registries. It would also require additional configuration for all of our users installing Airbyte on their Kubernetes clusters. Any additional hurdles for configuring and running Airbyte decreases the likelihood that a new user will actually start using Airbyte. This is something we want to avoid whenever possible.
Since our users are implementing a CLI that matches our protocol, we could require users developing Airbyte connectors to use a fixed entrypoint. For example, this could be a shell script at a specific path.
If we update our previous examples we would have:
This would require users to always use this script. Only developers (not all users) would need to interact with this, which would be a major improvement over Option 1. This seems like a decent option with one downside: it requires adding an additional script. This options adds some additional hurdles for users because they would have to add additional steps to their Dockerfiles to load the script or extend a base image where this script was already installed.
We ultimately went with something very similar to Option 2:
We require developers to add an environment variable <span class='text-style-code'>AIRBYTE_ENTRYPOINT</span> to specify the entrypoint that they want eval-ed to run their CLI. This has all of the benefits of Option 2 (only developers are impacted, not all users) and it also offers flexibility in entrypoint definition. It's also one of the most lightweight options. It's trivial to add to a new image and doesn't require any additional files on the image or base images.
Next, we need to dynamically create pods that only vary with the image name. To extend the previous examples, we want to be able to create pods that run any image and append " World" to its output.
Now we can create pods for both the <span class='text-style-code'>airbyte-hello</span> or <span class='text-style-code'>airbyte-bonjour</span> images, only changing the image specified. Then, these pods will output "Hello World" and "Bonjour World", respectively.
We've just wrapped the entrypoint for the Docker image and extended its behavior in a way that's standardized across a set of images.
In practice, what we are doing to wrap the entrypoint is more complicated than this (see one of our templates). Here we add a heartbeat mechanism to kill the running node if a different server is not online. We also inject the ability to route stdin/stdout/stderr from/to sidecar containers that relay that information over the network to other pods. We'll dive into how we're routing this information in a later blog post!
In order to work around a (necessary) limitation of the Kubernetes API, we added a requirement for developers implementing our connectors to add an environment variable that contained their entrypoint as a string that could be executed with eval. This allows us to wrap Docker entrypoints on the fly to inject complex logic for reading in stdin from a network connection, relaying stdout/stderr over the network, and performing other operations.
Get all your ELT data pipelines running in minutes with Airbyte.