diff --git a/_posts/2024-07-22-the-curious-case-of-a-service-level-objective.md b/_posts/2024-07-22-the-curious-case-of-a-service-level-objective.md new file mode 100644 index 00000000..78864efc --- /dev/null +++ b/_posts/2024-07-22-the-curious-case-of-a-service-level-objective.md @@ -0,0 +1,111 @@ +--- +layout: post + +title: "The Curious Case of a Service-level Objective" + +tags: [SLO, Service-level objective, SLI, Service-level indicator, SRE, Site reliability engineering, Prometheus, AWS Cloudwatch] + +author: + name: Jean-François Smith + bio: Senior Software Developer + image: jfsmith.jpeg +--- + +## The context + +The site reliability engineering (SRE) team at Coveo is currently hard at work implementing tools and processes with a lofty goal in mind: moving our existing monitoring culture in R&D toward the systematic use of service-level objectives (SLO). Writing blogs about SLOs or announcing products making use of them is pretty common nowadays, and understandably so. Yet I’m finding that most of the discourse around this topic is limited to the same kind of examples and use cases. In this blog post, I will tell the convoluted story of a definitely unconventional SLO. + + + +But first, a refresher. What is an SLO? A service-level objective is an acceptability threshold relating to the performance of a service. This concept also has an inseparable twin, the service-level indicator or SLI. The SLI is a measurement of a service’s behavior expressed as the frequency of some successful state or result, for example, the number of requests that return HTTP 200 OK responses, or the number of jobs that completed within 5 minutes. A simple guidance to ensure that your SLI is expressed in this conventional way is that your measurement is an unequivocal yes/no or true/false proposition. Did a response return a 200 OK? Did a job complete within 5 minutes? This is so that you can calculate the ratio of good versus bad events. This is your SLI measurement: + +![SLI as the ration of good / bad events](/images/2024-07-22-the-curious-case-of-a-service-level-objective/sli.png){:style="display:block; margin-left:auto; margin-right:auto; width:35%"} + +An *acceptable* value of this SLI, within a predetermined time window, is your SLO. The intent behind the time window is to calculate an error budget (EB) and a burn rate (EBBR). The main purpose of the EB is to represent the margin of error within which you are allowing your service to operate. An empty budget should always represent the moment when your customers begin to feel unhappy; a non-empty budget means you can allow yourself to deploy, or even experiment with, application changes. The related EBBR will then be used for alerting when the budget is ailing because your service is going south (or you are messing dangerously with it!). In short: + +![Error budget and budget burn rate](/images/2024-07-22-the-curious-case-of-a-service-level-objective/eb-ebbr.png){:style="display:block; margin-left:auto; margin-right:auto; width:25%"} + +In concrete terms, you could declare that 99.9% of your requests in the last 24h should return 200 OK, or that 95% of the jobs within the last 28 days should complete within 5 minutes. SLOs such as these are much more than mere monitoring redlines on a dashboard. They are, in essence, quality pledges to your customers. + +Up until now at Coveo our implementation of SLOs has leveraged Honeycomb, which uses [distributed tracing](https://docs.honeycomb.io/get-started/basics/observability/concepts/distributed-tracing/#what-is-a-trace) to propel request observability to impressive heights. Using this data, setting up availability and latency SLOs is not only easy, but also quite appropriate. Thanks to its almost limitless cardinality, drilling down into traces and cross-associating multiple properties allow for very deep investigations. + +It turns out however that the SRE team has a very different kind of SLO on its hands, the implementation of which has been the opposite of straightforward. Here is why. + +## The problem + +Since around the year 2 BC (Before Covid), I have been maintaining a metric that tracks how long it takes for a simple document to go through our indexing pipeline after being either pushed by API or pulled by what we call a crawler. The idea behind this is to observe the health of the pipeline at a higher level. When this simple document takes too long to index and become available for querying, chances are that this is indicative of a problem for everyone else too. In theory, this metric is nothing less than perfect for a SLO. In practice, however, reality begged to differ. + +This metric is the result of an automated operation (using an AWS lambda function) that evaluates given states, computes a result, and sends it to an external metric backend, [HostedGraphite](https://www.hostedgraphite.com/). This service does its job very well, but only that – hosting the data. There are no SLO features on top of it that we can take advantage of. + +Since our metric is generated by an automated job that performs an end-to-end test, this means that Honeycomb is not particularly relevant to our problem. The value we are tracking (a delay) does not stand for a request and there is no tracing involved. There *are* versions of our universe in which we can indeed push custom metrics into Honeycomb, but our current implementation of this service is not meant for that and it would amount to the usual square peg in a round hole problem. + +We thought briefly about the [Grafana Cloud](https://grafana.com/products/cloud/) observability platform, as it does support both SLOs and Graphite data, but this application is such a hugely complex offering that we cannot just cherry pick a single functionality for one use case, however important it may be. We’re not going to buy a [73-function swiss-army knife](https://www.victorinox.com/en-CA/Products/Swiss-Army-Knives/Medium-Pocket-Knives/Swiss-Champ-XXL/p/1.6795.XXL) just because we need a can opener. Well, *I* would, but the point still stands that this is the wrong way to go about it. + +In short, we need to find the proper backend for our data, one that properly supports SLOs. But this requirement is only half the problem. + +In Google’s [Site Relability Workbook](https://sre.google/books/), there is a table listing [seven different types of SLIs](https://sre.google/workbook/implementing-slos/#slis-for-different-types-of-services) depending on the types of components in play. The one that applies to our use case here is the *freshness* SLI for a pipeline-type component. While this sounds straightforward enough, when looking for a proper backend to support this particular type of SLI/SLO, it turns out that *nobody is ever talking about it*. + +The vast majority of platforms range between exclusively documenting request-driven availability and latency SLOs and exclusively supporting them. Sloth’s documentation [examples](https://sloth.dev/examples/default/getting-started/), for instance, only mention these common SLIs. Its list of [plugins](https://github.com/slok/sloth-common-sli-plugins) is even more telling. It is no surprise then that [Pyrra](https://github.com/pyrra-dev/pyrra), another tool built on top of Prometheus, only supports availability and latency. I will come back to Prometheus later in this blog post, but in the meantime let’s appreciate how even Google Cloud’s [documentation](https://cloud.google.com/stackdriver/docs/solutions/slo-monitoring) only mentions the same two SLIs/SLOs. Which kind of SLOs does AWS Cloudwatch’s new Application Signals support out-of-the-box? You may take a wild guess, and then see the answer down at point 5d in the relevant [documentation](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-ServiceLevelObjectives.html#CloudWatch-ServiceLevelObjectives-Create). + +While I do understand the gravitational pull that application availability and latency SLOs may have, a cynical side of me fears that we are witnessing a bit of a bandwagon effect going on here. Fair enough. Our requirement for a freshness SLO is clear and if we need to cook up our own recipe for it, then we shall do so. + +## The solution + +The first part of the solution was to move the raw data closer to our infrastructure – to AWS Cloudwatch. The automated job, as well as many other related ones, already run on AWS lambda functions. It made sense to start from there. + +| ![Push indexing delay in Hostedgraphite](/images/2024-07-22-the-curious-case-of-a-service-level-objective/01_hg.png) | +|:--:| +| _Original data in HostedGraphite_ | + +| ![Push indexing delay in AWS CLoudwatch](/images/2024-07-22-the-curious-case-of-a-service-level-objective/02_cw.png) | +|:--:| +| _Same data but in AWS Cloudwatch_ | + +I mentioned above that Cloudwatch recently added a new SLO feature, through Application Signals. This new service automatically collects your application metrics and allows you to define SLIs and SLOs on top of that. This is not our use case but, thankfully, it also supports SLOs based on any custom metric! The move to Cloudwatch thus felt quite timely. However, this feature is so fresh from the oven that it is not particularly versatile. For example, it does not track burn rate (which is a very valuable target for alerting, a strategy that Google is quite [keen](https://sre.google/workbook/alerting-on-slos/) on), nor can we easily set multiple alerting thresholds or windows. To achieve the latter, we would have to create multiple SLOs on top of the same metric (our SLI), each with its own single window and alert. This is impractical, without even going into the kind of virtuoso implementations involving proper [multi-window, multi-burn-rate](https://sre.google/workbook/alerting-on-slos/#6-multiwindow-multi-burn-rate-alerts) alerting. + +A reasonable requirement is that we can enjoy alerting features on par with our SLOs in Honeycomb: at least one burn rate alert (i.e. when the error budget is being drained too fast) and at least one budget exhaustion alert (i.e. the remaining error budget is too low). What can we do then, short of calculating the SLOs ourselves? + +We turn to implementing Google’s very own [slo-generator](https://github.com/google/slo-generator). This Python tool does exactly what we need: measuring SLI compliance and computing error budget and burn rate. I bundled this tool in a new AWS Lambda function, alongside a custom backend class for pulling our data from Cloudwatch. It then did its magic by pushing its results to our Prometheus stack, as it is one of slo-generator’s default exporters. Witnessing our first SLI measurement live was quite satisfying: + +| ![Push indexing delay in AWS CLoudwatch](/images/2024-07-22-the-curious-case-of-a-service-level-objective/03_pushgateway.png) | +|:--:| +| _An SLI calculation sent to Prometheus. 99.96% is not bad at all!_ | + +As a famished philosopher once said, however, there’s no such thing as a free lunch. This solution requires us to use [Prometheus Pushgateway](https://github.com/prometheus/pushgateway), which was kindly installed by our infrastructure team for the sake of this proof of concept. The one important thing to know about Pushgateway is that its documentation begins by telling us when [not to use it](https://github.com/prometheus/pushgateway?tab=readme-ov-file#non-goals) (see also [here](https://prometheus.io/docs/practices/pushing/)). This literal warning sign is not trivial. Prometheus works best by pulling (or scraping) data. This should not be surprising when the application metrics it collects are in effect bound to the instances that run that application. Our indexing metric here is independent of that, though, and in fact this is precisely the only acceptable use case for Pushgateway. Yet the fact remains that Pushgateway is not a metrics backend – it is merely a metrics cache. This comes with its own sets of caveats and challenges. Did we really need to burden ourselves with them? + +We did not! As I could add a custom backend to slo-generator, so could I add a custom exporter redirecting all its calculation results to Cloudwatch itself instead. Thus the same AWS lambda function I created simply pushed back its results to the same backend as its source data. + +| ![Push indexing delay in AWS CLoudwatch](/images/2024-07-22-the-curious-case-of-a-service-level-objective/04_cw.png) | +|:--:| +| _Indexing a Push document should take less than 7 minutes, 99% of the time within 24 hours. 6 bad events affected our error budget somewhat, but thankfully our compliance is still above 99%! (This data is from our development environment only)_ | + +The benefit of using Cloudwatch as a backend for our custom SLO – let’s not be shy about it – is that we can potentially re-use this data in many other ways, not just within AWS Cloudwatch. This is why I was able to add one last column to the edifice: a custom Prometheus collector/exporter that can pull our SLO data (as it should be) so that in the end, we get the same result as if we were using Pushgateway, without the hassle of maintaining it. This way, we can enjoy Grafana’s powerful visualization tools, though of course the actual graphs shown below remain quite simple for the time being: + +| ![Push indexing delay in AWS CLoudwatch](/images/2024-07-22-the-curious-case-of-a-service-level-objective/05_grafana.png) | +|:--:| +| _Our now familiar Push freshness SLO, here shown in Grafana, collected through Prometheus (again, all data is from our development environment only)_ | + +And so here we are! A fully functional freshness SLO, built up from several individual smaller pieces. + +## The upshot + +For sure, our end game is going through a lot of hoops, but let’s revisit our requirements: + +- Ability to push custom metrics to a backend +- Ability to compute SLI compliance, error budget, and burn rate on which we can alert +- Ability to represent SLOs that are not of availability or latency types +- Ability to store this SLO data in a reliable backend + +Using an efficient but generic tool, Google’s slo-generator, alongside AWS Cloudwatch and Lambda functions is all that it took in the end. The road to get there was certainly not a straightforward one -- this post only describes the result, sparing you from the many different iterations of this proof of concept. But I do hope the retained solution offers a proper way forward for all kinds of unconventional (but legitimate) SLOs we can come up with here at Coveo. + +One of my favorite benefits of using Google’s slo-generator is how SLOs are defined through a YAML spec. I did not have the space to dwell on that here, but this is one of the areas I really want to exploit further down the line. As we already support Honeycomb SLOs as code (in this case, Terraform), I am hoping that eventually we can make all our SLOs uniform through a shared specification language, such as [OpenSLO](https://github.com/openslo/openslo). I firmly believe this will be of great help not only to drive, but also to scale up our adoption of SLOs. So until then, [may your queries flow and the pagers stay silent](https://sre.google/workbook/conclusion/)! + +| ![SLO Backend architecture](/images/2024-07-22-the-curious-case-of-a-service-level-objective/06_graph.jpg) | +|:--:| +| _Architecture diagram of the chosen solution_ | + + +*If you're passionate about software engineering, and you would like to work with other developers who are passionate about their work, make sure to check out our [careers](https://www.coveo.com/en/company/careers/open-positions?utm_source=tech-blog&utm_medium=blog-post&utm_campaign=organic#t=career-search&numberOfResults=9) page and apply to join the team!* + + + diff --git a/_posts/2024-07-29-building-a-resilient-and-high-performance-real-time-data-pipeline-using-aws-serverless-technilogies.md b/_posts/2024-07-29-building-a-resilient-and-high-performance-real-time-data-pipeline-using-aws-serverless-technilogies.md new file mode 100644 index 00000000..d587519a --- /dev/null +++ b/_posts/2024-07-29-building-a-resilient-and-high-performance-real-time-data-pipeline-using-aws-serverless-technilogies.md @@ -0,0 +1,80 @@ +--- +layout: post + +title: "Building a resilient and high performance real-time data pipeline using AWS Serverless Technologies Part 1" + +tags: [Streaming Data, AWS, Coveo, Data Platform] + +author: + name: Lucy Lu, Marie Payne + bio: Senior Software Developers on the Data Platform + image: llu_mpayne.png +--- + +At Coveo, we track how end-users interact with search interfaces by capturing client-side and server-side signals from our customers' implementations. Initially, we only collected client-side events through [Usage Analytics Write API](https://docs.coveo.com/en/1430/build-a-search-ui/use-the-usage-analytics-write-api) which implementers can use to log Click, View, Search, and Custom Events. These events are used by Coveo Machine Learning models to provide relevant and personalized experiences for end-users. These events are also used by implementers to build reports and dashboards where they can gain insights into user behaviors, and make informed decisions to optimize Coveo solutions. The diagram below shows the real-time data pipeline that receives and processes client-side events. + +![Original real-time data pipeline](/images/2024-07-29-building-a-resilient-and-high-performance-real-time-data-pipeline-part-1/old_pipeline.jpg) +*Original real-time data pipeline* + + + +Over the last few years, there has been a growing demand for real-time data analytics and applications. In addition to tracking events submitted explicitly through Usage Analytics Write API (client-side events), we also wanted to capture events from internal services (server-side events). However, it is challenging to expand the above architecture to include server-side events for the following reasons: + +1. The original design of Write Service did not consider the additional data source and integrations with new real-time consumers. Expanding its existing capabilities requires a significant amount of redesign efforts. +2. Adding a data source or a consumer involves specific validation and transformation logic for incoming events. That logic can differ depending on the data source or consumer. As the number of data sources or consumers grows, the complexity of the Write Service increases, making it harder to manage and maintain which ultimately leads to increased chances of errors and failures. +3. The additional transformation logic will potentially introduce more processing time, leading to performance degradation of the Write API. + +This motivated us to build a new real-time streaming pipeline that can be easily extended to adapt to new data sources from client side or server side, as well as accommodate new real-time data consumers. Beyond extensibility, there are other factors that we prioritized when designing the new real-time data pipeline, particularly: + +- **Data Quality**. The quality of data is crucial to the success of any AI or ML model. Poor data quality affects outcomes of downstream applications. For example, events that collect incomplete product information affect the accuracy of the product recommendation models, which eventually results in a degraded personalization experience for end-users. Adding more data sources has the risk of compromising data quality, particularly in data consistency when different formats and standards are used. The new real-time data pipeline should have the capability to enforce standards for all events ingested, ensuring accuracy, completeness, validity, and consistency of the data delivered downstream. +- **Scalability**. The volume and velocity of events varies depending on multiple factors like time of the day, seasonal events (e.g. Black Friday), customer load testing, etc. The data pipeline should be easily scalable to handle larger volumes and to avoid data delays or any performance issues. Additionally, it should be able to scale down to save costs when data traffic is low. +- **Resilience**. Unexpected conditions, e.g. failures of third-party services, software bugs, or disruptions from data sources, can happen occasionally. Given that we receive thousands of events per second, failure to recover from these unexpected scenarios can lead to the loss of a significant number of events. + +With these identified requirements, we have built a new real-time streaming data pipeline, and are continuously improving it. In this blog post, we will introduce the current state of the real-time data pipeline at Coveo, and discuss the benefits and challenges we faced. In subsequent blog posts, we will detail the strategies we’ve implemented to overcome these challenges. + +# New Architecture Overview + +The diagram below shows the newly built real-time data pipeline architecture at Coveo. The event service, a Kubernetes service running on EKS, acts as the entrypoint for all analytics events in our platform. It is responsible for forwarding events to a Kinesis Stream (Raw Event Stream) as quickly as possible without performing any data transformations. Raw events will then be processed by a lambda function (Enrichment Lambda) that augments information, validates against predefined schemas, and masks sensitive data in the raw events. These enriched events are forwarded to another Kinesis Stream (Enriched Event Stream). The enriched events have two consumers: a Kinesis Data Firehose which loads these enriched events into S3, and a lambda function that is able to route events to multiple streams used by different applications. After events are loaded into S3, we use Snowpipe, a service provided by Snowflake, to ingest data from S3 to our centralized data lake in Snowflake. + +![New real-time data pipeline](/images/2024-07-29-building-a-resilient-and-high-performance-real-time-data-pipeline-part-1/new_data_pipline.jpg) +*New real-time data pipeline* + +# How does this architecture benefit us? + +## Data Quality + +The Enrichment Lambda validates each event against predefined data schemas, and adds validation results to the original event. We use JSON Schema to specify constraints such as allowed values and ranges for all events ingested through the event service. Common fields (e.g. URL, userAgent, etc.) that exist in all events share the same constraints. Event type specific data fields have their own rules. This makes sure that events in the pipeline adhere to the same standard and prevents data invalidity, incompleteness, and inconsistencies. + +Enriched events will also be delivered to the Snowflake data lake, the single source of truth database that both internal users and batch processing jobs (e.g. e-commerce metrics calculation and ML model building) access. This further enhances data consistency across different use cases. + +## Extensibility + +Enriched events are routed to different streams through a lambda function (Router Lambda). This lambda is responsible for determining which events should be routed to which downstream Kinesis Data streams. The criteria of delivering events to a specific stream are configured through a JSON document. This document can be extended or modified if a new application needs to consume real-time events, or if the criteria need to be updated for an existing application. This flexibility allows teams to easily experiment with or build additional features on top of our real-time data. Similarly, adding a new data source simply involves adding JSON Schemas that specify constraints, eliminating the need for code changes. Compared to the original real-time pipeline, this new architecture greatly enhanced extensibility, making it much easier and safer to integrate with new data sources or consumers. + +## Scalability + +We chose Kinesis Data Stream, Lambda, and Kinesis Firehose provided by AWS to collect, process, and deliver events in the pipeline. Kinesis Data Stream is a streaming service that can be used to take in and process large amounts of data in real time. A Kinesis Data Stream is composed of a single or multiple shards. A shard is a base throughput unit of a stream. As each shard has a fixed unit of capacity, it is easy to predict the performance of the pipeline. The number of shards can be increased or decreased depending on the data rate and requirements. + +A Lambda function can be mapped to a Kinesis Data Stream through Event Source Mapping which automatically invokes the lambda function. We can configure the number of concurrent invocations of a lambda to process one shard of a Kinesis Data Stream. Combined with the configurable shard number, we can achieve high scalability of the data pipeline. In case of increasing amounts of events, we can add more shards to a Kinesis Data stream or increase the number of concurrent invocations of Lambda to prevent throttling or performance issues. + +## Resilience + +A Kinesis Data Stream provides the capability to retain data records for up to 365 days. This ensures that if there is an issue preventing the invocation of Lambdas, data records remain stored in the Kinesis Stream until they expire. This mechanism guarantees that no events are lost, provided that any downstream issues are resolved within the data retention period. In addition, AWS Lambda offers robust error handling mechanisms (e.g. retry with exponential backoff) to handle runtime errors that may occur during function executions. This helps Lambda functions recover from transient errors and ensures reliable processing of events over time. Together, these capabilities offered by AWS contribute to the overall resilience and reliability of real-time data processing pipelines, minimizing the risk of data loss and maintaining system availability. + +# What challenges did we have? + +## Cold Starts in Lambda + +Lambda cold starts occur when AWS Lambda has to initialize a new instance to process requests. During the initialization, Lambda downloads the code, prepares the execution environment, and executes any code outside of the request handler. This adds a significant latency to our data pipeline, where latency is critical. Although cold start only accounts for under 1% of requests, it disproportionately affected the overall latency of the pipeline. + +## Partial Failures when sending a batch of events to a Kinesis Stream + +In the Lambda that writes records to a Kinesis Stream, we can write multiple records in a single call, which can achieve much higher throughput compared to writing records individually. However, batching records can introduce the risk of partial failures, where some records in the batch succeed while others fail. When a partial failure occurs and the Lambda function retries the batch, it rewrites the entire batch or a contiguous subset of the batch, including records that were successfully written previously. This redundancy results in duplicate records being sent downstream, which can impact the accuracy and performance of real-time applications that consume these records. + +## Limitations with Observability + +CloudWatch is an AWS native observability tool, offering multiple features like metrics, statistics, dashboard, and logs for in-depth analysis and visualization. For all the services used in our pipeline, they automatically publish predefined metrics to CloudWatch. For instance, Lambda offers different types of metrics to measure function invocation, performance, and concurrency. However, default metrics provided by AWS give limited insights, and gaining comprehensive insights into the pipeline can be cost- and performance-prohibitive. For example, when we auto-instrumented our AWS lambda functions with OpenTelemetry, a tool widely used across other services at Coveo, we experienced an average increase of 30s in cold starts. + +These challenges had long been obstacles preventing us from enhancing the performance of our real-time data pipeline. Over the past six months, the Data Platform team at Coveo conducted an extensive review of both the code and infrastructure, targeting the challenges above. In the next post, we will share our solutions of dealing with Cold Start in Lambda and the improvements achieved. Please stay tuned! + +*If you're passionate about software engineering, and you would like to work with other developers who are passionate about their work, make sure to check out our [careers](https://www.coveo.com/en/company/careers/open-positions?utm_source=tech-blog&utm_medium=blog-post&utm_campaign=organic#t=career-search&numberOfResults=9) page and apply to join the team!* \ No newline at end of file diff --git a/images/2024-07-22-the-curious-case-of-a-service-level-objective/01_hg.png b/images/2024-07-22-the-curious-case-of-a-service-level-objective/01_hg.png new file mode 100644 index 00000000..1356e79a Binary files /dev/null and b/images/2024-07-22-the-curious-case-of-a-service-level-objective/01_hg.png differ diff --git a/images/2024-07-22-the-curious-case-of-a-service-level-objective/02_cw.png b/images/2024-07-22-the-curious-case-of-a-service-level-objective/02_cw.png new file mode 100644 index 00000000..9db420a3 Binary files /dev/null and b/images/2024-07-22-the-curious-case-of-a-service-level-objective/02_cw.png differ diff --git a/images/2024-07-22-the-curious-case-of-a-service-level-objective/03_pushgateway.png b/images/2024-07-22-the-curious-case-of-a-service-level-objective/03_pushgateway.png new file mode 100644 index 00000000..07d75acd Binary files /dev/null and b/images/2024-07-22-the-curious-case-of-a-service-level-objective/03_pushgateway.png differ diff --git a/images/2024-07-22-the-curious-case-of-a-service-level-objective/04_cw.png b/images/2024-07-22-the-curious-case-of-a-service-level-objective/04_cw.png new file mode 100644 index 00000000..2b05d263 Binary files /dev/null and b/images/2024-07-22-the-curious-case-of-a-service-level-objective/04_cw.png differ diff --git a/images/2024-07-22-the-curious-case-of-a-service-level-objective/05_grafana.png b/images/2024-07-22-the-curious-case-of-a-service-level-objective/05_grafana.png new file mode 100644 index 00000000..f2b05283 Binary files /dev/null and b/images/2024-07-22-the-curious-case-of-a-service-level-objective/05_grafana.png differ diff --git a/images/2024-07-22-the-curious-case-of-a-service-level-objective/06_graph.jpg b/images/2024-07-22-the-curious-case-of-a-service-level-objective/06_graph.jpg new file mode 100644 index 00000000..ba80b43f Binary files /dev/null and b/images/2024-07-22-the-curious-case-of-a-service-level-objective/06_graph.jpg differ diff --git a/images/2024-07-22-the-curious-case-of-a-service-level-objective/eb-ebbr.png b/images/2024-07-22-the-curious-case-of-a-service-level-objective/eb-ebbr.png new file mode 100644 index 00000000..91428bfe Binary files /dev/null and b/images/2024-07-22-the-curious-case-of-a-service-level-objective/eb-ebbr.png differ diff --git a/images/2024-07-22-the-curious-case-of-a-service-level-objective/sli.png b/images/2024-07-22-the-curious-case-of-a-service-level-objective/sli.png new file mode 100644 index 00000000..37599c4d Binary files /dev/null and b/images/2024-07-22-the-curious-case-of-a-service-level-objective/sli.png differ diff --git a/images/2024-07-29-building-a-resilient-and-high-performance-real-time-data-pipeline-part-1/new_data_pipline.jpg b/images/2024-07-29-building-a-resilient-and-high-performance-real-time-data-pipeline-part-1/new_data_pipline.jpg new file mode 100644 index 00000000..d4eef57e Binary files /dev/null and b/images/2024-07-29-building-a-resilient-and-high-performance-real-time-data-pipeline-part-1/new_data_pipline.jpg differ diff --git a/images/2024-07-29-building-a-resilient-and-high-performance-real-time-data-pipeline-part-1/old_pipeline.jpg b/images/2024-07-29-building-a-resilient-and-high-performance-real-time-data-pipeline-part-1/old_pipeline.jpg new file mode 100644 index 00000000..14bfa13d Binary files /dev/null and b/images/2024-07-29-building-a-resilient-and-high-performance-real-time-data-pipeline-part-1/old_pipeline.jpg differ diff --git a/images/jfsmith.jpeg b/images/jfsmith.jpeg new file mode 100644 index 00000000..b0030aea Binary files /dev/null and b/images/jfsmith.jpeg differ