An open-source, easy to set up observability solution for GraphQL APIs, enabling seamless analysis of slow requests, errors, resolver performance, and identification of backend bottlenecks.

Metrics

Monitor your GraphQL API request rate, latency, and error rate.

Traces

Analyze traces for every request made to the /graphql endpoint.

Case Study

1. Introduction

GraphQL is a query language for APIs and a server-side runtime that gives clients the flexibility to send customized queries to the API. While GraphQL provides the benefits of speed and flexibility, using it can make it more complex for backend engineers to monitor the health of an API. Say an engineering team receives complaints about their GraphQL API being slow, how do they troubleshoot? Is it because a request was very complex? Is there a backend bottleneck? Is there an inefficient database operation, and how do they find it? If one request executes multiple database queries, how do they find which one is causing increased latency? Given all these questions, the team needs detailed information to find the root cause of the problem.

1.1 QMantis

QMantis, an open-source observability solution for GraphQL APIs, intends to address these questions by abstracting away the complexity of generating, processing, exporting, storing, and visualizing metrics and traces data, making it easy to track errors and analyze an API's performance at the resolver level.

QMantis helps teams spot potential problems and quickly identify what is causing these issues. QMantis uses telemetry data, specifically metrics and GraphQL-specific traces, to help teams monitor a GraphQL API's health. With QMantis, engineering teams can see exactly which requests are performing poorly, even in complex operations.

Before diving into the details of the creation of QMantis, let us introduce Novels and Barns, a hypothetical company that can benefit from adding a GraphQL-specific observability service to their API.

1.2 Novels and Barns

Novels and Barns is a small, independent bookstore with a few brick-and-mortar locations, as well as an online store. They have a small engineering team of only two people that are responsible for maintaining Novels and Barns' monolithic, single-server application.

Over the past year, Novels and Barns' sales have increased significantly. Notably, mobile sales are skyrocketing, now accounting for over a third of the company's revenue, and are projected to have continued growth.

Recently, with this surge in sales, customers' complaints have also increased. There have been many complaints about the slow speed of the mobile web application. Maintaining customer satisfaction is critical, especially for a small retailer like Novels and Barns, as when an application is slow, there is the added risk of customers buying fewer books, or simply giving up and not purchasing anything at all. Knowing the negative impact that high latency has on sales ^[1], the team is working hard to improve the performance of their mobile web application.

2. REST vs GraphQL

REST is a common architectural style for building APIs, in which clients perform a pre-defined set of actions on database objects, and the server returns a structured representation of those objects. For each HTTP request, clients choose one endpoint and an appropriate HTTP method for the purpose of obtaining data or sending data, among other operations, and each response returns a fixed resource representation.

As mentioned previously, GraphQL is a query language for APIs and a server-side runtime. GraphQL APIs must define a schema that specifies both the operations that can be executed, as well as the type of data that can be requested. When a request is received by a GraphQL server, it parses the request, validates it against the schema, and then executes it by calling functions, known as resolvers, containing the business logic for communicating with the data source and carrying out the request. GraphQL APIs give clients the flexibility to customize their requests, but on the other hand, requests vary in their complexity and structure, and server responses are not always consistent.

2.1 Mobile Experience

There are two key differences between REST and GraphQL APIs that could significantly impact a user's mobile experience.

2.1.1 Only One Endpoint

With a REST API, to load a page, a client may need to make multiple requests to get the information they need. For instance, if a Novels and Barns client wants to get a list of all the romance novels written by a specific author, they need to make three different GET requests: one to the /authors endpoint, another to the /genres endpoint, and one to the /books endpoint. Even in this simple example, there are already three network calls made between the client and the server. As mobile devices are designed for mobility, they do not always have access to a good network connection, so these multiple network calls can take time to complete, impacting the amount of time it takes for a page to load.

With a GraphQL API, clients can customize requests in order to fetch exactly the data they need, reducing the overall number of requests made. Using this architecture, the heavy lifting is done by the backend server. When a request comes in, the GraphQL server parses it, validates it, and then executes the necessary resolver functions. Then, the server aggregates the requested data, which is sent back to the client in one response. This process can improve an application's performance and reduce latency, especially on mobile devices, since there are fewer network calls made between the client and the server.

2.1.2 Customizable Queries

REST APIs often result in an issue known as data over-fetching. In the example above, the client wants to load a book page, and to do that, the client needs the author's first and last name. However, because a REST API always returns the same resource representation, when a client makes a GET request to the /authors endpoint, the client receives a lot of unnecessary information. Mobile devices may have limited bandwidth and computing power, making it harder to handle a large amount of data. In this scenario, using GraphQL minimizes bandwidth usage, as requests only fetch exactly the data that is needed, eliminating the issue of over-fetching.

2.2 Novels and Barns

Up until recently, the Novels and Barns mobile application used a REST API. Say a customer wanted to view a specific category of books on their device—to do so, the client application had to make many requests, and each response came with extra unneeded information; in order to display only the requested information, the client had to perform the additional work of filtering for relevant information. This process worked well for many years and still works well for the desktop browser application. However, the recent exponential increase in their mobile traffic and sales put an unnecessary strain on the client-side of their architecture, and negatively impacted the shopping experience for customers. In order to improve the performance of their mobile application, the Novels and Barns engineering team decided to implement a GraphQL API.

3. Observability

3.1 What is Observability?

Observability means measuring and analyzing a system's current state based on the data it generates. The more observable a system is, the more quickly a developer can go from spotting a potential problem to identifying its root cause, without needing to write any code. The goal of observability is to help developers understand what is happening so that they can troubleshoot and debug applications effectively. Observability relies on telemetry data to analyze systems and applications.

3.2 Telemetry

Telemetry is the automated process of remotely collecting and transmitting measurements and other types of data ^[2]. This process is used in many industries, such as healthcare, transportation, and meteorology ^[3]. In the context of software engineering, telemetry data allows teams to measure and track the performance and the health of an application or system.

The three main telemetry data types used in software engineering are metrics, traces, and logs. For the purpose of this case study, only metrics and traces are discussed in detail, as these are the data types generated by QMantis.

3.2.1 Metrics

Metrics are numerical values calculated over a period of time; they provide big-picture information about an application or system.

The Novels and Barns sales department, for example, collects metrics, such as sales volume per day, number of customers in a brick and mortar store per hour, average total sales per customer per hour, to name a few. These metrics provide information that is essential to guide their business decisions.

Similarly, collecting metrics helps engineering teams maintain and improve an API. Metrics help answer questions such as: is the application performance degrading? Is latency higher than usual? Are users encountering errors? When are the downtimes? Metrics help engineering teams spot potential problems and stay proactive. For example, developers can use metric data to identify usage trends. This information can be utilized in various ways, such as scheduling maintenance during known downtimes, when there is less traffic.

QMantis tracks what is known as RED metrics: request rate, error rate, and duration (latency) ^[4].

3.2.2 Traces

The other data type QMantis generates are called traces, which provide detailed information about a request's journey. Traces help developers identify where failures occur, as well as the root causes of poor performance. Traces are made up of spans, and each span represents a piece of the request workflow. The top-level root span encapsulates the end-to-end latency of an entire request, while all the subsequent spans, known as child spans, represent operations that take place as part of the request.

When a customer visits a Novels and Barns location, from the moment they walk through the door, until they leave the store, that can be considered a trace. Using this scenario, a span could be, for instance, their walk from the door to the non-fiction section. Another span is when the customer picks up a book and skims it. When the customer pays for their book, this is yet another span. Each trace provides all of this detailed information about a request's (or customer's) journey. Traces contain the answer to questions like: What sections did the customer visit? What books did they look at? If there was a purchase, how much was it?

Above is an example of a GraphQL-specific trace that labels the operation (or trace) name, the trace ID, the total duration of the request, the root span, and other key pieces of information.

Every GraphQL request goes through parsing, validation, and execution phases (labelled on the left side of the diagram); each of these phases and their respective spans are visualized above. The execution phase contains the operation type, the parent and child resolver names, and their associated spans.

Analyzing these GraphQL-specific spans help developers gain a deeper understanding of the journey of a GraphQL request, and identify where any bottlenecks or errors occur. Traces are often discussed in the context of microservices. However, with GraphQL, traces are useful to observe, even when using a monolithic architecture.

3.3 Observability for GraphQL APIs

Why must observability for GraphQL APIs be approached differently compared to REST APIs?

With REST APIs, collecting metrics provides good insight into the health of an application. For instance, error rates can be filtered by endpoint, HTTP status code, and/or HTTP method. If there is an increase in the error rate, having these three pieces of information makes it easier to narrow down where the problem is.

With GraphQL, on the other hand, an increase in latency may be caused by an increase in operation complexity. As a client can perform different requests that differ in structure and complexity, the server response is influenced by the complexity of the request.

Therefore, when it comes to GraphQL APIs, metrics alone do not provide enough information or context to find the root cause of a problem, even within a monolithic architecture. This can be explained by three features unique to GraphQL, which are not found in REST: firstly, GraphQL only has one endpoint (/graphql), secondly, most GraphQL requests default to POST requests, and finally, most GraphQL responses return a status code of 200, even when errors occur. Let us discuss each of these points in more detail.

3.3.1 Only One Endpoint

graphql endpoint — GraphQL: Latency data for the `/graphql` endpoint.

rest endpoint — REST: Latency data for the `/authors`,
`/genre`, and `/books` endpoints.

As GraphQL only has one endpoint, which is /graphql, developers need more information to investigate application issues. For example, to find the root causes of high latency, developers need to answer questions such as: Is the latency because of the complexity of the request? Is there a performance bottleneck? If yes, where is this happening? There is a potential problem, but that is all the information known at this point.

Visualizing the latency rate points the developer in the right direction. It shows that there is something worth investigating, but the first diagram alone does not provide enough information to figure out where the problem is happening. With only the information from this diagram, the developer can not distinguish if the higher latency is caused by a complex operation, a backend bottleneck, or an inefficient resolver.

In contrast, as seen in the second diagram, with REST APIs, it is easier to quickly narrow down where the issue is coming from: the /authors endpoint. Also, because REST always returns the same resource representation, an increase in latency is often related to a backend bottleneck. To get more information about the source of high latency, developers can just filter by the HTTP response status code and HTTP request method.

3.3.2 Most GraphQL Requests are POST Requests

graphql http method — Most GraphQL requests are POST requests.

rest http method — REST requests can be filtered by HTTP method.

As the vast majority of GraphQL requests are POST requests, filtering the error rate by HTTP method does not give any insight into a potential issue. On top of that, GraphQL does not accept PUT or DELETE requests. Instead, GraphQL requests are categorized as either queries, mutations, or subscriptions; for further differentiation, users have the option to name each request.

With REST, filtering by HTTP method is actually helpful, as each request is a GET, POST, PUT, or DELETE HTTP method. In the example above, it is evident that potential latency issues are coming from PUT and DELETE requests.

3.3.3 Most GraphQL responses have a status code 200

Finally, with REST, errors can be tracked by the response status code. For example, if the request has a 4xx or 5xx status code, it can safely be counted as an error.

With GraphQL, it is common for responses with errors to return a status code of 200. Therefore, relying on the response status code to count errors will underestimate the "real" error rate of GraphQL APIs. The diagram above tracks the error rate of a GraphQL API—the dashed line represents the API's actual error rate, while the solid line below tracks the API's error rate using 4xx and 5xx request status codes. It is evident that tracking status codes is not a reliable way to track the error rate of a GraphQL API.

Rather than using status codes, GraphQL handles errors differently. With GraphQL, all errors, regardless of status code, are handled as part of the response body under a special errors object, as seen above. To have a clear picture of the GraphQL API's health, it is important to track this object, instead of the response status code alone.

3.4 Observability for the Novels and Barns GraphQL API

Why is all of this information relevant for Novels and Barns, and small businesses like it? Novels and Barns has already invested a lot of time and money to develop a GraphQL API, and they want to ensure that it is performant. The company's engineers are not GraphQL experts, and they want to effectively monitor the health of their newly implemented GraphQL API. Mainly, they want to: track and reduce latency, monitor the overall performance of requests and their resolvers, know how requests are hitting the database, track errors, and respond to issues quickly and proactively. Therefore, to achieve these goals, having an observability service for their GraphQL API is critical for the Novels and Barns team.

4. Existing Solutions

Novels and Barns understands that using a GraphQL-specific observability tool is important for their new GraphQL API, and are now considering and comparing some existing GraphQL observability solutions. They have narrowed down their choice of solutions to two broad categories: they could use fully managed services like Apollo Studio or Hasura Cloud, or, they could take a DIY approach that pieces together open source tools.

4.1 Fully Managed Cloud Services

Apollo Studio and Hasura Cloud are two established, fully managed observability and monitoring cloud services for GraphQL. These platforms provide meaningful GraphQL-specific information, such as each operation name and type, as well as an abundance of features. However, only some of their services are free.

4.1.1 Apollo Studio

Apollo Studio is a cloud platform that offers monitoring of metrics and traces. It has many different features, as well as a relatively easy-to-use UI.

Above is a view of Apollo Studio's monitoring dashboard. It displays metrics information such as request rate, latency, and error rate. It is also possible to see which is the most frequently used operations, the slowest operations, and operations resulting in the most errors. On the bottom is a graph that shows the request rate over time, making it easy to identify spikes in traffic.

This is another view of the Apollo Studio's dashboard. The top of the screenshot shows the request latency over time, which helps show when slow requests happened. On the bottom is a visualization of the request latency distribution, which makes it easy to spot outliers and slow requests.

4.1.2 Hasura Cloud

Hasura Cloud is another fully-managed cloud service that handles the infrastructure of GraphQL applications, and offers real-time metrics and tracing. Like Apollo Studio, Hasura Cloud offers many features and has an easy-to-use user interface.

Above is a view of Hasura Cloud's dashboard. It is somewhat similar to the Apollo Studio dashboard, in that it shows the same three main metrics: request rate, error rate, and latency. It also displays graphs for the most frequently used operations, the slowest operations, and operations resulting in the most errors.

Novels and Barns could choose to use Apollo Studio or Hasura Cloud as their observability tool. However, Apollo Studio requires users to pay to access traces. The free version of Hasura Cloud limits users to 60 requests per minute. Novels and Barns is getting more traffic with new customers, so they would rather not have this limit. Also, Apollo Studio or Hasura Cloud both own any collected data, meaning the team would not be able to download or visualize data collected past a certain retention period.

While Apollo Studio and Hasura Cloud are great options for large enterprise companies with many engineers and big budgets, a small company like Novels and Barns prefers to minimize costs as much as possible. Also, although Apollo Studio and Hasura Cloud offer fully-serviced, easy to setup observability solutions for GraphQL APIs, data ownership is a must-have for the Novels and Barns team. So, in this case, the Novels and Barns team decides that neither Apollo Studio nor Hasura Cloud would be the best fit for them.

4.2 DIY

Aside from fully-managed services, Novels and Barns could go for a DIY route. They would be able to do this without spending a lot of money while maintaining ownership of their telemetry data, which are the two most important features they are looking for in an observability tool. This approach would also allow them to have full control over the features and how the tool works.

However, a DIY approach has a steep learning curve which involves researching many different components and stitching them together. This approach is research-intensive, resource-demanding, and time-consuming. If Novels and Barns had a big engineering team, they would probably choose the DIY approach. However, with a team of only two engineers, they would rather spend their time and effort focusing on the business logic of their application. Therefore, after careful consideration, the team decides that the DIY approach is not the best option for them.

5. Our Solution: QMantis

There is a third solution that Novels and Barns could use: QMantis.

As discussed earlier, QMantis is an open-source observability tool for GraphQL APIs, designed for small applications. With QMantis, Novels and Barns gets to maintain control over their data without needing to export it to a third-party.

QMantis is well-documented and designed for easy setup, reducing time cost. Users are able to see both metrics and traces, all in one dashboard, even though QMantis has fewer features than the fully-managed cloud solutions.

Finally, QMantis is free and open-source. There are no time or storage limitations, so Novels and Barns can invest their funds into other business needs.

5.1 How Novels and Barns Can Use QMantis

After discussing the pros and cons of the various existing GraphQL observability solutions, Novels and Barns has found QMantis to be the best fit for their business.

Now, with QMantis, Novels and Barns' small development team can get free metric and tracing coverage made specifically for GraphQL APIs, without having to learn or implement any other observability technologies. They can check their mobile application's usage by monitoring its request rate, error rate, and overall latency. They can also analyze and distinguish every request made to the GraphQL endpoint.

Further, Novels and Barns can determine their application's usage trends by analyzing each operation and resolvers' runtime. With the growing success of their business and their recent surge in users, they can more quickly investigate and address bottlenecks in their mobile application, preventing high latency and ensuring customer satisfaction.

Here is an overview of what QMantis offers.

5.2 An Overview of QMantis

5.2.1 Main Dashboard

The screenshot above shows what a QMantis dashboard may look like for a user like Novels and Barns. The dashboard shows metrics and traces side by side. Being able to view metrics and traces in the same place provides insight into what is going on with the GraphQL API. For example, on the top row are some metrics panels; here, the request rate and error rate move together, as expected. However, in the panels below the first row, it is evident that the average latency of requests is spiking outside of any increase in traffic. When metrics are seen alongside traces, it is clear that some outliers are driving up the average latency time, and should be investigated.

5.2.2 Analyzing Traces

With QMantis, the user can easily analyze traces for every request made to the /graphql endpoint.

Below, we can see what a typical trace should look like for a simple query when viewed using Grafana (QMantis' visual layer).

In this query, books from a particular author are fetched, along with the genres of those books. Far on the right hand of the image, in the condensed span section, there are multiple spans executing at the same time. This is exactly what an efficient query looks like when visualized.

Once the author is fetched, all the books related to that author, and the genres related to that book, are fetched asynchronously. On the left-hand side, it is possible to see exactly what is fetched while each span executes. This visualization gives valuable insight into the GraphQL resolver that was used to fetch this data and makes it easy to see where any inefficiencies are negatively impacting a server's response time.

5.2.3 Investigating Bottlenecks

When viewing traces from the QMantis dashboard, it is easy to spot slow requests. With the growing success of the Novels and Barns' business and their recent surge in users, they can use QMantis to quickly investigate and address bottlenecks in their application. By doing so, their team can efficiently respond to issues such as high latency, contributing to customer satisfaction.

In the above image, there are some outlying traces. Sometimes, an outlying trace is related to an operation that usually runs in line with the rest of the GraphQL operations. If this is the case, then the issue is usually related to something outside of the GraphQL API, such as a temporarily overloaded database. In this case, however, each of the outliers is related to the same GraphQL query named allBooks (the operation name, as well as other relevant information, can be seen when hovering over a trace), so it is time to investigate that resolver.

5.2.4 Visualizing Resolver Runtime

By clicking on a trace, it is possible to get more detailed information about that trace and its spans. Above is an individual view of one of the outlying traces that was shown in the previous image. Here, spans are executed synchronously, and information for only 20 books is fetched in nearly half a second. What happens when making much larger requests that fetch information for 100, or 1,000 books? This inefficient query defeats the purpose of switching to GraphQL in order to improve response times and may increase response time when compared to a REST API.

By examining the allBooks resolver function above, it is obvious that the function can be optimized. This particular resolver returns a promise, which in turn, returns a map of the original result, which in turn, returns a promise that populates the genres field once the authors field is populated for each book. There is a lot of waiting, and it makes the response to this query unnecessarily slow.

Now, compare the above inefficient resolver to the resolver below, which is much more efficient:

The above image shows trace information for the operation made using the efficient resolver. In the above trace, spans execute asynchronously. There are no significant gaps between span executions, and latency time is now under 200 ms. In an expanded view of this trace that shows all of its 156 spans, the first group of asynchronous spans retrieves books, while the second group retrieves authors and genres related to those books. Since it is impossible to find the author or the genre of a book before knowing what the book is, this is as efficient as this query can get. In addition, this trace gives us confidence that this resolver could return a much larger dataset without greatly increasing the response time of the server due to its asynchronous nature.

This example may seem contrived and a little unrealistic, and it probably is for such a simple query, but for people new to GraphQL, or when dealing with more complicated resolvers, this is a big issue. If a company like Novels and Barns did not have a GraphQL observability tool set up, an inefficient resolver could easily be pushed into a production environment and go unnoticed for weeks, months, or even years.

QMantis provides insight into many areas of a GraphQL API, not just resolver performance, but highlighting an inefficient resolver is important for two reasons. First, inefficient resolvers can be very hard to spot in a working API without proper observability, and second, this demonstrates how QMantis can be used to track down an issue all the way from the dashboard to a resolver execution.

Now that we have a better understanding of the features of QMantis, let us discuss its architecture.

5.3 Architecture

This diagram gives an overview of the QMantis architecture. There are two main overarching and interconnected components: qmantis-express and qmantis-compose. On the left, qmantis-express is a npm package with a configured express-graphql server that is responsible for generating the metrics and traces. On the right, qmantis-compose lets the user deploy the QMantis architecture on their own server using Docker in order to process, export, store, and visualize incoming metrics and traces.

Now, to provide a better understanding, let us break up this architecture into key phases.

The QMantis architecture is split up into four major conceptual phases:
Phase 1: Generate Traces and Metrics with a GraphQL server and OpenTelemetry SDKs
Phase 2: Export Data for Processing and Storage
Phase 3: Store Data with Promscale / TimescaleDB
Phase 4: Visualize / Analyze Data with Grafana

5.3.1 Phase 1: Generate Data with the qmantis-express GraphQL server and OpenTelemetry SDKs

The first phase of the QMantis architecture, which is part of the qmantis-express component, is to generate the metrics and traces. In order to eventually visualize the data on the dashboard, the data has to be created first. QMantis uses OpenTelemetry to generate and collect telemetry data such as metrics and traces.

OpenTelemetry is an open-source project that is composed of a collection of tools, including APIs and SDKs. It is the current leading developer tool kit for generating, collecting, and exporting telemetry data ^[5]. In QMantis' use case, we use OpenTelementry to instrument a GraphQL API.

To create the telemetry data, we configured an existing GraphQL server with observability extensions, added custom express middleware functions, and added two pre-configured OpenTelemetry Node SDKs (one for tracing, the other for metrics).

This is a crucial phase of the QMantis architecture because it is when we ensure the data we generate is specific to GraphQL. For example, the qmantis-express component checks the response object to see if there is an errors property. If there is one, it updates the error rate counter. Also, one of the panels in the QMantis dashboard has all traces that contain an error. As mentioned earlier, developers cannot rely on a response status code to filter traces for errors, as most GraphQL responses, regardless of their success, return a status code of 200. To address this, we added custom error tags as needed, making it easy to filter and visualize all traces with errors.

Additionally, the majority of GraphQL requests are POST requests. The default name of a trace is the corresponding request HTTP method, however, this is not meaningful to the user, since the majority of traces would be named “HTTP POST”. Therefore, our team altered each trace so that the user could distinguish between traces more easily. Our custom middleware fetches the operation name from the request body. If the user does not provide a name, then the operation name becomes the operation type.

5.3.2 Phase 2: Export to the OpenTelemetry Collector and Prometheus

The Export phase, the 2nd phase of the QMantis architecture, is part of the qmantis-compose component. Once telemetry data is created, it needs to be exported before anything useful can be done to it.

In the case of traces, the data is exported to the OpenTelemetry Collector. The OpenTelemetry Collector is a central component that handles additional steps such as batching and processing of traces ^[6]. It consists of three main elements: receivers, processors, and exporters, as seen below:

A receiver is how data gets into the collector and an exporter is how data is sent to a specified destination. Processors are run on data between being received and being exported, transforming it into a common format, or adding metadata.

For the purposes of the QMantis Architecture, the Collector offloads the responsibility of the application to manage the collected data, which reduces overhead. Using the collector, our custom processors add useful GraphQL information to each trace. Some of the spans in a trace are given customized names, so, for example, users can distinguish one request's field resolver from another without having to click into each span for more information.

In the case of metrics, they are exported to Prometheus, which is an open-source systems monitoring service that collects and stores metrics as time-series data, meaning metrics information is stored with the timestamp at which it was recorded. The purpose of exporting metrics to Prometheus is to make the data compatible for the next phase.

Now that the data has been generated and processed into a readable and more meaningful format, making it easier to work with for observability purposes, this data can be exported to long-term storage, which is the following phase.

5.3.3 Phase 3: Storage with Promscale / TimescaleDB

The third phase of the QMantis architecture, which is part of the qmantis-compose component, is storage. Before we can visualize and analyze telemetry data on Grafana, data needs to be stored somewhere. Long-term storage is useful for users because they can access not only current data, but also older data, which will be very helpful if they want to analyze trends in their GraphQL API.

As time is a key component of how QMantis generates and reads data, we chose to use a time-series database, as it is optimized for time-stamped data. Time matters. If there is a potential problem, it is important to know if it happened at 1 pm or 8 pm. If a user wants to do maintenance during downtime, they need to know when there are the downtimes. Also, time-series databases support a high rate of inserts and a low rate of modifications and deletions. This is a great fit for QMantis' use case because when a metric or trace is created, there is no need to ever change it—the user only needs to insert it and read it.

Specifically, QMantis uses TimescaleDB, an open-source time-series database powered by SQL, with Promscale, a backend extension that connects telemetry data to TimescaleDB. This database has no time limit or cost involved in storing telemetry data for users. Moreover, it allows us to keep everything in one place, therefore simplifying our architecture, since we can store both metrics and traces data on the same database.

Another main benefit of using TimescaleDB is that it leverages PostgreSQL to store data, an established data store that is used widely among developers. The choice to use TimescaleDB means that it is easier for users to understand how their data is stored. Also, using Promscale and Timescale for data storage allows users to easily connect that data to Grafana, giving them the ability to fully leverage Grafana's built-in queries.

5.3.4 Phase 4: Visualize / Analyze with Grafana

Finally, the last phase of the qmantis-compose component involves visualizing and analyzing data. To visualize the data, our team chose to use Grafana, a popular open-source tool for observability dashboards that comes with many options and functionalities. Our team pre-configured an observability dashboard so that users can view and analyze their application's request rate, request latency, error rate, all their traces, and error traces. Grafana connects directly to the Promscale connector and TimescaleDB in order to obtain metrics data, as well as Jaeger, an open-source tool for end-to-end distributed tracing, to query the traces data. In the design decisions section, we will discuss in detail why we chose Grafana as our visualization tool.

5.3.5 Overall View of QMantis Architecture's Components

To recap, the above illustrates the QMantis architecture and its two main components: qmantis-express (which consists of the Generate phase) and qmantis-compose (which consists of the Export, Store, and Visualize phases).

In summary, qmantis-express is a npm package, composed of OpenTelemetry SDKs and a GraphQL server observability configuration. It is responsible for generating the metrics and traces data.qmantis-compose lets the user quickly and seamlessly deploy the QMantis architecture on their own server using Docker. It contains all the tools necessary to export, store, and visualize GraphQL-specific observability data.

6. Design Decisions

Let us now go over some of the important design decisions we made while developing QMantis.

6.1 Grafana vs Custom UI

When it came to visualizing data, our team considered two options—building a custom user interface or using Grafana. A custom UI would have provided two main benefits: we could have branded the UI to add a QMantis-specific design, and we could have displayed data more intuitively, especially for a first-time user. On another hand, we would have to maintain a whole extra code base and any updates to the UI would demand a lot of time.

Grafana, an industry-standard observability tool that many developers are already familiar with, is extremely customizable. Once a user has QMantis running, they can change the dashboard and query settings to best fit their needs. To give an example, projects at different points in the development cycle may want to query data by different lengths of time, and Grafana makes that very easy. Moreover, if our team decides to update the dashboard or panels in the future, reconfiguration would be relatively simple.

Ultimately, the pros of building a custom UI did not outweigh the pros of using Grafana, so we chose to go with Grafana.

6.2 GraphQL Server Choice

Early on in the process of building QMantis, our team considered building our own GraphQL server. This would have given us control over the implementation of the server and made it simple to generate and collect data exactly how we wanted. However, if a developer had an already existing GraphQL API, they would have had to change their code to work with our less popular server in order to gain observability. Taking this into consideration, we decided it would be more useful to build a tool that could work with an already widely used GraphQL server.

Another option was to use apollo-server, a popular GraphQL server package. Note that this is different than the fully managed Apollo Studio observability service. Our team considered using apollo-server to give users the option to use a GraphQL server that was framework-agnostic for Node.js. However, this option had a major downside—the way apollo-server is implemented makes it very difficult to extract meaningful trace data, even when creating custom plugins to extract this information.

In the end, we landed on express-graphql, a popular JavaScript server for GraphQL, used by millions of developers around the world. express-graphql does not currently have any out-of-the-box observability options—paid or unpaid, and it is highly customizable, allowing for developers to create extensions with observability features. With this in mind, we built the first iteration of QMantis using an extended express-graphql server, providing an observability service packaged with an already popular GraphQL server.

7. Implementation Challenges

7.1 Generating Metrics Data from express-graphql

Using express-graphql as our GraphQL server came with significant challenges. As we did not have control over the implementation of the server, our team had to be creative in order to collect the needed data from request and response objects—information that is essential to generating accurate and GraphQL-specific telemetry data, a key feature of QMantis.

The two most difficult pieces of observability metrics data to obtain were: the error count and the latency data for requests that had 4xx or 5xx status codes. To get accurate information, the server needed to access the response object before it was sent back to the client. To do that, we relied on a function from the express-graphql server to access the response object. However, when the status code was 4xx or 5xx, the express-graphql server did not execute that function, so it no longer checked for errors or latency information.

At this point, we considered modifying the QMantis architecture and even prototyped a custom GraphQL server to get this data. This approach worked, and we were able to obtain the needed data, but it came with the limitations mentioned above. Ultimately, we found better solutions to these challenges: using an existing npm package called response-time to obtain latency data, and passing trace data to a custom function to accurately count errors.

7.2 Connecting Metrics with Traces

Another significant challenge was connecting metrics with traces. Initially, we wanted to connect metrics with traces to make it easy for the developer to go from spotting a potential problem to getting detailed trace information. How would that work? When the developer spots something worth investigating, such as an outlier metric data point, or an increase in the error rate, they would be able to click on the datapoint, taking them to a list of traces that occurred within that time frame.

After a lot of research, we found that the only way to have this feature in Grafana was to use Grafana Tempo, Grafana's solution for a traces backend. Currently, Grafana Tempo is only compatible with an object storage database, such as Google Cloud Storage, Amazon S3, or Azure Blob Storage. That meant we would need a different database to store traces data, adding extra complexity to the QMantis architecture. Another option was to build a custom UI, which was not ideal, as discussed previously.

This left us with the challenge of making it easy to go from metrics to traces in our dashboard. To overcome this obstacle, we created two trace panels, one that shows traces by latency, making it easy to spot the outliers, while the other only shows error traces. As both metrics and traces have a time-stamp, it is easy for the user to filter the traces panels to look for the information they need.

It is also worth noting that, as of early 2022, the leading paid observability solutions for GraphQL do not currently support linking metrics and traces together.

8. Installing and Using QMantis

Starting up QMantis is simple and can be done in a few steps:

1. Install the qmantis-express npm package. If qmantis-express is not already installed, that needs to be installed as well.

2. Clone the qmantis-compose repository. This repository can be cloned anywhere on your machine.

3. Import the QMantis variables into your “main” file. This is the file that imports the schema and starts up the express-graphql server for your API. It is crucial to note that all of the QMantis variables need to be imported before the express-graphql server, or there may be some issues with collecting traces.

4. Set up the route handlers for your /graphql endpoint. Note that the QMantis registerLatency function needs to be passed to the responseTime function as an argument and the main qMantis function needs to take the GraphQL schema as an argument and be passed to the express-graphql server.

5. Initialize the QMantis docker container. cd into the qmantis-compose repository and run docker compose up. The docker images will start to show in the terminal and the qmantis-compose container with all the running images can be visualized on Docker Desktop.

6. That is it for setup! Now, you will need some data before you can visualize it. If you have a production server and are using QMantis with it, you can simply deploy it and wait for requests to come in. If you are still in a development environment, you can use GraphiQL or another tool of your choice to make requests.

7. Once requests start to come in, navigate to localhost:3000 and view the prebuilt QMantis dashboard in Grafana.

9. Future Work

9.1 Expanding Metrics

Our team believes adding certain datasets to the dashboard could improve the usefulness of QMantis. Namely, we would like to be able to see how often certain operations are requested, and the average latency of specific operations. These features would allow developers to get a quick read on the efficiency of individual operations, and prioritize potential problems based on how often an operation is requested.

9.2 Connecting Metrics and Traces

As discussed earlier, metrics and traces are not directly linked in the current iteration of the QMantis dashboard. This will not be a simple issue to solve, but our team is interested in continuing to explore ways to offer this feature. In the long run, linking metrics and traces together would be incredibly useful for quickly and correctly diagnosing application issues.

9.3 Adding More GraphQL Server Types

Our team built the first iteration of QMantis using a customized express-graphql server but would like to continue working on the aforementioned customized apollo-server prototype to give it fully functional observability features, as this would give developers the flexibility to work with various Node.js frameworks, aside from just Express. Another option would be to add servers that work for other languages, such as Ruby and Go.

9.4 Incorporating Alerts

Many observability platforms offer alerts via text, Slack, or email when an issue arises. This is an interesting feature that could be added to QMantis in the future, so users can add automated alerts that are triggered by specific problems. Alerts would offer a nice alternative to constantly having to check the Grafana dashboard for issues.

10. References

Presentation

Meet Our Team

We are currently looking for opportunities. If you liked what you saw and want to talk more, please reach out!

Kathy Fregoso

Los Angeles, CA, USA
Kal Krueger

Bozeman, MT, USA
Lisa Melo

Montreal, Quebec, Canada
Emily Siu

Rockville, MD, USA

Metrics

Metrics

Traces

Traces

Case Study

1. Introduction

1.1 QMantis

1.2 Novels and Barns

2. REST vs GraphQL

2.1 Mobile Experience

2.1.1 Only One Endpoint

2.1.2 Customizable Queries

2.2 Novels and Barns

3. Observability

3.1 What is Observability?

3.2 Telemetry

3.2.1 Metrics

3.2.2 Traces

3.3 Observability for GraphQL APIs

3.3.1 Only One Endpoint

3.3.2 Most GraphQL Requests are POST Requests

3.3.3 Most GraphQL responses have a status code 200

3.4 Observability for the Novels and Barns GraphQL API

4. Existing Solutions

4.1 Fully Managed Cloud Services

4.1.1 Apollo Studio

4.1.2 Hasura Cloud

4.2 DIY

5. Our Solution: QMantis

5.1 How Novels and Barns Can Use QMantis

5.2 An Overview of QMantis

5.2.1 Main Dashboard

5.2.2 Analyzing Traces

5.2.3 Investigating Bottlenecks

5.2.4 Visualizing Resolver Runtime

5.3 Architecture

5.3.1 Phase 1: Generate Data with the qmantis-express GraphQL server and OpenTelemetry SDKs

5.3.2 Phase 2: Export to the OpenTelemetry Collector and Prometheus

5.3.3 Phase 3: Storage with Promscale / TimescaleDB

5.3.4 Phase 4: Visualize / Analyze with Grafana

5.3.5 Overall View of QMantis Architecture's Components

6. Design Decisions

6.1 Grafana vs Custom UI

6.2 GraphQL Server Choice

7. Implementation Challenges

7.1 Generating Metrics Data from express-graphql

7.2 Connecting Metrics with Traces

8. Installing and Using QMantis

9. Future Work

9.1 Expanding Metrics

9.2 Connecting Metrics and Traces

9.3 Adding More GraphQL Server Types

9.4 Incorporating Alerts

10. References

Presentation

Meet Our Team

Kathy Fregoso

Kal Krueger

Lisa Melo

Emily Siu