The observability team in charge of maintaining the health of Goldman Sachs' new digital saving and lending platform Marcus had to quickly stand up a Splunk instance to monitor its logs and establish some key health metrics for its systems as the service quickly scaled.
Since launching in the UK in August last year, Marcus has amassed more than 250,000 UK users and £8 billion in deposits or savings. The digital-only platform is often referred to as a 'fintech' within the bank, as it was built in the cloud from scratch in just 11 months. The business was given the freedom to build key applications using microservices, operate a devops model and adopt best of breed cloud-native tools, predominantly from the Amazon Web Services (AWS) stack.
Speaking at Splunk's .conf event in Las Vegas last week Maria Loginova, VP of observability at Goldman Sachs, talked about how the financial services giant quickly implemented Splunk monitoring tools to maintain the health of its services ahead of launching Marcus.
Loginova heads up a six-person observability team which is responsible for the health of all systems at Marcus, including the all-important customer-impacting functions like account onboarding and loan origination.
So why did she choose Splunk over other logging tools like Sumo Logic or Loggly? The short answer is familiarity.
"I do have a long history with Splunk," she told Computerworld. "When I joined Marcus three years ago I ended up suggesting Splunk for logs as I knew it from previous roles and that it would answer the requirements."
Now, "Splunk is our central logging repository for investigations, SLA reporting, fraud detection and business intelligence, capacity planning, all those functions," she added.
Setting up their instance
The team at Marcus opted to quickly launch a Splunk instance for its AWS environment using a handy CloudFormation template.
"Me and my team needed to be up and running literally day one, because the developers started writing code and they needed Splunk for their logs," Loginova said. "It took us less than a day to set this up, which was super surprising. As we entered this whole new world of AWS we were able to stand up a working instance of Splunk in less than one day."
Naturally the implementation wasn't literally plug and play, but it was certainly pretty straightforward.
"Of course we took our time designing our production Splunk, we were following all of the best practices for infrastructure-as-code," she added. "We needed this to be resilient and fault tolerant, so we use auto scaling groups so that new components can come up if the old one's fail. We did the installation across the different availability zones."
Once everything had been defined it was simply a case of migrating the data from the CloudFormation into one large Splunk cluster and shutting down the the CloudFormation indexers.
This approach didn't last very long though. "Since we're putting both production and non-production data through the same pipes into the same cluster, we found that stuff done in development can actually negatively impact production," Loginova said.
As a result the team built a separate indexer cluster for non-production data, and a small dev replica of its production Splunk instance to run experiments on. Everything can still be viewed in one place, but the data is flowing through separate pipelines and stored in different Splunk indexers.
Building a KPI dashboard
One of the key projects for Loginova and her team was to build a master KPI-based dashboard for tracking the health of key services, like user login, or loan applications.
"We have a Python script that is scheduled to run via Splunk that pulls out those queries, checks if it's time for them to run or not – because some queries you might want to run them every minute, others you might want to run every hour – and kicks off those queries as they need to be run, puts the results back into the data store," she explained.
"So the dashboard is really quick to load, refreshes every minute and is super fast because it's not actually running any queries, just pulling the results out of the store. As a result we have this consolidated view of the high level functions and the KPIs that define its health."
Then if the team wants to investigate an issue it can dig in directly from this dashboard.
Investigating 503 errors
An issue that popped up early on in the firm's use of Splunk was a sudden spike in 503 errors being seen by the observability team during times of peak volume.
After investigating the problem, Loginova found a bottleneck at ingestion due to a default limit being set. Instead she set up persistent queuing using spare disk space, which resolved the issue.
However, this drastically slowed down data ingestion, with data taking as long as 13 minutes to appear for developers who had a service level agreement (SLA) in place of less than one minute latency.
By switching this queue to run in-memory and adjusting the default ingest levels there Loginova was able to finally resolve the issue.
In terms of key takeaways from the experiences of the Marcus team, Loginova spoke at length about the importance of establishing a "monitor of monitors" to track the Splunk environment.
"Anything on the ingestion side, search side, disk space coming up high, when the ingestion queues are starting to block. Monitoring this ourselves so that we can not only react to issues when they happen, but have an idea that we're breaching the capacity and react before it becomes an issue," she said.
It also requires a broad skill set from everyone on the team. "As a Splunk admin you really have a large range of things that you're involved with, you are sysadmin, you are a network admin, a developer – there are a lot of roles that you get to play and a lot of room for creativity," she said.
Lastly, Loginova said that although the organisation is happy with its monitoring setup, "there is no end to data analysis. Once you have the data in there there's so many things you want to do with it."
Next she wants to define a multi-region AWS clustering strategy and is also considering a switch from Splunk's on-premise Enterprise tool to the more managed Splunk Cloud service.