This document's source location: References & Additional info for .conf23 PLA1335B "Getting Smarter about Splunk SmartStore"
Author: Nadine Miller, aka vraptor
on Splunk Community Slack (join here https://splk.it/slack) , and redvelociraptor
on github.com.
- SmartStore Architecture Overview
- The SmartStore Cache Manager
- How Indexing Works in SmartStore
- How Search Works in SmartStore
- Indexer Cluster Operations and SmartStore
The restrictions on SmartStore use as of July 2023 (refer to Splunk docs About SmartStore for more details):
- Replication factor and search factor must be equal (for example, 3/3 or 2/2) if using indexer clustering
- Each index's home path and cold path must point to the same partition
- Some index.conf settings are restricted or incompatible, refer to Settings in indexes.conf that are incompatible with SmartStore or otherwise restricted
- Converting to SmartStore has no roll-back capability.
- In a multisite index cluster, any SmartStore enabled index must have search affinity disabled if report acceleration or data model acceleration is used. E.g. Set all search heads to site0.
Properly setting the "great eight" is key to getting your data onboarded correctly. Explicitly setting these has the additional benefit of lowering ingestion overhead on your indexers because Splunk doesn't have run every single event through it's internal analysis routines.
IMO your priority to properly on-board data, in order:
- Time stamps properly extracted
- Correct line breaking
- Correct line merging
- Correct truncation
- Correct sourcetype
- Correct index
Recall the Great Eight:
TIME_PREFIX
MAX_TIMESTAMP_LOOKAHEAD
TIME_FORMAT
SHOULD_LINEMERGE
LINE_BREAKER
TRUNCATE
TZ
EVENT_BREAKER_ENABLE
EVENT_BREAKER
Create a test instance so as to not disrupt production servers while testing data onboarding configurations. This can be as simple as a VM on your workstation, but having a VM or server that can handle all ingestion configurations from your production environment will speed up the process, since it will allow you to identify any configuration layering issues more quickly.
Reference: https://docs.splunk.com/Documentation/Splunk/latest/Admin/Propsconf
Aplura has a handy reference for GDI available: Data Onboarding Cheat Sheet
Many still use forceTimebasedAutoLB
on Universal Forwarder to send data to indexers, as many versions ago it was the only way to prevent UFs from "pinning" to indexers. Pinning resulted in poor distribution of data, or in some cases, overwhelming the indexer's queues. forceTimebasedAutoLB
should no longer be used. With the introduction of EVENT_BREAKER
and EVENT_BREAKER_ENABLE
when combined with either autoLBFrequency
or autoLBVolume
data should be properly balanced across indexers without risk of overwhelming a single indexer.
forceTimebasedAutoLB
can cause data to be dropped, even with useACK enabled. And when applied to high volume data, it nearly always causes some events to be broken mid-event, with a portion of the event landing on two different indexers.
A Splunk blog post dives into a detailed technical discussion of forceTimebasedAutoLB
and the risks: Splunk Forwarders and Forced Time Based Load Balancing
Also note: Never set autoLBFrequency
lower than 30 seconds. Doing so will cause your indexers to spend more time creating and tearing down network connections than indexing data. It's more effective to increase autoLBFrequency
to 60 or even 90 seconds, or switch to autoLBVolume
.
If you use the HEC indexer acknowledgement feature, ensure the following:
- useACK is enabled at your indexers for the HEC endpoint you are using. In Splunk Cloud Platform, useACK is only enabled for the Firehose endpoint, as an example.
- Any loadbalancer in front of your indexers has sticky sessions enabled.
Failure to have the indexers and load balancers properly configured will cause:
- Duplicated data. Since the forwarder doesn't receive an ack from the indexer, it resends the data.
- In an environment where healthchecks are used to automate server reprovisions or to quarantine servers temporarily for recovery, indexing may stop. As the ack queue fills on an indexer it will fail to respond to health check requests from the environment's load balancer (e.g. ELB in AWS) or other health monitoring system. In the worst case scenario, if the misconfiguration is not resolved, the entire index cluster may stop indexing data. Rolling restarts temporarily resolve the issue until the ack queues fill again.
Refer to About HTTP Event Collector Indexer Acknowledgment in Splunk docs to learn how it works and differs from forwarder useACK and complete details to implement it.
See the dashboards in this document's repo for methods to identify bucket rolls, bucket sizes, and HEC issues. To use the HEC issues dashboard you must also download hec_reply_codes.csv
and set it up as a lookup table.
Create a "data dictionary," which in a minimal form outlines the data each index contains. Make it easy for end users to find this information by either creating a custom landing app for each role, or a generic one for all users, with a link to this document, or if your index count is smallish embed in the landing page itself. Bonus points: create a lookup, kvstore, or CI/CD pipeline which present individualized information based on roles automatically.
In a more complete form, a data dictionary could go much further, providing information on the fields in the records contained in the index, the relationship of these fields to the servers and applications creating the data, and who is responsible for ownership of the data.
Resources for building data dictionaries:
- Splunk Blog: Data Dictionary
- LAME Creations: Splunk Using a Custom Dashboard and KVs to Create a Data Dictionary (This channel also has a good series on bridging the gap from Splunk Search training to real-world use for Security Analysts, could be helpful for end users as well. The channel owner won "The Guide" Splunkie Award.)