Build a Low Cost Serverless Linux SIEM with Vector, S3, and ClickHouse (Local)

Build a Low Cost Serverless Linux SIEM with Vector, S3, and ClickHouse (Local)

It has been over five years since I first used Vector, but only in the last several months have I started to take advantage of ClickHouse with S3 as I mentioned on LinkedIn last month.

I also recently stood up a website with Ghost CMS for a non-profit and wanted something cheap and easy (off box) to analyze Nginx and Linux logs.

Since I know that Vector automatically parsed nginx and journald and could easily log to S3 so I gave it a shot.

Create your S3 Bucket & IAM Keypair

Use my CloudFormation template to create the S3 Bucket and IAM credentials that you'll need to pass to Vector to be able to write JSON events to the bucket.

NOTE: Since CloudFormation outputs are unencrypted, I've gotten into a habit of writing the Key to a Parameter so you can delete it later.

  S3UserAccessKey:
    Type: AWS::IAM::AccessKey
    Properties:
      UserName: !Ref S3User

  BasicParameter:
    Type: AWS::SSM::Parameter
    Properties:
      Name: !Sub ${S3Bucket}-${S3UserAccessKey}
      Value: !GetAtt S3UserAccessKey.SecretAccessKey
      Type: String

Install Vector Set your Environment Variables

Install the latest vector using your method of choice. I use the APT install script and this as worked across ARM and Intel systems just fine.

You'll put these in /etc/default/vector so you can avoid putting in your

AWS_ACCESS_KEY_ID=AKIAXXX
AWS_SECRET_ACCESS_KEY=YYYYY
VECTOR_S3_BUCKET=my-log-bucket

/etc/default/vector

Define your Bucket "Path Schema"

We'll start at the end focusing on what is most critical on how you will search logs with Clickhouse Local where the path matters.

We'll get into that later. Trust me. It matters.

sinks:
    s3_bucket:
        type: aws_s3
        inputs:
            - common
        bucket: "${VECTOR_S3_BUCKET}"
        key_prefix: "{{hostname}}/{{event_type}}/%D/"
        region: "us-east-1"
        compression: gzip
        encoding:
           codec: json

This will create a directory structure like this once we have all the event_type's defined. These are essentially your "tables."

Events on each host

The %D will help populated the date structure so you can choose what to search overs, such as:

  • asus-pn50/auditd/03/12/25/1741812841-d797e698-3302-4e63-a863-6f4292d04d16.log.gz
  • opti3070/journald/03/04/25/1741798470-5f7b4bd9-abb4-42a6-9816-b82c4e7bbe96.log.gz
  • ubuntu-s-1vcpu-2gb-nyc3-01/nginx-access/03/13/25/1741824471-f39b039f-be2f-4723-b74f-f5b7e3a495ec.log.gz

(This is where it is helpful to having hostname conventions!)

When you form your SELECT or DESCRIBE from clickhouse you'll be use wildcards to only retrieve a subset of objects.

Define your Data Sources

For basic file monitoring I use the file source and journald to avoid parsing syslog-style messages.

sources:
    journald:
        type: journald

    nginx_access:
        type: file
        include:
            - /var/log/nginx/access.log*
    nginx_error:
        type: file
        include:
            - /var/log/nginx/error.log*
    auditd:
        type: file
        include:
            - /var/log/audit/audit.log
    auth_log:
        type: file
        include:
            - /var/log/auth.log

This will create state files in /var/lib/vector like the following

root@ubuntu-s-1vcpu-2gb-nyc3-01:/var/lib/vector# ls
auditd  auth_log  journald  nginx_access  nginx_error  var_log
root@ubuntu-s-1vcpu-2gb-nyc3-01:/var/lib/vector# ls -al
total 32
drwxr-xr-x  8 vector vector 4096 Mar 13 01:35 .
drwxr-xr-x 46 root   root   4096 Mar 12 13:09 ..
drwxr-xr-x  2 vector vector 4096 Mar 13 01:35 auditd
drwxr-xr-x  2 vector vector 4096 Mar 13 01:35 auth_log
drwxr-xr-x  2 vector vector 4096 Mar 12 16:23 journald
drwxr-xr-x  2 vector vector 4096 Mar 13 01:37 nginx_access
drwxr-xr-x  2 vector vector 4096 Mar 13 01:35 nginx_error

As I was getting the parsers (and the path structure) just right, I would stop vector, delete from the bucket, then delete whichever file and they would be re-ingested. S3 writes are not immediate and sometimes it does take up to 5-10 minutes during your initial ingestion.

Parse and Transform

See the end of the blog for the links to the full code, but transforms do the heavy lifting in VRL and I only used some very simple remaps to parse key value pairs in auditd as well as add the event_type field to each message. In the case of auditd_parse I did delete the "msg" field because it was not ingesting cleaning in ClickHouse, but I'll need to address that later because that field is the only part of the event that contains a timestamp.

transforms:
    auditd_parse:
        type: remap
        inputs:
            - auditd
        source: |
            . = parse_key_value!(.message)
            .event_type = "auditd"
            del(.msg)
            .hostname = get_hostname!()

    auth_parse:
        type: remap
        inputs:
            - auth_log
        source: |
            . = parse_linux_authorization!(.message)
            .event_type = "auth"

    journald_parse:
        type: remap
        inputs:
            - journald
        source: |
            .event_type = "journald"

    common:
        type: remap
        inputs:
            - auth_parse
            - auditd_parse
            - journald_parse
        source: |
            .hostname = get_hostname!()

The common transform at the end allows me to add the common hostname field onto each message.

Parser Development & Testing

Although you can test your VRL on pipelines the best way to do this is either by running vector vrl with an input JSON stream and the code to show the output like this. See my vrl-devtest report scripts and sample data to test this out. This allows you to run across a larger dataset to ensure you are properly handling errors.

This is just the VRL code used by the remap.

.event_type = "auditd"
audit_msg = .msg[0]
audit_regex, err = parse_regex(audit_msg, r'^audit\((?P<time>\d+).*\):')
if err == null {
  .event_timestamp = audit_regex.time
}
del(.msg)
.hostname = get_hostname!()

You then execute with the following command.

vector vrl -i auditd.json -p auditd.vrl -o --print-warnings 

When developing the initial parser you can also use the VRL Playground to test your code but it will not include all the functions such as get_hostname()

LLMs were not very helpful for VRL code examples, but there are lots of GitHub projects so searching for parse_regex vector language:yaml in GitHub code search is the best way to find examples.

Before you wire up S3, it does make sense to test locally with just file sources and sinks. The run.sh is just a simple wrapper for vector -vv -c vector.yaml and clean.sh removes the watcher files that are created so you can rapidly retest.

Assessing the Data in S3

Assuming you've set your AWS Environment variables, ClickHouse will honor those.


SELECT DISTINCT event_type
FROM s3('https://bucket_name.s3.us-east-1.amazonaws.com/*/*/*/*/*/*.log.gz')

Query id: cd07ac4e-9640-414b-be84-459fc84b8c3d

   ┌─event_type─┐
1. │ auditd     │
   └────────────┘
   ┌─event_type─┐
2. │ auth       │
   └────────────┘
   ┌─event_type─┐
3. │ journald   │
   └────────────┘

3 rows in set. Elapsed: 57.546 sec. Processed 6.20 million rows, 162.91 MB (107.76 thousand rows/s., 2.83 MB/s.)
Peak memory usage: 1.33 GiB.

Remember the path structure allows to you search only subset of the data, for example this counts the various event types in March 2025

SELECT
    count(*) AS cnt,
    event_type
FROM s3('https://bucket_name.s3.us-east-1.amazonaws.com/*/*/03/*/25/*.log.gz')
GROUP BY event_type
ORDER BY cnt ASC

Query id: d1de464f-b6f0-406e-9fbd-20a7dcf0f80c

   ┌─────cnt─┬─event_type─┐
1. │    7717 │ auth       │
2. │   17697 │ auditd     │
3. │ 2181709 │ journald   │
   └─────────┴────────────┘

3 rows in set. Elapsed: 28.226 sec. Processed 2.21 million rows, 58.29 MB (78.19 thousand rows/s., 2.07 MB/s.)
Peak memory usage: 1.33 GiB.

Getting at the Schema

Simply use DESCRIBE to have ClickHouse infer the schema

DESCRIBE TABLE s3('https://bucket_name.s3.us-east-1.amazonaws.com/*/journald/*/*/*/*.log.gz')

And for auditd

And auth which is pretty sparse in comparison.

Digging into the Data

Let's start with auth... Remember DISTINCT is your friend for starting to characterize.

SELECT DISTINCT (appname, hostname)
FROM s3('https://bucket_name.s3.us-east-1.amazonaws.com/*/auth/*/*/*/*.log.gz')
ORDER BY appname ASC

Query id: 0ad7a767-034d-4728-bfb7-731a0db313a0

    ┌─(appname, hostname)────────────────────────────┐
 1. │ ('CRON','asus-pn50')                           │
 2. │ ('CRON','franz-OptiPlex-7050')                 │
 3. │ ('CRON','opti3070')                            │
 4. │ ('chfn','opti3070')                            │
 5. │ ('dbus-daemon','franz-OptiPlex-7050')          │
 6. │ ('gdm-password]','franz-OptiPlex-7050')        │
 7. │ ('gnome-keyring-daemon','franz-OptiPlex-7050') │
 8. │ ('gpasswd','opti3070')                         │
 9. │ ('groupadd','opti3070')                        │
10. │ ('passwd','opti3070')                          │
11. │ ('pkexec','franz-OptiPlex-7050')               │
12. │ ('polkitd','opti3070')                         │
13. │ ('polkitd','asus-pn50')                        │
14. │ ('sshd','asus-pn50')                           │
15. │ ('sshd','opti3070')                            │
16. │ ('sshd','franz-OptiPlex-7050')                 │
17. │ ('sudo','opti3070')                            │
18. │ ('sudo','asus-pn50')                           │
19. │ ('sudo','franz-OptiPlex-7050')                 │
20. │ ('systemd-logind','asus-pn50')                 │
21. │ ('systemd-logind','opti3070')                  │
22. │ ('systemd-logind','franz-OptiPlex-7050')       │
23. │ ('systemd-machined','asus-pn50')               │
24. │ ('useradd','opti3070')                         │
25. │ ('userdel','opti3070')                         │
26. │ ('usermod','opti3070')                         │
    └────────────────────────────────────────────────┘

What we can already tell from just this table that

  • franz-OptiPlex-7050 is the only system running Ubuntu desktop
  • Users were added and deleted on opti3070

So pkexec was only run on one system let's take peek

SELECT (timestamp, message)
FROM s3('https://bucket.s3.us-east-1.amazonaws.com/*/auth/*/*/*/*.log.gz')
WHERE appname = 'pkexec'

There is obviously a lot more you can do with the data depending on your SQL skills, but this shows you how to get started!

Putting it all together

data_dir: "/var/lib/vector"
api:
    enabled: true
sources:
    journald:
        type: journald

    auditd:
        type: file
        include:
            - /var/log/audit/audit.log

    auth_log:
        type: file
        include:
            - /var/log/auth.log

transforms:
    auditd_parse:
        type: remap
        inputs:
            - auditd
        source: |
            . = parse_key_value!(.message)
            .event_type = "auditd"
            audit_msg = .msg[0]
            audit_regex, err = parse_regex(audit_msg, r'^audit\((?P<time>\d+).*\):')
            if err == null {
              .event_timestamp = audit_regex.time
            }
            del(.msg)
            .hostname = get_hostname!()

    auth_parse:
        type: remap
        inputs:
            - auth_log
        source: |
            . = parse_linux_authorization!(.message)
            .event_type = "auth"

    journald_parse:
        type: remap
        inputs:
            - journald
        source: |
            .event_type = "journald"

    common:
        type: remap
        inputs:
            - auth_parse
            - auditd_parse
            - journald_parse
        source: |
            .hostname = get_hostname!()

sinks:
    s3_bucket:
        type: aws_s3
        inputs:
            - common
        bucket: "${VECTOR_S3_BUCKET}"
        key_prefix: "{{hostname}}/{{event_type}}/%D/"
        region: "us-east-1"
        compression: gzip
        encoding:
           codec: json