Friday, 27 December 2024

Exploring Sysbench: Installation and CPU Benchmarking

What is Sysbench?
Sysbench is designed for evaluating the performance of systems by executing CPU, memory, file I/O, and database benchmarks. It is highly customizable, and the results can be used for performance tuning, comparing different configurations, or validating hardware upgrades.
Installing Sysbench on CentOS and Ubuntu.
Step 1: Install Dependencies
Run the following command to install the required dependencies:
sudo yum install mysql-devel libtool automake autoconf
Step 2: Download Sysbench
For example, to download Sysbench from GitHub, you can use curl:
curl -LO https://github.com/akopytov/sysbench/releases/download/1.0.20/sysbench-1.0.20.tar.gz
If the .tar.gz file is downloaded, extract it:
tar -xzvf sysbench-1.0.20.tar.gz
cd sysbench-1.0.20/

Step 3: Build Sysbench from Source
./autogen.sh ; ./configure ; make -j ; make install
Step 4: Verify the Installation
sysbench --version
sysbench 1.0.20

Running the CPU Benchmark Test:
Sysbench provides a number of tests, but here, we will focus on testing the CPU performance. Sysbench’s CPU test calculates prime numbers up to a specified limit and reports the number of events processed during the benchmark run.
Here’s how you can run the benchmark on the CPU:
Running Sysbench CPU Test with 1 Thread:
sysbench --test=cpu --threads=1 --time=60 run
Sample Output:
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time
Prime numbers limit: 10000
Initializing worker threads...
Threads started!
CPU speed:
    events per second:  2955.46
General statistics:
    total time:                          60.0002s
    total number of events:              177333
Latency (ms):
        min:                                    0.33
        avg:                                    0.34
        max:                                    0.39
        95th percentile:                        0.35
        sum:                                59978.41
Threads fairness:
    events (avg/stddev):           177333.0000/0.00
    execution time (avg/stddev):   59.9784/0.00

Running Sysbench CPU Test with 2 Threads:
sysbench --test=cpu --threads=2 --time=60 run
Sample Output:
CPU speed:
    events per second:  5903.08
General statistics:
    total time:                          60.0002s
    total number of events:              354195
Latency (ms):
        min:                                    0.33
        avg:                                    0.34
        max:                                    1.29
        95th percentile:                        0.35
        sum:                               119915.68
Threads fairness:
    events (avg/stddev):           177097.5000/89.50
    execution time (avg/stddev):   59.9578/0.00

As you can see, with more threads, the events per second increase as well, which indicates better CPU utilization.
Running Sysbench CPU Test with 4 Threads:
Increasing the number of threads even more tests the system's scalability:
sysbench --test=cpu --threads=4 --time=60 run
Sample Output:
CPU speed:
    events per second: 11819.27
General statistics:
    total time:                          60.0003s
    total number of events:              709178
Latency (ms):
        min:                                    0.33
        avg:                                    0.34
        max:                                    1.28
        95th percentile:                        0.35
        sum:                               239860.45
Threads fairness:
    events (avg/stddev):           177294.5000/122.97
    execution time (avg/stddev):   59.9651/0.00


Running Sysbench CPU Test with 8 Threads:
sysbench --test=cpu --threads=8 --time=60 run
Sample Output:
CPU speed:
    events per second: 23637.61
General statistics:
    total time:                          60.0004s
    total number of events:              1418301
Latency (ms):
        min:                                    0.33
        avg:                                    0.34
        max:                                    1.28
        95th percentile:                        0.35
        sum:                               479730.48
Threads fairness:
    events (avg/stddev):           177287.6250/42.17
    execution time (avg/stddev):   59.9663/0.00


Running Sysbench CPU Test with 16 Threads:
sysbench --test=cpu --threads=16 --time=60 run
Sample Output:
    CPU speed:
        events per second: 47267.52
    General statistics:
        total time:                          60.0004s
        total number of events:              2836140
    Latency (ms):
            min:                                    0.33
            avg:                                    0.34
            max:                                    1.42
            95th percentile:                        0.35
            sum:                               959459.02
    Threads fairness:
        events (avg/stddev):           177287.6250/42.17
        execution time (avg/stddev):   59.9662/0.00

Observations:
    Threads and Performance: As the number of threads increases, the events per second increase proportionally. This shows how well your CPU can handle multi-threaded workloads.
    Latency: The latency remains fairly constant across different thread counts, which suggests that the CPU's ability to process each event is relatively consistent.
    Thread Fairness: The fairness remains stable, indicating that Sysbench is distributing tasks evenly across the threads.

Thursday, 12 December 2024

PostgreSQL work_mem: What is work_mem? When to Consider Increasing work_mem?

PostgreSQL is renowned for its robust query optimization and memory management capabilities. One of the most critical yet often misunderstood parameters is work_mem. This memory setting can dramatically impact query performance, but it's not always clear when or how to tune it effectively.
we'll explore what work_mem is, how PostgreSQL uses it, and when you should consider increasing it to optimize your database performance.

What is work_mem?
work_mem is a PostgreSQL configuration parameter that defines the amount of memory allocated to each database operation that requires temporary storage. Unlike shared_buffers which is shared across all connections, work_mem is allocated per operation, per connection.

Key Characteristics:
- Per-operation allocation: Each sort, hash join, or merge operation gets its own work_mem
- Temporary memory: Used only during query execution
- Automatic cleanup: Memory is released when the operation completes
- Default value: 4MB (PostgreSQL 13+), 1MB in older versions

How PostgreSQL Uses work_mem:
PostgreSQL allocates work_mem for several types of operations:
1. Sort Operations
-- This query will use work_mem for sorting
SELECT * FROM large_table  ORDER BY created_at DESC LIMIT 1000;
2. Hash Joins
-- Hash joins use work_mem for building hash tables
SELECT u.name, o.total FROM users u JOIN orders o ON u.id = o.user_id;
3. Merge Joins
-- Merge joins use work_mem for sorting inputs
SELECT * FROM table1 t1 JOIN table2 t2 ON t1.id = t2.ref_id;
4. Window Functions

-- Window functions may use work_mem for sorting
SELECT name, salary, ROW_NUMBER() OVER (ORDER BY salary DESC) as rank FROM employees;
5. Index Creation
-- Creating indexes uses work_mem for sorting
CREATE INDEX idx_created_at ON large_table(created_at);

Memory Allocation in Practice:
-- This query might use work_mem multiple times:
SELECT u.name, COUNT(o.id) as order_count
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2023-01-01'
GROUP BY u.name
ORDER BY order_count DESC;

Memory allocation:
 1. Hash join: 1 × work_mem
 2. GROUP BY sorting: 1 × work_mem  
 3. ORDER BY sorting: 1 × work_mem
    Total: 3 × work_mem per connection


When to Consider Increasing work_mem:
1. Disk Spills (Temporary Files)
The most obvious sign that you need more work_mem is when PostgreSQL writes temporary files to disk:
-- Check for disk spills
SELECT query, temp_files, temp_bytes,  temp_bytes / temp_files as avg_temp_file_size FROM pg_stat_statements 
WHERE temp_files > 0 ORDER BY temp_bytes DESC;

What to look for:
   temp_files > 0: Indicates operations spilled to disk
   High temp_bytes: Shows significant disk usage
   Frequent spills: Operations consistently exceeding memory

2. Slow Query Performance 
Monitor query execution plans for external sorts:
-- Check if sorts are using external files
EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM large_table  ORDER BY created_at DESC;

Look for:
- External Sort in the execution plan
- High I/O time in EXPLAIN ANALYZE
- temp_files in the plan output

3. Large Data Processing
Consider increasing work_mem when:
- Processing large datasets (>1GB)
- Running complex analytical queries
- Creating indexes on large tables
- Performing bulk data operations

4. Memory-Intensive Operations

Operations that typically benefit from more work_mem:
- Large sorts: ORDER BY on millions of rows
- Hash joins: Joining large tables
- Window functions: Complex analytical queries
- Aggregations: GROUP BY on large datasets

How to Configure work_mem
1. Check Current Setting
-- View current work_mem setting
SHOW work_mem;
-- View all memory-related settings
SELECT name, setting, unit, context FROM pg_settings WHERE name LIKE '%mem%';

2. Set work_mem for Session
-- Temporary setting for current session
SET work_mem = '256MB';
-- Verify the change
SHOW work_mem;

3. Set work_mem Globally
-- Set globally (requires superuser privileges)
ALTER SYSTEM SET work_mem = '256MB';

-- Reload configuration
SELECT pg_reload_conf();
-- Restart may be required for some changes

4. Set work_mem in postgresql.conf
# In postgresql.conf
work_mem = 256MB
Restart PostgreSQL after changes

Memory Calculation and Planning

Total Memory Usage
Total work_mem = work_mem × max_connections × concurrent_operations
Example calculation:
- work_mem = 256MB
- max_connections = 100
- Average concurrent operations per connection = 2
- Total work_mem = 256MB × 100 × 2 = 51.2GB

Check your memory configuration:
SELECT  name,setting,unit, CASE 
 WHEN name = 'shared_buffers' THEN 'Shared across all connections'
        WHEN name = 'work_mem' THEN 'Per operation, per connection'
        WHEN name = 'maintenance_work_mem' THEN 'Per maintenance operation'
        END as description
FROM pg_settings 
WHERE name IN ('shared_buffers', 'work_mem', 'maintenance_work_mem');


Common Pitfalls to Avoid:
1. Setting Too High
- Can cause memory pressure
- May lead to OOM (Out of Memory) errors
- Reduces available memory for other operations
2. Ignoring Concurrent Operations
- Multiple operations per connection multiply memory usage
- Consider peak concurrent load, not just max_connections
3. Not Monitoring Long-term
- Performance can change as data grows
- Regular monitoring is essential
4. One-Size-Fits-All Approach
- Different workloads may need different settings
- Consider per-query tuning for critical operations

Monitoring Queries:
-- Monitor work_mem usage
SELECT  pid,usename,application_name,state,query_start,query,temp_files,temp_bytes
FROM pg_stat_activity WHERE temp_files > 0;


-- Check for memory pressure
SELECT name,setting,unit FROM pg_settings WHERE name IN ('work_mem', 'shared_buffers', 'effective_cache_size');

Friday, 6 December 2024

SLAs: The "Nines" of Uptime

Ever wondered how your favorite online services stay up and running almost all the time? A lot of it comes down to a Service Level Agreement (SLA). An SLA is a contract between a service provider and a customer that defines the level of service to be expected. It's a key document that sets expectations and provides recourse if those expectations aren't met. For an IT support person, understanding SLAs is crucial because it helps you know what to prioritize and what's at stake when something goes down.

What's the Big Deal with "Uptime"?
Uptime is the most common metric used in SLAs. It refers to the percentage of time a service is operational and available for use. The higher the percentage, the less downtime a service experiences. This is often expressed in "nines"—99%, 99.9%, and so on.

The difference between a few "nines" might seem insignificant, but it has a huge impact on real-world availability. A service with an uptime of 99% sounds good on paper, but when you break it down, it means the service can be down for over three and a half days a year. For a business that relies on a critical application, that amount of downtime can be catastrophic.

Decoding the "Nines" :
To truly grasp the impact of each percentage, let's look at a breakdown of the downtime allowed for different uptime levels:
SLA Uptime                  Daily Downtime    Weekly Downtime    Yearly Downtime
99% (Two Nines)           14.4 minutes         1.68 hours                   3.65 days
99.9% (Three Nines)     1.44 minutes        10.08 minutes            8.77 hours
99.99% (Four Nines)    8.64 seconds        1.01 minutes                52.56 minutes
99.999% (Five Nines)   0.86 seconds       6.05 seconds                5.26 minutes
99.9999% (Six Nines)   0.086 seconds     0.61 seconds                31.54 seconds

As an IT support professional, these numbers should be your north star. If you're managing a system with a 99.9% SLA, you know that every minute of downtime counts. A service outage that lasts just a few minutes could put you in breach of the SLA, potentially leading to financial penalties for your company. This is why you'll often hear about the concept of "five nines" (99.999%) in enterprise-level services. It represents a level of reliability that is almost perfect.

Why It Matters to You, the IT Pro?
Understanding SLAs isn't just about memorizing a table; it's about shifting your mindset. It helps you:
    Prioritize incidents: A critical system with a strict SLA must be addressed immediately.
    Manage expectations: You can communicate realistic recovery times to stakeholders based on the SLA.
    Advocate for resources: If a service with a high-stakes SLA is struggling, you can use the numbers to justify the need for better infrastructure or tools.