Neosync Complete Tutorial: Data Anonymization and Synthetic Data Generation
⏱️ Estimated Reading Time: 15 minutes
Introduction to Neosync
Neosync is an open-source, developer-first platform that revolutionizes how organizations handle sensitive data. It provides comprehensive solutions for data anonymization, synthetic data generation, and environment synchronization to help companies safely test against production-like data while maintaining compliance with privacy regulations like GDPR, HIPAA, and FERPA.
Why Neosync Matters
In today’s data-driven development landscape, developers need access to realistic data for testing, debugging, and development. However, using actual production data poses significant security and compliance risks. Neosync bridges this gap by providing:
- Safe Production Data Testing - Anonymize sensitive production data for local development
- Production Bug Reproduction - Create safe, representative datasets for debugging
- High-Quality Test Data - Generate production-like data for staging and QA environments
- Compliance Solution - Reduce compliance scope for GDPR, HIPAA, FERPA regulations
- Development Database Seeding - Create synthetic data for unit testing and demos
Key Features Overview
- Synthetic Data Generation based on your existing schema
- Production Data Anonymization with referential integrity preservation
- Database Subsetting using SQL queries for focused testing
- Async Pipeline Architecture with automatic retries and failure handling
- GitOps Integration for declarative configuration management
- Built-in Transformers for major data types (emails, names, addresses, etc.)
- Custom Transformers using JavaScript or LLMs
- Multiple Database Support - PostgreSQL, MySQL, and S3 integration
Prerequisites and Environment Setup
System Requirements
Before starting this tutorial, ensure you have:
- Docker & Docker Compose (latest version)
- Git for repository cloning
- PostgreSQL client (optional, for testing connections)
- Web browser for accessing the Neosync UI
- macOS, Linux, or Windows with WSL2
Installation Steps
Let’s begin by setting up Neosync on your local machine:
Step 1: Clone the Repository
# Clone Neosync repository
git clone https://github.com/nucleuscloud/neosync.git
cd neosync
# Check repository structure
ls -la
Step 2: Start Neosync Services
Neosync provides a production-ready Docker Compose setup:
# Start all Neosync services
make compose/up
# Alternatively, you can use Docker Compose directly
docker compose up -d
This command will:
- Download and start all required containers
- Set up PostgreSQL database for Neosync metadata
- Launch the Neosync backend API
- Start the web frontend interface
- Initialize sample connections and jobs
Step 3: Verify Installation
# Check running containers
docker compose ps
# View logs if needed
docker compose logs -f neosync-app
Access Neosync at http://localhost:3000
in your web browser.
Understanding Neosync Architecture
Core Components
Neosync consists of several interconnected components:
- Frontend (Next.js) - Web interface for configuration and monitoring
- Backend API (Go) - Core business logic and job orchestration
- Worker Service - Handles data processing and transformation jobs
- PostgreSQL Database - Stores metadata, configurations, and job state
- Temporal - Workflow orchestration for reliable job execution
Data Flow Architecture
graph TD
A[Source Database] --> B[Neosync Worker]
B --> C[Data Transformers]
C --> D[Anonymized/Synthetic Data]
D --> E[Target Database]
F[Neosync UI] --> G[Backend API]
G --> H[Job Scheduler]
H --> B
I[Configuration] --> G
J[Temporal] --> H
Initial Configuration and Setup
Accessing the Dashboard
- Open your browser and navigate to
http://localhost:3000
- You’ll see the Neosync welcome dashboard
- The system comes pre-configured with sample connections for demonstration
Understanding Connections
Connections in Neosync represent database or storage endpoints. The default setup includes:
- Source Connection - PostgreSQL database with sample data
- Destination Connection - Target database for anonymized data
Sample Data Overview
Neosync includes pre-populated sample data to demonstrate its capabilities:
-- Sample schema structure
CREATE TABLE users (
id SERIAL PRIMARY KEY,
first_name VARCHAR(50),
last_name VARCHAR(50),
email VARCHAR(100) UNIQUE,
phone VARCHAR(20),
birth_date DATE,
salary DECIMAL(10,2)
);
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
user_id INTEGER REFERENCES users(id),
order_date TIMESTAMP,
total_amount DECIMAL(10,2),
status VARCHAR(20)
);
Creating Your First Anonymization Job
Job Configuration Wizard
Let’s create a data anonymization job that transforms sensitive information while preserving data relationships:
Step 1: Create New Job
- Click “Jobs” in the navigation menu
- Select “Create Job”
- Choose “Data Anonymization” job type
- Set job name:
user-data-anonymization
Step 2: Configure Source Connection
# Source connection settings
Connection Type: PostgreSQL
Host: localhost
Port: 5432
Database: sample_db
Username: postgres
Password: [provided in compose]
Step 3: Define Transformation Rules
For the users
table, configure these transformations:
Column | Transformer | Configuration |
---|---|---|
first_name |
Generate First Name | Random generation |
last_name |
Generate Last Name | Random generation |
email |
Transform Email | Preserve domain structure |
phone |
Generate Phone | Format: +1-XXX-XXX-XXXX |
birth_date |
Transform Date | Randomize ±5 years |
salary |
Transform Numeric | Randomize ±20% |
Step 4: Preserve Referential Integrity
Configure foreign key relationships:
# Maintain user_id relationships in orders table
Foreign Keys:
- Source Table: orders
Source Column: user_id
Reference Table: users
Reference Column: id
Action: preserve_relationship
Step 5: Execute the Job
# Monitor job execution via CLI (optional)
docker compose exec neosync-worker neosync jobs run --job-id=user-data-anonymization
# Or use the web interface
# Click "Run Job" in the dashboard
Synthetic Data Generation
Creating Synthetic Datasets
Neosync can generate completely synthetic data that matches your schema constraints:
Step 1: Schema Analysis
-- Analyze existing schema
SELECT
column_name,
data_type,
is_nullable,
column_default
FROM information_schema.columns
WHERE table_name = 'users';
Step 2: Configure Synthetic Generation
Create a new job with these settings:
Job Type: Generate Synthetic Data
Target Rows: 10000
Data Distribution:
users:
- first_name: weighted_random([common_names])
- last_name: weighted_random([surnames])
- email: generate_email(first_name, last_name)
- age_distribution: normal(mean=35, std=12)
- salary_distribution: lognormal(mean=75000, std=25000)
Step 3: Advanced Synthetic Patterns
// Custom transformer for realistic email generation
function generateEmail(firstName, lastName) {
const domains = ['gmail.com', 'yahoo.com', 'company.com'];
const domain = domains[Math.floor(Math.random() * domains.length)];
const username = `${firstName.toLowerCase()}.${lastName.toLowerCase()}`;
return `${username}@${domain}`;
}
// Generate correlated data
function generateSalary(experience, education) {
const baseSalary = 50000;
const experienceMultiplier = experience * 2000;
const educationBonus = education === 'masters' ? 15000 :
education === 'phd' ? 25000 : 0;
return baseSalary + experienceMultiplier + educationBonus;
}
Advanced Data Transformations
Custom JavaScript Transformers
Neosync supports custom transformations using JavaScript:
// Credit card number anonymization
function anonymizeCreditCard(value) {
if (!value || value.length < 4) return value;
const lastFour = value.slice(-4);
const masked = '*'.repeat(value.length - 4);
return masked + lastFour;
}
// Address anonymization while preserving geographic region
function anonymizeAddress(address, city, state) {
return {
street: generateRandomStreet(),
city: city, // Preserve city for geographic analysis
state: state,
zipCode: generateRandomZipInState(state)
};
}
// Timestamp anonymization with time pattern preservation
function anonymizeTimestamp(timestamp) {
const date = new Date(timestamp);
const randomDays = Math.floor(Math.random() * 365) - 182; // ±6 months
date.setDate(date.getDate() + randomDays);
return date.toISOString();
}
LLM-Powered Transformations
For more sophisticated transformations, Neosync can integrate with Large Language Models:
# LLM transformer configuration
Transformer: LLM_Transform
Model: gpt-3.5-turbo
Prompt: |
Transform this customer review to remove personal information
while preserving sentiment and key product feedback:
Original: "{review_text}"
Requirements:
- Remove specific names, locations, dates
- Preserve product features mentioned
- Maintain emotional tone
- Keep review length similar
Temperature: 0.3
Max_Tokens: 300
Database Integration and Subsetting
PostgreSQL Integration
Configure PostgreSQL connection for production data:
# Production PostgreSQL setup
Connection:
type: postgresql
host: prod-db.company.com
port: 5432
database: production_db
username: neosync_reader
password: ${NEOSYNC_DB_PASSWORD}
ssl_mode: require
# Read-only permissions for safety
Permissions:
- SELECT on public.*
- No write permissions
Data Subsetting Strategies
Create focused datasets for testing:
-- User-based subsetting
SELECT * FROM users
WHERE created_at >= '2024-01-01'
AND account_type = 'premium'
LIMIT 1000;
-- Relationship-aware subsetting
WITH sample_users AS (
SELECT id FROM users
WHERE region = 'US-WEST'
LIMIT 500
)
SELECT o.* FROM orders o
JOIN sample_users su ON o.user_id = su.id
WHERE o.order_date >= '2024-01-01';
-- Time-based subsetting with referential integrity
SELECT * FROM events
WHERE event_date BETWEEN '2024-07-01' AND '2024-07-31'
AND user_id IN (
SELECT id FROM users
WHERE last_active >= '2024-06-01'
);
MySQL Integration
# MySQL connection configuration
Connection:
type: mysql
host: mysql-server.internal
port: 3306
database: app_database
username: neosync_user
password: ${MYSQL_PASSWORD}
charset: utf8mb4
# MySQL-specific settings
Options:
sql_mode: STRICT_TRANS_TABLES
time_zone: UTC
max_connections: 10
Workflow Automation and GitOps
Declarative Configuration
Create reusable job configurations:
# .neosync/jobs/user-anonymization.yaml
apiVersion: neosync.dev/v1
kind: Job
metadata:
name: user-data-anonymization
namespace: development
spec:
source:
connection: prod-postgres
tables:
- users
- user_profiles
- user_preferences
destination:
connection: dev-postgres
transformations:
users:
first_name:
type: generate_first_name
last_name:
type: generate_last_name
email:
type: transform_email
preserve_domain: true
ssn:
type: hash_value
algorithm: sha256
user_profiles:
bio:
type: llm_transform
model: gpt-3.5-turbo
prompt: "Anonymize personal details while preserving professional information"
schedule:
cron: "0 2 * * *" # Daily at 2 AM
timezone: UTC
CI/CD Integration
# .github/workflows/data-sync.yml
name: Neosync Data Synchronization
on:
schedule:
- cron: '0 6 * * 1' # Every Monday at 6 AM
workflow_dispatch:
jobs:
sync-development-data:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Neosync CLI
run: |
curl -sSL https://install.neosync.dev | sh
echo "$HOME/.neosync/bin" >> $GITHUB_PATH
- name: Run Anonymization Job
env:
NEOSYNC_API_TOKEN: $
NEOSYNC_API_URL: $
run: |
neosync jobs run \
--job-config .neosync/jobs/user-anonymization.yaml \
--wait-for-completion \
--timeout 30m
- name: Verify Data Quality
run: |
neosync validate \
--connection dev-postgres \
--check referential-integrity \
--check data-quality
Monitoring and Observability
Job Monitoring Dashboard
Neosync provides comprehensive monitoring capabilities:
- Job Execution Status - Real-time progress tracking
- Data Transformation Metrics - Row counts, transformation rates
- Error Tracking - Failed transformations and retry logic
- Performance Metrics - Execution time, throughput analysis
- Data Quality Checks - Validation results and anomaly detection
Metrics and Alerting
# Monitoring configuration
Monitoring:
metrics:
- job_duration_seconds
- rows_processed_total
- transformation_errors_total
- data_quality_score
alerts:
- name: job_failure
condition: job_status == "failed"
notification: slack_webhook
- name: data_quality_degradation
condition: data_quality_score < 0.95
notification: email
- name: long_running_job
condition: job_duration_seconds > 3600
notification: pagerduty
Log Analysis
# View job execution logs
docker compose logs neosync-worker | grep "job_id=user-anonymization"
# Monitor transformation performance
docker compose logs neosync-worker | grep "transformation_stats"
# Check for errors
docker compose logs neosync-worker | grep "ERROR"
Security and Compliance
Data Privacy Best Practices
- Principle of Least Privilege - Grant minimal necessary permissions
- Data Retention Policies - Automatically purge old anonymized data
- Audit Logging - Track all data access and transformations
- Encryption - Encrypt data in transit and at rest
- Access Controls - Role-based access to different data sensitivity levels
GDPR Compliance Features
# GDPR compliance configuration
GDPR:
data_subject_rights:
right_to_be_forgotten:
enabled: true
retention_days: 90
right_of_access:
enabled: true
response_time_days: 30
data_portability:
enabled: true
export_formats: [json, csv, xml]
consent_management:
track_consent_changes: true
consent_expiry_days: 365
breach_notification:
enabled: true
notification_time_hours: 72
HIPAA Compliance
# HIPAA compliance for healthcare data
HIPAA:
phi_identification:
automatic_detection: true
custom_patterns:
- medical_record_number: '\d{8,12}'
- patient_id: 'P\d{6,10}'
safe_harbor_method:
remove_direct_identifiers: true
statistical_disclosure_control: true
audit_controls:
log_all_access: true
log_retention_years: 6
Performance Optimization
Parallel Processing Configuration
# Performance optimization settings
Performance:
worker_concurrency: 8
batch_size: 1000
memory_limit: "4Gi"
database_connections:
max_open: 25
max_idle: 5
connection_lifetime: "5m"
transformation_cache:
enabled: true
size: "1Gi"
ttl: "1h"
Large Dataset Handling
-- Chunked processing for large tables
SELECT * FROM large_table
WHERE id BETWEEN ? AND ?
ORDER BY id
LIMIT 10000;
-- Memory-efficient streaming
SET work_mem = '256MB';
SET maintenance_work_mem = '1GB';
Troubleshooting Guide
Common Issues and Solutions
Issue 1: Job Timeout
# Solution: Increase timeout and optimize batch size
Job:
timeout: 3600s # 1 hour
batch_size: 500 # Smaller batches
retry_attempts: 3
Issue 2: Memory Issues
# Monitor memory usage
docker stats neosync-worker
# Increase container memory
docker compose up -d --scale neosync-worker=2
Issue 3: Connection Failures
# Robust connection configuration
Connection:
retry_attempts: 5
retry_delay: 30s
connection_timeout: 60s
read_timeout: 300s
Debug Mode
# Enable debug logging
export NEOSYNC_LOG_LEVEL=debug
docker compose up -d
# View detailed logs
docker compose logs -f neosync-worker | grep DEBUG
Testing and Validation
Let’s create a comprehensive test script to validate our Neosync setup:
#!/bin/bash
# File: test-neosync-setup.sh
echo "🚀 Testing Neosync Setup..."
# Test 1: Check if services are running
echo "📡 Checking Neosync services..."
if curl -f http://localhost:3000/health > /dev/null 2>&1; then
echo "✅ Neosync UI is accessible"
else
echo "❌ Neosync UI is not accessible"
exit 1
fi
# Test 2: Verify database connectivity
echo "🗄️ Testing database connectivity..."
docker compose exec neosync-app neosync connections test --connection-id=sample-postgres
if [ $? -eq 0 ]; then
echo "✅ Database connection successful"
else
echo "❌ Database connection failed"
fi
# Test 3: Run sample anonymization job
echo "🔄 Running sample anonymization job..."
JOB_ID=$(docker compose exec neosync-app neosync jobs create \
--name "test-anonymization" \
--source-connection sample-postgres \
--destination-connection sample-postgres-dest)
docker compose exec neosync-app neosync jobs run --job-id=$JOB_ID --wait
# Test 4: Validate anonymized data
echo "🔍 Validating anonymized data..."
docker compose exec postgres psql -U postgres -d neosync -c \
"SELECT COUNT(*) as anonymized_records FROM users_anonymized;"
echo "✅ Neosync setup test completed successfully!"
Next Steps and Advanced Usage
Production Deployment
For production deployment, consider:
- Kubernetes Deployment - Use the provided Helm charts
- High Availability - Deploy multiple worker instances
- External Database - Use managed PostgreSQL for metadata
- Secrets Management - Integrate with HashiCorp Vault or AWS Secrets Manager
- Load Balancing - Distribute API requests across multiple instances
Integration Patterns
# Microservices integration
Services:
user-service:
anonymization_job: user-data-anonymization
schedule: "0 3 * * *"
order-service:
anonymization_job: order-data-anonymization
depends_on: [user-service]
analytics-service:
synthetic_data_job: analytics-synthetic-data
schema_source: production_analytics
Custom Extensions
// Custom transformer in Go
package transformers
type CustomTransformer struct {
config TransformerConfig
}
func (t *CustomTransformer) Transform(value interface{}) (interface{}, error) {
// Implement custom transformation logic
return transformedValue, nil
}
Conclusion
Neosync provides a comprehensive solution for modern data privacy and testing challenges. By implementing proper data anonymization and synthetic data generation, organizations can:
- Accelerate Development - Safe access to production-like data
- Improve Data Quality - Realistic test scenarios and edge cases
- Ensure Compliance - Automated privacy protection for regulated industries
- Reduce Risk - Eliminate exposure of sensitive production data
- Scale Testing - Generate unlimited synthetic datasets for various scenarios
The platform’s declarative configuration, GitOps integration, and extensive customization options make it suitable for organizations of all sizes, from startups to enterprise deployments.
Key Takeaways
- Start Simple - Begin with basic anonymization jobs and gradually add complexity
- Preserve Relationships - Always maintain referential integrity in your transformations
- Monitor Quality - Implement data quality checks to ensure transformation effectiveness
- Automate Everything - Use GitOps and CI/CD integration for consistent data provisioning
- Plan for Scale - Design your transformation pipelines with production volume in mind
Resources for Further Learning
- Neosync Documentation - Comprehensive guides and API reference
- Community Discord - Connect with other users and get support
- GitHub Repository - Source code and issue tracking
- Blog and Tutorials - Latest features and use cases
Need Help? Join the Neosync community on Discord or open an issue on GitHub for technical support and feature requests.