;

How to Return a Random Sample of Documents from a MongoDB Collection


Tutorialsrack 04/02/2025 MongoDB

Introduction

When working with large MongoDB collections, retrieving a random sample of documents is a common requirement. Whether for A/B testing, data analysis, or improving query performance, having efficient ways to fetch random documents can be highly beneficial.

This article explores different techniques to return random samples in MongoDB, covering both beginner-friendly and advanced methods.

Why Use Random Sampling in MongoDB?

Random sampling helps with:

  • Data Analysis: Extracting a small but representative dataset for statistical analysis.
  • Machine Learning: Training models with randomized datasets.
  • Load Testing: Fetching random records for testing API performance.
  • A/B Testing: Selecting random users for controlled experiments.

Understanding how to efficiently retrieve random records is crucial when working with large datasets.

Methods to Retrieve a Random Sample of Documents

1. Using $sample Aggregation Pipeline

MongoDB provides the $sample stage in the aggregation pipeline, which is the most efficient and recommended approach for random sampling.

Example:

db.collection.aggregate([
   { $sample: { size: 5 } }
 ])

This method selects exactly 5 documents randomly. It internally optimizes the selection process for efficiency.

Pros:

  • Simple and efficient.
  • Works well for large datasets.

Cons:

  • Can be slower on sharded collections as it requires merging sampled documents from each shard.

2. Using $sampleRate for Approximate Sampling

Introduced in MongoDB 5.0, $sampleRate allows for approximate random sampling by returning a fraction of documents from the collection.

Example:

db.collection.aggregate([
   { $sampleRate: 0.1 }
 ])

This will return approximately 10% of the total documents in the collection. Unlike $sample, $sampleRate works efficiently with large datasets and sharded collections by leveraging sampling at the query level.

Minimum and Maximum Values

  • Minimum fraction: 0.0 (returns no documents)
  • Maximum fraction: 1.0 (returns all documents, equivalent to a full collection scan)

Pros:

  • More efficient for large collections.
  • Works well for approximate sampling on sharded clusters.
  • Does not require scanning the entire dataset.

Cons:

  • Cannot specify an exact number of documents.
  • Returns an approximate fraction rather than a precise count.

3. Using find() with Random Sorting

Another approach is to use .find() with sorting based on a random value.

Example:

db.collection.find().sort({ random_field: 1 }).limit(5)

If documents don’t have a precomputed random field, you can modify the query to:

db.collection.find().sort({ $natural: -1 }).limit(5)

Pros:

  • Works without aggregation.
  • Good for small collections.

Cons:

  • Inefficient for large collections.
  • Requires an indexed field for better performance.

4. Using a Random Skip Approach

For smaller collections, a random skip method can be effective:

Example:

let count = db.collection.countDocuments();
let randomSkip = Math.floor(Math.random() * count);
db.collection.find().skip(randomSkip).limit(5);

Pros:

  • Works for smaller datasets.

Cons:

  • Becomes slow on large datasets.
  • skip() can be inefficient.

5. Using a Precomputed Random Field

To improve performance, you can add a precomputed random field and query it efficiently.

Example:

db.collection.updateMany({}, { $set: { random_field: Math.random() } });

Querying:

db.collection.find().sort({ random_field: 1 }).limit(5);

Pros:

  • Optimized for repeated queries.
  • Works well with indexes.

Cons:

  • Requires periodic updates.

6. Using $rand Operator for Random Selection

The $rand operator generates a random number for each document, which can be used to filter results.

Example:

db.collection.find({ $expr: { $lt: [ { $rand: {} }, 0.1 ] } }).limit(5);

This method randomly selects approximately 10% of the documents.

Minimum and Maximum Values

  • Minimum fraction: 0.0 (returns no documents)
  • Maximum fraction: 1.0 (returns all documents, equivalent to a full collection scan)

Pros:

  • Lightweight and efficient.
  • Works without requiring a separate aggregation pipeline.

Cons:

  • Selection is approximate, not exact.
  • Less efficient for very large collections compared to $sampleRate.

Real-World Examples and Use Cases

Example 1: A/B Testing

E-commerce websites can use random sampling to test features on a subset of users.

db.users.aggregate([{ $sample: { size: 100 } }]);

Example 2: Data Analysis

Analyzing social media trends by sampling random posts.

db.posts.aggregate([{ $sample: { size: 50 } }]);

Key Takeaways

  • The $sample aggregation is the most efficient method for retrieving random documents.
  • $sampleRate provides approximate sampling with better performance on large collections.
  • The minimum fraction for $sampleRate and $rand is 0.0, and the maximum is 1.0.
  • Sorting with a random field is a practical approach for small datasets.
  • The skip() method works but is inefficient for large collections.
  • Precomputing a random field improves performance for repeated queries.
  • The $rand operator allows lightweight, approximate random selection.

Summary

Returning random samples in MongoDB can be achieved in multiple ways, each with pros and cons. The $sample stage is the most recommended method for efficiency. However, alternative approaches such as $sampleRate, $rand, random sorting, skipping, and precomputed random fields offer flexibility depending on dataset size and use cases.


Related Posts



Comments

Recent Posts
Tags