Java 集合对象去重技术详解

在Java开发中，对象去重是一个常见且重要的需求。无论是处理用户数据、业务记录还是系统日志，去重技术都能帮助我们提高数据质量、优化存储空间和提升查询性能。本文将详细介绍各种去重技术及其适用场景。

1. 对象去重概述

1.1 什么是对象去重？

核心概念

对象去重是指从集合中移除重复元素，保留唯一元素的过程。去重的核心在于如何定义"重复"：

🔍 基于对象引用：两个对象引用指向同一内存地址，使用==比较
📦 基于对象内容：两个对象在业务逻辑上被认为是相同的，通过equals()和hashCode()
🏷️ 基于特定字段：两个对象在指定字段上具有相同的值，使用自定义比较逻辑
🔄 基于组合条件：多个字段或复杂业务规则的组合判断

1.2 去重的重要性

重要性	具体体现	业务价值
数据质量	避免重复数据影响分析结果	提高决策准确性
存储优化	减少冗余数据占用空间	降低存储成本
性能提升	减少重复查询和处理	提升系统响应速度
业务逻辑	确保业务规则的一致性	维护数据完整性

1.3 去重技术分类

去重技术分类示例

java

1public class DeduplicationTechniques {
2    
3    /**
4     * 基于集合的去重技术
5     */
6    public enum CollectionBased {
7        HASH_SET,      // 基于HashSet
8        TREE_SET,      // 基于TreeSet
9        LINKED_HASH_SET // 基于LinkedHashSet
10    }
11    
12    /**
13     * 基于Stream的去重技术
14     */
15    public enum StreamBased {
16        DISTINCT,      // 使用distinct()方法
17        TO_MAP,        // 使用toMap()收集器
18        COLLECTING_AND_THEN // 使用collectingAndThen
19    }
20    
21    /**
22     * 基于自定义逻辑的去重技术
23     */
24    public enum CustomBased {
25        COMPARATOR,    // 自定义比较器
26        MULTI_FIELD,   // 多字段组合
27        TIMESTAMP,     // 基于时间戳
28        BUSINESS_RULE  // 基于业务规则
29    }
30}

2. 基本去重方法详解

HashSet去重
Stream API去重
LinkedHashSet去重
TreeSet去重

2.1 使用HashSet去重

HashSet是最常用的去重方式，基于对象的hashCode()和equals()方法：

HashSet去重完整示例

java

1public class User {
2    private String name;
3    private int age;
4    private String email;
5    
6    // 构造函数
7    public User(String name, int age, String email) {
8        this.name = name;
9        this.age = age;
10        this.email = email;
11    }
12    
13    // Getter方法
14    public String getName() { return name; }
15    public int getAge() { return age; }
16    public String getEmail() { return email; }
17    
18    @Override
19    public boolean equals(Object obj) {
20        if (this == obj) return true;
21        if (obj == null || getClass() != obj.getClass()) return false;
22        User user = (User) obj;
23        return age == user.age && 
24               Objects.equals(name, user.name) && 
25               Objects.equals(email, user.email);
26    }
27    
28    @Override
29    public int hashCode() {
30        return Objects.hash(name, age, email);
31    }
32    
33    @Override
34    public String toString() {
35        return "User{name='" + name + "', age=" + age + ", email='" + email + "'}";
36    }
37}
38
39// HashSet去重示例
40public class HashSetDeduplicationExample {
41    public static void main(String[] args) {
42        // 创建包含重复元素的用户列表
43        List<User> users = Arrays.asList(
44            new User("Alice", 25, "alice@example.com"),
45            new User("Bob", 30, "bob@example.com"),
46            new User("Alice", 25, "alice@example.com"),  // 重复
47            new User("Charlie", 35, "charlie@example.com"),
48            new User("Bob", 30, "bob@example.com")       // 重复
49        );
50        
51        System.out.println("=== HashSet去重示例 ===");
52        System.out.println("原始用户列表大小: " + users.size());
53        System.out.println("原始用户列表: " + users);
54        
55        // 使用HashSet去重
56        Set<User> uniqueUsers = new HashSet<>(users);
57        List<User> deduplicatedList = new ArrayList<>(uniqueUsers);
58        
59        System.out.println("去重后用户列表大小: " + deduplicatedList.size());
60        System.out.println("去重后用户列表: " + deduplicatedList);
61        
62        // 验证去重效果
63        System.out.println("去重效果: " + (users.size() - deduplicatedList.size()) + " 个重复元素被移除");
64    }
65}

HashSet去重特点对比

特点	优势	局限性	适用场景
时间复杂度	O(1) 平均查找时间	最坏情况O(n)	一般数据量
空间复杂度	额外空间存储Set	需要额外内存	内存充足
顺序保持	不保证原有顺序	顺序随机	不要求顺序
null处理	支持null值	需要特殊处理	包含null的集合

2.2 使用Stream API去重

Java 8的Stream API提供了更优雅的去重方式：

Stream API去重示例

java

1public class StreamDeduplicationExample {
2    public static void main(String[] args) {
3        List<User> users = Arrays.asList(
4            new User("Alice", 25, "alice@example.com"),
5            new User("Bob", 30, "bob@example.com"),
6            new User("Alice", 25, "alice@example.com"),
7            new User("Charlie", 35, "charlie@example.com"),
8            new User("Bob", 30, "bob@example.com")
9        );
10        
11        System.out.println("=== Stream API去重示例 ===");
12        
13        // 1. 基于equals方法去重（保持顺序）
14        List<User> uniqueUsers = users.stream()
15            .distinct()
16            .collect(Collectors.toList());
17        System.out.println("distinct()去重结果: " + uniqueUsers);
18        
19        // 2. 基于特定字段去重（保留第一个）
20        List<User> uniqueByName = users.stream()
21            .collect(Collectors.toMap(
22                User::getName,
23                user -> user,
24                (existing, replacement) -> existing
25            ))
26            .values()
27            .stream()
28            .collect(Collectors.toList());
29        System.out.println("基于name字段去重结果: " + uniqueByName);
30        
31        // 3. 基于多个字段去重
32        List<User> uniqueByNameAndAge = users.stream()
33            .collect(Collectors.toMap(
34                user -> user.getName() + "|" + user.getAge(),
35                user -> user,
36                (existing, replacement) -> existing
37            ))
38            .values()
39            .stream()
40            .collect(Collectors.toList());
41        System.out.println("基于name和age字段去重结果: " + uniqueByNameAndAge);
42        
43        // 4. 使用collectingAndThen优化
44        List<User> optimizedUnique = users.stream()
45            .collect(Collectors.collectingAndThen(
46                Collectors.toMap(
47                    User::getName,
48                    user -> user,
49                    (existing, replacement) -> existing
50                ),
51                map -> new ArrayList<>(map.values())
52            ));
53        System.out.println("优化后的去重结果: " + optimizedUnique);
54    }
55}

Stream API去重方法对比

方法	功能	性能	适用场景
`distinct()`	基于equals去重	中等	保持顺序，基于对象内容
`toMap()`	基于字段去重	较高	基于特定字段，可自定义冲突处理
`collectingAndThen`	链式操作优化	高	需要进一步处理的场景
`groupingBy`	分组后去重	中等	需要分组统计的场景

2.3 使用LinkedHashSet保持顺序

如果需要保持原有顺序：

LinkedHashSet保持顺序去重示例

java

1public class LinkedHashSetDeduplicationExample {
2    public static void main(String[] args) {
3        List<String> names = Arrays.asList(
4            "Alice", "Bob", "Charlie", "Alice", "David", "Bob"
5        );
6        
7        System.out.println("=== LinkedHashSet保持顺序去重示例 ===");
8        System.out.println("原始顺序: " + names);
9        
10        // 使用LinkedHashSet保持顺序
11        Set<String> uniqueNames = new LinkedHashSet<>(names);
12        List<String> orderedUniqueList = new ArrayList<>(uniqueNames);
13        
14        System.out.println("去重后顺序: " + orderedUniqueList);
15        
16        // 对比HashSet（不保证顺序）
17        Set<String> hashSetNames = new HashSet<>(names);
18        List<String> unorderedList = new ArrayList<>(hashSetNames);
19        System.out.println("HashSet去重（不保证顺序）: " + unorderedList);
20        
21        // 性能对比
22        long startTime = System.nanoTime();
23        Set<String> linkedHashSet = new LinkedHashSet<>(names);
24        long linkedHashSetTime = System.nanoTime() - startTime;
25        
26        startTime = System.nanoTime();
27        Set<String> hashSet = new HashSet<>(names);
28        long hashSetTime = System.nanoTime() - startTime;
29        
30        System.out.println("LinkedHashSet耗时: " + linkedHashSetTime + " 纳秒");
31        System.out.println("HashSet耗时: " + hashSetTime + " 纳秒");
32        System.out.println("性能差异: " + (linkedHashSetTime - hashSetTime) + " 纳秒");
33    }
34}

2.4 使用TreeSet有序去重

TreeSet基于红黑树实现，在去重的同时可以按指定顺序排序：

TreeSet去重示例

java

1public class TreeSetDeduplicationExample {
2    public static void main(String[] args) {
3        List<User> users = Arrays.asList(
4            new User("Alice", 25, "alice@example.com"),
5            new User("Bob", 30, "bob@example.com"),
6            new User("Charlie", 35, "charlie@example.com"),
7            new User("Bob", 30, "bob@example.com"),       // 重复
8            new User("Alice", 25, "alice@example.com")    // 重复
9        );
10        
11        System.out.println("=== TreeSet去重示例 ===");
12        System.out.println("原始用户列表: " + users);
13        
14        // 使用TreeSet按自然顺序去重
15        // 注意：User类需要实现Comparable接口
16        TreeSet<User> naturalOrderSet = new TreeSet<>(new Comparator<User>() {
17            @Override
18            public int compare(User u1, User u2) {
19                int nameCompare = u1.getName().compareTo(u2.getName());
20                if (nameCompare != 0) return nameCompare;
21                return Integer.compare(u1.getAge(), u2.getAge());
22            }
23        });
24        naturalOrderSet.addAll(users);
25        
26        System.out.println("自然顺序去重: " + naturalOrderSet);
27        
28        // 使用TreeSet按指定顺序去重
29        TreeSet<User> ageOrderSet = new TreeSet<>(
30            Comparator.comparingInt(User::getAge).thenComparing(User::getName)
31        );
32        ageOrderSet.addAll(users);
33        
34        System.out.println("按年龄排序去重: " + ageOrderSet);
35        
36        // 使用TreeSet的导航功能
37        User first = ageOrderSet.first();
38        User last = ageOrderSet.last();
39        
40        System.out.println("年龄最小的用户: " + first);
41        System.out.println("年龄最大的用户: " + last);
42        
43        // 范围查询
44        User target = new User("Bob", 30, "bob@example.com");
45        User higher = ageOrderSet.higher(target); // 获取比target大的下一个元素
46        
47        if (higher != null) {
48            System.out.println("比Bob年龄大的下一个用户: " + higher);
49        }
50    }
51}

特点	TreeSet	HashSet	LinkedHashSet
去重原理	基于红黑树和比较器	基于哈希码和equals	基于哈希码和equals
是否有序	是（自然顺序或比较器指定）	否	是（插入顺序）
时间复杂度	O(log n)	O(1)	O(1)
空间复杂度	O(n)	O(n)	O(n)
特有功能	范围查询、导航操作	无	保持插入顺序
适用场景	需要有序去重	一般去重场景	保持顺序的去重

3. 高级去重技术

多字段去重
自定义比较器去重
时间戳去重

3.1 基于多个字段去重

多字段去重示例

java

1public class ComplexUser {
2    private String name;
3    private int age;
4    private String department;
5    private String location;
6    
7    // 构造函数和getter方法省略...
8    
9    /**
10     * 基于name和department去重
11     */
12    public static List<ComplexUser> deduplicateByNameAndDept(List<ComplexUser> users) {
13        return users.stream()
14            .collect(Collectors.toMap(
15                user -> user.getName() + "|" + user.getDepartment(),
16                user -> user,
17                (existing, replacement) -> existing
18            ))
19            .values()
20            .stream()
21            .collect(Collectors.toList());
22    }
23    
24    /**
25     * 基于多个字段组合去重
26     */
27    public static List<ComplexUser> deduplicateByMultipleFields(
28            List<ComplexUser> users, 
29            Function<ComplexUser, String>... fieldExtractors) {
30        
31        return users.stream()
32            .collect(Collectors.toMap(
33                user -> Arrays.stream(fieldExtractors)
34                    .map(extractor -> extractor.apply(user))
35                    .filter(Objects::nonNull)
36                    .collect(Collectors.joining("|")),
37                user -> user,
38                (existing, replacement) -> existing
39            ))
40            .values()
41            .stream()
42            .collect(Collectors.toList());
43    }
44    
45    /**
46     * 使用Builder模式创建复合键
47     */
48    public static class CompositeKey {
49        private final String name;
50        private final String department;
51        private final String location;
52        
53        public CompositeKey(String name, String department, String location) {
54            this.name = name;
55            this.department = department;
56            this.location = location;
57        }
58        
59        @Override
60        public boolean equals(Object obj) {
61            if (this == obj) return true;
62            if (obj == null || getClass() != obj.getClass()) return false;
63            CompositeKey that = (CompositeKey) obj;
64            return Objects.equals(name, that.name) &&
65                   Objects.equals(department, that.department) &&
66                   Objects.equals(location, that.location);
67        }
68        
69        @Override
70        public int hashCode() {
71            return Objects.hash(name, department, location);
72        }
73    }
74    
75    /**
76     * 使用复合键去重
77     */
78    public static List<ComplexUser> deduplicateByCompositeKey(List<ComplexUser> users) {
79        return users.stream()
80            .collect(Collectors.toMap(
81                user -> new CompositeKey(user.getName(), user.getDepartment(), user.getLocation()),
82                user -> user,
83                (existing, replacement) -> existing
84            ))
85            .values()
86            .stream()
87            .collect(Collectors.toList());
88    }
89}

3.2 自定义比较器去重

自定义比较器去重示例

java

1public class CustomComparatorDeduplicationExample {
2    public static void main(String[] args) {
3        List<User> users = Arrays.asList(
4            new User("Alice", 25, "alice@example.com"),
5            new User("Alice", 30, "alice2@example.com"),
6            new User("Bob", 25, "bob@example.com"),
7            new User("Bob", 30, "bob2@example.com")
8        );
9        
10        System.out.println("=== 自定义比较器去重示例 ===");
11        
12        // 1. 使用TreeSet和自定义比较器
13        List<User> uniqueByName = users.stream()
14            .collect(Collectors.toCollection(() -> 
15                new TreeSet<>(Comparator.comparing(User::getName)
16                    .thenComparing(User::getAge))
17            ))
18            .stream()
19            .collect(Collectors.toList());
20        System.out.println("基于name和age排序去重: " + uniqueByName);
21        
22        // 2. 自定义业务逻辑比较器
23        Comparator<User> businessComparator = (u1, u2) -> {
24            // 首先按name比较
25            int nameCompare = u1.getName().compareTo(u2.getName());
26            if (nameCompare != 0) return nameCompare;
27            
28            // name相同时，按年龄降序（保留年龄大的）
29            return Integer.compare(u2.getAge(), u1.getAge());
30        };
31        
32        List<User> uniqueByBusinessRule = users.stream()
33            .collect(Collectors.toCollection(() -> 
34                new TreeSet<>(businessComparator)
35            ))
36            .stream()
37            .collect(Collectors.toList());
38        System.out.println("基于业务规则去重: " + uniqueByBusinessRule);
39        
40        // 3. 链式比较器
41        Comparator<User> chainedComparator = Comparator
42            .comparing(User::getName)
43            .thenComparing(User::getAge)
44            .thenComparing(User::getEmail);
45        
46        List<User> uniqueByChained = users.stream()
47            .collect(Collectors.toCollection(() -> 
48                new TreeSet<>(chainedComparator)
49            ))
50            .stream()
51            .collect(Collectors.toList());
52        System.out.println("基于链式比较器去重: " + uniqueByChained);
53    }
54}

3.3 基于时间戳去重

基于时间戳去重示例

java

1public class TimestampedUser {
2    private String name;
3    private LocalDateTime timestamp;
4    private String data;
5    
6    // 构造函数和getter方法省略...
7    
8    /**
9     * 保留最新的记录
10     */
11    public static List<TimestampedUser> deduplicateKeepLatest(List<TimestampedUser> users) {
12        return users.stream()
13            .collect(Collectors.toMap(
14                TimestampedUser::getName,
15                user -> user,
16                (existing, replacement) -> 
17                    existing.getTimestamp().isAfter(replacement.getTimestamp()) ? existing : replacement
18            ))
19            .values()
20            .stream()
21            .collect(Collectors.toList());
22    }
23    
24    /**
25     * 保留最早的记录
26     */
27    public static List<TimestampedUser> deduplicateKeepEarliest(List<TimestampedUser> users) {
28        return users.stream()
29            .collect(Collectors.toMap(
30                TimestampedUser::getName,
31                user -> user,
32                (existing, replacement) -> 
33                    existing.getTimestamp().isBefore(replacement.getTimestamp()) ? existing : replacement
34            ))
35            .values()
36            .stream()
37            .collect(Collectors.toList());
38    }
39    
40    /**
41     * 基于时间窗口去重
42     */
43    public static List<TimestampedUser> deduplicateByTimeWindow(
44            List<TimestampedUser> users, 
45            Duration window) {
46        
47        return users.stream()
48            .sorted(Comparator.comparing(TimestampedUser::getTimestamp))
49            .collect(Collectors.toMap(
50                TimestampedUser::getName,
51                user -> user,
52                (existing, replacement) -> {
53                    Duration timeDiff = Duration.between(existing.getTimestamp(), replacement.getTimestamp());
54                    return timeDiff.compareTo(window) <= 0 ? existing : replacement;
55                }
56            ))
57            .values()
58            .stream()
59            .collect(Collectors.toList());
60    }
61}

4. 性能优化技巧

容量优化
并行流优化
分批处理

4.1 预分配容量

预分配容量优化示例

java

1public class CapacityOptimizationExample {
2    public static void main(String[] args) {
3        List<User> users = generateLargeUserList(10000);
4        
5        System.out.println("=== 容量优化示例 ===");
6        
7        // 1. 预分配HashSet容量，避免扩容
8        long startTime = System.nanoTime();
9        Set<User> uniqueUsers = new HashSet<>(users.size());
10        uniqueUsers.addAll(users);
11        long optimizedTime = System.nanoTime() - startTime;
12        
13        // 2. 不预分配容量
14        startTime = System.nanoTime();
15        Set<User> uniqueUsers2 = new HashSet<>();
16        uniqueUsers2.addAll(users);
17        long defaultTime = System.nanoTime() - startTime;
18        
19        System.out.println("预分配容量耗时: " + optimizedTime + " 纳秒");
20        System.out.println("默认容量耗时: " + defaultTime + " 纳秒");
21        System.out.println("性能提升: " + ((defaultTime - optimizedTime) * 100.0 / defaultTime) + "%");
22        
23        // 3. 不同初始容量的性能对比
24        testDifferentCapacities(users);
25    }
26    
27    private static void testDifferentCapacities(List<User> users) {
28        System.out.println("\n=== 不同初始容量性能对比 ===");
29        
30        int[] capacities = {16, 100, 1000, 10000, 20000};
31        
32        for (int capacity : capacities) {
33            long startTime = System.nanoTime();
34            Set<User> set = new HashSet<>(capacity);
35            set.addAll(users);
36            long time = System.nanoTime() - startTime;
37            
38            System.out.println("初始容量 " + capacity + ": " + time + " 纳秒");
39        }
40    }
41    
42    private static List<User> generateLargeUserList(int size) {
43        List<User> users = new ArrayList<>(size);
44        Random random = new Random();
45        
46        for (int i = 0; i < size; i++) {
47            users.add(new User(
48                "User" + random.nextInt(1000),
49                random.nextInt(100),
50                "user" + random.nextInt(1000) + "@example.com"
51            ));
52        }
53        
54        return users;
55    }
56}

4.2 使用并行流处理大数据量

并行流去重示例

java

1public class ParallelStreamDeduplicationExample {
2    public static void main(String[] args) {
3        List<User> users = generateLargeUserList(100000);
4        
5        System.out.println("=== 并行流去重性能对比 ===");
6        
7        // 1. 串行流去重
8        long startTime = System.nanoTime();
9        List<User> uniqueUsers = users.stream()
10            .distinct()
11            .collect(Collectors.toList());
12        long sequentialTime = System.nanoTime() - startTime;
13        
14        // 2. 并行流去重
15        startTime = System.nanoTime();
16        List<User> uniqueUsersParallel = users.parallelStream()
17            .distinct()
18            .collect(Collectors.toList());
19        long parallelTime = System.nanoTime() - startTime;
20        
21        System.out.println("串行流去重耗时: " + sequentialTime + " 纳秒");
22        System.out.println("并行流去重耗时: " + parallelTime + " 纳秒");
23        System.out.println("性能提升: " + ((sequentialTime - parallelTime) * 100.0 / sequentialTime) + "%");
24        
25        // 3. 不同数据量的性能对比
26        testDifferentDataSizes();
27    }
28    
29    private static void testDifferentDataSizes() {
30        System.out.println("\n=== 不同数据量性能对比 ===");
31        
32        int[] sizes = {1000, 10000, 100000, 1000000};
33        
34        for (int size : sizes) {
35            List<User> users = generateLargeUserList(size);
36            
37            long startTime = System.nanoTime();
38            users.stream().distinct().collect(Collectors.toList());
39            long sequentialTime = System.nanoTime() - startTime;
40            
41            startTime = System.nanoTime();
42            users.parallelStream().distinct().collect(Collectors.toList());
43            long parallelTime = System.nanoTime() - startTime;
44            
45            System.out.println("数据量 " + size + ":");
46            System.out.println("  串行: " + sequentialTime + " 纳秒");
47            System.out.println("  并行: " + parallelTime + " 纳秒");
48            System.out.println("  提升: " + ((sequentialTime - parallelTime) * 100.0 / sequentialTime) + "%");
49        }
50    }
51    
52    private static List<User> generateLargeUserList(int size) {
53        List<User> users = new ArrayList<>(size);
54        Random random = new Random();
55        
56        for (int i = 0; i < size; i++) {
57            users.add(new User(
58                "User" + random.nextInt(size / 10), // 控制重复率
59                random.nextInt(100),
60                "user" + random.nextInt(size / 10) + "@example.com"
61            ));
62        }
63        
64        return users;
65    }
66}

4.3 分批处理超大数据集

分批处理大数据集示例

java

1public class BatchProcessingDeduplicationExample {
2    public static void main(String[] args) {
3        List<User> users = generateLargeUserList(1000000);
4        
5        System.out.println("=== 分批处理大数据集示例 ===");
6        System.out.println("数据集大小: " + users.size());
7        
8        // 1. 分批处理去重
9        int batchSize = 100000;
10        long startTime = System.nanoTime();
11        List<User> result = deduplicateLargeDataset(users, batchSize);
12        long batchTime = System.nanoTime() - startTime;
13        
14        // 2. 直接处理去重
15        startTime = System.nanoTime();
16        List<User> directResult = users.stream().distinct().collect(Collectors.toList());
17        long directTime = System.nanoTime() - startTime;
18        
19        System.out.println("分批处理耗时: " + batchTime + " 纳秒");
20        System.out.println("直接处理耗时: " + directTime + " 纳秒");
21        System.out.println("分批处理结果数量: " + result.size());
22        System.out.println("直接处理结果数量: " + directResult.size());
23        
24        // 3. 不同批次大小的性能对比
25        testDifferentBatchSizes(users);
26    }
27    
28    public static <T> List<T> deduplicateLargeDataset(List<T> items, int batchSize) {
29        Set<T> uniqueItems = new HashSet<>();
30        List<T> result = new ArrayList<>();
31        
32        for (int i = 0; i < items.size(); i += batchSize) {
33            int end = Math.min(i + batchSize, items.size());
34            List<T> batch = items.subList(i, end);
35            
36            for (T item : batch) {
37                if (uniqueItems.add(item)) {
38                    result.add(item);
39                }
40            }
41            
42            // 输出进度
43            if (i % (batchSize * 10) == 0) {
44                System.out.println("已处理: " + end + "/" + items.size());
45            }
46        }
47        
48        return result;
49    }
50    
51    private static void testDifferentBatchSizes(List<User> users) {
52        System.out.println("\n=== 不同批次大小性能对比 ===");
53        
54        int[] batchSizes = {10000, 50000, 100000, 200000};
55        
56        for (int batchSize : batchSizes) {
57            long startTime = System.nanoTime();
58            List<User> result = deduplicateLargeDataset(users, batchSize);
59            long time = System.nanoTime() - startTime;
60            
61            System.out.println("批次大小 " + batchSize + ": " + time + " 纳秒");
62        }
63    }
64    
65    private static List<User> generateLargeUserList(int size) {
66        List<User> users = new ArrayList<>(size);
67        Random random = new Random();
68        
69        for (int i = 0; i < size; i++) {
70            users.add(new User(
71                "User" + random.nextInt(size / 100), // 控制重复率
72                random.nextInt(100),
73                "user" + random.nextInt(size / 100) + "@example.com"
74            ));
75        }
76        
77        return users;
78    }
79}

5. 实际应用场景

用户数据去重
日志数据去重
业务数据去重

5.1 用户数据去重

用户数据去重应用示例

java

1public class UserDataDeduplicationExample {
2    public static void main(String[] args) {
3        // 模拟从不同数据源获取的用户数据
4        List<User> source1Users = Arrays.asList(
5            new User("Alice", 25, "alice@example.com"),
6            new User("Bob", 30, "bob@example.com"),
7            new User("Charlie", 35, "charlie@example.com")
8        );
9        
10        List<User> source2Users = Arrays.asList(
11            new User("Alice", 25, "alice@example.com"), // 重复
12            new User("David", 28, "david@example.com"),
13            new User("Eve", 32, "eve@example.com")
14        );
15        
16        List<User> source3Users = Arrays.asList(
17            new User("Bob", 30, "bob@example.com"), // 重复
18            new User("Frank", 40, "frank@example.com")
19        );
20        
21        System.out.println("=== 多数据源用户去重示例 ===");
22        
23        // 合并所有数据源
24        List<User> allUsers = new ArrayList<>();
25        allUsers.addAll(source1Users);
26        allUsers.addAll(source2Users);
27        allUsers.addAll(source3Users);
28        
29        System.out.println("合并前总用户数: " + allUsers.size());
30        
31        // 基于邮箱去重（邮箱通常唯一）
32        List<User> uniqueByEmail = allUsers.stream()
33            .collect(Collectors.toMap(
34                User::getEmail,
35                user -> user,
36                (existing, replacement) -> existing
37            ))
38            .values()
39            .stream()
40            .collect(Collectors.toList());
41        
42        System.out.println("基于邮箱去重后用户数: " + uniqueByEmail.size());
43        System.out.println("去重效果: " + (allUsers.size() - uniqueByEmail.size()) + " 个重复用户被移除");
44        
45        // 基于姓名和年龄去重（业务逻辑）
46        List<User> uniqueByNameAndAge = allUsers.stream()
47            .collect(Collectors.toMap(
48                user -> user.getName() + "|" + user.getAge(),
49                user -> user,
50                (existing, replacement) -> existing
51            ))
52            .values()
53            .stream()
54            .collect(Collectors.toList());
55        
56        System.out.println("基于姓名和年龄去重后用户数: " + uniqueByNameAndAge.size());
57        
58        // 输出去重结果
59        System.out.println("\n去重后的用户列表:");
60        uniqueByEmail.forEach(user -> 
61            System.out.println("  " + user.getName() + " (" + user.getAge() + ") - " + user.getEmail())
62        );
63    }
64}

5.2 日志数据去重

日志数据去重应用示例

java

1public class LogDeduplicationExample {
2    public static void main(String[] args) {
3        // 模拟日志数据
4        List<LogEntry> logs = Arrays.asList(
5            new LogEntry("ERROR", "Database connection failed", "2024-01-07 10:00:00", "user-service"),
6            new LogEntry("ERROR", "Database connection failed", "2024-01-07 10:00:01", "user-service"),
7            new LogEntry("ERROR", "Database connection failed", "2024-01-07 10:00:02", "user-service"),
8            new LogEntry("INFO", "User login successful", "2024-01-07 10:01:00", "auth-service"),
9            new LogEntry("WARN", "High memory usage", "2024-01-07 10:02:00", "system-monitor"),
10            new LogEntry("WARN", "High memory usage", "2024-01-07 10:03:00", "system-monitor")
11        );
12        
13        System.out.println("=== 日志数据去重示例 ===");
14        System.out.println("原始日志数量: " + logs.size());
15        
16        // 1. 基于日志内容去重（保留最新的）
17        List<LogEntry> uniqueByContent = logs.stream()
18            .collect(Collectors.toMap(
19                log -> log.getLevel() + "|" + log.getMessage() + "|" + log.getService(),
20                log -> log,
21                (existing, replacement) -> 
22                    existing.getTimestamp().compareTo(replacement.getTimestamp()) > 0 ? existing : replacement
23            ))
24            .values()
25            .stream()
26            .collect(Collectors.toList());
27        
28        System.out.println("基于内容去重后日志数量: " + uniqueByContent.size());
29        
30        // 2. 基于时间窗口去重（5分钟内相同日志视为重复）
31        List<LogEntry> uniqueByTimeWindow = deduplicateLogsByTimeWindow(logs, Duration.ofMinutes(5));
32        System.out.println("基于时间窗口去重后日志数量: " + uniqueByTimeWindow.size());
33        
34        // 3. 输出去重结果
35        System.out.println("\n去重后的日志:");
36        uniqueByContent.forEach(log -> 
37            System.out.println("  [" + log.getTimestamp() + "] " + log.getLevel() + 
38                             " - " + log.getMessage() + " (" + log.getService() + ")")
39        );
40    }
41    
42    private static List<LogEntry> deduplicateLogsByTimeWindow(List<LogEntry> logs, Duration window) {
43        return logs.stream()
44            .sorted(Comparator.comparing(LogEntry::getTimestamp))
45            .collect(Collectors.toMap(
46                log -> log.getLevel() + "|" + log.getMessage() + "|" + log.getService(),
47                log -> log,
48                (existing, replacement) -> {
49                    Duration timeDiff = Duration.between(existing.getTimestamp(), replacement.getTimestamp());
50                    return timeDiff.compareTo(window) <= 0 ? existing : replacement;
51                }
52            ))
53            .values()
54            .stream()
55            .collect(Collectors.toList());
56    }
57}
58
59class LogEntry {
60    private String level;
61    private String message;
62    private LocalDateTime timestamp;
63    private String service;
64    
65    // 构造函数和getter方法省略...
66}

5.3 业务数据去重

业务数据去重应用示例

java

1public class BusinessDataDeduplicationExample {
2    public static void main(String[] args) {
3        // 模拟订单数据
4        List<Order> orders = Arrays.asList(
5            new Order("ORD001", "Alice", 100.0, "2024-01-07 09:00:00", "PENDING"),
6            new Order("ORD001", "Alice", 100.0, "2024-01-07 09:01:00", "CONFIRMED"), // 重复订单号
7            new Order("ORD002", "Bob", 200.0, "2024-01-07 10:00:00", "PENDING"),
8            new Order("ORD003", "Charlie", 150.0, "2024-01-07 11:00:00", "PENDING"),
9            new Order("ORD003", "Charlie", 150.0, "2024-01-07 11:05:00", "CANCELLED") // 重复订单号
10        );
11        
12        System.out.println("=== 业务数据去重示例 ===");
13        System.out.println("原始订单数量: " + orders.size());
14        
15        // 1. 基于订单号去重（保留最新的状态）
16        List<Order> uniqueByOrderId = orders.stream()
17            .collect(Collectors.toMap(
18                Order::getOrderId,
19                order -> order,
20                (existing, replacement) -> 
21                    existing.getTimestamp().compareTo(replacement.getTimestamp()) > 0 ? existing : replacement
22            ))
23            .values()
24            .stream()
25            .collect(Collectors.toList());
26        
27        System.out.println("基于订单号去重后订单数量: " + uniqueByOrderId.size());
28        
29        // 2. 基于客户和金额去重（防止重复下单）
30        List<Order> uniqueByCustomerAndAmount = orders.stream()
31            .collect(Collectors.toMap(
32                order -> order.getCustomerName() + "|" + order.getAmount(),
33                order -> order,
34                (existing, replacement) -> 
35                    existing.getTimestamp().compareTo(replacement.getTimestamp()) > 0 ? existing : replacement
36            ))
37            .values()
38            .stream()
39            .collect(Collectors.toList());
40        
41        System.out.println("基于客户和金额去重后订单数量: " + uniqueByCustomerAndAmount.size());
42        
43        // 3. 输出去重结果
44        System.out.println("\n去重后的订单:");
45        uniqueByOrderId.forEach(order -> 
46            System.out.println("  " + order.getOrderId() + " - " + order.getCustomerName() + 
47                             " ($" + order.getAmount() + ") - " + order.getStatus() + 
48                             " [" + order.getTimestamp() + "]")
49        );
50    }
51}
52
53class Order {
54    private String orderId;
55    private String customerName;
56    private double amount;
57    private LocalDateTime timestamp;
58    private String status;
59    
60    // 构造函数和getter方法省略...
61}

6. 最佳实践总结

6.1 去重策略选择

核心原则

选择合适的去重策略需要考虑以下因素：

数据规模：小数据集使用HashSet，大数据集考虑分批处理
性能要求：对性能要求高的场景使用并行流
内存限制：内存受限时使用分批处理或外部存储
业务逻辑：根据业务需求选择合适的去重字段

6.2 性能优化策略

优化策略	具体方法	适用场景	预期效果
预分配容量	使用 `new HashSet<>(expectedSize)`	已知数据量	避免扩容，提升20-30%
选择合适的集合	HashSet用于查找，TreeSet用于排序	根据使用场景	提升查找性能
并行处理	使用 `parallelStream()`	大数据量	多核环境下提升2-4倍
分批处理	将大数据集分成小批次	超大数据集	避免内存溢出
缓存结果	缓存去重结果	重复去重	避免重复计算

6.3 常见陷阱和解决方案

注意事项

在使用去重技术时，需要注意以下常见陷阱：

equals()和hashCode()不一致

java

1// 错误：equals和hashCode不一致
2@Override
3public boolean equals(Object obj) {
4    if (this == obj) return true;
5    if (obj == null || getClass() != obj.getClass()) return false;
6    User user = (User) obj;
7    return Objects.equals(name, user.name); // 只比较name
8}
9
10@Override
11public int hashCode() {
12    return Objects.hash(name, age, email); // 但hashCode包含所有字段
13}
14
15// 正确：保持一致性
16@Override
17public boolean equals(Object obj) {
18    if (this == obj) return true;
19    if (obj == null || getClass() != obj.getClass()) return false;
20    User user = (User) obj;
21    return Objects.equals(name, user.name);
22}
23
24@Override
25public int hashCode() {
26    return Objects.hash(name); // 只包含equals中比较的字段
27}

Stream API的延迟执行

java

1// 错误：Stream延迟执行可能导致问题
2Stream<User> stream = users.stream().distinct();
3users.add(new User("New", 25, "new@example.com")); // 这会影响stream的结果
4List<User> result = stream.collect(Collectors.toList());
5
6// 正确：立即收集结果
7List<User> result = users.stream().distinct().collect(Collectors.toList());
8users.add(new User("New", 25, "new@example.com")); // 不会影响已收集的结果

内存溢出问题

java

1// 错误：直接处理超大数据集
2List<User> hugeList = generateHugeList(10000000);
3Set<User> uniqueUsers = new HashSet<>(hugeList); // 可能内存溢出
4
5// 正确：分批处理
6List<User> uniqueUsers = deduplicateLargeDataset(hugeList, 100000);

6.4 测试和调试建议

单元测试覆盖
- 测试边界条件（空集合、null值、单个元素）
- 测试重复元素的处理逻辑
- 测试不同数据类型的去重效果
性能测试
- 使用JMH进行性能基准测试
- 测试不同数据量下的性能表现
- 监控内存使用情况
调试技巧
- 使用日志记录去重过程
- 使用Stream API的 peek() 方法调试流操作
- 验证去重结果的正确性

7. 总结

Java集合对象去重技术为数据处理提供了强大而灵活的工具。通过合理使用各种去重方法，我们可以：

提高数据质量：去除重复数据，保证数据的一致性
优化存储空间：减少冗余数据，降低存储成本
提升处理性能：避免重复计算，提高系统响应速度
支持业务需求：根据不同的业务场景选择合适的去重策略

在实际开发中，我们应该：

理解业务需求：明确去重的业务含义和规则
选择合适的算法：根据数据规模和性能要求选择合适的方法
注意性能优化：合理使用预分配容量、并行处理等技术
保证代码质量：正确实现equals()和hashCode()方法，处理边界情况

通过深入理解和熟练运用这些去重技术，我们能够构建出更加高效、健壮和可维护的Java应用程序。

1. 对象去重概述​

1.1 什么是对象去重？​

1.2 去重的重要性​

1.3 去重技术分类​

2. 基本去重方法详解​

2.1 使用HashSet去重​

HashSet去重特点对比​

2.2 使用Stream API去重​

Stream API去重方法对比​

2.3 使用LinkedHashSet保持顺序​

2.4 使用TreeSet有序去重​

3. 高级去重技术​

3.1 基于多个字段去重​

3.2 自定义比较器去重​

3.3 基于时间戳去重​

4. 性能优化技巧​

4.1 预分配容量​

4.2 使用并行流处理大数据量​

4.3 分批处理超大数据集​

5. 实际应用场景​

5.1 用户数据去重​

5.2 日志数据去重​

5.3 业务数据去重​

6. 最佳实践总结​

6.1 去重策略选择​

6.2 性能优化策略​

6.3 常见陷阱和解决方案​

6.4 测试和调试建议​

7. 总结​

参与讨论

1. 对象去重概述

1.1 什么是对象去重？

1.2 去重的重要性

1.3 去重技术分类

2. 基本去重方法详解

2.1 使用HashSet去重

HashSet去重特点对比

2.2 使用Stream API去重

Stream API去重方法对比

2.3 使用LinkedHashSet保持顺序

2.4 使用TreeSet有序去重

3. 高级去重技术

3.1 基于多个字段去重

3.2 自定义比较器去重

3.3 基于时间戳去重

4. 性能优化技巧

4.1 预分配容量

4.2 使用并行流处理大数据量

4.3 分批处理超大数据集

5. 实际应用场景

5.1 用户数据去重

5.2 日志数据去重

5.3 业务数据去重

6. 最佳实践总结

6.1 去重策略选择

6.2 性能优化策略

6.3 常见陷阱和解决方案

6.4 测试和调试建议

7. 总结