[Report] Find My Password For NetEase Mail
2016-06-21(二) by chenjia.meFind My Password For NetEase Mail
背景
去年黑产界流传着一个威胁十分巨大的数据库,那就是网易邮箱的数据库。这个数据库包含了众多邮箱的明文密码和密保答案。导致了去年APPLE ID被盗事件,战网账号被盗等事件高发。特别是Apple ID 被盗导致苹果设备被锁定敲诈事件引起了众多人民的愤怒。 我们在今年4月拿到了已经被黑产多次洗过的数据库。虽然里面的大部分数据都已经被人利用了。为了让更多无知群众知道自己的网易邮箱是不安全的。我们发起了这个项目,方便大家查找自己邮箱是否在泄露列表中,而不是仅仅相信网易官方发布的撞库公告。
课程背景
本学期高级数据库系统介绍了一些数据库的底层的基本架构。并且介绍了众多前沿科技使用的论文。其中也介绍了非关系型数据。也就是现在比较流行的nosql。论文中使用了Redsi和Memcache两种进行对比。
项目说明
我们使用了Redis来作为我们的数据库,一共处理了2.6亿条不整齐数据,将他们以<Emali_address,password>
作为<key,value>
来存储,并且在前端界面通过EMAIL_ADDRESS
进行查询,返回对应value
的结果
项目地址
Find My Password For NetEase Mail
数据说明
我们获取到数据是纯文本文件(txt),解压后一共约有72G的文本文件。在文本文件中,每行文本文件是一个邮箱及其相关信息,这些信息格式并不整齐,例如
mjgj8735@163.com----15969360262yaya
通过整理我们一共用了9种正则表达式来识别数据,通过数据库去重操作获取了2.6亿条数据,并且筛选出了172万行无效数据和未知格式错误。
数据库选择
面对这么大规模的数据,使用传统的关系型数据库就需要很多的优化才能有比较好的性能,比如最经常使用的Mysql在千万级别数据后效率会急剧下降。虽然分表能解决问题,然而结合memchache后命中率等问题也会困恼我们。而在nosql中,redis和memchache的对比中,我们偏向于选择redis,因为
- 他是在内存中进行查找的
- 支持数据在硬盘中备份。提供持久化功能。
- 其支持的数据结构比较多,
- 他在内存中只缓存了key而不是都缓存。
- 会自适应内存
在实际的使用中我们也发现在插入的过程中,执行瓶颈在Java读取硬盘文本文件的I/O上,说明Redis在数据量在千万级别以上的时候,其插入的时候去重依旧能达到每秒几万的水平,这是一个很高的状态。
数据统计
在我们完成所有文本数据处理到redis中后,我们对一些技术数据进行了统计:
-
对于邮箱列表 >不重复数据有 266,202,509 无效数据有5,811,079 ,全部的文本数据忘记统计了,不过根据加入的时候观察重复和多次处理后的结果。应该在5亿条左右(因为这些流出数据库是很多渠道,所以会有比较高的重复率)
-
对于Redis >初始化Redis,查看其info:
# Memory used_memory:28783857752 used_memory_human:26.81G used_memory_rss:29373415424 used_memory_peak:28783856696 used_memory_peak_human:26.81G used_memory_lua:33792 mem_fragmentation_ratio:1.02 mem_allocator:jemalloc-3.4.1 # Persistence loading:0 rdb_changes_since_last_save:0 rdb_bgsave_in_progress:0 rdb_last_save_time:1466519188 rdb_last_bgsave_status:ok rdb_last_bgsave_time_sec:-1 rdb_current_bgsave_time_sec:-1 aof_enabled:0 aof_rewrite_in_progress:0 aof_rewrite_scheduled:0 aof_last_rewrite_time_sec:-1 aof_current_rewrite_time_sec:-1 aof_last_bgrewrite_status:ok # Stats total_connections_received:5 total_commands_processed:7 instantaneous_ops_per_sec:0 rejected_connections:0 sync_full:0 sync_partial_ok:0 sync_partial_err:0 expired_keys:0 evicted_keys:0 keyspace_hits:0 keyspace_misses:0 pubsub_channels:0 pubsub_patterns:0 latest_fork_usec:0 # Replication role:master connected_slaves:0 master_repl_offset:0 repl_backlog_active:0 repl_backlog_size:1048576 repl_backlog_first_byte_offset:0 repl_backlog_histlen:0 # CPU used_cpu_sys:7.54 used_cpu_user:248.08 used_cpu_sys_children:0.00 used_cpu_user_children:0.00 # Keyspace db0:keys=266202509,expires=0,avg_ttl=0
对于进行基准测试后的Redis数据库,查看其info:
# Memory used_memory:421019072 used_memory_human:401.52M used_memory_rss:439119872 used_memory_peak:28783877568 used_memory_peak_human:26.81G used_memory_lua:33792 mem_fragmentation_ratio:1.04 mem_allocator:jemalloc-3.4.1 # Persistence loading:0 rdb_changes_since_last_save:0 rdb_bgsave_in_progress:0 rdb_last_save_time:1466516964 rdb_last_bgsave_status:ok rdb_last_bgsave_time_sec:1 rdb_current_bgsave_time_sec:-1 aof_enabled:0 aof_rewrite_in_progress:0 aof_rewrite_scheduled:0 aof_last_rewrite_time_sec:-1 aof_current_rewrite_time_sec:-1 aof_last_bgrewrite_status:ok # Stats total_connections_received:3406 total_commands_processed:62720788 instantaneous_ops_per_sec:0 rejected_connections:0 sync_full:0 sync_partial_ok:0 sync_partial_err:0 expired_keys:0 evicted_keys:0 keyspace_hits:18447376 keyspace_misses:244016 pubsub_channels:0 pubsub_patterns:0 latest_fork_usec:38959 # Replication role:master connected_slaves:0 master_repl_offset:0 repl_backlog_active:0 repl_backlog_size:1048576 repl_backlog_first_byte_offset:0 repl_backlog_histlen:0 # CPU used_cpu_sys:199.15 used_cpu_user:1058.54 used_cpu_sys_children:10.44 used_cpu_user_children:153.48
服务器配置
我们使用DELL T630塔式服务器作为测试,其基本的配置如下
- CPU 2 x Intel® Xeon(R) CPU E5-2650 v3 @ 2.30GHz × 16
- Memory 128Gb
- Network 千兆以太网卡
- HDD 2T HDD
Redis测试
我们首先对写入数据后的Redis数据库进行了一些基本测试。
-
对于Redis服务器 10万条基准测试
PING_INLINE: 180831.83 requests per second PING_BULK: 203252.03 requests per second SET: 168067.22 requests per second GET: 192307.70 requests per second INCR: 177619.89 requests per second LPUSH: 195694.72 requests per second LPOP: 192307.70 requests per second SADD: 147492.62 requests per second SPOP: 131926.12 requests per second LPUSH (needed to benchmark LRANGE): 129198.97 requests per second LRANGE_100 (first 100 elements): 70821.53 requests per second LRANGE_300 (first 300 elements): 26788.11 requests per second LRANGE_500 (first 450 elements): 18261.51 requests per second LRANGE_600 (first 600 elements): 13213.53 requests per second MSET (10 keys): 96993.21 requests per second
-
读写操作,使用命令
redis-benchmark -t set,get -r 100000 -n 1000000 -P (1-32)
2.1 读写测试(1线程)
====== SET ====== 1000000 requests completed in 6.95 seconds 50 parallel clients 3 bytes payload keep alive: 1 99.94% <= 1 milliseconds 99.97% <= 2 milliseconds 99.99% <= 5 milliseconds 100.00% <= 40 milliseconds 100.00% <= 41 milliseconds 100.00% <= 42 milliseconds 100.00% <= 42 milliseconds 143864.19 requests per second ====== GET ====== 1000000 requests completed in 7.64 seconds 50 parallel clients 3 bytes payload keep alive: 1 99.98% <= 1 milliseconds 99.99% <= 3 milliseconds 99.99% <= 4 milliseconds 100.00% <= 5 milliseconds 100.00% <= 5 milliseconds 130890.05 requests per second
2.2读写测试(4线程)
====== SET ====== 1000000 requests completed in 2.69 seconds 50 parallel clients 3 bytes payload keep alive: 1 99.49% <= 1 milliseconds 99.97% <= 2 milliseconds 99.98% <= 3 milliseconds 99.98% <= 31 milliseconds 99.98% <= 32 milliseconds 99.98% <= 33 milliseconds 99.99% <= 34 milliseconds 100.00% <= 34 milliseconds 371471.03 requests per second GET: -nan GET: 356176.00 GET: 393656.00 GET: 407546.66 GET: 416996.00 GET: 419772.81 GET: 423405.34 GET: 423568.00 GET: 424560.00 GET: 426416.00 ====== GET ====== 1000000 requests completed in 2.35 seconds 50 parallel clients 3 bytes payload keep alive: 1 99.97% <= 1 milliseconds 100.00% <= 1 milliseconds 426439.22 requests per second INCR: 222494.11 INCR: 354997.00 INCR: 380929.94 INCR: 388378.44 INCR: 398009.19 INCR: 404931.81 INCR: 410682.62 INCR: 413861.56 INCR: 416932.38 INCR: 419282.22
2.3 读写测试(8线程) ====== SET ====== 1000000 requests completed in 1.73 seconds 50 parallel clients 3 bytes payload keep alive: 1
97.40% <= 1 milliseconds 99.63% <= 2 milliseconds 99.84% <= 3 milliseconds 99.92% <= 4 milliseconds 99.99% <= 7 milliseconds 99.99% <= 8 milliseconds 100.00% <= 8 milliseconds 577034.06 requests per second ====== GET ====== 1000000 requests completed in 1.50 seconds 50 parallel clients 3 bytes payload keep alive: 1 98.99% <= 1 milliseconds 99.96% <= 4 milliseconds 99.99% <= 5 milliseconds 100.00% <= 5 milliseconds 668896.31 requests per second
2.4读写操作(16线程) ====== SET ====== 1000000 requests completed in 1.33 seconds 50 parallel clients 3 bytes payload keep alive: 1
72.78% <= 1 milliseconds 98.62% <= 2 milliseconds 99.53% <= 3 milliseconds 99.66% <= 4 milliseconds 99.84% <= 5 milliseconds 99.90% <= 6 milliseconds 99.95% <= 7 milliseconds 99.95% <= 10 milliseconds 100.00% <= 10 milliseconds 753012.06 requests per second ====== GET ====== 1000000 requests completed in 1.11 seconds 50 parallel clients 3 bytes payload keep alive: 1 86.16% <= 1 milliseconds 99.83% <= 2 milliseconds 100.00% <= 2 milliseconds 898472.56 requests per second
2.5读写操作(32线程)
====== SET ====== 1000000 requests completed in 1.13 seconds 50 parallel clients 3 bytes payload keep alive: 1 3.57% <= 1 milliseconds 79.79% <= 2 milliseconds 97.69% <= 3 milliseconds 99.55% <= 4 milliseconds 99.82% <= 5 milliseconds 99.84% <= 10 milliseconds 99.93% <= 11 milliseconds 99.93% <= 13 milliseconds 99.94% <= 14 milliseconds 99.94% <= 16 milliseconds 99.95% <= 18 milliseconds 99.95% <= 20 milliseconds 99.96% <= 22 milliseconds 99.96% <= 23 milliseconds 99.96% <= 25 milliseconds 99.96% <= 26 milliseconds 99.97% <= 27 milliseconds 99.97% <= 29 milliseconds 99.97% <= 30 milliseconds 99.98% <= 32 milliseconds 99.98% <= 34 milliseconds 99.98% <= 35 milliseconds 99.99% <= 37 milliseconds 99.99% <= 39 milliseconds 99.99% <= 40 milliseconds 100.00% <= 42 milliseconds 100.00% <= 42 milliseconds 887311.44 requests per second ====== GET ====== 1000000 requests completed in 1.04 seconds 50 parallel clients 3 bytes payload keep alive: 1 6.38% <= 1 milliseconds 87.71% <= 2 milliseconds 99.22% <= 3 milliseconds 99.79% <= 4 milliseconds 99.84% <= 40 milliseconds 99.86% <= 41 milliseconds 99.89% <= 42 milliseconds 99.98% <= 43 milliseconds 100.00% <= 43 milliseconds 960614.81 requests per second
从读写测试我们可以看到Redis在有2.6亿条数据后其效率还是很惊人的,但是我们也发现并发的线程越多,其速度并不是线性增加的。在16个线程到32个线程之间,虽然线程数目增加了一倍,但是效率提高的并不多。说明这一句到了Redis的一个瓶颈
Redis安全说明
Redis开放端口应该放在内网,然后让网页服务器通过内网进行连接,这样可以防止Redis绕过授权登录(或者及时更新最新版本Redis)
总结
Redis真的好用而且快,支持Java支持PHP,一键连接快速读写。