Distcp的那点事

[TOC]

灵魂拷问：你真的了解distcp吗？这里说的就是distcp的那点事

背景

今天在整理笔记的时候，发现了好几篇临时记录都是记录的集群间文件复制需要注意的地方，虽然记录的东西和重点不同，但是核心的东西都是distcp相关的，所以，感觉还是有点必要归总一下，这篇文章的内容主要是一点细节问题，更多的是侧重在遇到疑问的时候如何快速去找到自己的答案
参考地址

概述

首先，distcp是个什么东西呢,从字面意思来说就是 distributed copy(分布式拷贝)，也就是说将原来以一个人做的事情，分摊给很多人来并行处理，当然这个任务分工的粒度是基于文件的，也就是说只有一个文件，那么这个拷贝最多也就只能一个人来完成

基本用法

# 可以使用hdfs协议（同版本hadoop），也可以是用hftp协议(不同版本hadoop可用)
hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo
# 同时可以指定多个数据源进行拷贝
hadoop distcp hdfs://nn1:8020/foo/bar1 hdfs://nn2:8020/foo/bar2 hdfs://nn3:8020/bar/foo
# 也可以使用 -f ，字面意思就是 -file也就是我的数据源source是以绝对路径形式列表存储一个文件中

参数说明

这里就不一一翻译每一个参数的中文意思，基本都能直译是什么意思，这里主要说一下我觉得使用上可能需要注意的细节问题

[yourFather@hadoop-onedata ~]$ hadoop distcp --help
usage: distcp OPTIONS [source_path...] <target_path>
              OPTIONS
 -append                Reuse existing data in target files and append new
                        data to them if possible
 -async                 Should distcp execution be blocking
 -atomic                Commit all changes or none
 -bandwidth <arg>       Specify bandwidth per map in MB
 -delete                Delete from target, files missing in source
 -diff <arg>            Use snapshot diff report to identify the
                        difference between source and target
 -f <arg>               List of files that need to be copied
 -filelimit <arg>       (Deprecated!) Limit number of files copied to <= n
 -i                     Ignore failures during copy
 -log <arg>             Folder on DFS where distcp execution logs are
                        saved
 -m <arg>               Max number of concurrent maps to use for copy
 -mapredSslConf <arg>   Configuration for ssl config file, to use with
                        hftps://
 -overwrite             Choose to overwrite target files unconditionally,
                        even if they exist.
 -p <arg>               preserve status (rbugpcaxt)(replication,
                        block-size, user, group, permission,
                        checksum-type, ACL, XATTR, timestamps). If -p is
                        specified with no <arg>, then preserves
                        replication, block size, user, group, permission,
                        checksum type and timestamps. raw.* xattrs are
                        preserved when both the source and destination
                        paths are in the /.reserved/raw hierarchy (HDFS
                        only). raw.* xattrpreservation is independent of
                        the -p flag. Refer to the DistCp documentation for
                        more details.
 -sizelimit <arg>       (Deprecated!) Limit number of files copied to <= n
                        bytes
 -skipcrccheck          Whether to skip CRC checks between source and
                        target paths.
 -strategy <arg>        Copy strategy to use. Default is dividing work
                        based on file sizes
 -tmp <arg>             Intermediate work path to be used for atomic
                        commit
 -update                Update target, copying only missingfiles or
                        directories

--append,--overwrite,--update之间的关系参数解释备注 append 追加，复用sink文件已经存在的数据，并尝试将数据追加，判断标准TODO overwrite 覆盖，不管之前是否存在是重新生成 update 更新，判断标准是source和sink文件大小是否一致 -m

这个很好解释，应为distcp使用的是mapreduce模型，和sqoop有点类似，所以-m就是-map，说的就是并行度，启动最多多少个map来同时拷贝，为什么说最多呢，因为拷贝是基于文件的(严格来说应该是block)，一个文件拆成多份拷贝的难度肯定要稍微大一点，所以如果source只有一个文件，那么-m指定多少个，也都只会有一个map task执行拷贝任务

-i

忽略失败，就是如果拷贝任务比较重，资源紧张时很有可能会中途失败啥的，但是又不像每次重启任务重新全量拷贝，这里可以考虑忽略失败，后续执行的时候的再增量拷贝

-strategy

拷贝策略问题，默认是按照文件的大小进行任务的拆分，可选参数为dynamic|uniform，默认是每个拷贝任务拷贝相同的字节数

-p

字面意思就是保存到目标系统中文件的status，包括副本、block大小、用户权限等，当然默认肯定是和目标系统保持一致

-bandwidth

很明显，就是带宽大小，因为distcp没有计算逻辑，属于io密集型任务，集群迁移的时候需要对带宽的使用有严格的把控，这个参数就是控制map的使用带宽，那么限制distcp任务个数以及distcp任务的map个数即可控制整体迁移程序的带宽使用

QA

这里记录的是我在使用的时候遇到的小问题，并不一定是什么原理优化的问题，就是使用上可能会产生疑问或者歧义的地方

Q1:如果拷贝的时候，数据出现冲突，会有什么结果？
A1:如果soure出现同名文件，distcp任务会发生失败并打印出错日志；如果目标目录已经存在待拷贝的文件，默认会忽略掉源文件的拷贝任务，当然也可以设置报错；如果有另外的进程向目标文件中写数据也会报错并打印日志

Q2:distcp的任务部署位置有要求吗？
A2:只要求运行distcp任务的节点或者说task能够和上下游进行访问交互即可，并不要求部署的位置，实际情况一般部署在目标集群的节点上

Q3:distcp任务在做大数据迁移的时候需要注意什么？
A3:distcp任务是大io的任务，所以带宽是限制因素，可以写一个监控集群机器带宽(shell/python尤佳)的脚本，然后在空闲时间去启动迁移任务

附录

# 样例1：复制目录的时候下游会自动生成目录，无需手动添加，如下
time hadoop distcp hdfs://nn1:8020/user/hive/warehouse/${database}.db/${table}/dt=${partition}  hdfs://nn2:8020/user/hive/warehouse/${database}.db/${table} >> /logs/distcp/${database}.log

# 样例2：多参数的拷贝
hadoop distcp \
    -Dmapred.jobtracker.maxtasks.per.job=1800000 \   #任务最大map数（数据分成多map任务）
    -Dmapred.job.max.map.running=4000 \              #最大map并发
    -Ddistcp.bandwidth=150000000 \                   #带宽
    -Ddfs.replication=2 \                            #复制因子，两副本
    -Ddistcp.skip.dir=$skipPath \                    #过滤的目录（不拷贝的目录）
    -Dmapred.map.max.attempts=9 \                    #每个task最大尝试次数
    -Dmapred.fairscheduler.pool=distcp \             #指定任务运行的pool
    -pugp \                                          #保留属性（用户，组，权限）
    -i \                                             #忽略失败的task
    -skipcrccheck \                                  #忽略CRC校验（防止源，目标集群hdfs版本不一致导致任务失败。）
    hdfs://clusterA:9000/AAA/data  \                 #源地址
    hdfs://clusterB:9000/BBB/data                    #目标地址
    
# 样例3：跨版本的拷贝,参数和dfs.http.address保持一致
hadoop distcp -numListstatusThreads 40 -update -delete -prbugpaxtq hftp://nn1:50070/source hdfs://cluster2/target

参考链接：

hadoop迁移分布式拷贝

记录稍显仓促，如有错误，请不吝指正