caffe源码之VideoDataLayer

    xiaoxiao2021-08-15  168

    解析

    写这个文章是为了了解caffe的数据读取如果不是图片,而是视频,数据层是如何设计的。

    直观感觉

    对于caffe的入口层数据不是图像而是一帧帧视频的读取层,直观上的设计是好像没啥区别,因为视频本质也是一帧帧图像,但我真不知道怎么处理,可能最暴力的办法就是把一帧帧图像分别保存为单通道图像,然后将多幅单通道图像合并为一个多通道图像,那么一个视频本质上就是一幅多通道的图像,但问题就在于好像这样做不是很简单的事情,而且,我们网络的输入,如果输入全部的帧,那运算量真是感人,好吧,确实是有这个问题,但其实更重要的一个原因就是:我编不出来。

    那么我们就借助先人做的工作,来看看他们是怎么实现的: 1、打开caffe.proto看一下VideoDataLayer层参数

    message VideoDataParameter{ // Specify the data source. optional string source = 1; // Specify the batch size. optional uint32 batch_size = 4; // The rand_skip variable is for the data layer to skip a few data points // to avoid all asynchronous sgd clients to start at the same point. The skip // point would be set as rand_skip * rand(0,1). Note that rand_skip should not // be larger than the number of keys in the leveldb. optional uint32 rand_skip = 7 [default = 0]; // Whether or not ImageLayer should shuffle the list of files at every epoch. optional bool shuffle = 8 [default = false]; // It will also resize images if new_height or new_width are not zero. optional uint32 new_height = 9 [default = 0]; optional uint32 new_width = 10 [default = 0]; optional uint32 new_length = 11 [default = 1]; optional uint32 num_segments = 12 [default = 1]; // DEPRECATED. See TransformationParameter. For data pre-processing, we can do // simple scaling and subtracting the data mean, if provided. Note that the // mean subtraction is always carried out before scaling. optional float scale = 2 [default = 1]; optional string mean_file = 3; // DEPRECATED. See TransformationParameter. Specify if we would like to randomly // crop an image. optional uint32 crop_size = 5 [default = 0]; // DEPRECATED. See TransformationParameter. Specify if we want to randomly mirror // data. optional bool mirror = 6 [default = false]; enum Modality { RGB = 0; FLOW = 1; } optional Modality modality = 13 [default = FLOW]; // the name pattern for frame images, // for RGB modality it is default to "img_d.jpg", for FLOW "flow_x_d" and "flow_y_d" optional string name_pattern = 14; // The type of input optional bool encoded = 15 [default = false]; }

    只说没有注释的变量,new_length 这个变量是输入是光流时,累积输入的一个样本的光流帧数。num_segments,这个也是在光流里的变量,默认为1,猜测意思是对于一个视频样本,将这个视频样本分割为几部分视频片段。实际我们用的也是num_segments=1,即一个视频就是一个样本。name_pattern即保存的Flow和img的格式,对于RGB通道的图像,我们的求出来的光流图是“flow_x_d.jpg”和“flow_y_d.jpg”的格式,对于RGB保存的image的名字格式是“img_d.jpg”,而数据读取层对于img和flow分别猜去不同的读取模式 ,这一点在源码中会有体现,于是,modality的两个选项就是RGB和FLOW。


    2、我们看看VideoDataLayer层的功能声明是怎么做到的,打开VideoDataLayer.hpp:

    template <typename Dtype> class VideoDataLayer : public BasePrefetchingDataLayer<Dtype> { public: explicit VideoDataLayer(const LayerParameter& param) : BasePrefetchingDataLayer<Dtype>(param) {} virtual ~VideoDataLayer(); virtual void DataLayerSetUp(const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top); virtual inline const char* type() const { return "VideoData"; } virtual inline int ExactNumBottomBlobs() const { return 0; } virtual inline int ExactNumTopBlobs() const { return 2; }

    和ImageDataLayer一样,VideoDataLayer层继承自BasePretchingDataLayer,最重要的还是该层要实现的具体功能,见DataLayerSetUp,具体实现代码中解析;然后声明了几个重要的概念,光流随机数和帧随机数,caffe里的随机数是一个很重要的概念,最主要用于初始化权重,shuffle的时候也要用到。

    protected: shared_ptr<Caffe::RNG> prefetch_rng_; shared_ptr<Caffe::RNG> prefetch_rng_2_; shared_ptr<Caffe::RNG> prefetch_rng_1_; shared_ptr<Caffe::RNG> frame_prefetch_rng_; virtual void ShuffleVideos(); virtual void InternalThreadEntry();

    prfetch_rng_1和pretch_rng_2分别代表了flow_x和flow_y的随机数,而prefetch_rng_是为了保证flow_x和flow_y产生的随机数是一样的,frame_prefetch_rng是帧图像随机数,线程入口也与ImageDataLayer不同,具体实现中解析。

    #ifdef USE_MPI inline virtual void advance_cursor(){ lines_id_++; if (lines_id_ >= lines_.size()) { // We have reached the end. Restart from the first. DLOG(INFO) << "Restarting data prefetching from start."; lines_id_ = 0; if (this->layer_param_.video_data_param().shuffle()) { ShuffleVideos(); } } } #endif

    上面那一坨我也不清楚,大概就是一个游标,每个eproch时候shuffle,产生的随机数都不同。

    vector<std::pair<std::string, int> > lines_; vector<int> lines_duration_; int lines_id_; string name_pattern_;

    lines_是映射对,记录了source(.txt)里的样本和图像的对应关系。lines_[0].size便是样本个数。lines_duration_是每个视频的帧数,lines_id_ 是样本index,name_pattern_是…………自己悟。


    3、VideoDataLayer.cpp

    namespace caffe{ template <typename Dtype> VideoDataLayer<Dtype>:: ~VideoDataLayer<Dtype>(){ this->JoinPrefetchThread(); }

    不具体了解,大概就是加入预取进程的意思。

    template <typename Dtype> void VideoDataLayer<Dtype>:: DataLayerSetUp(const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top){ const int new_height = this->layer_param_.video_data_param().new_height(); const int new_width = this->layer_param_.video_data_param().new_width(); const int new_length = this->layer_param_.video_data_param().new_length(); const int num_segments = this->layer_param_.video_data_param().num_segments(); const string& source = this->layer_param_.video_data_param().source(); LOG(INFO) << "Opening file: " << source; std:: ifstream infile(source.c_str()); string filename; int label; int length; while (infile >> filename >> length >> label){ lines_.push_back(std::make_pair(filename,label)); lines_duration_.push_back(length); }

    new_length是累积的光流帧长度,对应论文中的L,num_segments默认为1,感觉应该是每个视频所取的帧数为1帧。ifstream 是从硬盘读取文件的命令,infile是识别字,括号里为文件名。

    if (this->layer_param_.video_data_param().shuffle()){ const unsigned int prefectch_rng_seed = caffe_rng_rand(); prefetch_rng_1_.reset(new Caffe::RNG(prefectch_rng_seed)); prefetch_rng_2_.reset(new Caffe::RNG(prefectch_rng_seed)); ShuffleVideos(); } LOG(INFO) << "A total of " << lines_.size() << " videos.";

    显而易见,这里的与随机数相关的内容是用来shuffle的。

    if (this->layer_param_.video_data_param().name_pattern() == ""){ if (this->layer_param_.video_data_param().modality() == VideoDataParameter_Modality_RGB){ name_pattern_ = "image_d.jpg"; }else if (this->layer_param_.video_data_param().modality() == VideoDataParameter_Modality_FLOW){ name_pattern_ = "flow_%c_d.jpg"; } }else{ name_pattern_ = this->layer_param_.video_data_param().name_pattern(); }

    这部分用来确定名称格式的,帧图像是img_d.jpg,光流图像是flow_x_d.jpg和flow_y_d.jpg。

    Datum datum; const unsigned int frame_prefectch_rng_seed = caffe_rng_rand(); frame_prefetch_rng_.reset(new Caffe::RNG(frame_prefectch_rng_seed)); int average_duration = (int) lines_duration_[lines_id_]/num_segments; vector<int> offsets; for (int i = 0; i < num_segments; ++i){ caffe::rng_t* frame_rng = static_cast<caffe::rng_t*>(frame_prefetch_rng_->generator()); int offset = (*frame_rng)() % (average_duration - new_length + 1); offsets.push_back(offset+i*average_duration); }

    offsets是一个vector,一开始研究了很久没搞懂啥意思,搞了个啥么,连点说明都没有就直接定义一个vector,后来才明白,由于每个视频的num_segments=1,那么每个视频只取一帧,那么问题来了,在一个视频取一帧之后怎么跳到下个视频,这里有一个average_duration,这里由于num_segments=1,所以average_duration就是前面的length,也就是每一个视频的帧数,可以看到循环结构只执行了一次,说明每个视频只取了一帧,offsets.push_back(offset+i*average_duration);这个意思便是跳过一个视频到下一个视频。并把那一帧push_back到offsets vector中。

    if (this->layer_param_.video_data_param().modality() == VideoDataParameter_Modality_FLOW) CHECK(ReadSegmentFlowToDatum(lines_[lines_id_].first, lines_[lines_id_].second, offsets, new_height, new_width, new_length, &datum, name_pattern_.c_str())); else CHECK(ReadSegmentRGBToDatum(lines_[lines_id_].first, lines_[lines_id_].second, offsets, new_height, new_width, new_length, &datum, true, name_pattern_.c_str()));

    这里是很关键的两种读入方式,具体解析如下: 打开/src/caffe/util/io.cpp,首先是ReadSegmentRGBToDatum:

    bool ReadSegmentRGBToDatum(const string& filename, const int label, const vector<int> offsets, const int height, const int width, const int length, Datum* datum, bool is_color, const char* name_pattern ){ cv::Mat cv_img; string* datum_string; char tmp[30]; int cv_read_flag = (is_color ? CV_LOAD_IMAGE_COLOR : CV_LOAD_IMAGE_GRAYSCALE); for (int i = 0; i < offsets.size(); ++i){ int offset = offsets[i]; for (int file_id = 1; file_id < length+1; ++file_id){ sprintf(tmp, name_pattern, int(file_id+offset)); string filename_t = filename + "/" + tmp; cv::Mat cv_img_origin = cv::imread(filename_t, cv_read_flag); if (!cv_img_origin.data){ LOG(ERROR) << "Could not load file " << filename; return false; } if (height > 0 && width > 0){ cv::resize(cv_img_origin, cv_img, cv::Size(width, height)); }else{ cv_img = cv_img_origin; } int num_channels = (is_color ? 3 : 1); if (file_id==1 && i==0){ datum->set_channels(num_channels*length*offsets.size()); datum->set_height(cv_img.rows); datum->set_width(cv_img.cols); datum->set_label(label); datum->clear_data(); datum->clear_float_data(); datum_string = datum->mutable_data(); } if (is_color) { for (int c = 0; c < num_channels; ++c) { for (int h = 0; h < cv_img.rows; ++h) { for (int w = 0; w < cv_img.cols; ++w) { datum_string->push_back( static_cast<char>(cv_img.at<cv::Vec3b>(h, w)[c])); } } } } else { // Faster than repeatedly testing is_color for each pixel w/i loop for (int h = 0; h < cv_img.rows; ++h) { for (int w = 0; w < cv_img.cols; ++w) { datum_string->push_back( static_cast<char>(cv_img.at<uchar>(h, w))); } } } } } return true; }

    这一块,自己慢慢看。

    bool ReadSegmentFlowToDatum(const string& filename, const int label, const vector<int> offsets, const int height, const int width, const int length, Datum* datum, const char* name_pattern ){ cv::Mat cv_img_x, cv_img_y; string* datum_string; char tmp[30]; for (int i = 0; i < offsets.size(); ++i){ int offset = offsets[i]; for (int file_id = 1; file_id < length+1; ++file_id){ sprintf(tmp,name_pattern, 'x', int(file_id+offset)); string filename_x = filename + "/" + tmp; cv::Mat cv_img_origin_x = cv::imread(filename_x, CV_LOAD_IMAGE_GRAYSCALE); sprintf(tmp, name_pattern, 'y', int(file_id+offset)); string filename_y = filename + "/" + tmp; cv::Mat cv_img_origin_y = cv::imread(filename_y, CV_LOAD_IMAGE_GRAYSCALE); if (!cv_img_origin_x.data || !cv_img_origin_y.data){ LOG(ERROR) << "Could not load file " << filename_x << " or " << filename_y; return false; } if (height > 0 && width > 0){ cv::resize(cv_img_origin_x, cv_img_x, cv::Size(width, height)); cv::resize(cv_img_origin_y, cv_img_y, cv::Size(width, height)); }else{ cv_img_x = cv_img_origin_x; cv_img_y = cv_img_origin_y; } if (file_id==1 && i==0){ int num_channels = 2; datum->set_channels(num_channels*length*offsets.size()); datum->set_height(cv_img_x.rows); datum->set_width(cv_img_x.cols); datum->set_label(label); datum->clear_data(); datum->clear_float_data(); datum_string = datum->mutable_data(); } for (int h = 0; h < cv_img_x.rows; ++h){ for (int w = 0; w < cv_img_x.cols; ++w){ datum_string->push_back(static_cast<char>(cv_img_x.at<uchar>(h,w))); } } for (int h = 0; h < cv_img_y.rows; ++h){ for (int w = 0; w < cv_img_y.cols; ++w){ datum_string->push_back(static_cast<char>(cv_img_y.at<uchar>(h,w))); } } } } return true;

    同样,自己悟。

    const int crop_size = this->layer_param_.transform_param().crop_size(); const int batch_size = this->layer_param_.video_data_param().batch_size(); if (crop_size > 0){ top[0]->Reshape(batch_size, datum.channels(), crop_size, crop_size); this->prefetch_data_.Reshape(batch_size, datum.channels(), crop_size, crop_size); } else { top[0]->Reshape(batch_size, datum.channels(), datum.height(), datum.width()); this->prefetch_data_.Reshape(batch_size, datum.channels(), datum.height(), datum.width()); } LOG(INFO) << "output data size: " << top[0]->num() << "," << top[0]->channels() << "," << top[0]->height() << "," << top[0]->width(); top[1]->Reshape(batcch_size, 1, 1, 1); this->prefetch_label_.Reshape(batch_size, 1, 1, 1); vector<int> top_shape = this->data_transformer_->InferBlobShape(datum); this->transformed_data_.Reshape(top_shape); } template <typename Dtype> void VideoDataLayer<Dtype>::InternalThreadEntry(){ Datum datum; CHECK(this->prefetch_data_.count()); Dtype* top_data = this->prefetch_data_.mutable_cpu_data(); Dtype* top_label = this->prefetch_label_.mutable_cpu_data(); VideoDataParameter video_data_param = this->layer_param_.video_data_param(); const int batch_size = video_data_param.batch_size(); const int new_height = video_data_param.new_height(); const int new_width = video_data_param.new_width(); const int new_length = video_data_param.new_length(); const int num_segments = video_data_param.num_segments(); const int lines_size = lines_.size(); for (int item_id = 0; item_id < batch_size; ++item_id){ CHECK_GT(lines_size, lines_id_); vector<int> offsets; int average_duration = (int) lines_duration_[lines_id_] / num_segments; for (int i = 0; i < num_segments; ++i){ if (this->phase_==TRAIN){ if (average_duration >= new_length){ caffe::rng_t* frame_rng = static_cast<caffe::rng_t*>(frame_prefetch_rng_->generator()); int offset = (*frame_rng)() % (average_duration - new_length + 1); offsets.push_back(offset+i*average_duration); } else { offsets.push_back(1); } } else{ if (average_duration >= new_length) offsets.push_back(int((average_duration-new_length+1)/2 + i*average_duration)); else offsets.push_back(1); } } if (this->layer_param_.video_data_param().modality() == VideoDataParameter_Modality_FLOW){ if(!ReadSegmentFlowToDatum(lines_[lines_id_].first, lines_[lines_id_].second, offsets, new_height, new_width, new_length, &datum, name_pattern_.c_str())) { continue; } } else{ if(!ReadSegmentRGBToDatum(lines_[lines_id_].first, lines_[lines_id_].second, offsets, new_height, new_width, new_length, &datum, true, name_pattern_.c_str())) { continue; } } int offset1 = this->prefetch_data_.offset(item_id); this->transformed_data_.set_cpu_data(top_data + offset1); this->data_transformer_->Transform(datum, &(this->transformed_data_)); top_label[item_id] = lines_[lines_id_].second; //LOG() //next iteration lines_id_++; if (lines_id_ >= lines_size) { DLOG(INFO) << "Restarting data prefetching from start."; lines_id_ = 0; if(this->layer_param_.video_data_param().shuffle()){ ShuffleVideos(); } } } } INSTANTIATE_CLASS(VideoDataLayer); REGISTER_LAYER_CLASS(VideoData); }

    基本和前面相同。



    总结: 自己其实也挺水的,时间久了感觉自己一直在打酱油,很长时间一点没有进展,头疼,记录下来,以这种方式方便自己回头看,也希望自己的一点微小的理解可以帮助到他人,就这样,又啰嗦了。

    转载请注明原文地址: https://ju.6miu.com/read-676372.html

    最新回复(0)