caffe源码之VideoDataLayer

xiaoxiao2021-08-15 173

解析

写这个文章是为了了解caffe的数据读取如果不是图片，而是视频，数据层是如何设计的。

直观感觉

对于caffe的入口层数据不是图像而是一帧帧视频的读取层，直观上的设计是好像没啥区别，因为视频本质也是一帧帧图像，但我真不知道怎么处理，可能最暴力的办法就是把一帧帧图像分别保存为单通道图像，然后将多幅单通道图像合并为一个多通道图像，那么一个视频本质上就是一幅多通道的图像，但问题就在于好像这样做不是很简单的事情，而且，我们网络的输入，如果输入全部的帧，那运算量真是感人，好吧，确实是有这个问题，但其实更重要的一个原因就是：我编不出来。

那么我们就借助先人做的工作，来看看他们是怎么实现的： 1、打开caffe.proto看一下VideoDataLayer层参数

message VideoDataParameter{ // Specify the data source. optional string source = 1; // Specify the batch size. optional uint32 batch_size = 4; // The rand_skip variable is for the data layer to skip a few data points // to avoid all asynchronous sgd clients to start at the same point. The skip // point would be set as rand_skip * rand(0,1). Note that rand_skip should not // be larger than the number of keys in the leveldb. optional uint32 rand_skip = 7 [default = 0]; // Whether or not ImageLayer should shuffle the list of files at every epoch. optional bool shuffle = 8 [default = false]; // It will also resize images if new_height or new_width are not zero. optional uint32 new_height = 9 [default = 0]; optional uint32 new_width = 10 [default = 0]; optional uint32 new_length = 11 [default = 1]; optional uint32 num_segments = 12 [default = 1]; // DEPRECATED. See TransformationParameter. For data pre-processing, we can do // simple scaling and subtracting the data mean, if provided. Note that the // mean subtraction is always carried out before scaling. optional float scale = 2 [default = 1]; optional string mean_file = 3; // DEPRECATED. See TransformationParameter. Specify if we would like to randomly // crop an image. optional uint32 crop_size = 5 [default = 0]; // DEPRECATED. See TransformationParameter. Specify if we want to randomly mirror // data. optional bool mirror = 6 [default = false]; enum Modality { RGB = 0; FLOW = 1; } optional Modality modality = 13 [default = FLOW]; // the name pattern for frame images, // for RGB modality it is default to "img_d.jpg", for FLOW "flow_x_d" and "flow_y_d" optional string name_pattern = 14; // The type of input optional bool encoded = 15 [default = false]; }

只说没有注释的变量，new_length 这个变量是输入是光流时，累积输入的一个样本的光流帧数。num_segments，这个也是在光流里的变量，默认为1，猜测意思是对于一个视频样本，将这个视频样本分割为几部分视频片段。实际我们用的也是num_segments=1，即一个视频就是一个样本。name_pattern即保存的Flow和img的格式，对于RGB通道的图像，我们的求出来的光流图是“flow_x_d.jpg”和“flow_y_d.jpg”的格式，对于RGB保存的image的名字格式是“img_d.jpg”，而数据读取层对于img和flow分别猜去不同的读取模式，这一点在源码中会有体现，于是，modality的两个选项就是RGB和FLOW。

2、我们看看VideoDataLayer层的功能声明是怎么做到的，打开VideoDataLayer.hpp：

template <typename Dtype> class VideoDataLayer : public BasePrefetchingDataLayer<Dtype> { public: explicit VideoDataLayer(const LayerParameter& param) : BasePrefetchingDataLayer<Dtype>(param) {} virtual ~VideoDataLayer(); virtual void DataLayerSetUp(const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top); virtual inline const char* type() const { return "VideoData"; } virtual inline int ExactNumBottomBlobs() const { return 0; } virtual inline int ExactNumTopBlobs() const { return 2; }

和ImageDataLayer一样，VideoDataLayer层继承自BasePretchingDataLayer，最重要的还是该层要实现的具体功能，见DataLayerSetUp，具体实现代码中解析；然后声明了几个重要的概念，光流随机数和帧随机数，caffe里的随机数是一个很重要的概念，最主要用于初始化权重，shuffle的时候也要用到。

protected: shared_ptr<Caffe::RNG> prefetch_rng_; shared_ptr<Caffe::RNG> prefetch_rng_2_; shared_ptr<Caffe::RNG> prefetch_rng_1_; shared_ptr<Caffe::RNG> frame_prefetch_rng_; virtual void ShuffleVideos(); virtual void InternalThreadEntry();

prfetch_rng_1和pretch_rng_2分别代表了flow_x和flow_y的随机数，而prefetch_rng_是为了保证flow_x和flow_y产生的随机数是一样的，frame_prefetch_rng是帧图像随机数，线程入口也与ImageDataLayer不同，具体实现中解析。

#ifdef USE_MPI inline virtual void advance_cursor(){ lines_id_++; if (lines_id_ >= lines_.size()) { // We have reached the end. Restart from the first. DLOG(INFO) << "Restarting data prefetching from start."; lines_id_ = 0; if (this->layer_param_.video_data_param().shuffle()) { ShuffleVideos(); } } } #endif

上面那一坨我也不清楚，大概就是一个游标，每个eproch时候shuffle，产生的随机数都不同。

vector<std::pair<std::string, int> > lines_; vector<int> lines_duration_; int lines_id_; string name_pattern_;

lines_是映射对，记录了source（.txt）里的样本和图像的对应关系。lines_[0].size便是样本个数。lines_duration_是每个视频的帧数，lines_id_ 是样本index，name_pattern_是…………自己悟。

3、VideoDataLayer.cpp

namespace caffe{ template <typename Dtype> VideoDataLayer<Dtype>:: ~VideoDataLayer<Dtype>(){ this->JoinPrefetchThread(); }

不具体了解，大概就是加入预取进程的意思。

template <typename Dtype> void VideoDataLayer<Dtype>:: DataLayerSetUp(const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top){ const int new_height = this->layer_param_.video_data_param().new_height(); const int new_width = this->layer_param_.video_data_param().new_width(); const int new_length = this->layer_param_.video_data_param().new_length(); const int num_segments = this->layer_param_.video_data_param().num_segments(); const string& source = this->layer_param_.video_data_param().source(); LOG(INFO) << "Opening file: " << source; std:: ifstream infile(source.c_str()); string filename; int label; int length; while (infile >> filename >> length >> label){ lines_.push_back(std::make_pair(filename,label)); lines_duration_.push_back(length); }

new_length是累积的光流帧长度，对应论文中的Ｌ，num_segments默认为1，感觉应该是每个视频所取的帧数为1帧。ifstream 是从硬盘读取文件的命令，infile是识别字，括号里为文件名。

if (this->layer_param_.video_data_param().shuffle()){ const unsigned int prefectch_rng_seed = caffe_rng_rand(); prefetch_rng_1_.reset(new Caffe::RNG(prefectch_rng_seed)); prefetch_rng_2_.reset(new Caffe::RNG(prefectch_rng_seed)); ShuffleVideos(); } LOG(INFO) << "A total of " << lines_.size() << " videos.";

显而易见，这里的与随机数相关的内容是用来shuffle的。

if (this->layer_param_.video_data_param().name_pattern() == ""){ if (this->layer_param_.video_data_param().modality() == VideoDataParameter_Modality_RGB){ name_pattern_ = "image_d.jpg"; }else if (this->layer_param_.video_data_param().modality() == VideoDataParameter_Modality_FLOW){ name_pattern_ = "flow_%c_d.jpg"; } }else{ name_pattern_ = this->layer_param_.video_data_param().name_pattern(); }

这部分用来确定名称格式的，帧图像是img_d.jpg，光流图像是flow_x_d.jpg和flow_y_d.jpg。

Datum datum; const unsigned int frame_prefectch_rng_seed = caffe_rng_rand(); frame_prefetch_rng_.reset(new Caffe::RNG(frame_prefectch_rng_seed)); int average_duration = (int) lines_duration_[lines_id_]/num_segments; vector<int> offsets; for (int i = 0; i < num_segments; ++i){ caffe::rng_t* frame_rng = static_cast<caffe::rng_t*>(frame_prefetch_rng_->generator()); int offset = (*frame_rng)() % (average_duration - new_length + 1); offsets.push_back(offset+i*average_duration); }

offsets是一个vector，一开始研究了很久没搞懂啥意思，搞了个啥么，连点说明都没有就直接定义一个vector，后来才明白，由于每个视频的num_segments=1，那么每个视频只取一帧，那么问题来了，在一个视频取一帧之后怎么跳到下个视频，这里有一个average_duration，这里由于num_segments=1，所以average_duration就是前面的length，也就是每一个视频的帧数，可以看到循环结构只执行了一次，说明每个视频只取了一帧，offsets.push_back(offset+i*average_duration);这个意思便是跳过一个视频到下一个视频。并把那一帧push_back到offsets vector中。

if (this->layer_param_.video_data_param().modality() == VideoDataParameter_Modality_FLOW) CHECK(ReadSegmentFlowToDatum(lines_[lines_id_].first, lines_[lines_id_].second, offsets, new_height, new_width, new_length, &datum, name_pattern_.c_str())); else CHECK(ReadSegmentRGBToDatum(lines_[lines_id_].first, lines_[lines_id_].second, offsets, new_height, new_width, new_length, &datum, true, name_pattern_.c_str()));

这里是很关键的两种读入方式，具体解析如下：打开/src/caffe/util/io.cpp，首先是ReadSegmentRGBToDatum：

bool ReadSegmentRGBToDatum(const string& filename, const int label, const vector<int> offsets, const int height, const int width, const int length, Datum* datum, bool is_color, const char* name_pattern ){ cv::Mat cv_img; string* datum_string; char tmp[30]; int cv_read_flag = (is_color ? CV_LOAD_IMAGE_COLOR : CV_LOAD_IMAGE_GRAYSCALE); for (int i = 0; i < offsets.size(); ++i){ int offset = offsets[i]; for (int file_id = 1; file_id < length+1; ++file_id){ sprintf(tmp, name_pattern, int(file_id+offset)); string filename_t = filename + "/" + tmp; cv::Mat cv_img_origin = cv::imread(filename_t, cv_read_flag); if (!cv_img_origin.data){ LOG(ERROR) << "Could not load file " << filename; return false; } if (height > 0 && width > 0){ cv::resize(cv_img_origin, cv_img, cv::Size(width, height)); }else{ cv_img = cv_img_origin; } int num_channels = (is_color ? 3 : 1); if (file_id==1 && i==0){ datum->set_channels(num_channels*length*offsets.size()); datum->set_height(cv_img.rows); datum->set_width(cv_img.cols); datum->set_label(label); datum->clear_data(); datum->clear_float_data(); datum_string = datum->mutable_data(); } if (is_color) { for (int c = 0; c < num_channels; ++c) { for (int h = 0; h < cv_img.rows; ++h) { for (int w = 0; w < cv_img.cols; ++w) { datum_string->push_back( static_cast<char>(cv_img.at<cv::Vec3b>(h, w)[c])); } } } } else { // Faster than repeatedly testing is_color for each pixel w/i loop for (int h = 0; h < cv_img.rows; ++h) { for (int w = 0; w < cv_img.cols; ++w) { datum_string->push_back( static_cast<char>(cv_img.at<uchar>(h, w))); } } } } } return true; }

这一块，自己慢慢看。

bool ReadSegmentFlowToDatum(const string& filename, const int label, const vector<int> offsets, const int height, const int width, const int length, Datum* datum, const char* name_pattern ){ cv::Mat cv_img_x, cv_img_y; string* datum_string; char tmp[30]; for (int i = 0; i < offsets.size(); ++i){ int offset = offsets[i]; for (int file_id = 1; file_id < length+1; ++file_id){ sprintf(tmp,name_pattern, 'x', int(file_id+offset)); string filename_x = filename + "/" + tmp; cv::Mat cv_img_origin_x = cv::imread(filename_x, CV_LOAD_IMAGE_GRAYSCALE); sprintf(tmp, name_pattern, 'y', int(file_id+offset)); string filename_y = filename + "/" + tmp; cv::Mat cv_img_origin_y = cv::imread(filename_y, CV_LOAD_IMAGE_GRAYSCALE); if (!cv_img_origin_x.data || !cv_img_origin_y.data){ LOG(ERROR) << "Could not load file " << filename_x << " or " << filename_y; return false; } if (height > 0 && width > 0){ cv::resize(cv_img_origin_x, cv_img_x, cv::Size(width, height)); cv::resize(cv_img_origin_y, cv_img_y, cv::Size(width, height)); }else{ cv_img_x = cv_img_origin_x; cv_img_y = cv_img_origin_y; } if (file_id==1 && i==0){ int num_channels = 2; datum->set_channels(num_channels*length*offsets.size()); datum->set_height(cv_img_x.rows); datum->set_width(cv_img_x.cols); datum->set_label(label); datum->clear_data(); datum->clear_float_data(); datum_string = datum->mutable_data(); } for (int h = 0; h < cv_img_x.rows; ++h){ for (int w = 0; w < cv_img_x.cols; ++w){ datum_string->push_back(static_cast<char>(cv_img_x.at<uchar>(h,w))); } } for (int h = 0; h < cv_img_y.rows; ++h){ for (int w = 0; w < cv_img_y.cols; ++w){ datum_string->push_back(static_cast<char>(cv_img_y.at<uchar>(h,w))); } } } } return true;

同样，自己悟。

const int crop_size = this->layer_param_.transform_param().crop_size(); const int batch_size = this->layer_param_.video_data_param().batch_size(); if (crop_size > 0){ top[0]->Reshape(batch_size, datum.channels(), crop_size, crop_size); this->prefetch_data_.Reshape(batch_size, datum.channels(), crop_size, crop_size); } else { top[0]->Reshape(batch_size, datum.channels(), datum.height(), datum.width()); this->prefetch_data_.Reshape(batch_size, datum.channels(), datum.height(), datum.width()); } LOG(INFO) << "output data size: " << top[0]->num() << "," << top[0]->channels() << "," << top[0]->height() << "," << top[0]->width(); top[1]->Reshape(batcch_size, 1, 1, 1); this->prefetch_label_.Reshape(batch_size, 1, 1, 1); vector<int> top_shape = this->data_transformer_->InferBlobShape(datum); this->transformed_data_.Reshape(top_shape); } template <typename Dtype> void VideoDataLayer<Dtype>::InternalThreadEntry(){ Datum datum; CHECK(this->prefetch_data_.count()); Dtype* top_data = this->prefetch_data_.mutable_cpu_data(); Dtype* top_label = this->prefetch_label_.mutable_cpu_data(); VideoDataParameter video_data_param = this->layer_param_.video_data_param(); const int batch_size = video_data_param.batch_size(); const int new_height = video_data_param.new_height(); const int new_width = video_data_param.new_width(); const int new_length = video_data_param.new_length(); const int num_segments = video_data_param.num_segments(); const int lines_size = lines_.size(); for (int item_id = 0; item_id < batch_size; ++item_id){ CHECK_GT(lines_size, lines_id_); vector<int> offsets; int average_duration = (int) lines_duration_[lines_id_] / num_segments; for (int i = 0; i < num_segments; ++i){ if (this->phase_==TRAIN){ if (average_duration >= new_length){ caffe::rng_t* frame_rng = static_cast<caffe::rng_t*>(frame_prefetch_rng_->generator()); int offset = (*frame_rng)() % (average_duration - new_length + 1); offsets.push_back(offset+i*average_duration); } else { offsets.push_back(1); } } else{ if (average_duration >= new_length) offsets.push_back(int((average_duration-new_length+1)/2 + i*average_duration)); else offsets.push_back(1); } } if (this->layer_param_.video_data_param().modality() == VideoDataParameter_Modality_FLOW){ if(!ReadSegmentFlowToDatum(lines_[lines_id_].first, lines_[lines_id_].second, offsets, new_height, new_width, new_length, &datum, name_pattern_.c_str())) { continue; } } else{ if(!ReadSegmentRGBToDatum(lines_[lines_id_].first, lines_[lines_id_].second, offsets, new_height, new_width, new_length, &datum, true, name_pattern_.c_str())) { continue; } } int offset1 = this->prefetch_data_.offset(item_id); this->transformed_data_.set_cpu_data(top_data + offset1); this->data_transformer_->Transform(datum, &(this->transformed_data_)); top_label[item_id] = lines_[lines_id_].second; //LOG() //next iteration lines_id_++; if (lines_id_ >= lines_size) { DLOG(INFO) << "Restarting data prefetching from start."; lines_id_ = 0; if(this->layer_param_.video_data_param().shuffle()){ ShuffleVideos(); } } } } INSTANTIATE_CLASS(VideoDataLayer); REGISTER_LAYER_CLASS(VideoData); }

基本和前面相同。

总结：自己其实也挺水的，时间久了感觉自己一直在打酱油，很长时间一点没有进展，头疼，记录下来，以这种方式方便自己回头看，也希望自己的一点微小的理解可以帮助到他人，就这样，又啰嗦了。

转载请注明原文地址: https://ju.6miu.com/read-676372.html

专利

最新回复(0)