Nvidia GPU + CoreOS + Docker + TensorFlow = A Fast, Flexible, Deep Learning Platform

xiaoxiao2025-09-15 847

from: http://www.emergingstack.com/2016/01/10/Nvidia-GPU-plus-CoreOS-plus-Docker-plus-TensorFlow.html

The Challenge

Providing data scientists with the best tool for the job is hard. They need everything from their computers - raw performance, bleeding-edge software and free-rein to experiment.

We’ve developed a solution that meets all those requirements and avoids creating a ‘one off’ build that would haunt sysadmin and devops teams.

tl;dr - Code to provision your environment is on Github.

It is still experimental, but it works. And because we’re early in the lifecycle for many of these tools, it will only get better.

Farewell Cloud

For compute-intensive tasks, public cloud hosting costs can become prohibitive. A high-spec GPU-powered VM on AWS is about 20x more expensive than our day-to-day instances. About USD$25,000 annually.

On-premises virtual servers, whilst cheaper, are also not tuned for these scientific-computing use-cases and don’t make for good neighbours in a shared environment.

We must look elsewhere…

Hello, old friend

The ‘server under the desk’ is back. Better than ever.

Nvidia gave us the “Dev Box” in 2015, a data scientist’s dream-machine. But at USD$15,000, it’s still a bit pricey.

Andrej Karpathy’s built an home-rig that is pretty much ideal, hardware-wise. It can scale-up to about the same specification as Nvidia’s beast. Our’s is very similar and there’s a nice spot for it, right under the desk.

Development is your Production

So you’ve got the right hardware. And you followed Nvidia’s instructions to get all your software installed and configured. Hours spent deploying the right packages and dependencies, all hand-crafted. And it works perfectly. But you’ve created a sysadmin’s nightmare - a completely bespoke build.

The longer you use this machine, the harder it will be to rebuild, if/when it dies. Or if/when you want to do a major version upgrade. Or if/when you created that once-off workaround…more than once.

Downtime is your enemy, but you’ve introduced an ugly single-point-of-failure. And gone are the benefits of fully-automated builds.

A Solution

Step 1 - Vanilla CoreOS build

Assuming you’ve got the right hardware ready, follow one of the bullet-proof quick-start guides for a ‘bare-metal’ build. CoreOS support PXE, iPXE and installing to disk. Choose whichever method suits your needs.

We use the PXE build approach to provision our CoreOS systems.

Step 2 - Install the CUDA drivers and Nvidia Devices - One-Time Operation

Clone the es-dev-stack repository on Github.

$ git clone http://github.com/emergingstack/es-dev-stack.git

Docker image build (takes about 30 minutes, downloads ~2.5GB source files)

$ cd es-dev-stack/corenvidiadrivers $ docker build -t cuda .

Once complete, you may want to push this image to a private Docker registry, if one is handy. It could be useful when it comes to rebuilding your host one day.

Dockerfile below for reference.

FROM ubuntu:14.04 MAINTAINER Mike Orzel <mike.orzel@emergingstack.com> RUN apt-get -y update && apt-get -y install git bc make dpkg-dev && mkdir -p /usr/src/kernels && mkdir -p /opt/nvidia/nvidia_installers ADD http://developer.download.nvidia.com/compute/cuda/7_0/Prod/local_installers/cuda_7.0.28_linux.run /opt/nvidia/ WORKDIR /usr/src/kernels RUN git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux WORKDIR linux RUN git checkout -b stable v`uname -r` && zcat /proc/config.gz > .config && make modules_prepare RUN sed -i -e "s/`uname -r`+/`uname -r`/" include/generated/utsrelease.h # In case a '+' was added # Nvidia drivers setup WORKDIR /opt/nvidia/ RUN chmod +x cuda_7.0.28_linux.run && ./cuda_7.0.28_linux.run -extract=`pwd`/nvidia_installers WORKDIR /opt/nvidia/nvidia_installers RUN ./NVIDIA-Linux-x86_64-346.46.run -a -x --ui=none RUN sed -i "s/read_cr4/__read_cr4/g" NVIDIA-Linux-x86_64-346.46/kernel/nv-pat.c RUN sed -i "s/write_cr4/__write_cr4/g" NVIDIA-Linux-x86_64-346.46/kernel/nv-pat.c CMD ./NVIDIA-Linux-x86_64-346.46/nvidia-installer -q -a -n -s --kernel-source-path=/usr/src/kernels/linux/ && insmod /opt/nvidia/nvidia_installers/NVIDIA-Linux-x86_64-346.46/kernel/uvm/nvidia-uvm.ko

Run the CUDA Docker Container

# docker run -it --privileged cuda

Confirm Nvidia drivers are installed

# lsmod

You should see a few ‘nvidia’ items

Install devices

Run the “mkdevs” script to create the devices (as root)

# ./mkdevs.sh

Confirm devices are installed

# cd /dev # ls -al | grep -i "nvidia"

You should see a list of devices that are now available to be mapped into a Docker container.

crw-rw-rw- 1 root root 247, 0 Jan 4 05:54 nvidia-uvm crw-rw-rw- 1 root root 195, 0 Jan 4 05:54 nvidia0 crw-rw-rw- 1 root root 195, 1 Jan 4 05:54 nvidia1 crw-rw-rw- 1 root root 195, 255 Jan 4 05:54 nvidiactl

Success!

Now you’re ready to put your almost-immutable system to use and expose the power of containerisation. Note, we’re using ‘privileged’ mode here to map the GPU devices to a Docker container which is not secure from a ‘shared host’ perspective.

Step 3 - Test it out with Google’s TensorFlow.

WARNING: The Dockerfile below produces a Docker image that is over 10GB. It will take 30-40 minutes to build.

We’ve added a Jupyter notebook to the docker image to validate that GPUs are functioning - it’s a basic ConvNet via TensorFlow and is sufficient for verification.

Docker image build

$ cd es-dev-stack/tflowgpu $ docker build -t tflowgpu .

Dockerfile below for reference.

FROM b.gcr.io/tensorflow/tensorflow:latest-gpu MAINTAINER Mike Orzel <mike.orzel@emergingstack.com> # Add some dependent packages we will need for the build process RUN apt-get -y update && apt-get -y install git bc make dpkg-dev && mkdir -p /usr/src/kernels && mkdir -p /opt/nvidia/nvidia_installers # Download the nvidia cuda package ADD http://developer.download.nvidia.com/compute/cuda/7_0/Prod/local_installers/cuda_7.0.28_linux.run /opt/nvidia/ RUN chmod +x /opt/nvidia/cuda_7.0.28_linux.run # download the linux kernel source and prepare it for use WORKDIR /usr/src/kernels RUN git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux WORKDIR linux RUN git checkout -b stable v`uname -r` && zcat /proc/config.gz > .config && make modules_prepare RUN sed -i -e "s/`uname -r`+/`uname -r`/" include/generated/utsrelease.h # In case a '+' was added RUN sed -i -e "s/`uname -r`+/`uname -r`/" include/config/kernel.release # In case a '+' was added # Nvidia drivers setup WORKDIR /opt/nvidia/ RUN chmod +x cuda_7.0.28_linux.run && ./cuda_7.0.28_linux.run -extract=`pwd`/nvidia_installers WORKDIR /opt/nvidia/nvidia_installers RUN ./NVIDIA-Linux-x86_64-346.46.run -a -x --ui=none RUN sed -i "s/read_cr4/__read_cr4/g" NVIDIA-Linux-x86_64-346.46/kernel/nv-pat.c RUN sed -i "s/write_cr4/__write_cr4/g" NVIDIA-Linux-x86_64-346.46/kernel/nv-pat.c RUN ./NVIDIA-Linux-x86_64-346.46/nvidia-installer -q -a -n -s --kernel-source-path=/usr/src/kernels/linux/ --no-kernel-module # install modules to expected location, cuda will do modprobes in certain situations which require this WORKDIR /usr/src/kernels/linux RUN make modules && make modules_install RUN mv /lib/modules/`uname -r`+ /lib/modules/`uname -r` WORKDIR /opt/nvidia/nvidia_installers RUN depmod # Run jupyter notebook and create a folder for the notebooks RUN chmod +x /run_jupyter.sh RUN mkdir /examples WORKDIR /examples COPY CNN.ipynb /examples/CNN.ipynb CMD /run_jupyter.sh

Run the TensowFlow Docker Container

This ‘docker run’ command maps the freshly installed GPU devices to the TensorFlow container. This is where the magic happens…

$ docker run --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia1:/dev/nvidia1 --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm -it -p 8888:8888 --privileged tflowgpu

Open a browser and navigate to http://{your dev box IP}:8888 to launch Jupyter and run through the example. With a single Nvidia TitanX, this test should be about 10x faster than running on an Intel i7 CPU.

Start using the es-dev-stack code now. Contibutions welcomed.

AttributionsThis solution takes inspiration from a few community sources. Thanks to;

Nvidia driver setup via Docker - Joshua Kolden joshua@studiopyxis.comConvNet demo notebook - Edward Banner edward.banner@gmail.com

In ‘Part 2’, we will demonstrate how to integrate this dockerized environment with Spark and potentially Kubernetes (depending on the status of issue 19049).

转载请注明原文地址: https://ju.6miu.com/read-1302670.html

最新回复(0)