R语言扩展包dplyr——数据清洗和整理

    xiaoxiao2021-03-25  16

    <div id="article_content" class="article_content"> <p><span style="font-size:14px"><span style="font-family:SimSun">该包主要用于数据清洗和整理,coursera课程链接:</span><a target="_blank" target="_blank" href="https://class.coursera.org/getdata-017" style="font-family:SimSun">Getting and Cleaning Data</a></span></p> <p><span style="font-family:SimSun; font-size:14px">也可以载入swirl包,加载课Getting and Cleaning Data跟着学习。</span></p> <p><span style="font-family:SimSun; font-size:14px">如下:</span></p> <p><span style="font-family:SimSun; font-size:14px"></span></p> <pre name="code" class="html">library(swirl) install_from_swirl("Getting and Cleaning Data") swirl()</pre><br> <p></p> <p><span style="font-family:SimSun; font-size:14px">此文主要是参考R自带的简介:<a target="_blank" target="_blank" href="http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html">Introduce to dplyr</a></span></p> <p><span style="font-family:SimSun; font-size:14px">1、示范数据</span></p> <p><span style="font-family:SimSun; font-size:14px"></span></p> <pre name="code" class="html">> library(nycflights13) > dim(flights) [1] 336776     16 > head(flights, 3) Source: local data frame [3 x 16]   year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time 1 2013     1   1      517         2      830        11      UA  N14228   1545    EWR  IAH      227 2 2013     1   1      533         4      850        20      UA  N24211   1714    LGA  IAH      227 3 2013     1   1      542         2      923        33      AA  N619AA   1141    JFK  MIA      160 Variables not shown: distance (dbl), hour (dbl), minute (dbl)</pre><br> 2、将过长的数据整理成友好的tbl_df数据 <p></p> <p><span style="font-family:SimSun; font-size:14px"></span></p> <pre name="code" class="html">> flights_df <- tbl_df(flights) > flights_df</pre> <p></p> <p><span style="font-family:SimSun; font-size:14px"><br> </span></p> <p><span style="font-family:SimSun; font-size:14px">3、筛选filter()</span></p> <p><span style="font-family:SimSun; font-size:14px"></span></p> <pre name="code" class="html">> filter(flights_df, month == 1, day == 1) Source: local data frame [842 x 16]    year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time 1  2013     1   1      517         2      830        11      UA  N14228   1545    EWR  IAH      227 2  2013     1   1      533         4      850        20      UA  N24211   1714    LGA  IAH      227</pre>筛选出month=1和day=1的数据 <p></p> <p>同样效果的,</p> <p></p> <pre name="code" class="html">flights_df[flights_df$month == 1 & flights_df$day == 1, ]</pre><br> 4、选出几行数据slice() <p></p> <p></p> <pre name="code" class="html">slice(flights_df, 1:10)</pre><br> 5、排列arrange() <p></p> <p><span style="font-family:SimSun; font-size:14px"></span></p> <pre name="code" class="html">>arrange(flights_df, year, month, day)</pre>将flights_df数据按照year,month,day的升序排列。 <p></p> <p><span style="font-family:SimSun; font-size:14px">降序</span></p> <p><span style="font-family:SimSun; font-size:14px"></span></p> <pre name="code" class="html">>arrange(flights_df, year, desc(month), day)</pre>R语言当中的自带函数 <p></p> <p><span style="font-family:SimSun; font-size:14px"></span></p> <pre name="code" class="html">flights_df[order(flights$year, flights_df$month, flights_df$day), ] flights_df[order(desc(flights_df$arr_delay)), ]</pre> <p></p> <p><span style="font-family:SimSun; font-size:14px"><br> </span></p> 6、选择select() <p><span style="font-family:SimSun; font-size:14px">通过列名来选择所要的数据<br> </span></p> <pre name="code" class="html">select(flights_df, year, month, day)</pre>选出三列数据<br> 使用:符号<br> <pre name="code" class="html">select(flights_df, year:day)</pre>使用-来删除不要的列表 <p></p> <p><span style="font-family:SimSun; font-size:14px"></span></p> <pre name="code" class="html">select(flights_df, -(year:day))</pre><br> 7、变形mutate() <p></p> <p><span style="font-family:SimSun; font-size:14px">产生新的列</span></p> <p><span style="font-family:SimSun; font-size:14px"></span></p> <pre name="code" class="html">> mutate(flights_df, +        gain = arr_delay - dep_delay, +        speed = distance / air_time * 60)</pre> <p></p> <p><span style="font-family:SimSun; font-size:14px"><br> </span></p> 8、汇总summarize()<br> <pre name="code" class="html"><pre name="code" class="html">> summarise(flights, +           delay = mean(dep_delay, na.rm = TRUE)</pre> <pre></pre> <p><span style="font-family:SimSun; font-size:14px">求dep_delay的均值</span></p> <p><span style="font-family:SimSun; font-size:14px"></span></p> <p><span style="font-family:SimSun; font-size:14px"></span></p> <p><span style="font-family:SimSun; font-size:14px">9、随机选出样本</span></p> <p><span style="font-family:SimSun; font-size:14px"></span></p> <pre name="code" class="html">sample_n(flights_df, 10)</pre>随机选出10个样本<br> <pre name="code" class="html">sample_frac(flights_df, 0.01)</pre><span style="font-family:SimSun; font-size:14px">随机选出1%个样本</span><br> <br> <p></p> <p><span style="font-family:SimSun; font-size:14px">10、分组group_py()</span></p> <p><span style="font-family:SimSun; font-size:14px"></span></p> <pre name="code" class="html">by_tailnum <- group_by(flights, tailnum) #确定组别为tailnum,赋值为by_tailnum delay <- summarise(by_tailnum,                    count = n(),                    dist = mean(distance, na.rm = TRUE),                    delay = mean(arr_delay, na.rm = TRUE)) #汇总flights里地tailnum组的分类数量,及其组别对应的distance和arr_delay的均值 delay <- filter(delay, count > 20, dist < 2000) ggplot(delay, aes(dist, delay)) +     geom_point(aes(size = count), alpha = 1/2) +     geom_smooth() +     scale_size_area() </pre><br> <img src="https://img-blog.csdn.net/20150122175820824?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvdTAxMTI1Mzg3NA==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center" alt=""><br> <p></p> <p><span style="font-family:SimSun; font-size:14px"><br> </span></p> <p>结果都需要通过赋值存储</p> <p></p> <pre name="code" class="html">a1 <- group_by(flights, year, month, day) a2 <- select(a1, arr_delay, dep_delay) a3 <- summarise(a2,   arr = mean(arr_delay, na.rm = TRUE),   dep = mean(dep_delay, na.rm = TRUE)) a4 <- filter(a3, arr > 30 | dep > 30)</pre><br> 11、引入链接符%>% <p></p> <p>使用时把数据名作为开头,然后依次对数据进行多步操作:</p> <p></p> <pre name="code" class="html">flights %>%     group_by(year, month, day) %>%     select(arr_delay, dep_delay) %>%     summarise(         arr = mean(arr_delay, na.rm = TRUE),         dep = mean(dep_delay, na.rm = TRUE)     ) %>%     filter(arr > 30 | dep > 30) </pre>前面都免去了数据名 <p></p> <p><br> </p> <p>若想要进行更多地了解这个包,可以参考其自带的说明书(60页):<a target="_blank" target="_blank" href="http://cran.rstudio.com/web/packages/dplyr/dplyr.pdf">dplyr</a></p>    </div>
    转载请注明原文地址: https://ju.6miu.com/read-300178.html

    最新回复(0)