Chapter 11 Introducción al análisis de datos con Shell

https://www.datacamp.com/courses/introduction-to-shell-for-data-science

La línea de comandos Unix permite comibnar los programas existentes de diversas maneras automatizar tareas repetitivas, entre otras.

11.1 Manipular archivos y directorios

Vamos a ver una introducción al shell viendo cómo crear, modificar y eliminar archivos y carpetas.

El sistema de archivos (file system) gestiona los archivos y directorios. Cada uno de los archivos y directorios se identifican por una ruta absoluta de acceso que indica cómo acceder a ellos desde la carpeta raíz (root directory). Con el comando pwd, que proviene de print working directory, nos indicará la ruta de nuestra carpeta actual de trabajo.

pwd
## /home/mario/MEGA/LIBROS/Ciencia de Datos con R

Si queremos hacer un listado de los archivos y carpetas contenidas en nuestra carpeta de trabajo tendremos que ejecutar el comando ls, que proviene de listing. Como argumento, podemos indicar la ruta de la carpeta que estemos interesados en listar, por ejemplo con ls /home/mario.

ls
## 01-programar-r-basico.Rmd
## 02-programar-r-intermedio.Rmd
## 03-introduccion-tidyverse.Rmd
## 04-importar-datos.Rmd
## 05-tratamiento-datos.Rmd
## 06-manipular-datos-dplyr.Rmd
## 07-introducción-datos.Rmd
## 08-escribir-funciones.Rmd
## 09-analisis-exploratorio.Rmd
## 10-introducción-analisis-datos-shell.Rmd
## 11-visualizacion-ggplot2.Rmd
## _book
## book.bib
## _bookdown_files
## _bookdown.yml
## Ciencia_de_Datos_con_R_files
## Ciencia-de-Datos-con-R_files
## Ciencia de Datos con R.Rmd
## Ciencia-de-Datos-con-R.Rmd
## Ciencia de Datos con R.Rproj
## datacamp.png
## Datos
## figure
## gitbook.R
## google_analytics.html
## index.Rmd
## local_latitude.xls
## _output.yml
## packages.bib
## preamble.tex
## README.md
## style.css
## temporal
## wine_local.RData

Las rutas absolutas de los archivos y carpetas son siempre las mismas no importa dónde te encuentres dentro del ordenador. Por su parte, las rutas relativas dependen de la carpeta en la que nos encontremos. Por ejemplo, si nuestra carpeta de trabajo es /home/mario, y queremos hacer un listado ls en una ruta relativa situado en una carpeta llamada trabajo (y situada dentro de /home/mario/trabajo), podemos ejecutar ls trabajo directamente.

11.1.1 Cambiar de carpetas

¿Cómo podemos movernos de una carpeta a otra dentro de la terminal (carpetas hijas)? Utilizando el comando cd (change directory) para entrar en carpetas y el comando cd .. para salir y volver a una carpeta jerárquicamente situada por encima (carpeta madre).

The parent of a directory is the directory above it. For example, /home is the parent of /home/repl, and /home/repl is the parent of /home/repl/seasonal. You can always give the absolute path of your parent directory to commands like cd and ls. More often, though, you will take advantage of the fact that the special path .. (two dots with no spaces) means “the directory above the one I’m currently in”. If you are in /home/repl/seasonal, then cd .. moves you up to /home/repl. If you use cd .. once again, it puts you in /home. One more cd .. puts you in the root directory /, which is the very top of the filesystem. (Remember to put a space between cd and .. - it is a command and a path, not a single four-letter command.)

A single dot on its own, ., always means “the current directory”, so ls on its own and ls . do the same thing, while cd . has no effect (because it moves you into the directory you’re currently in).

One final special path is ~ (the tilde character), which means “your home directory”, such as /home/repl. No matter where you are, ls ~ will always list the contents of your home directory, and cd ~ will always take you home.

11.1.2 Copiar archivos

You will often want to copy files, move them into other directories to organize them, or rename them. One command to do this is cp, which is short for “copy”. If original.txt is an existing file, then:

cp original.txt duplicate.txt

creates a copy of original.txt called duplicate.txt. If there already was a file called duplicate.txt, it is overwritten. If the last parameter to cp is an existing directory, then a command like:

cp seasonal/autumn.csv seasonal/winter.csv backup

copies all of the files into that directory.

11.1.3 Mover archivos

While cp copies a file, mv moves it from one directory to another, just as if you had dragged it in a graphical file browser. It handles its parameters the same way as cp, so the command:

mv autumn.csv winter.csv ..

moves the files autumn.csv and winter.csv from the current working directory up one level to its parent directory (because .. always refers to the directory above your current location).

11.1.4 Renombrar archivos

mv can also be used to rename files. If you run:

mv course.txt old-course.txt

then the file course.txt in the current working directory is “moved” to the file old-course.txt. This is different from the way file browsers work, but is often handy.

One warning: just like cp, mv will overwrite existing files. If, for example, you already have a file called old-course.txt, then the command shown above will replace it with whatever is in course.txt.

11.1.5 Eliminar archivos

We can copy files and move them around; to delete them, we use rm, which stands for “remove”. As with cp and mv, you can give rm the names of as many files as you’d like, so:

rm thesis.txt backup/thesis-2017-08.txt

removes both thesis.txt and backup/thesis-2017-08.txt

rm does exactly what its name says, and it does it right away: unlike graphical file browsers, the shell doesn’t have a trash can, so when you type the command above, your thesis is gone for good.

11.1.6 Crear y eliminar carpetas

mv treats directories the same way it treats files: if you are in your home directory and run mv seasonal by-season, for example, mv changes the name of the seasonal directory to by-season. However, rm works differently.

If you try to rm a directory, the shell prints an error message telling you it can’t do that, primarily to stop you from accidentally deleting an entire directory full of work. Instead, you can use a separate command called rmdir. For added safety, it only works when the directory is empty, so you must delete the files in a directory before you delete the directory. (Experienced users can use the -r option to rm to get the same effect; we will discuss command options in the next chapter.)

Para crear una carpeta usamos mkdir (make directory).

Si ponemos una barra delante de la carpeta nos estamos refiriendo a una carpeta que se encuentra en la raíz, por ejemplo /tmp se refiere a los temporales. Si nos referimos a nuestra carpeta de trabajo usamos ~/.

11.2 Manipular datos

11.2.1 Ver el contenido de un archivo

Usamos el comando cat que viene de concatenate.

You can use cat to print large files and then scroll through the output, but it is usually more convenient to page the output. The original command for doing this was called more, but it has been superseded by a more powerful command called less. (This kind of naming is what passes for humor in the Unix world.) When you less a file, one page is displayed at a time; you can press spacebar to page down or type q to quit.

If you give less the names of several files, you can type :n (colon and a lower-case ‘n’) to move to the next file, :p to go back to the previous one, or :q to quit.

Note: If you view solutions to exercises that use less, you will see an extra command at the end that turns paging off so that we can test your solutions efficiently.

11.2.2 Ver el principio de un archivo

The first thing most data scientists do when given a new dataset to analyze is figure out what fields it contains and what values those fields have. If the dataset has been exported from a database or spreadsheet, it will often be stored as comma-separated values (CSV). A quick way to figure out what it contains is to look at the first few rows.

We can do this in the shell using a command called head. As its name suggests, it prints the first few lines of a file (where “a few” means 10).

11.2.3 Tab completion

One of the shell’s power tools is tab completion. If you start typing the name of a file and then press the tab key, the shell will do its best to auto-complete the path. For example, if you type sea and press tab, it will fill in the directory name seasonal/ (with a trailing slash). If you then type a and tab, it will complete the path as seasonal/autumn.csv.

If the path is ambiguous, such as seasonal/s, pressing tab a second time will display a list of possibilities. Typing another character or two to make your path more specific and then pressing tab will fill in the rest of the name.

11.2.4 Modificar o personalizar comandos

You won’t always want to look at the first 10 lines of a file, so the shell lets you change head’s behavior by giving it a command-line flag (or just “flag” for short). If you run the command:

head -n 3 seasonal/summer.csv

head will only display the first three lines of the file. If you run head -n 100, it will display the first 100 (assuming there are that many), and so on.

A flag’s name usually indicates its purpose (for example, -n is meant to signal “number of lines”). Command flags don’t have to be a - followed by a single letter, but it’s a widely-used convention.

Note: it’s considered good style to put all flags before any filenames, so in this course, we only accept answers that do that.

11.2.5 How can I list everything below a directory?

In order to see everything underneath a directory, no matter how deeply nested it is, you can give ls the flag -R (which means “recursive”). If you use ls -R in your home directory, you will see something like this:

ls -R
## .:
## 01-programar-r-basico.Rmd
## 02-programar-r-intermedio.Rmd
## 03-introduccion-tidyverse.Rmd
## 04-importar-datos.Rmd
## 05-tratamiento-datos.Rmd
## 06-manipular-datos-dplyr.Rmd
## 07-introducción-datos.Rmd
## 08-escribir-funciones.Rmd
## 09-analisis-exploratorio.Rmd
## 10-introducción-analisis-datos-shell.Rmd
## 11-visualizacion-ggplot2.Rmd
## _book
## book.bib
## _bookdown_files
## _bookdown.yml
## Ciencia_de_Datos_con_R_files
## Ciencia-de-Datos-con-R_files
## Ciencia de Datos con R.Rmd
## Ciencia-de-Datos-con-R.Rmd
## Ciencia de Datos con R.Rproj
## datacamp.png
## Datos
## figure
## gitbook.R
## google_analytics.html
## index.Rmd
## local_latitude.xls
## _output.yml
## packages.bib
## preamble.tex
## README.md
## style.css
## temporal
## wine_local.RData
## 
## ./_book:
## analisis-exploratorio-de-los-datos.html
## applications.html
## Ciencia_de_Datos_con_R_files
## Ciencia-de-Datos-con-R_files
## Ciencia_de_Datos_con_R.pdf
## Ciencia_de_Datos_con_R.tex
## datacamp.png
## el-mundo-tidyverse.html
## escribir-funciones-en-r.html
## figure
## final-words.html
## importar-datos-en-r.html
## index.html
## introduccion-al-analisis-de-datos-con-shell.html
## introduccion-a-los-datos.html
## intro.html
## libs
## literature.html
## manipulacion-de-datos-en-r-con-dplyr.html
## methods.html
## programacion-basica-en-r.html
## programacion-intermedia-en-r.html
## references.html
## search_index.json
## style.css
## tratamiento-de-datos-en-r.html
## visualizacion-de-datos-con-ggplot2.html
## 
## ./_book/Ciencia_de_Datos_con_R_files:
## figure-html
## 
## ./_book/Ciencia_de_Datos_con_R_files/figure-html:
## nice-fig-1.png
## unnamed-chunk-132-1.png
## unnamed-chunk-133-1.png
## unnamed-chunk-133-2.png
## unnamed-chunk-134-1.png
## unnamed-chunk-135-1.png
## unnamed-chunk-135-2.png
## unnamed-chunk-138-1.png
## unnamed-chunk-139-1.png
## unnamed-chunk-140-1.png
## unnamed-chunk-141-1.png
## unnamed-chunk-141-2.png
## unnamed-chunk-142-1.png
## unnamed-chunk-142-2.png
## unnamed-chunk-143-1.png
## unnamed-chunk-143-2.png
## unnamed-chunk-144-1.png
## unnamed-chunk-201-1.png
## unnamed-chunk-255-1.png
## unnamed-chunk-316-1.png
## unnamed-chunk-316-2.png
## unnamed-chunk-316-3.png
## unnamed-chunk-317-1.png
## unnamed-chunk-317-2.png
## unnamed-chunk-317-3.png
## unnamed-chunk-319-1.png
## unnamed-chunk-319-2.png
## unnamed-chunk-319-3.png
## unnamed-chunk-320-1.png
## unnamed-chunk-320-2.png
## unnamed-chunk-320-3.png
## unnamed-chunk-321-1.png
## unnamed-chunk-321-2.png
## unnamed-chunk-321-3.png
## unnamed-chunk-322-1.png
## unnamed-chunk-322-2.png
## unnamed-chunk-322-3.png
## unnamed-chunk-323-1.png
## unnamed-chunk-323-2.png
## unnamed-chunk-323-3.png
## unnamed-chunk-337-1.png
## unnamed-chunk-337-2.png
## unnamed-chunk-339-1.png
## unnamed-chunk-339-2.png
## unnamed-chunk-340-1.png
## unnamed-chunk-341-1.png
## unnamed-chunk-343-1.png
## unnamed-chunk-343-2.png
## unnamed-chunk-344-1.png
## unnamed-chunk-344-2.png
## unnamed-chunk-346-1.png
## unnamed-chunk-347-1.png
## unnamed-chunk-349-1.png
## unnamed-chunk-350-1.png
## unnamed-chunk-351-1.png
## unnamed-chunk-351-2.png
## unnamed-chunk-352-1.png
## unnamed-chunk-353-1.png
## unnamed-chunk-358-1.png
## unnamed-chunk-359-1.png
## unnamed-chunk-360-1.png
## unnamed-chunk-360-2.png
## unnamed-chunk-360-3.png
## unnamed-chunk-361-1.png
## unnamed-chunk-361-2.png
## unnamed-chunk-361-3.png
## unnamed-chunk-361-4.png
## unnamed-chunk-362-1.png
## unnamed-chunk-362-2.png
## unnamed-chunk-363-1.png
## unnamed-chunk-363-2.png
## 
## ./_book/Ciencia-de-Datos-con-R_files:
## figure-html
## 
## ./_book/Ciencia-de-Datos-con-R_files/figure-html:
## unnamed-chunk-1-1.png
## unnamed-chunk-132-1.png
## unnamed-chunk-133-1.png
## unnamed-chunk-133-2.png
## unnamed-chunk-134-1.png
## unnamed-chunk-135-1.png
## unnamed-chunk-135-2.png
## unnamed-chunk-138-1.png
## unnamed-chunk-139-1.png
## unnamed-chunk-140-1.png
## unnamed-chunk-141-1.png
## unnamed-chunk-141-2.png
## unnamed-chunk-142-1.png
## unnamed-chunk-142-2.png
## unnamed-chunk-143-1.png
## unnamed-chunk-143-2.png
## unnamed-chunk-144-1.png
## unnamed-chunk-201-1.png
## unnamed-chunk-2-1.png
## unnamed-chunk-255-1.png
## unnamed-chunk-316-1.png
## unnamed-chunk-316-2.png
## unnamed-chunk-316-3.png
## unnamed-chunk-317-1.png
## unnamed-chunk-317-2.png
## unnamed-chunk-317-3.png
## unnamed-chunk-319-1.png
## unnamed-chunk-319-2.png
## unnamed-chunk-319-3.png
## unnamed-chunk-3-1.png
## unnamed-chunk-320-1.png
## unnamed-chunk-320-2.png
## unnamed-chunk-320-3.png
## unnamed-chunk-321-1.png
## unnamed-chunk-321-2.png
## unnamed-chunk-321-3.png
## unnamed-chunk-322-1.png
## unnamed-chunk-322-2.png
## unnamed-chunk-322-3.png
## unnamed-chunk-323-1.png
## unnamed-chunk-323-2.png
## unnamed-chunk-323-3.png
## unnamed-chunk-3-2.png
## unnamed-chunk-337-1.png
## unnamed-chunk-337-2.png
## unnamed-chunk-339-1.png
## unnamed-chunk-339-2.png
## unnamed-chunk-3-3.png
## unnamed-chunk-340-1.png
## unnamed-chunk-341-1.png
## unnamed-chunk-343-1.png
## unnamed-chunk-343-2.png
## unnamed-chunk-344-1.png
## unnamed-chunk-344-2.png
## unnamed-chunk-346-1.png
## unnamed-chunk-347-1.png
## unnamed-chunk-349-1.png
## unnamed-chunk-350-1.png
## unnamed-chunk-351-1.png
## unnamed-chunk-351-2.png
## unnamed-chunk-352-1.png
## unnamed-chunk-353-1.png
## unnamed-chunk-358-1.png
## unnamed-chunk-359-1.png
## unnamed-chunk-360-1.png
## unnamed-chunk-360-2.png
## unnamed-chunk-360-3.png
## unnamed-chunk-361-1.png
## unnamed-chunk-361-2.png
## unnamed-chunk-361-3.png
## unnamed-chunk-361-4.png
## unnamed-chunk-362-1.png
## unnamed-chunk-362-2.png
## unnamed-chunk-363-1.png
## unnamed-chunk-363-2.png
## unnamed-chunk-364-1.png
## unnamed-chunk-364-2.png
## unnamed-chunk-365-1.png
## unnamed-chunk-365-2.png
## unnamed-chunk-365-3.png
## unnamed-chunk-365-4.png
## unnamed-chunk-367-1.png
## unnamed-chunk-368-1.png
## unnamed-chunk-369-1.png
## unnamed-chunk-371-10.png
## unnamed-chunk-371-1.png
## unnamed-chunk-371-2.png
## unnamed-chunk-371-3.png
## unnamed-chunk-371-4.png
## unnamed-chunk-371-5.png
## unnamed-chunk-371-6.png
## unnamed-chunk-371-7.png
## unnamed-chunk-371-8.png
## unnamed-chunk-371-9.png
## unnamed-chunk-372-1.png
## unnamed-chunk-372-2.png
## unnamed-chunk-372-3.png
## unnamed-chunk-372-4.png
## unnamed-chunk-372-5.png
## unnamed-chunk-372-6.png
## unnamed-chunk-373-1.png
## unnamed-chunk-373-2.png
## unnamed-chunk-373-3.png
## unnamed-chunk-375-1.png
## unnamed-chunk-375-2.png
## unnamed-chunk-376-1.png
## unnamed-chunk-376-2.png
## unnamed-chunk-378-1.png
## unnamed-chunk-379-1.png
## unnamed-chunk-380-1.png
## unnamed-chunk-380-2.png
## unnamed-chunk-380-3.png
## unnamed-chunk-380-4.png
## unnamed-chunk-381-1.png
## unnamed-chunk-381-2.png
## unnamed-chunk-382-1.png
## unnamed-chunk-383-1.png
## unnamed-chunk-384-1.png
## unnamed-chunk-385-1.png
## unnamed-chunk-387-1.png
## unnamed-chunk-388-1.png
## unnamed-chunk-389-1.png
## unnamed-chunk-389-2.png
## unnamed-chunk-389-3.png
## unnamed-chunk-390-1.png
## unnamed-chunk-391-1.png
## unnamed-chunk-391-2.png
## unnamed-chunk-391-3.png
## unnamed-chunk-392-1.png
## unnamed-chunk-392-2.png
## unnamed-chunk-392-3.png
## unnamed-chunk-393-1.png
## unnamed-chunk-393-2.png
## unnamed-chunk-395-1.png
## unnamed-chunk-395-2.png
## unnamed-chunk-396-1.png
## unnamed-chunk-396-2.png
## unnamed-chunk-398-1.png
## unnamed-chunk-402-1.png
## unnamed-chunk-403-1.png
## unnamed-chunk-403-2.png
## unnamed-chunk-403-3.png
## unnamed-chunk-404-1.png
## unnamed-chunk-404-2.png
## unnamed-chunk-405-1.png
## unnamed-chunk-405-2.png
## unnamed-chunk-406-1.png
## unnamed-chunk-406-2.png
## unnamed-chunk-406-3.png
## unnamed-chunk-406-4.png
## unnamed-chunk-406-5.png
## unnamed-chunk-407-1.png
## unnamed-chunk-407-2.png
## unnamed-chunk-407-3.png
## unnamed-chunk-408-1.png
## unnamed-chunk-408-2.png
## unnamed-chunk-408-3.png
## unnamed-chunk-409-1.png
## unnamed-chunk-409-2.png
## unnamed-chunk-409-3.png
## unnamed-chunk-410-1.png
## unnamed-chunk-410-2.png
## unnamed-chunk-411-1.png
## unnamed-chunk-411-2.png
## unnamed-chunk-411-3.png
## unnamed-chunk-412-1.png
## unnamed-chunk-412-2.png
## unnamed-chunk-413-1.png
## unnamed-chunk-413-2.png
## unnamed-chunk-413-3.png
## unnamed-chunk-413-4.png
## unnamed-chunk-4-1.png
## unnamed-chunk-4-2.png
## unnamed-chunk-4-3.png
## unnamed-chunk-4-4.png
## unnamed-chunk-5-1.png
## unnamed-chunk-5-2.png
## unnamed-chunk-6-1.png
## unnamed-chunk-6-2.png
## 
## ./_book/figure:
## Subsetting_listas.png
## 
## ./_book/libs:
## gitbook-2.6.7
## jquery-2.2.3
## 
## ./_book/libs/gitbook-2.6.7:
## css
## js
## 
## ./_book/libs/gitbook-2.6.7/css:
## fontawesome
## plugin-bookdown.css
## plugin-fontsettings.css
## plugin-highlight.css
## plugin-search.css
## plugin-table.css
## style.css
## 
## ./_book/libs/gitbook-2.6.7/css/fontawesome:
## fontawesome-webfont.ttf
## 
## ./_book/libs/gitbook-2.6.7/js:
## app.min.js
## jquery.highlight.js
## lunr.js
## plugin-bookdown.js
## plugin-fontsettings.js
## plugin-search.js
## plugin-sharing.js
## 
## ./_book/libs/jquery-2.2.3:
## jquery.min.js
## 
## ./_bookdown_files:
## 
## ./Ciencia_de_Datos_con_R_files:
## figure-html
## figure-latex
## 
## ./Ciencia_de_Datos_con_R_files/figure-html:
## nice-fig-1.png
## unnamed-chunk-132-1.png
## unnamed-chunk-133-1.png
## unnamed-chunk-133-2.png
## unnamed-chunk-134-1.png
## unnamed-chunk-135-1.png
## unnamed-chunk-135-2.png
## unnamed-chunk-138-1.png
## unnamed-chunk-139-1.png
## unnamed-chunk-140-1.png
## unnamed-chunk-141-1.png
## unnamed-chunk-141-2.png
## unnamed-chunk-142-1.png
## unnamed-chunk-142-2.png
## unnamed-chunk-143-1.png
## unnamed-chunk-143-2.png
## unnamed-chunk-144-1.png
## unnamed-chunk-201-1.png
## unnamed-chunk-255-1.png
## unnamed-chunk-316-1.png
## unnamed-chunk-316-2.png
## unnamed-chunk-316-3.png
## unnamed-chunk-317-1.png
## unnamed-chunk-317-2.png
## unnamed-chunk-317-3.png
## unnamed-chunk-319-1.png
## unnamed-chunk-319-2.png
## unnamed-chunk-319-3.png
## unnamed-chunk-320-1.png
## unnamed-chunk-320-2.png
## unnamed-chunk-320-3.png
## unnamed-chunk-321-1.png
## unnamed-chunk-321-2.png
## unnamed-chunk-321-3.png
## unnamed-chunk-322-1.png
## unnamed-chunk-322-2.png
## unnamed-chunk-322-3.png
## unnamed-chunk-323-1.png
## unnamed-chunk-323-2.png
## unnamed-chunk-323-3.png
## unnamed-chunk-337-1.png
## unnamed-chunk-337-2.png
## unnamed-chunk-339-1.png
## unnamed-chunk-339-2.png
## unnamed-chunk-340-1.png
## unnamed-chunk-341-1.png
## unnamed-chunk-343-1.png
## unnamed-chunk-343-2.png
## unnamed-chunk-344-1.png
## unnamed-chunk-344-2.png
## unnamed-chunk-346-1.png
## unnamed-chunk-347-1.png
## unnamed-chunk-349-1.png
## unnamed-chunk-350-1.png
## unnamed-chunk-351-1.png
## unnamed-chunk-351-2.png
## unnamed-chunk-352-1.png
## unnamed-chunk-353-1.png
## unnamed-chunk-358-1.png
## unnamed-chunk-359-1.png
## unnamed-chunk-360-1.png
## unnamed-chunk-360-2.png
## unnamed-chunk-360-3.png
## unnamed-chunk-361-1.png
## unnamed-chunk-361-2.png
## unnamed-chunk-361-3.png
## unnamed-chunk-361-4.png
## unnamed-chunk-362-1.png
## unnamed-chunk-362-2.png
## unnamed-chunk-363-1.png
## unnamed-chunk-363-2.png
## 
## ./Ciencia_de_Datos_con_R_files/figure-latex:
## cars-1.pdf
## unnamed-chunk-132-1.pdf
## unnamed-chunk-133-1.pdf
## unnamed-chunk-133-2.pdf
## unnamed-chunk-134-1.pdf
## unnamed-chunk-134-2.pdf
## unnamed-chunk-135-1.pdf
## unnamed-chunk-135-2.pdf
## unnamed-chunk-136-1.pdf
## unnamed-chunk-136-2.pdf
## unnamed-chunk-137-1.pdf
## unnamed-chunk-137-2.pdf
## unnamed-chunk-138-1.pdf
## unnamed-chunk-138-2.pdf
## unnamed-chunk-139-1.pdf
## unnamed-chunk-139-2.pdf
## unnamed-chunk-140-1.pdf
## unnamed-chunk-141-1.pdf
## unnamed-chunk-141-2.pdf
## unnamed-chunk-142-1.pdf
## unnamed-chunk-142-2.pdf
## unnamed-chunk-143-1.pdf
## unnamed-chunk-143-2.pdf
## unnamed-chunk-144-1.pdf
## unnamed-chunk-144-2.pdf
## unnamed-chunk-145-1.pdf
## unnamed-chunk-145-2.pdf
## unnamed-chunk-146-1.pdf
## unnamed-chunk-146-2.pdf
## unnamed-chunk-147-1.pdf
## unnamed-chunk-147-2.pdf
## unnamed-chunk-148-1.pdf
## unnamed-chunk-148-2.pdf
## unnamed-chunk-149-1.pdf
## unnamed-chunk-149-2.pdf
## unnamed-chunk-150-1.pdf
## unnamed-chunk-150-2.pdf
## unnamed-chunk-151-1.pdf
## unnamed-chunk-151-2.pdf
## unnamed-chunk-152-1.pdf
## unnamed-chunk-152-2.pdf
## unnamed-chunk-153-1.pdf
## unnamed-chunk-153-2.pdf
## unnamed-chunk-154-1.pdf
## unnamed-chunk-154-2.pdf
## unnamed-chunk-155-1.pdf
## unnamed-chunk-156-1.pdf
## unnamed-chunk-156-2.pdf
## unnamed-chunk-159-1.pdf
## unnamed-chunk-160-1.pdf
## unnamed-chunk-161-1.pdf
## unnamed-chunk-162-1.pdf
## unnamed-chunk-162-2.pdf
## unnamed-chunk-163-1.pdf
## unnamed-chunk-163-2.pdf
## unnamed-chunk-164-1.pdf
## unnamed-chunk-164-2.pdf
## unnamed-chunk-165-1.pdf
## unnamed-chunk-201-1.pdf
## unnamed-chunk-202-1.pdf
## unnamed-chunk-204-1.pdf
## unnamed-chunk-205-1.pdf
## unnamed-chunk-207-1.pdf
## unnamed-chunk-210-1.pdf
## unnamed-chunk-211-1.pdf
## unnamed-chunk-222-1.pdf
## unnamed-chunk-255-1.pdf
## unnamed-chunk-256-1.pdf
## unnamed-chunk-258-1.pdf
## unnamed-chunk-259-1.pdf
## unnamed-chunk-261-1.pdf
## unnamed-chunk-265-1.pdf
## unnamed-chunk-316-1.pdf
## unnamed-chunk-316-2.pdf
## unnamed-chunk-316-3.pdf
## unnamed-chunk-317-1.pdf
## unnamed-chunk-317-2.pdf
## unnamed-chunk-317-3.pdf
## unnamed-chunk-318-1.pdf
## unnamed-chunk-318-2.pdf
## unnamed-chunk-318-3.pdf
## unnamed-chunk-319-1.pdf
## unnamed-chunk-319-2.pdf
## unnamed-chunk-319-3.pdf
## unnamed-chunk-320-1.pdf
## unnamed-chunk-320-2.pdf
## unnamed-chunk-320-3.pdf
## unnamed-chunk-321-1.pdf
## unnamed-chunk-321-2.pdf
## unnamed-chunk-321-3.pdf
## unnamed-chunk-322-1.pdf
## unnamed-chunk-322-2.pdf
## unnamed-chunk-322-3.pdf
## unnamed-chunk-323-1.pdf
## unnamed-chunk-323-2.pdf
## unnamed-chunk-323-3.pdf
## unnamed-chunk-324-1.pdf
## unnamed-chunk-324-2.pdf
## unnamed-chunk-324-3.pdf
## unnamed-chunk-325-1.pdf
## unnamed-chunk-325-2.pdf
## unnamed-chunk-325-3.pdf
## unnamed-chunk-326-1.pdf
## unnamed-chunk-326-2.pdf
## unnamed-chunk-326-3.pdf
## unnamed-chunk-327-1.pdf
## unnamed-chunk-327-2.pdf
## unnamed-chunk-327-3.pdf
## unnamed-chunk-329-1.pdf
## unnamed-chunk-329-2.pdf
## unnamed-chunk-329-3.pdf
## unnamed-chunk-330-1.pdf
## unnamed-chunk-330-2.pdf
## unnamed-chunk-330-3.pdf
## unnamed-chunk-331-1.pdf
## unnamed-chunk-331-2.pdf
## unnamed-chunk-331-3.pdf
## unnamed-chunk-332-1.pdf
## unnamed-chunk-332-2.pdf
## unnamed-chunk-332-3.pdf
## unnamed-chunk-333-1.pdf
## unnamed-chunk-333-2.pdf
## unnamed-chunk-333-3.pdf
## unnamed-chunk-338-1.pdf
## unnamed-chunk-338-2.pdf
## unnamed-chunk-339-1.pdf
## unnamed-chunk-339-2.pdf
## unnamed-chunk-340-1.pdf
## unnamed-chunk-340-2.pdf
## unnamed-chunk-341-1.pdf
## unnamed-chunk-341-2.pdf
## unnamed-chunk-342-1.pdf
## unnamed-chunk-343-1.pdf
## unnamed-chunk-344-1.pdf
## unnamed-chunk-344-2.pdf
## unnamed-chunk-345-1.pdf
## unnamed-chunk-345-2.pdf
## unnamed-chunk-346-1.pdf
## unnamed-chunk-346-2.pdf
## unnamed-chunk-347-1.pdf
## unnamed-chunk-348-1.pdf
## unnamed-chunk-349-1.pdf
## unnamed-chunk-350-1.pdf
## unnamed-chunk-351-1.pdf
## unnamed-chunk-352-1.pdf
## unnamed-chunk-352-2.pdf
## unnamed-chunk-353-1.pdf
## unnamed-chunk-353-2.pdf
## unnamed-chunk-354-1.pdf
## unnamed-chunk-355-1.pdf
## 
## ./Ciencia-de-Datos-con-R_files:
## figure-html
## 
## ./Ciencia-de-Datos-con-R_files/figure-html:
## unnamed-chunk-1-1.png
## unnamed-chunk-132-1.png
## unnamed-chunk-133-1.png
## unnamed-chunk-133-2.png
## unnamed-chunk-134-1.png
## unnamed-chunk-135-1.png
## unnamed-chunk-135-2.png
## unnamed-chunk-138-1.png
## unnamed-chunk-139-1.png
## unnamed-chunk-140-1.png
## unnamed-chunk-141-1.png
## unnamed-chunk-141-2.png
## unnamed-chunk-142-1.png
## unnamed-chunk-142-2.png
## unnamed-chunk-143-1.png
## unnamed-chunk-143-2.png
## unnamed-chunk-144-1.png
## unnamed-chunk-201-1.png
## unnamed-chunk-2-1.png
## unnamed-chunk-255-1.png
## unnamed-chunk-316-1.png
## unnamed-chunk-316-2.png
## unnamed-chunk-316-3.png
## unnamed-chunk-317-1.png
## unnamed-chunk-317-2.png
## unnamed-chunk-317-3.png
## unnamed-chunk-319-1.png
## unnamed-chunk-319-2.png
## unnamed-chunk-319-3.png
## unnamed-chunk-3-1.png
## unnamed-chunk-320-1.png
## unnamed-chunk-320-2.png
## unnamed-chunk-320-3.png
## unnamed-chunk-321-1.png
## unnamed-chunk-321-2.png
## unnamed-chunk-321-3.png
## unnamed-chunk-322-1.png
## unnamed-chunk-322-2.png
## unnamed-chunk-322-3.png
## unnamed-chunk-323-1.png
## unnamed-chunk-323-2.png
## unnamed-chunk-323-3.png
## unnamed-chunk-3-2.png
## unnamed-chunk-337-1.png
## unnamed-chunk-337-2.png
## unnamed-chunk-339-1.png
## unnamed-chunk-339-2.png
## unnamed-chunk-3-3.png
## unnamed-chunk-340-1.png
## unnamed-chunk-341-1.png
## unnamed-chunk-343-1.png
## unnamed-chunk-343-2.png
## unnamed-chunk-344-1.png
## unnamed-chunk-344-2.png
## unnamed-chunk-346-1.png
## unnamed-chunk-347-1.png
## unnamed-chunk-349-1.png
## unnamed-chunk-350-1.png
## unnamed-chunk-351-1.png
## unnamed-chunk-351-2.png
## unnamed-chunk-352-1.png
## unnamed-chunk-353-1.png
## unnamed-chunk-358-1.png
## unnamed-chunk-359-1.png
## unnamed-chunk-360-1.png
## unnamed-chunk-360-2.png
## unnamed-chunk-360-3.png
## unnamed-chunk-361-1.png
## unnamed-chunk-361-2.png
## unnamed-chunk-361-3.png
## unnamed-chunk-361-4.png
## unnamed-chunk-362-1.png
## unnamed-chunk-362-2.png
## unnamed-chunk-363-1.png
## unnamed-chunk-363-2.png
## unnamed-chunk-364-1.png
## unnamed-chunk-364-2.png
## unnamed-chunk-365-1.png
## unnamed-chunk-365-2.png
## unnamed-chunk-365-3.png
## unnamed-chunk-365-4.png
## unnamed-chunk-367-1.png
## unnamed-chunk-368-1.png
## unnamed-chunk-369-1.png
## unnamed-chunk-371-10.png
## unnamed-chunk-371-1.png
## unnamed-chunk-371-2.png
## unnamed-chunk-371-3.png
## unnamed-chunk-371-4.png
## unnamed-chunk-371-5.png
## unnamed-chunk-371-6.png
## unnamed-chunk-371-7.png
## unnamed-chunk-371-8.png
## unnamed-chunk-371-9.png
## unnamed-chunk-372-1.png
## unnamed-chunk-372-2.png
## unnamed-chunk-372-3.png
## unnamed-chunk-372-4.png
## unnamed-chunk-372-5.png
## unnamed-chunk-372-6.png
## unnamed-chunk-373-1.png
## unnamed-chunk-373-2.png
## unnamed-chunk-373-3.png
## unnamed-chunk-375-1.png
## unnamed-chunk-375-2.png
## unnamed-chunk-376-1.png
## unnamed-chunk-376-2.png
## unnamed-chunk-378-1.png
## unnamed-chunk-379-1.png
## unnamed-chunk-380-1.png
## unnamed-chunk-380-2.png
## unnamed-chunk-380-3.png
## unnamed-chunk-380-4.png
## unnamed-chunk-381-1.png
## unnamed-chunk-381-2.png
## unnamed-chunk-382-1.png
## unnamed-chunk-383-1.png
## unnamed-chunk-384-1.png
## unnamed-chunk-385-1.png
## unnamed-chunk-387-1.png
## unnamed-chunk-388-1.png
## unnamed-chunk-389-1.png
## unnamed-chunk-389-2.png
## unnamed-chunk-389-3.png
## unnamed-chunk-390-1.png
## unnamed-chunk-391-1.png
## unnamed-chunk-391-2.png
## unnamed-chunk-391-3.png
## unnamed-chunk-392-1.png
## unnamed-chunk-392-2.png
## unnamed-chunk-392-3.png
## unnamed-chunk-393-1.png
## unnamed-chunk-393-2.png
## unnamed-chunk-395-1.png
## unnamed-chunk-395-2.png
## unnamed-chunk-396-1.png
## unnamed-chunk-396-2.png
## unnamed-chunk-398-1.png
## unnamed-chunk-402-1.png
## unnamed-chunk-403-1.png
## unnamed-chunk-403-2.png
## unnamed-chunk-403-3.png
## unnamed-chunk-404-1.png
## unnamed-chunk-404-2.png
## unnamed-chunk-405-1.png
## unnamed-chunk-405-2.png
## unnamed-chunk-406-1.png
## unnamed-chunk-406-2.png
## unnamed-chunk-406-3.png
## unnamed-chunk-406-4.png
## unnamed-chunk-406-5.png
## unnamed-chunk-407-1.png
## unnamed-chunk-407-2.png
## unnamed-chunk-407-3.png
## unnamed-chunk-408-1.png
## unnamed-chunk-408-2.png
## unnamed-chunk-408-3.png
## unnamed-chunk-409-1.png
## unnamed-chunk-409-2.png
## unnamed-chunk-409-3.png
## unnamed-chunk-410-1.png
## unnamed-chunk-410-2.png
## unnamed-chunk-411-1.png
## unnamed-chunk-411-2.png
## unnamed-chunk-411-3.png
## unnamed-chunk-412-1.png
## unnamed-chunk-412-2.png
## unnamed-chunk-413-1.png
## unnamed-chunk-413-2.png
## unnamed-chunk-413-3.png
## unnamed-chunk-413-4.png
## unnamed-chunk-4-1.png
## unnamed-chunk-4-2.png
## unnamed-chunk-4-3.png
## unnamed-chunk-4-4.png
## unnamed-chunk-5-1.png
## unnamed-chunk-5-2.png
## unnamed-chunk-6-1.png
## unnamed-chunk-6-2.png
## 
## ./Datos:
## attendance.xls
## bmi.R
## clean.xlsx
## edequality.dta
## films.sql
## florida.dta
## hotdogs.txt
## international.sav
## mbta.xlsx
## person.sav
## potatoes.csv
## potatoes.txt
## renamed.xlsx
## students2.R
## students.R
## summary.xlsx
## swimming_pools.csv
## urbanpop_nonames.xlsx
## urbanpop.xlsx
## Viva el Software Libre.png
## 
## ./figure:
## Subsetting_listas.png
## 
## ./temporal:
## 06-importar-tratamiento-casos-estudio.Rmd
## 09-unir-datos-dplyr.Rmd
## 10-introducción-analisis-datos-SQL.Rmd
## 10-machine-learning.Rmd
## 11-r-markdown.Rmd
## 14-machine-learning-toolbox.Rmd
## 15-web-mapping.Rmd

To help you know what is what, ls has another flag -F that prints a / after the name of every directory and a * after the name of every runnable program. Run ls with the two flags, -R and -F, and the absolute path to your home directory to see everything it contains. (The order of the flags doesn’t matter, but the directory name must come last.)

11.2.6 Conseguir ayuda para los comandos

To find out what commands do, people used to use the man command (short for “manual”). For example, the command man head brings up this information:

man head
## HEAD(1)                          User Commands                         HEAD(1)
## 
## NAME
##        head - output the first part of files
## 
## SYNOPSIS
##        head [OPTION]... [FILE]...
## 
## DESCRIPTION
##        Print  the  first  10 lines of each FILE to standard output.  With more
##        than one FILE, precede each with a header giving the file name.
## 
##        With no FILE, or when FILE is -, read standard input.
## 
##        Mandatory arguments to long options are  mandatory  for  short  options
##        too.
## 
##        -c, --bytes=[-]NUM
##               print  the  first  NUM bytes of each file; with the leading '-',
##               print all but the last NUM bytes of each file
## 
##        -n, --lines=[-]NUM
##               print the first NUM lines instead of  the  first  10;  with  the
##               leading '-', print all but the last NUM lines of each file
## 
##        -q, --quiet, --silent
##               never print headers giving file names
## 
##        -v, --verbose
##               always print headers giving file names
## 
##        -z, --zero-terminated
##               line delimiter is NUL, not newline
## 
##        --help display this help and exit
## 
##        --version
##               output version information and exit
## 
##        NUM may have a multiplier suffix: b 512, kB 1000, K 1024, MB 1000*1000,
##        M 1024*1024, GB 1000*1000*1000, G 1024*1024*1024, and so on for  T,  P,
##        E, Z, Y.
## 
## AUTHOR
##        Written by David MacKenzie and Jim Meyering.
## 
## REPORTING BUGS
##        GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
##        Report head translation bugs to <http://translationproject.org/team/>
## 
## COPYRIGHT
##        Copyright  ©  2017  Free Software Foundation, Inc.  License GPLv3+: GNU
##        GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
##        This is free software: you are free  to  change  and  redistribute  it.
##        There is NO WARRANTY, to the extent permitted by law.
## 
## SEE ALSO
##        tail(1)
## 
##        Full documentation at: <http://www.gnu.org/software/coreutils/head>
##        or available locally via: info '(coreutils) head invocation'
## 
## GNU coreutils 8.28               January 2018                          HEAD(1)

man automatically invokes less, so you may need to press spacebar to page through the information and :q to quit.

The one-line description under NAME tells you briefly what the command does, and the summary under SYNOPSIS lists all the flags it understands. Anything that is optional is shown in square brackets […], either/or alternatives are separated by |, and things that can be repeated are shown by …, so head’s manual page is telling you that you can either give a line count with -n or a byte count with -c, and that you can give it any number of filenames.

11.2.7 Seleccionar columnas de un archivo

head and tail let you select rows from a text file. If you want to select columns, you can use the command cut. It has several options (use man cut to explore them), but the most common is something like:

cut -f 2-5,8 -d , values.csv

which means “select columns 2 through 5 and columns 8, using comma as the separator”. cut uses -f (meaning “fields”) to specify columns and -d (meaning “delimiter”) to specify the separator. You need to specify the latter because some files may use spaces, tabs, or colons to separate columns.

cut is a simple-minded command. In particular, it doesn’t understand quoted strings. If, for example, your file is:

Name,Age
"Johel,Ranjit",28
"Sharma,Rupinder",26

then:

cut -f 2 -d , everyone.csv

will produce:

Age
Ranjit"
Rupinder"

rather than everyone’s age, because it will think the comma between last and first names is a column separator.

11.2.8 Repetir comandos

One of the biggest advantages of using the shell is that it makes it easy for you to do things over again. If you run some commands, you can then press the up-arrow key to cycle back through them. You can also use the left and right arrow keys and the delete key to edit them. Pressing return will then run the modified command.

Even better, history will print a list of commands you have run recently. Each one is preceded by a serial number to make it easy to re-run particular commands: just type !55 to re-run the 55th command in your history (if you have that many). You can also re-run a command by typing an exclamation mark followed by the command’s name, such as !head or !cut, which will re-run the most recent use of that command.

11.2.9 Seleccionar líneas con valores específicos

head and tail select rows, cut selects columns, and grep selects lines according to what they contain. In its simplest form, grep takes a piece of text followed by one or more filenames and prints all of the lines in those files that contain that text. For example, grep bicuspid seasonal/winter.csv prints lines from winter.csv that contain “bicuspid”.

grep can search for patterns as well; we will explore those in the next course. What’s more important right now is some of grep’s more common flags:

  • -c: print a count of matching lines rather than the lines themselves
  • -h: do not print the names of files when searching multiple files
  • -i: ignore case (e.g., treat “Regression” and “regression” as matches)
  • -l: print the names of files that contain matches, not the matches
  • -n: print line numbers for matching lines
  • -v: invert the match, i.e., only show lines that don’t match

The SEE ALSO section of the manual page for cut refers to a command called paste that can be used to combine data files instead of cutting them up.

Read the manual page for paste, and then run paste to combine the autumn and winter data files in a single table using a comma as a separator. What’s wrong with the output from a data analysis point of view?

11.3 Herramientas de combinación

The real power of the Unix shell lies not in the individual commands, but in how easily they can be combined to do new things. This chapter will show you how to use this power to select the data you want, and introduce commands for sorting values and removing duplicates.

11.3.1 Almacenar el resultado de un comando

All of the tools you have seen so far let you name input files. Most don’t have an option for naming an output file because they don’t need one. Instead, you can use redirection to save any command’s output anywhere you want. If you run this command:

head -n 5 seasonal/summer.csv

it prints the first 5 lines of the summer data on the screen. If you run this command instead:

head -n 5 seasonal/summer.csv > top.csv

nothing appears on the screen. Instead, head’s output is put in a new file called top.csv. You can take a look at that file’s contents using cat:

cat top.csv

The greater-than sign > tells the shell to redirect head’s output to a file. It isn’t part of the head command; instead, it works with every shell command that produces output.

11.3.2 How can I use a command’s output as an input?

Suppose you want to get lines from the middle of a file. More specifically, suppose you want to get lines 3-5 from one of our data files. You can start by using head to get the first 5 lines and redirect that to a file, and then use tail to select the last 3:

head -n 5 seasonal/winter.csv > top.csv tail -n 3 top.csv

A quick check confirms that this is lines 3-5 of our original file, because it is the last 3 lines of the first 5.

11.3.3 What’s a better way to combine commands?

Using redirection to combine commands has two drawbacks:

It leaves a lot of intermediate files lying around (like top.csv).
The commands to produce your final result are scattered across several lines of history.

The shell provides another tool that solves both of these problems at once called a pipe. Once again, start by running head:

head -n 5 seasonal/summer.csv

Instead of sending head’s output to a file, add a vertical bar and the tail command without a filename:

head -n 5 seasonal/summer.csv | tail -n 3

The pipe symbol tells the shell to use the output of the command on the left as the input to the command on the right.

11.3.4 How can I combine many commands?

You can chain any number of commands together. For example, this command:

cut -d , -f 1 seasonal/spring.csv | grep -v Date | head -n 10

will:

select the first column from the spring data;
remove the header line containing the word "Date"; and
select the first 10 lines of actual data.

11.3.5 How can I count the records in a file?

The command wc (short for “word count”) prints the number of characters, words, and lines in a file. You can make it print only one of these using -c, -w, or -l respectively.

11.3.6 How can I specify many files at once?

Most shell commands will work on multiple files if you give them multiple filenames. For example, you can get the first column from all of the seasonal data files at once like this:

cut -d , -f 1 seasonal/winter.csv seasonal/spring.csv seasonal/summer.csv seasonal/autumn.csv

But typing the names of many files over and over is a bad idea: it wastes time, and sooner or later you will either leave a file out or repeat a file’s name. To make your life better, the shell allows you to use wildcards to specify a list of files with a single expression. The most common wildcard is *, which means “match zero or more characters”. Using it, we can shorten the cut command above to this:

cut -d , -f 1 seasonal/*

or:

cut -d , -f 1 seasonal/*.csv

11.3.7 What other wildcards can I use?

The shell has other wildcards as well, though they are less commonly used:

? matches a single character, so 201?.txt will match 2017.txt or 2018.txt, but not 2017-01.txt.
[...] matches any one of the characters inside the square brackets, so 201[78].txt matches 2017.txt or 2018.txt, but not 2016.txt.
{...} matches any of the comma-separated patterns inside the curly brackets, so {*.txt, *.csv} matches any file whose name ends with .txt or .csv, but not files whose names end with .pdf.

11.3.8 Ordenar líneas de texto

As its name suggests, sort puts data in order. By default it does this in ascending alphabetical order, but the flags -n and -r can be used to sort numerically and reverse the order of its output, while -b tells it to ignore leading blanks and -f tells it to fold case (i.e., be case-insensitive). Pipelines often use grep to get rid of unwanted records and then sort to put the remaining records in order.

11.3.9 Eliminar líneas duplicadas

Another command that is often used with sort is uniq, whose job is to remove duplicated lines. More specifically, it removes adjacent duplicated lines. If a file contains:

2017-07-03 2017-07-03 2017-08-03 2017-08-03

then uniq will produce:

2017-07-03 2017-08-03

but if it contains:

2017-07-03 2017-08-03 2017-07-03 2017-08-03

then uniq will print all four lines. The reason is that uniq is built to work with very large files. In order to remove non-adjacent lines from a file, it would have to keep the whole file in memory (or at least, all the unique lines seen so far). By only removing adjacent duplicates, it only has to keep the most recent unique line in memory.

Si añadimos uniq -c nos cuenta cuántas veces aparecen.

11.3.10 How can I save the output of a pipe?

The shell lets us redirect the output of a sequence of piped commands:

cut -d , -f 2 seasonal/*.csv | grep -v Tooth > teeth-only.txt

However, > must appear at the end of the pipeline: if we try to use it in the middle, like this:

cut -d , -f 2 seasonal/*.csv > teeth-only.txt | grep -v Tooth

then all of the output from cut is written to teeth-only.txt, so there is nothing left for grep and it waits forever for some input.

11.3.11 How can I stop a running program?

The commands and scripts that you have run so far have all executed quickly, but some tasks will take minutes, hours, or even days to complete. You may also mistakenly put redirection in the middle of a pipeline, causing it to hang up. If you decide that you don’t want a program to keep running, you can type Ctrl + C to end it. This is often written ^C in Unix documentation; note that the ‘c’ can be lower-case.

11.4 Batch processing

Most shell commands will process many files at once. This chapter will show you how to make your own pipelines do that. Along the way, you will see how the shell uses variables to store information.

11.4.1 How does the shell store information?

Like other programs, the shell stores information in variables. Some of these, called environment variables, are available all the time. Environment variables’ names are conventionally written in upper case, and a few of the more commonly-used ones are shown below. Variable Purpose Value HOME User’s home directory /home/repl PWD Present working directory Same as pwd command SHELL Which shell program is being used /bin/bash USER User’s ID repl

To get a complete list (which is quite long), you can type set in the shell.

Use set and grep with a pipe to display the value of HISTFILESIZE, which determines how many old commands are stored in your command history. What is its value?

11.4.2 Imprimir el valor de una variable

A simpler way to find a variable’s value is to use a command called echo, which prints its arguments. Typing

echo hello DataCamp!

prints

hello DataCamp!

If you try to use it to print a variable’s value like this:

echo USER

it will print the variable’s name, USER.

To get the variable’s value, you must put a dollar sign $ in front of it. Typing

echo $USER

prints

repl

This is true everywhere: to get the value of a variable called X, you must write $X. (This is so that the shell can tell whether you mean “a file named X” or “the value of a variable named X”.)

11.4.3 How else does the shell store information?

The other kind of variable is called a shell variable, which is like a local variable in a programming language.

To create a shell variable, you simply assign a value to a name:

training=seasonal/summer.csv

without any spaces before or after the = sign. Once you have done this, you can check the variable’s value with:

echo $training

seasonal/summer.csv

11.4.4 How can I repeat a command many times?

Shell variables are also used in loops, which repeat commands many times. If we run this command:

for filetype in gif jpg png; do echo $filetype; done

it produces:

gif jpg png

Notice these things about the loop:

The structure is for ...variable... in ...list... ; do ...body... ; done
The list of things the loop is to process (in our case, the words gif, jpg, and png).
The variable that keeps track of which thing the loop is currently processing (in our case, filetype).
The body of the loop that does the processing (in our case, echo $filetype).

Notice that the body uses $filetype to get the variable’s value instead of just filetype, just like it does with any other shell variable. Also notice where the semi-colons go: the first one comes between the list and the keyword do, and the second comes between the body and the keyword done.

11.4.5 How can I repeat a command once for each file?

You can always type in the names of the files you want to process when writing the loop, but it’s usually better to use wildcards. Try running this loop in the console:

for filename in seasonal/*.csv; do echo $filename; done

It prints:

seasonal/autumn.csv seasonal/spring.csv seasonal/summer.csv seasonal/winter.csv

because the shell expands seasonal/*.csv to be a list of four filenames before it runs the loop.

11.4.6 How can I record the names of a set of files?

People often set a variable using a wildcard expression to record a list of filenames. For example, if you define datasets like this:

datasets=seasonal/*.csv

you can display the files’ names later using:

for filename in $datasets; do echo $filename; done

This saves typing and makes errors less likely.

If you run these two commands in your home directory, how many lines of output will they print?

files=seasonal/*.csv for f in $files; do echo $f; done

11.4.7 A variable’s name versus its value

A common mistake is to forget to use $ before the name of a variable. When you do this, the shell uses the name you have typed rather than the value of that variable.

A more common mistake for experienced users is to mis-type the variable’s name. For example, if you define datasets like this:

datasets=seasonal/*.csv

and then type:

echo $datsets

the shell doesn’t print anything, because datsets (without the second “a”) isn’t defined.

If you were to run these two commands in your home directory, what output would be printed?

files=seasonal/*.csv for f in files; do echo $f; done

11.4.8 How can I run many commands in a single loop?

Printing filenames is useful for debugging, but the real purpose of loops is to do things with multiple files. This loop prints the second line of each data file:

for file in seasonal/*.csv; do head -n 2 $file | tail -n 1; done

It has the same structure as the other loops you have already seen: all that’s different is that its body is a pipeline of two commands instead of a single command.

11.4.9 Why shouldn’t I use spaces in filenames?

It’s easy and sensible to give files multi-word names like July 2017.csv when you are using a graphical file explorer. However, this causes problems when you are working in the shell. For example, suppose you wanted to rename July 2017.csv to be 2017 July data.csv. You cannot type:

mv July 2017.csv 2017 July data.csv

because it looks to the shell as though you are trying to move four files called July, 2017.csv, 2017, and July (again) into a directory called data.csv. Instead, you have to quote the files’ names so that the shell treats each one as a single parameter:

mv ‘July 2017.csv’ ‘2017 July data.csv’

11.4.10 How can I do many things in a single loop?

The loops you have seen so far all have a single command or pipeline in their body, but a loop can contain any number of commands. To tell the shell where one ends and the next begins, you must separate them with semi-colons:

for f in seasonal/*.csv; do echo $f; head -n 2 $f | tail -n 1; done

seasonal/autumn.csv 2017-01-05,canine seasonal/spring.csv 2017-01-25,wisdom seasonal/summer.csv 2017-01-11,canine seasonal/winter.csv 2017-01-03,bicuspid

11.5 Crear nuevas herramientas

History lets you repeat things with just a few keystrokes, and pipes let you combine existing commands to create new ones. In this chapter, you will see how to go one step further and create new commands of your own.

11.5.1 Editar un archivo

Unix has a bewildering variety of text editors. For this course, we will use a simple one called Nano. If you type nano filename, it will open filename for editing (or create it if it doesn’t already exist). You can move around with the arrow keys, delete characters using backspace, and do other operations with control-key combinations:

Ctrl + K: delete a line.
Ctrl + U: un-delete a line.
Ctrl + O: save the file ('O' stands for 'output').
Ctrl + X: exit the editor.

11.5.2 How can I record what I just did?

When you are doing a complex analysis, you will often want to keep a record of the commands you used. You can do this with the tools you have already seen:

Run history.
Pipe its output to tail -n 10 (or however many recent steps you want to save).
Redirect that to a file called something like figure-5.history.

This is better than writing things down in a lab notebook because it is guaranteed not to miss any steps. It also illustrates the central idea of the shell: simple tools that produce and consume lines of text can be combined in a wide variety of ways to solve a broad range of problems.

history | tail -n 3 > steps.txt

11.5.3 How can I save commands to re-run later?

You have been using the shell interactively so far. But since the commands you type in are just text, you can store them in files for the shell to run over and over again. To start exploring this powerful capability, put the following command in a file called headers.sh:

head -n 1 seasonal/*.csv

This command selects the first row from each of the CSV files in the seasonal directory. Once you have created this file, you can run it by typing:

bash headers.sh

This tells the shell (which is just a program called bash) to run the commands contained in the file headers.sh, which produces the same output as running the commands directly.

Básicamente bash permite ejecutar el contenido de un archivo en shell script *.sh.

11.5.4 How can I re-use pipes?

A file full of shell commands is called a *shell script, or sometimes just a “script” for short. Scripts don’t have to have names ending in .sh, but this lesson will use that convention to help you keep track of which files are scripts.

Scripts can also contain pipes. For example, if all-dates.sh contains this line:

cut -d , -f 1 seasonal/*.csv | grep -v Date | sort | uniq

then:

bash all-dates.sh > dates.out

will extract the unique dates from the seasonal data files and save them in dates.out.

11.5.5 How can I pass filenames to scripts?

Es como crear una función, donde tienes que indicarle lo que quieres hacer.

A script that processes specific files is useful as a record of what you did, but one that allows you to process any files you want is more useful. To support this, you can use the special expression $@ (dollar sign immediately followed by at-sign) to mean “all of the command-line parameters given to the script”. For example, if unique-lines.sh contains this:

sort $@ | uniq

then when you run:

bash unique-lines.sh seasonal/summer.csv

the shell replaces $@ with seasonal/summer.csv and processes one file. If you run this:

bash unique-lines.sh seasonal/summer.csv seasonal/autumn.csv

11.5.6 How can I process a single argument?

As well as $@, the shell lets you use $1, $2, and so on to refer to specific command-line parameters. You can use this to write commands that feel simpler or more natural than the shell’s. For example, you can create a script called column.sh that selects a single column from a CSV file when the user provides the filename as the first parameter and the column as the second:

cut -d , -f $2 $1

and then run it using:

bash column.sh seasonal/autumn.csv 1

Notice how the script uses the two parameters in reverse order.

11.5.7 How can one shell script do many things?

Our shells scripts so far have had a single command or pipe, but a script can contain many lines of commands. For example, you can create one that tells you how many records are in the shortest and longest of your data files, i.e., the range of your datasets’ lengths.

Note that in Nano, “copy and paste” is achieved by navigating to the line you want to copy, pressing CTRL + K to cut the line, then CTRL + U twice to paste two copies of it.

11.5.8 How can I write loops in a shell script?

Shell scripts can also contain loops. You can write them using semi-colons, or split them across lines without semi-colons to make them more readable:

# Print the first and last data records of each file.
for filename in $@
do
    head -n 2 $filename | tail -n 1
    tail -n 1 $filename
done

(You don’t have to indent the commands inside the loop, but doing so makes things clearer.)

The first line of this script is a comment to tell readers what the script does. Comments start with the # character and run to the end of the line. Your future self will thank you for adding brief explanations like the one shown here to every script you write. it processes two data files, and so on.

11.5.9 What happens when I don’t provide filenames?

A common mistake in shell scripts (and interactive commands) is to put filenames in the wrong place. If you type:

tail -n 3

then since tail hasn’t been given any filenames, it waits to read input from your keyboard. This means that if you type:

head -n 5 | tail -n 3 somefile.txt

then tail goes ahead and prints the last three lines of somefile.txt, but head waits forever for keyboard input, since it wasn’t given a filename and there isn’t anything ahead of it in the pipeline.