If you are reading this blog you likely heard about Big Data. By my definition it’s not only quantity of data, but more the structure. I would call it “Messy Data”, as the data is not perfectly structured. Before analyzing such data you need to load it into your analysis software and structure it for analysis.

When it comes to data analysis there are three competing languages: Matlab, R and Python. Choice largely depends on the task at hand. In this short Matlab vs. R tutorial let me compare R to Matlab when it comes to dealing with mixed data (aka messy, big data). What I call mixed data is when there are different data types referring to the same observation. For example, an observation of company “Apple Inc.” on date 2015-7-24. Information about the company could be its stock price, volume, and ticker. As you could see company and ticker is non-numeric type, date is of date type, and price, volume is of numeric type. How does Matlab and R accommodates such mixed types?

**Data types**

*R way*

R has data type “data.frame”. It allows easy storage and manipulation of mixed type data.

*Matlab way*

Do you know what “Mat” in Matlab stands for? Contrary to popular belief it is not mathematics, but matrix. It is MATrix LABoratory. The language is intended for matrix manipulation, i.e. numeric data. Until Matlab version R2013b it was not possible to store the data in one variable that is also easy to manipulate. There is structure type, however each field in the structure is updated independently and therefore element i one field might not refer to the same observation as element i in another field. My guess is that Matlab developers decided to catch up with capabilities of R by adding type “data.table”.

**Loading data**

*R way*

1 2 |
AAPL <-read.csv(file="http://jbrazys.com/wp-content/uploads/data/AAPL.csv", colClasses = c("Date","character","numeric","numeric","factor")) |

Show names of columns:

1 |
names(AAPL) |

*Matlab way*

It is not possible to read the file directly from url. Need to download first.

1 2 |
urlwrite('http://jbrazys.com/wp-content/uploads/data/AAPL.csv','AAPL.csv') AAPL = readtable('AAPL.csv','Format','%{yyyy/MM/dd}D%s%f%f%C'); |

Show names of columns

1 |
AAPL.Properties.VariableNames |

Note that to read the dates as dates (%{yyyy/MM/dd}D) is only possible from version 2014b onwards. %C stands for categorical data, f% for floating point number.

**Ordering data**

*R way*

1 |
AAPL <- AAPL[order(AAPL$Date),] |

*Matlab way*

1 |
AAPL =sortrows(AAPL,'Date','ascend'); |

**Showing data**

*R way*

1 2 3 4 5 |
#show first 3 rows AAPL[1:3,] #show first 3 observation of Price AAPL$Price[1:3] |

*Matlab way*

Show first 3 rows

1 |
AAPL(1:3,:) |

Show first 3 observation of Price

1 |
AAPL.Price(1:3) |

**Manipulation of data**

A common task is computing some sort of transformation of cross-sectional data. For example compute time series of cross-sectional standard deviation of price and total volume. Since we have no cross-section yet, we need to load

*R way*

1 2 3 4 5 6 7 8 |
#load data stock_data <-read.csv(file=" http://jbrazys.com/wp-content/uploads/data/stock_data.csv", colClasses = c("Date","character","numeric","numeric","factor")) library(plyr) # package for data.frame manipulation summarized_data<-ddply(stock_data,.(Date),summarize, stdev_price=sd(Price),total_volume=sum(Volume)) |

What ddply does is it splits the data into groups and for each group computes specified transformation. In this case for each unique Date in stock_data it created variables *stdev_price* and *total_volume.* The function can be __any__ R base or user defined function.

*Matlab way*

1 2 |
urlwrite('http://jbrazys.com/wp-content/uploads/data/stock_data.csv','stock_data.csv') stock_data = readtable('stock_data.csv','Format','%{yyyy/MM/dd}D%s%f%f%C'); |

This is nice data table that is similar to data.frame in R. However the nice format of the table is not fully compatible with Matlab functions. For example to compute group data we can run function varfun(), however the grouping variable must be categorical, numerical, logical or string. Therefore if we would like to do it by date it will refuse to work. So we need a workaround: lets read Date column as string this time.

1 |
stock_data = readtable('stock_data.csv','Format','%s%s%f%f%C'); |

Another kink of varfun() is that currently it is not possible to do different transformation for each column in Matlab. Matlab function varfun applies the same function to all selected columns. Therefore the only way is to resort to multi-step procedure.

Compute standard deviation of price

1 2 3 |
summarized_data1 = varfun(@std, stock_data, ... 'InputVariables', 'Price',... 'GroupingVariables','Date'); |

Compute sum of Volume

1 2 3 |
summarized_data2 = varfun(@sum, stock_data, ... 'InputVariables', 'Volume',... 'GroupingVariables','Date'); |

Join data and get only data that we asked for

1 2 |
summarized_data = join(summarized_data1, summarized_data2,'Keys', 'Date') summarized_data(:,[1 3 5]) |

Summarizing: Matlab is catching up with functionality of R, however some features are still clumsy. For data analysis Matlab requires multiple steps whereas R can do the same in one step. Creating multiple temporary variables (or files) makes the code and analysis environment unnecessarily crowded.

Although in this post, I do not discuss issues when it comes to the size of big data, R has big data analytics packages (R: data.table) that can handle both mixed data types and large number of observations without running out of memory.

Rogier Potter van LoonHave you worked with MATLAB’s “datastore” yet? I got to test it with the new R2015A version at ESE and find it is a huge leap in working with big/messy data.

Justinas Brazys(Post author)Thanks for suggestion! I have not tried Matlab datastore object yet. However, at the glance datastore is very similar to datatable, with improvement of handling large data sets. Are the issues of function varfun() eliminated? Namely, grouping by date, creating columns of different transformations?

Rogier Potter van LoonI’m not familiar with datatable or varfun(), unfortunately, but I do know that datastore allows for a very easy handling of the different types (date, int8, boolean, string, etc.) of columns.