How to combine multiple data sources?

Question

I have two data sets. Data set A: data about lifestyles, health conditions and SOCIOECONOMIC and Data set B: data about travel behavior and also SOCIOECONOMIC. We don't know whether they are the same respondents (there is no ID or similar information to match respondents), but we do know these data sets are made from two different surveys. Since both data sets have SOCIOECONOMIC variables, how can I combine these two data sets? Is there any statistical tool to do this? Is there any software that has tools to do this? Please, if applicable, suggest some references. Thanks.

William L. · Accepted Answer

Before I answer your question, I'm going to start by warning that using SOCIOECONOMIC status (SES) as your column for matching people runs a high risk of incorrect matches as the size of your data increases. The reason for this is that no factor excludes two individuals from reporting the same SOCIOECONOMIC score.

That said, if you're intend on combining based on this term, what you need is an INNER JOIN. This is coded differently depending on which language you are using, but the basic function creates a new dataset by taking your first dataset and looking for where it has matches in your specified id column (SES) in the second dataset. An inner join will only include rows which match in both of your datasets. If you want to make sure that you don't lose rows that have no match between your datasets use a FULL JOIN. I won't go into FULL JOINs here as it sounds like you're using matched data.

Here are some examples of how to code this in different languages.

SQL:

SELECT *

FROM <insert the name of your first dataset here>

INNER JOIN <insert the name of your second dataset here>

ON <dataset1>.<matching column> = <dataset2>.<matching column>;

Example: SELECT * FROM Table1 INNER JOIN Table2 ON Table1.Col1 = Table2.Col2;

R: I recommend using dplyr because it's a streamlined way to manipulate data

library(dplyr)

<new dataset name> <- <name of first dataset> %>%

inner_join(<name of second dataset>, by = c("<column from first dataset>" = "<column from second

dataset>"))

Example: Table 3 <- Table 1 %>% inner_join( Table 2, by = c("Col1" = "Col2"))

Python:

import pandas as pd

pd.merge(<first dataset name>, <second dataset name>, left_on='<first datasets matching column>',

right_on='<second dataset's matching column>')

Example: pd.merge(Table1, Table2, left_on='Col1', right_on='Col2')

How to combine multiple data sources?

1 Expert Answer

Still looking for help? Get the right answer, fast.

OR

RELATED TOPICS

RELATED QUESTIONS

How to assign seats to the passengers maintaining social distancing?

Write a greedy predictor program for multivariate regression.

Does management generally get heavily involved with the technical details of a linear programming study?

Should I choose Computer Science or Data Science ?

What all do you have to become proficient at to learn just predictive analytics?

RECOMMENDED TUTORS

IXL

Rosetta Stone

Education.com

TPT

Vocabulary.com

ABCya

SpanishDictionary.com

Inglés.com

Emmersion

How to combine multiple data sources?

1 Expert Answer

Still looking for help? Get the right answer, fast.

OR

RELATED TOPICS

RELATED QUESTIONS

How to assign seats to the passengers maintaining social distancing?

Write a greedy predictor program for multivariate regression.

Does management generally get heavily involved with the technical details of a linear programming study?

Should I choose Computer Science or Data Science ?

What all do you have to become proficient at to learn just predictive analytics?

RECOMMENDED TUTORS

find an online tutor