Why MobiSurvStd?
Consider the following example: you want to compare the average travel time over a day, between men and women, for various territories / years.
EMC² Survey
With the EMC² standard, you can write a single code that can be run on any EMC² survey (with only minor changes to filenames). For example, for the survey of Brest 2018 and using the polars Python library, the code would look like:
# === Read average travel time by gender for EMC2 surveys. ===
import polars as pl
# Read the person (personne) file with `;` as value separator.
pers = pl.read_csv(
"Csv/Fichiers_Standard/brest_2018_std_pers.csv",
separator=";",
)
# Read the trip (déplacements) file with `;` as value separator.
depl = pl.read_csv(
"Csv/Fichiers_Standard/brest_2018_std_depl.csv",
separator=";",
)
print(
depl
# Join the two DataFrames using the 4 index columns.
# (Notice the different column name between the two DataFrames.)
.join(
pers,
left_on=["DMET", "ZFD", "ECH", "PER"],
right_on=["PMET", "ZFP", "ECH", "PER"],
)
# Group trips by gender of the person (column "P2").
.group_by("P2")
.agg(
# Compute the average travel time (column "D9"), weighted by sample
# weight (column "COEP").
(pl.col("D9") * pl.col("COEP")).sum()
/ pl.col("COEP").sum()
)
.sort("P2")
)
The code would print (1 is for men, 2 is for women):
shape: (2, 2)
┌─────┬───────────┐
│ P2 ┆ D9 │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪═══════════╡
│ 1 ┆ 18.145294 │
│ 2 ┆ 15.915749 │
└─────┴───────────┘
EGT2020 Survey
Now, assume that you want to compare with the values for Île-de-France. You would need to use the EGT2020 (or EGT2010) format and write a different but similar code:
# === Read average travel time by gender for EGT2020 survey. ===
import polars as pl
# Read the person (individu) file with `;` as value separator, proper encoding
# and infer_schema_length=10000 so that variable dtypes are correctly read.
pers = pl.read_csv(
"Csv/b_individu_egt1820.csv",
separator=";",
encoding="latin1",
infer_schema_length=10000,
)
# Read the trip (déplacements) file with `;` as value separator and proper
# encoding.
depl = pl.read_csv(
"Csv/c_deplacement_egt1820.csv",
separator=";",
encoding="latin1",
)
print(
depl
# Join the two DataFrames using the 2 index columns.
.join(pers, on=["IDCEREMA", "NP"])
# Group trips by gender of the person (column "SEXE").
.group_by("SEXE")
.agg(
# Compute the average travel time (column "DUREE"), weighted by sample
# weight (column "POIDSI").
(pl.col("DUREE") * pl.col("POIDSI")).sum()
/ pl.col("POIDSI").sum()
)
.sort("SEXE")
)
The code would print (1 is for men, 2 is for women):
shape: (2, 2)
┌──────┬───────────┐
│ SEXE ┆ DUREE │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞══════╪═══════════╡
│ 1 ┆ 26.753822 │
│ 2 ┆ 23.716359 │
└──────┴───────────┘
MobiSurvStd
Instead of having to write 2 (or more) different codes, MobiSurvStd allows you to write a single cleaner code that can run on many travel surveys. With MobiSurvStd, the code to compute average travel time by gender would be:
# === Read average travel time by gender with MobiSurvStd. ===
import polars as pl
# Read the persons parquet file.
pers = pl.read_parquet("persons.parquet")
# Read the trips parquet file.
trips = pl.read_parquet("trips.parquet")
print(
trips
# Join the two DataFrames using the "person_id" column.
.join(pers, on="person_id")
# Group trips by gender of the person (column "woman").
.group_by("woman")
.agg(
# Compute the average travel time (column "travel_time"), weighted by
# sample weight (column "sample_weight_surveyed").
(pl.col("travel_time") * pl.col("sample_weight_surveyed")).sum()
/ pl.col("sample_weight_surveyed").sum()
)
.sort("woman")
)
The code would print (for the Brest 2018 EMC²):
┌───────┬─────────────┐
│ woman ┆ travel_time │
│ --- ┆ --- │
│ bool ┆ f64 │
╞═══════╪═════════════╡
│ false ┆ 18.145294 │
│ true ┆ 15.915749 │
└───────┴─────────────┘