[SciPy-User] peer review of scientific software
Sun Jun 2 13:00:54 CDT 2013
Thomas Kluyver wrote:
>'type of users' might have been a more accurate phrase, but it has an
>unfortunate negative ring that I wanted to avoid. There are a lot of people
>doing important data analysis in quite risky and hard-to-maintain ways.
>Using spreadsheets where some simple code might be more reliable is one
>symptom of that, and there have been a couple of major examples from
>economics where spreadsheet errors led to serious mistakes.
>The discussion is revolving roughly around whether and how we can push
>those users towards better tools and methods, like coding, version control
Thanks for overview Thomas, I read all emails on the subject and will comment briefly, for the sake of my participation, although topic is huge
I don't have experience with critical modeling, but I do and learn data analysis with historical data and generally.
If we speak about errors, I think that most of it, like taught in Numerical analysis course, are due to human factor not understanding data types and also variety of data sources representing data differently. Trivial example that sql and netcdf databases represent same data in different format. Similarly for other data sources which in turn can be just plain text dumps. If that is handled correctly and user is familiar with the tool used, there shouldn't be any surprises.
If it is of any interest, I thought to generalize my usual workflow, as single user example (hope it's not useless):
- collecting data: if not directly available I use Python, and depending on source do validation. I don't change format if it's not necessary.
- pre-processing: if I preprocess (usually with Python), I store data to sql server.
- using data: single set or multiple datasets in PowerPivot (limited just by amount of RAM), where DAX allows calculations on pivoted views values. I haven't yet found any other tool that allows such diverse views in such short time.
- post-processing: when needed I export results to CSV. Usually to just load in numpy array and plot with Matplotlib, or 3D viewing in VisIt or Gephi.
- versioning: data in source database(s) stays intact, and all calculations can be saved to a file (with values), and then opened again even if datasource is not available.
So I use Excel mainly for data manipulation and Python back and forth. Also I use additional tools for 3D visualization.
I never liked to learn about versioning systems, and I'm happy with my current scheme
More information about the SciPy-User