refactoring, seminar technique, search by modeling

In parallel working session this morning (where collaborators gather in my office to work together), Montet, Bedell, and I worked out a re-factor of the RV code they have been working on, in order to make it more efficient and easier to maintain. That looked briefly like a big headache and challenge, but in the end the re-factor got completely done today. Somehow it is brutal to consider a refactor, but in the end it is almost always a good idea (and much easier than expected). I'm one to talk: I don't write much code directly myself these days.

Sarah Pearson (Columbia) gave the NYU Astro Seminar today. It was an excellent talk on what we learn about the Milky Way from stellar streams. She did exactly the right thing of spending more than half of the talk on necessary context, before describing her own results. She got the level of this context just right for the audience, so by the time she was talking about what she has done (which involves chaos on the one hand, and perturbations from the bar on the other), it was comprehensible and relevant for everyone. I wish I could boil down “good talk structure” to some simple points, but I feel like it is very context-dependent. Of course one thing that's great about the NYU Astro Seminar is that we are an interactive audience, so the speaker knows where the audience is.

After lunch I had a great and too-short discussion with Robyn Sanderson (Caltech), continuing ideas that came up on Wednesday about search for halo substructure. We discussed the point that when you transform the data to something like action space (or indeed do any non-linear transformation of the data), the measurement uncertainties become crazy and almost impossible to marginalize or even visualize. Let alone account for properly in a scientific analysis. So then we discussed whether we could search for substructure by transforming orbits into the data space and associating data with orbits, in the space where the data uncertainties are simple. As Sanderson pointed out, that's Schwarzschild modeling. Might be a great idea for substructure search.


theory of anomalies

Today was a low-research day, because [reality]. However, Kate Storey-Fisher (NYU) and I had a great discussion with Josh Ruderman (NYU) about anomalies in the LSS. As my loyal reader knows, we are looking at constructing a statistically valid, safe search for deviations from the cosmological model in the large-scale structure. That search is going to focus towards the overlap (if there is any overlap) between anomalies that are safe to systematic problems with the data (that is, anomalies that can't be mocked by reasonable adjustments to our beliefs about our selection function) and anomalies that live in spaces suggested or predicted by theoretical ideas about non-standard cosmological theories. In particular, we are imagining theories that have the dark sector do interesting things at late times. We didn't make concrete plans in this meeting, except to read down literatures about late decays of the dark matter, dark radiation, and other kinds of dark–dark interactions that could be happening in the current era.


actions or observables? forbidden planet radii

The highlight today of our Gaia DR2 prep meeting was a plenary argument (recall that this meeting is supposed to be parallel working, not plenary discussion, at least not mainly) about how to find halo substructure in the data. Belokurov (Cambridge) and Evans (Cambridge) showed some nice results of searching for substructure in something close to the raw data. We argued about the value of transforming to a space of invariants. The invariants are awesome, because clustering is long-lived and stark there. But clustering is terrible because (a) it introduces unnecessarily wrong assumptions into the problem and (b) normal uncertainties in the data space become arbitraily ugly noodles in the action space. We discussed whether there are intermediate approaches, that get the good things about working in observables, without too many of the bad things of working in the actions. We didn't make specific plans, but many good ideas hit the board.

Stars group meeting contained too many results to describe them all! It was great, and busy. But the stand-out result for me (and this is just me!) was a beautiful result by Vincent Van Eylen (Leiden) on exoplanet radii. As my loyal reader knows, the most common kinds of planets are not Earths or Neptunes, but something in-between, variously called super-Earths and mini-Neptunes. Now it turns out that even this class bifurcates, with a bimodal distribution—there really is a difference between super-Earths and mini-Neptunes, and little in between. Now Van Eylen shows that this gap really looks like it goes exactly to zero: There is a range of planet radii that really don't exist in the world. Note to reader: This effect probably depends on host star and many other things, but it is incredibly clear in this particular sample. Cool thing: The forbidden radii are a function of radius, and the forbidden zone was (loosely) predicted before it was observed. Just incredible. Van Eylen's super-power: Revision of asteroseismic stellar radii to get much more precision on stars and therefore on the transiting planets they host. What a result.


you never really understand a model until you implement it

Eilers (MPIA) and I discussed puzzling results she was getting in which she could fit just about any data (including insanely random data) with the Gaussian Process latent variable model (GPLVM) but with no predictive power on new data. We realized that we were missing a term in the model: We need to constrain the latent variables with a prior (or regularization), otherwise the latent variables can go off to crazy corners of space and the data points have (effectively) nothing to do with one another. Whew! This all justifies a point we have been making for a while, which is that you never really understand a model until you implement it.


modeling the heck out of the atmosphere

The day started with planning between Bedell (Flatiron), Foreman-Mackey (Flatiron), and I about a possible tri-linear model for stellar spectra. The model is that the star has a spectrum, which is drawn from a subspace in spectral space, and doppler shifted, and the star is subject to telluric absorption, which is drawn from a subspace in spectral space, and doppler shifted. The idea is to learn the telluric subspace using all the data ever taken from a spectrograph (HARPS, in this case). But of course the idea behind that is to account for the tellurics by simultaneously fitting them and thereby getting better radial velocities. This was all planning for the arrival of Ben Montet (Chicago), who arrived later in the day for a two-week visit.

At lunch time, Mike Blanton (NYU) gave the CCPP brown-bag talk about SDSS-V. He did a nice job of explaining how you measure the composition of ionized gas by looking at thermal state. And etc!


detailed abundances of pairs; coherent red-giant modes

In the morning I sat in on a meeting of the GALAH team, who are preparing for a data release to precede Gaia DR2. In that meeting, Jeffrey Simpson (USyd) showed me GALAH results on the Oh et al comoving pairs of stars. He finds that pairs from the Oh sample that are confirmed to have the same radial velocity (and are therefore likely to be truly comoving) have similar detailed element abundances, and the ones that aren't, don't. So awesome! But interestingly he doesn't find that the non-confirmed pairs are as different as randomly chosen stars from the sample. That's interesting, and suggests that we should make (or should have made) a carefully constructed null sample for A/B testing etc. Definitely for Gaia DR2!

In the afternoon, I joined the USyd asteroseismology group meeting. We discussed classification of seismic spectra using neural networks (I advised against) or kernel SVM (I advised in favor). We also discussed using very narrow (think: coherent) modes in red-giant stars to find binaries. This is like what my host Simon Murphy (USyd) does for delta-Scuti stars, but we would not have enough data to phase up little chunks of spectrum: We would have to do one huge simultaneous fit. I love that idea, infinitely! I asked them to give me a KIC number.

I gave two talks today, making it six talks (every one very different) in five days! I spoke about the pros and cons of machine learning (or what is portrayed as machine learning on TV) as my final Hunstead Lecture at the University of Sydney. I ended up being very negative on neural networks in comparison to Gaussian processes, at least for astrophysics applications. In my second talk, I spoke about de-noising Gaia data at Macquarie University. I got great crowds and good feedback at both places. It's been an exhausting but absolutely excellent week.


mixture of factor analyzers; centroiding stars

On this, day four of my Hunstead Lectures, Andy Casey (Monash) came into town, which was absolutely great. We talked about many things, including the mixture-of-factor-analyzers model, which is a good and under-used model in astrophysics. I think (if I remember correctly) that it can be generalized to heteroskedastic and missing data too. We also talked about using machine learning to interpolate models, and future projects with The Cannon.

At lunch I sat with Peter Tuthill (Sydney) and Kieran Larkin (Sydney) who are working on a project design that would permit measurement of the separation between two (nearby) stars to better than one millionth of a pixel. It's a great project; the designs they are thinking about involve making a very large, but very finely featured point-spread function, so that hundreds or thousands of pixels are importantly involved in the positional measurements. We discussed various directions of optimization.

My talk today was about The Cannon and the relationships between methods that are thought of as “machine learning” and the kinds of data analyses that I think will win in the long run.


MCMC, asteroseismology, delta-Scutis

Today I am on my third of five talks in five days, as part of my Hunstead Lectures at Sydney. I spoke about MCMC sampling. A lot of what I said was a subset of things we write in our recent manual on MCMC. At the end of the talk there was some nice discussion of detailed balance, with contributions from Tuthill (USyd) and Sharma (USyd).

At lunch I grilled asteroseismology guru Tim Bedding (USyd) about measuring the large frequency difference delta-nu in a stellar light curve. My position is that you ought to be able to do this without explicitly taking a Fourier Transform, but rather as some kind of mathematical operation on the data. That is, I am guessing that there is a very good and clever frequentist estimator for it. Bedding expressed the view that there already is such a thing, in that there are methods for automatically generating delta-nu values. They do take a Fourier Transform under the hood, but they are nonetheless good Frequentist estimators. But I want to work on sparser data, like Gaia and LSST light curves. I need to understand this all better. We also talked about how it is possible for a gastrophysics-y star to have oscillations with quality factors better than 105. Many stars do!

That's all highly relevant to the work of Simon Murphy (USyd), who finds binary stars by looking at phase drifts in highly coherent delta-Scuti star oscillations. He and I spent an Afternoon of hacking on models for one of his delta-Scuti stars, with the hopes of measuring the quality factor Q and also maybe exploring new and more information-preserving methods for finding the binary companions. This method of finding binaries has similar sensitivity to astrometric methods, which makes it very relevant to the binaries that Gaia will discover.


noise, calibration, and GALAH

Today I gave my second of five Hunstead Lectures at University of Sydney. It was about finding planets in the Kepler and K2 data, using our non-stationary Gaussian Process or linear model as a noise model. This is the model we wrote up in our Research Note of the AAS. In the question period, the question of confirmation or validation of planets came up. It is very real that the only way to validate most tiny planets is to make predictions for other data. But when will we have data more sensitive than Kepler? This is a significant problem for much of bleeding-edge astronomy.

Early in the morning I had a long call with Jason Wright (PSU) and Bedell (Flatiron) about the assessment of the calibration programs for extreme-precision RV surveys. My position is that it is possible to assess the end-to-end error budget in a data-driven way. That is, we can use ideas from causal inference to figure out what parts of the RV noise are coming from telescope plus instrument plus software. Wright didn't agree: He believes that large parts of the error budget can't be seen or calibrated. I guess we better start writing some kind of paper here.

In the afternoon I had a great discussion with Buder (MPIA), Sharma (USyd), and Bland-Hawthorn (USyd) about the current status of detailed elemental abundance measurements in GALAH. The element–element plots look fantastic, and clear trends and high precision are evident, just looking at the data. To extract these abundances, Buder has made a clever variant of The Cannon which makes use of the residuals away from a low-dimensional model to measure the detailed abundances. They are planning on doing a large data release in April.


five talks in five days

On the plane to Sydney, I started an outline for a paper with Bedell (Flatiron) on detailed elemental abundances, and the dimensionality or interpretability of the elemental subspace. I also started to plan the five talks I am going to give in five days as the Hunstead Lecturer. On arrival I went straight to University of Sydney and started lecturing. My first talk was on fitting a line to data, with a concentration on the assumptions and their role in setting procedures. That is, I emphasized that you shouldn't choose a procedure by which you fit your data: You should choose a set of assumptions you are willing to make about your data. Once you do that, the procedure will flow from the assumptions. After my talk I had a great lunch with graduate students at Sydney. The range of research around the table was remarkable. I plan to spend some of the week learning about asteroseismology.


best-ever detailed abundances

In Friday parallel-working session, Bedell (Flatiron) showed me all 900-ish plots of every element against every element for her sample of 80 Solar twins. Incredible. Outrageous precision, and outrageous structure. And it is a beautiful case where you can just see the precision directly in the figures: There are clearly real features at very small scales. And hugely informative structures. This is the ideal data set for addressing something that has been interesting me for a while: What is the dimensionality of the chemical-abundance space? And can we see different nucleosynthetic processes directly in the data?

Late in the day, Jim Peebles (Princeton) gave the Astro Seminar. He spoke about three related issues in numerical simulations of galaxies: They make bulges that are too large and round; they make halos that have too many stars; and they don't create a strong enough bimodality between disks and spheroids. There were many galaxy-simulators in the audience, so it was a lively talk, and a very lively dinner afterwards.


combinatoric options for a paper

I had my weekly call with Bonaca (Harvard), about information theory and cold stellar streams. We discussed which streams we should be considering in our paper. We have combinatoric choices, because there are N streams and K Milky-Way parameters; we could constrain any combination of parameters with any combination of streams! And it is even worse than that, because we are talking about basis-function expansions for the Milky-Way potential, which means that K is tending to infinity! We tentatively decided to do something fairly comprehensive and live with the fact that we won't be able to fully interpret it with finite page charges.


circumbinary planets, next-gen EPRV

The Gaia DR2 workshop and Stars Group meeting were both very well attended! At the former, Price-Whelan (Princeton) showed us PyGaia, a tool from Anthony Brown's group in Leiden to simulate the measurement properties of the Gaia Mission. It is really a noise model. And incredibly useful, and easy to use.

In the Stars meeting, so many things! Andrew Mann (Columbia) spoke about the reality or controversies around Planet 9, which got us arguing also about claims of extra-solar asteroids. Kopytova (ASU) described her project to sensitively find chemical abundance anomalies among stars with companions, and asked the audience to help find ways that true effects could be scooped. Her method is very safe, so it takes a near-conspiracy, I think, but Brewer (Yale) disagreed. Veselin Kostov (Goddard) talked about searching for circumbinary planets. This is a good idea! He has found a few in Kepler but believes there are more hidden. It is interesting for TESS for a number of reasons, one of which is that you can sometimes infer the period of the exoplanet with only a short stretch of transit data (much shorter than the period), by capitalizing on a double-transit across the binary.

Didier Queloz (Cambridge) was in town for the day. Bedell (Flatiron) and I discussed with him next-generation projects for HARPS and new HARPS-like instruments. He is pushing for extended campaigns on limited sets of bright stars. I like this idea for its statistical and experimental-design simplicity! But (as he notes) it is hard to get the heterogeneous community behind such big projects. He has a project to pitch, however, if people are looking to buy in to new data sources. He, Bedell, and I discussed what we know about limits to precision in this kind of work. We aren't far apart, in that we all agree that HARPS (and its competitors) are extremely well calibrated machines, much better calibrated than the end-to-end precision obtained.


searches for anomalies

Today Kate Storey-Fisher (NYU) and I met with Mike Blanton (NYU) and Zhongxu Zhai (NYU) to discuss possible projects that Storey-Fisher and I have been talking about. We are thinking about trying to systematize (and pre-register) the search for anomalies in cosmological surveys. The idea (which is still vague) is to somehow lexicographically order all anomalies we could search for, and then search, such that we can keep exquisite track of the number of independent hypotheses we have checked.

Blanton and Zhai had some advice for us. One category of advice was around systematics: Anomalies and systematics in the data might appear similar! So we should think about anomalies that are somehow least sensitive to these systematics. One good thing is that we are working at the home of many of the tools that we need to make these assessments. Another category of advice was to think about what anomalies are motivated by questions of theory in the dark sector, in galaxy formation, or in the initial conditions. Theory-inspired (if not predicted) anomalies are more productive, in a scientific-literature sense, than randomly specified anomalies. We are close to being able to specify a project!


detailed abundances and stellar companions

Taisiya Kopytova arrived in NYC for a few days to work on stellar abundances and orbital companions. Her project is very well designed: She has a set of red-giant stars in APOGEE where we know they have companions. For each of these stars with companions, she has found a set of matched stars—matched in stellar parameters—that don't have companions (or not companions that are detectable). She then compares the detailed chemical abundances between these two samples. The approach is extremely conservative and very robust to problems in the data: For a false effect to appear, it has to be an effect that causes a companion to be detected (or not detected)! And she finds signals.

One disturbing thing is that we find signal-to-noise effects, and we get slightly different results when we use APOGEE DR13 or DR14 data. So we might need to match on signal-to-noise as well as stellar parameters.