A polynomial regression model for the loudness of massive black-hole binaries
Published:
Black holes are abundant in the Universe and they come in different sizes, from those as massive as our Sun to those as massive as the Milky Way. The most massive reside at the centre of galaxies and go by the name of supermassive black holes. Discerning their origin is a key challenge of gravitational-wave astrophysics (and an overall cool topic to spend time contemplating!).
It has been suggested that supermassive black holes could be formed from the collisions of smaller, but still massive black holes. The latter have been dubbed intermediate-mass black holes, and they have not yet been detected as they reside in frequency bands that elude both gravitational and electromagnetic observations.
In this project, I ask to what extent future gravitational-wave interferometers could detect binaries with intermediate-mass black holes (IMBHs), thus paving the way to decipher the origin of supermassive black holes. I perform a scan of the parameter space and produce estimates for the signal-to-noise ratio of the IMBHs, which I fit with a polynomial regression model that achieves $R^2 = 0.996$.
Table of Content
- Building the dataset
- Training and validation of the model
- How to use the model and thoughts for the future
Building the dataset
The signal-to-noise ratio (SNR), denoted as $\rho$, is a measure of loudness of a signal compared to a noisy background. It is widely used in gravitational-wave (GW) astronomy to indicate when a signal from a binary could be detected: usually a value $\rho >8$ is taken to be a threshold for a positive detection. As a result, it is natural to consider the SNR as the target variable of my investigation.
Many software packages have been developed to calculate $\rho$ for a given set of detector configurations. In this investigation I use $\texttt{gwbench}$ by (Borhanian, 2020), which is particularly well suited to calculate the SNRs for next-generation detectors (and much more!).
I am interested in exploring how the SNR varies as a function of the masses of the black holes. Since the scope of the investigation is to check the detectability of intermediate mass black holes (IMBHs) for next-generation ground-based detectors, I restrict the masses in the range $10^2 M_\odot < m_{1,2} < 10^4 M_\odot$ (where $M_\odot$ stands for “solar masses”). Values below $10^2 M_\odot$ would not be considered (IMBHs), while values above $10^4 M_\odot$ are too high to be detected with next-gen ground-based detectors.
Another parameter I am interested in investigating the effect of is the redshift $z$, namely a measure of the distance of the binary. If we detect IMBHs far away, at redshifts $z \sim 10$ or even $15$, they could point to a dynamical origin of supermassive black holes. This is simply because at those redshift there is not enough matter that primordial clumps of mass could accrete to form supermassive black holes. In this investigation, I consider redshifts $0.5<z<15$
To recap, the features I am interested in could divided into target variable $y$ and set of predictors $X$ as follows
\[y = \rho; \quad X = \{m_1, m_2, z\}.\]Let me note that other parameters, such as the spins $\chi_{1,2}$ of the black holes, the binary inclination $\iota$, the right ascension “ra” and declination “dec” parameters, and the angle $\phi_0$ defining the direction of the observer, could all change the SNR $\rho$ considerably. I fold in this uncertainty by using an “averaged” SNR for the target variable,
\[y = \rho = \frac{1}{N} \int d\chi_{1,z}\,d\chi_{2,z}\,d\iota \, d\varphi_0\, d{\rm ra}\,d{\rm dec}\,d\psi \rho_\mathrm{net},\]where $\rho_\mathrm{net}$ is the SNR I would obtain with $\texttt{gwbench}$ for a network of detectors and for a given configuration of all the parameters mentioned above. For illustration purposes only, in this investigation I use a network of two american Cosmic Explorer and the european Einstein Telescope.
Finally, I create a dataset scanning the parameter space of $X$, while averaging the SNR over many choices of the parameters in equation (2). The procedure involves calculating lots of SNRs. Even if each calculation only requires seconds, I found it very useful to parallelize the code and use GPUs from the “Rockfish” cluster at Johns Hopkins University.
This creates a dateset with 2994 rows and 4 features as in equation (1), which I call as follows.
df = pd.read_pickle("./dataset_SNRavg.pkl")
df.columns = ['m1', 'm2', 'z', 'SNR']
Training and validation of the model
The jupyter notebook with the exploratory data analysis can be found here.
To help the fitting procedure we choose variables that are easier to work with. The component masses $m_1\, $ and $m_2 $ are features that can vary by orders of magnitude. As commonly done in GW astrophysics, I opt for sampling through the log of the total mass and the symmetric mass ratio $\nu$ instead. These are defined as:
\[\log_{10}(M_\text{tot})=\log_{10}(m_1 + m_2), \quad \nu = (m_1 m_2)/(m_1 + m_2)^2\]In this vein, I also take the log of the (averaged) SNR. The new variables are all of order unity, and as such they lead to less variability in the features the fit needs to capture and a better performance overall. The relevant code lines are
# First sum up the m1 and m2 columns into a single total mass and the ratio into a mass ratio column
df['Mtot'] = df.loc[:,['m1','m2']].sum(axis=1)
df['massratio'] = (df['m1']*df['m2'])/(df['Mtot']**2) # mass ratio: df['m1']/df['m2']
# Then take the log10 of the SNR and total mass for easier fits.
df['logSNR'] = np.log10(df['SNR'])
df['logMtot'] = np.log10(df['Mtot'])
# Then we define the new dataframe to contain only total mass, redshift and log of the SNR.
df=df[['logMtot','massratio','z','logSNR']]
Whenever a black hole is so massive to move out of the frequency band, it cannot be detected anymore. In this case I artificially set the SNR to 0 when using $\texttt{gwbench}$ in the dataset preparation. However, zeroes leads problems and unwanted infinities when taking the log of the SNR. Because of this, I remove cases with low SNRs $\rho < 0.1$ altogether, as they would not be informative for the analysis anyways.
# Remove the values with very low SNRs
rho_min = 0.1
df = df[df.logSNR > np.log10(rho_min)]
To fit the target variable $y = \rho$ in the dataset I use the $\texttt{train_test_split}$ function in $\texttt{scikit-learn}$, keeping 70% of the dataset as training data and leaving the rest as test data.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, random_state = 2, test_size=0.3)
variables = ['logMtot','massratio','z']
X_train, X_test= train[variables],test[variables]
y_train,y_test = train.logSNR, test.logSNR
Finally, I train a polynomial degree of fifth order,
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
# Call the fit.
poly = PolynomialFeatures(degree=5)
# Transform the data so that it can be accepted by a linear model fit.
X_train_new = poly.fit_transform(X_train)
X_test_new = poly.fit_transform(X_test)
# Fit the model as if it was linear regression.
model = linear_model.LinearRegression()
polyfit = model.fit(X_train_new, y_train)
y_pred = polyfit.predict(X_test_new)
The $R^2$ for this model is really good at 0.996. At this stage, however, it is also important to check whether the use of the fifth degree is somewhat overfitting the data. As a powerful check for this, I have generated a mock set of 3300 SNRs produced inputting random masses and redshift in $\texttt{gwbench}$. For this validation, I do not average the SNR. My aim is to check how well the fit can predict the SNRs when only the masses and redshift are input. (These are the most common products from astrophysical codes that predict the abundance of IMBHs.)
Using the same masses and redshift that generate the mock SNRs, I predict the SNRs using the model above. Then, I plot the mock and predicted SNRs, obtaining
It is interesting to investigate whether the predictions correctly place the binaries above the SNR threshold $\rho >8$ (orange dashed lines), which dictates whether a binary is broadly detactable or not. The false positives and negatives populate quadrants I and IV, respectively: overall, the false negatives are only 0.98% of the entire population, while the false positives the 0.58%. This highlights a good accuracy by the model when it comes to classifying signals as detectable or not by near-future gravitational-wave experiments.
How to use the model and thoughts for the future
I have created a simple library here that can be called to predict the SNR. To do this, the total mass $M$, symmetric mass ratio $0 <\eta < 0.25$ and redshift $z$ of the binary must be input:
# Simple usage to predict the fit given masses and redshift.
from xgsnr import *
# Choose the parameters here
Mtot = 1e3 # Total mass in solar masses
eta = 0.2 # Symmetric mass ratio (between 0 and 0.25)
z = 1 # redshift
# Predict the SNR
SNR_fit(Mtot,eta,z)
The fits can be used in conjunction with astrophysical simulation to assess when IMBHs can be detected. While the model was validated with a specific detector network comprising of two cosmic explorers and an Einstein Telescope, it can be generalized to other configurations with little effort.
I recommend using the model to classify whether a binary for which the masses and distance are known is detectable with next-generation ground-based detectors. The false positives and negatives are only 0.58% and 0.98% of the total binaries for which the model was validated, which suggests the model would be suitable for this task. This highlights that the model could be a very useful tool to explore the science case for next-generation experiments. Work is underway to perform these studies.